Database Search Performance Optimization: From SQL LIKE, MSSQL Full-Text Search to Elasticsearch

October 7, 2025

1. Performance Bottlenecks in Traditional Search Methods
1.1. Traditional Approach: Using SQL LIKE Syntax Queries
2. Search Performance Optimization Strategies
2.1. Core Principles: Tokenizer and Inverted Index
Tokenizer (Word Breaker)
Building Inverted Index
3. MSSQL Full-Text Search
3.1. How to Enable Full-Text Search
1. Verify Full-Text Search Feature Installation
2. Create Full-Text Catalog
3. Create Full-Text Index
4. Query Testing
5. Maintaining Full-Text Index
6. Common Errors
3.2. Differences Between CONTAINS and FREETEXT
3.3. CONTAINS
3.4. FREETEXT
4. Elasticsearch
4.1. Why Choose Elasticsearch for Large Data Volumes
1. Born for Distributed Design
2. Architecture Born for Search
3. Actual Performance Comparison
When to Consider Using Elasticsearch
Query Methods
4.2. Main Differences Between Chinese and English in Inverted Indexing
5. Fuzzy Search Strategies
5.1. Fuzzy Search Comparison
5.2. Overall Comparison
6. Advanced Supplement
6.1. Keyword Search vs Semantic Search
7. Actual Performance Test Results
7.1. Capacity Planning and Scalability Considerations
Data Volume and Performance Relationship
8. Optimization Recommendations
9. Technology Selection Recommendations
Small Projects (less than 10,000 articles)
Medium Projects (10,000-100,000 articles)
Large Projects (over 100,000 articles)
Super Large Projects (over 10 million articles)
Enterprise Applications (any scale)
10. Conclusion

In-depth exploration of database search performance optimization, comparing the pros and cons of SQL LIKE, MSSQL Full-Text Search, and Elasticsearch with implementation details.

As website content and user base grow, search functionality often becomes a system bottleneck. This article documents the author's considerations, research, and trade-offs made based on system usage development when encountering bottlenecks, exploring how to gradually optimize from basic SQL LIKE queries to enterprise-level full-text search solutions.

Performance Bottlenecks in Traditional Search Methods

🔗

When the number of website articles increases and user search frequency rises, traditional search methods encounter obvious performance issues.

Traditional Approach: Using SQL LIKE Syntax Queries

🔗

sql
select * from posts
where content like '%search_text%'

Pros: Simple to use, supports fuzzy search
Cons: Database cannot use indexes, causing full table scans, which becomes very slow with large datasets.

Search Performance Optimization Strategies

🔗

When facing large-scale text search requirements, we need better solutions. Basically, all approaches extend from the concept of indexing.

Core Principles: Tokenizer and Inverted Index

🔗

Tokenizer (Word Breaker)

🔗

When data is written, the content to be searched is segmented, and inverted indexes are built to improve search performance.

json
"apple is a red fruit"

["apple", "is", "a", "red", "fruit"]

"蘋果是一種紅色的水果"

["蘋果", "是", "一種", "紅色", "的", "水果"]

Building Inverted Index

🔗

Forward Index: Each document records what terms it contains
Inverted Index: Each term records which documents it appears in

json
[
  { "post_id": 1, "content": "apple is a red fruit" },
  { "post_id": 2, "content": "apple is green" },
  { "post_id": 3, "content": "banana and apple" }
]

The following simulates the actual storage format in the database:

Original Data (Forward Index):

post_id	content
1	apple is a red fruit
2	apple green
3	banana and apple

Inverted Index Results:

term	post_ids
apple	[1, 2, 3]
red	[1]
fruit	[1]
green	[2]
banana	[3]
and	[3]

So by simply querying keywords, we can quickly find the corresponding original data through the inverted index.

Next, we'll introduce two main optimization methods.

MSSQL Full-Text Search

🔗

How to Enable Full-Text Search

🔗

1. Verify Full-Text Search Feature Installation

🔗

When installing SQL Server, Full-Text Search is an optional component. To verify if it's installed:

sql
SELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS IsFullTextInstalled;

Returns 1 → Installed
Returns 0 → Not installed, need to reinstall SQL Server features

2. Create Full-Text Catalog

🔗

Full-text indexes need a Catalog for storage.

sql
CREATE FULLTEXT CATALOG ftCatalog AS DEFAULT;

This creates a full-text catalog named ftCatalog and sets it as the default catalog.

3. Create Full-Text Index

🔗

Assume there's an Articles table with Title and Content columns:

sql
CREATE TABLE Articles (
    ID INT PRIMARY KEY,
    Title NVARCHAR(200),
    Content NVARCHAR(MAX)
);

You must first add a Unique Index to the table as an "anchor" for the full-text index:

sql
CREATE UNIQUE INDEX IX_Articles_ID ON Articles(ID);

Then create the full-text index:

sql
CREATE FULLTEXT INDEX ON Articles
(
    Title LANGUAGE 1033,   -- English
    Content LANGUAGE 1028  -- Traditional Chinese
)
KEY INDEX IX_Articles_ID
ON ftCatalog
WITH CHANGE_TRACKING AUTO;

Explanation:

LANGUAGE 1033 → English (LCID)
LANGUAGE 1028 → Traditional Chinese (LCID)
CHANGE_TRACKING AUTO → Automatic index updates (recommended for test environments; consider MANUAL for production with periodic batch updates)

4. Query Testing

🔗

sql
-- Using CONTAINS
SELECT * FROM Articles WHERE CONTAINS(Content, N'資料庫');

-- Using FREETEXT
SELECT * FROM Articles WHERE FREETEXT(Content, N'SQL 很強大');

5. Maintaining Full-Text Index

🔗

Full-text indexes expand over time and require regular maintenance:

sql
-- Rebuild full-text index
ALTER FULLTEXT CATALOG ftCatalog REBUILD;

-- Reorganize full-text index
ALTER FULLTEXT CATALOG ftCatalog REORGANIZE;

6. Common Errors

🔗

Error: Full-text service not found → Full-Text Search feature not installed
Error: Cannot create full-text index due to missing unique index → Must first create an index on primary key or unique column
Cannot find Chinese content → Ensure column uses NVARCHAR and specify LANGUAGE 1028 (Traditional Chinese) when creating index

Differences Between CONTAINS and FREETEXT

🔗

You can use CONTAINS and FREETEXT syntax

sql
SELECT * 
FROM your_table
WHERE CONTAINS(column_name, @query_text)  
or 
WHERE FREETEXT(column_name, @query_text)

Multi-column Query

sql
SELECT *
FROM your_table
WHERE CONTAINS( (column1, column2, column3), @query_text );

CONTAINS

🔗

sql
Original content:
SQL is powerful

SQL Server tokenizes to:
[SQL], [is], [powerful]

The inverted index tokens are [SQL], [is], [powerful]

sql
CONTAINS(content, 'sq')  //If input 'sq', nothing will be found

sql
Original content:
SQL很強大
SQL Server tokenizes to:
[SQL],[很],[強大]

The inverted index tokens are [SQL],[很],[強大]

sql
CONTAINS(content, 'SQL')  -- ✅ Found
CONTAINS(content, '強大')          -- ✅ Found (if searching entire segment)
CONTAINS(content, '很強大')            -- ❌ Not found (due to tokenization)

FREETEXT

🔗

The principle is to preprocess the input query terms, expand search scope, and attempt to search for semantically related content

"很強大" → ["很", "強", "大"]
"running" → "run"
driving" → "drive"

This means preprocessing the input query string by segmenting it and then attempting to search, hoping to effectively match the inverted index tokens.

No need for exact terms, supports fuzzy search, but performance is slightly lower than CONTAINS, yet still much faster than LIKE

To better meet business requirements and prevent no search results when tokenization doesn't match, both syntaxes are used together.

sql
DECLARE @keyword NVARCHAR(100) = N'內容';

-- First try CONTAINS
IF EXISTS (
    SELECT 1 FROM Articles
    WHERE CONTAINS(Content, @keyword)
)
BEGIN
    SELECT ID, Title, Content
    FROM Articles
    WHERE CONTAINS(Content, @keyword);
END
ELSE
BEGIN
    -- If not found, fall back to FREETEXT
    SELECT ID, Title, Content
    FROM Articles
    WHERE FREETEXT(Content, @keyword);
END

Elasticsearch

🔗

Elasticsearch is a distributed search engine based on Apache Lucene, specifically designed to handle search and analysis of large volumes of text data. In modern website architectures, it has almost become the standard choice for building search functionality.

Why Choose Elasticsearch for Large Data Volumes

🔗

When it comes to handling millions or tens of millions of records for search, Elasticsearch is indeed the best solution available today. This is not without reason:

1. Born for Distributed Design

🔗

Elasticsearch was designed from the beginning to handle large amounts of data. When your data grows, you can easily add more machines to share the workload. Data is automatically distributed across different nodes, so no single machine becomes particularly busy.

Moreover, when many people search simultaneously, these query requests are also distributed to different nodes for processing, with everyone helping to answer questions together. Even if one machine fails, other machines can continue to provide service without affecting users.

2. Architecture Born for Search

🔗

Elasticsearch also uses inverted indexes and utilizes memory to cache data, so search speed can be tens of times faster than traditional databases.

Additionally, it can use multiple threads simultaneously to handle different parts of queries, like having many people searching different bookshelves for you at the same time, which is naturally much faster than one person searching slowly.

3. Actual Performance Comparison

🔗

Here's an actual performance comparison table to give everyone a more concrete understanding:

Data Volume	Traditional DB LIKE	MSSQL Full-Text	Elasticsearch
1 Million Records	Tens of seconds	Seconds	Millisecond-level
10 Million Records	Minutes	Tens of seconds	Second-level
Hundreds of millions+	Impractical	Severe performance degradation	Still fast

As you can see, when data volume increases, Elasticsearch's advantages become more and more apparent.

When to Consider Using Elasticsearch

🔗

If your system has the following situations, please directly consider using Elasticsearch~

Data volume exceeds 1 million records: At this point, traditional databases will start to struggle.

Daily search requests exceed 10,000: This means your search functionality is heavily used and needs a more robust architecture.

Requires millisecond-level response: If users expect to see results immediately after entering keywords.

Requires complex data analysis: Such as statistics on popular search keywords, user behavior analysis, etc.

Elasticsearch's benefits also include providing data in API format, reducing development workflow.

Query Methods

🔗

json
POST /articles/_search
{
  "query": {
    "match": {
      "content": "Elasticsearch"
    }
  }
}

Enable fuzzy search

json
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "elasticsarch",      ← Typo
        "fuzziness": "AUTO",          ← Auto-determine allowed errors
        "prefix_length": 1,           ← How many initial characters must be correct
        "max_expansions": 3          ← Limit of matching terms
      }
    }
  }
}

Chinese text in articles is always separated by commas and periods, unlike English which uses spaces to distinguish individual words. This creates complexity in tokenization and index building, because the tokenization algorithm corresponds to the accuracy of query results. Therefore, choosing a good tokenizer becomes very important. In MSSQL full-text search, the tokenizer is built-in and we can only rely on Microsoft, while Elasticsearch provides multiple tokenizers, allowing developers to choose different options to achieve the actual effects they want.

Main Differences Between Chinese and English in Inverted Indexing

🔗

Item	English	Chinese
Space-based tokenization	✅ Words have spaces between them, can be directly segmented	❌ No spaces, cannot be directly segmented, requires additional tokenizer
Tokenization difficulty	Simple: Direct segmentation by spaces and punctuation	Complex: Requires semantics and dictionary for word segmentation
Analyzer	StandardAnalyzer etc.	Requires ChineseAnalyzer, IK Analyzer, jieba etc.
Inverted index results	More direct, precise vocabulary	May produce different segmentation results, affecting indexing and querying

Chinese fuzzy search support is limited, because Chinese is not composed of letters, it cannot compare "letter differences" like English.

Elasticsearch can do fuzzy search on Chinese, but results are often unintuitive because it's not based on semantics or pinyin, but on character encoding differences

Fuzzy Search Strategies

🔗

The following compares currently common fuzzy search strategies

Strategy	Description	Example
Prefix matching	Search term is at the beginning of words	"天" can match "天氣", "天文"
Substring matching	Search term appears anywhere in words	"氣" can match "天氣", "氣候"
Synonym matching	Search "天氣" also finds "氣象"	Requires synonym dictionary
Pinyin fuzzy	Misspelled "tianqi" → finds "天氣"	Requires pinyin converter
Typo tolerance	User types "田氣" also finds "天氣"	Requires NLP or fuzzy matching algorithm

Fuzzy Search Comparison

🔗

Feature	MSSQL Full-Text Search	Elasticsearch	Description
🔍 Prefix matching	✅ `CONTAINS('word*')`	✅ `prefix` query, `edge_ngram`	Supports prefix fuzzy search
🔎 Substring matching	❌	✅ `wildcard`, `ngram`	MSSQL doesn't support `%word%` full-text indexing, Elasticsearch can simulate with ngram
🧠 Synonym search	❌	✅ Synonym filter	MSSQL has no built-in synonym feature, Elasticsearch can use synonym dictionary
🈚 Pinyin fuzzy search	❌	✅ Pinyin plugin (like ik-pinyin)	MSSQL has no pinyin support, Elasticsearch requires corresponding plugin
🔁 Typo tolerance / Fuzzy matching	❌	✅ `fuzzy` query	MSSQL doesn't support Levenshtein distance, Elasticsearch has native support

Overall Comparison

🔗

Feature/Characteristic	Elasticsearch	MSSQL Full-Text Search
🔍 Full-text search accuracy	Very strong, supports word segmentation, relevance scoring, highlighting etc.	Basically usable, but limited word segmentation effects
🈶 Chinese word segmentation support	✅ Supported (requires installing IK/Jieba tokenizer)	⚠️ Basic support, but no tokenizer choice and cannot customize
🔧 Tokenizer adjustability	✅ Highly flexible, can customize dictionary	❌ Fixed and non-adjustable
⚡ Query performance and scalability	Very high, horizontally scalable	Medium, limited by database itself
📦 Support for complex queries and sorting	✅ Supports relevance sorting, fuzzy spelling, proximity distance	❌ More basic, difficult to do fuzzy matching
🧠 Natural language search experience	Highly optimized, can create "Google-like" search experience	Average, cannot provide intelligent queries
📊 Additional data analysis (aggregation, statistics)	✅ Built-in powerful Aggregation functionality	❌ No full-text statistical analysis support
🔌 Integration and API support	Fully RESTful, easy to integrate with websites/backend services	Limited to SQL commands or CLR components
📈 Suitable for big data/high search volume scenarios	✅ Very suitable	❌ Easily becomes slow

Advanced Supplement

🔗

Keyword Search vs Semantic Search

🔗

Comparison Item	Keyword Search	Vector Semantic Search
🔍 Query method	Matches whether text appears in documents (usually using inverted index)	Converts queries and documents to vector space for similarity comparison (like cosine similarity)
📄 Content requirements	Search terms must explicitly appear in text	Search terms don't need to appear completely, semantically similar can also match
🌐 Language understanding ability	Weak, purely literal matching	Strong, can understand semantic associations (like causality, description, context)
📦 Execution performance	Efficient (especially inverted index)	Relatively slow (vector comparison is costly, requires ANN technology for acceleration)
🔧 Setup cost	Low, quick deployment	High, requires semantic models (like BERT) and vector databases
📌 Suitable scenarios	Precise matching, regulations, technical documentation, early search engines	Intelligent Q&A, customer service queries, recommendation systems, unstructured data understanding

Example:

Query: "How to buy iPhone"

System Type	Query Result Description
Keyword Search	Can only find articles containing "how", "buy" and "iPhone"
Semantic Search	Can find content with titles like "Apple Phone Purchase Process", "Apple Ordering Methods" etc.

Actual Performance Test Results

🔗

To verify optimization effectiveness, I conducted performance comparisons in a test environment containing thousands of articles:

Query Method	Execution Time	Performance Improvement
SQL LIKE Query	25.423 seconds	Baseline
MSSQL CONTAINS	0.093 seconds	273x improvement

Capacity Planning and Scalability Considerations

🔗

When planning search systems, we need to consider the impact of data growth on performance:

Data Volume and Performance Relationship

🔗

Article Count	Original Data Size	Index Size	Expected Performance
10,000 articles	~100 MB	~150 MB	Excellent
100,000 articles	~1 GB	~1.5 GB	Good, needs attention to maintenance
1,000,000 articles	~10 GB	~15 GB	Requires partitioning optimization
Millions+ articles	>10 GB	>15 GB	Recommend Elasticsearch

Capacity Calculation Explanation:

Original data estimation: Based on an average of 10KB per article (including title, content, tags and other metadata)

Full-text index size estimation: Based on actual testing experience, index size is typically 1.2-1.8 times the original data

MSSQL full-text index: Includes inverted index, term frequency statistics, position information etc.

Elasticsearch index: Includes tokenization, inverted index, segment merging, compression and other optimizations, index size may be larger depending on tokenizer and settings

Performance test environment: SQL Server 2019, 8GB RAM, SSD hard drive

⚠️ Note: Actual index size will vary significantly due to content type, language, and tokenizer settings. It's recommended to test based on actual data.

Optimization Recommendations

🔗

Based on actual business requirements, using MSSQL Full-Text Search can already achieve the goals. Here are several directions for continuous optimization of search systems:

Optimization Strategy	Description	Applicable Scenarios
Create computed columns to remove HTML	Use `dbo.StripHTML(content)` to reduce index noise	All scenarios
Regular full-text index maintenance	Rebuild indexes monthly to maintain query efficiency	Medium to large systems
Separate historical and latest data	Full-text index only maintains recent data, reducing maintenance costs	Large systems
Batch index updates	Set `CHANGE_TRACKING` to `MANUAL`, avoid real-time updates	High write frequency systems

Technology Selection Recommendations

🔗

Choosing the right search technology requires considering your actual needs:

Small Projects (less than 10,000 articles)

🔗

Recommendation: SQL LIKE or MSSQL Full-Text Search
Reason: Simple implementation, low maintenance cost, sufficient performance for daily use

Medium Projects (10,000-100,000 articles)

🔗

Recommendation: MSSQL Full-Text Search
Reason: Relatively simple implementation, reasonable maintenance cost, much better performance than LIKE queries

Large Projects (over 100,000 articles)

🔗

Strong Recommendation: Elasticsearch
Reasons:
- Distributed design: Can add machines anytime to handle larger data volumes
- Very fast: Can maintain millisecond-level query speed even with large amounts of data
- Rich functionality: Besides search, can also do various data analysis
- Easy to scale: No need to worry about future data growth issues

Super Large Projects (over 10 million articles)

🔗

Only Choice: Elasticsearch cluster deployment
Recommended Configuration:
- Use multiple machines to form clusters
- Properly plan data sharding
- Use Kibana for data reporting
- Separate old and new data storage

Enterprise Applications (any scale)

🔗

Recommendation: Elasticsearch + Vector Semantic Search
Reason: Can provide smart search experience similar to Google

Conclusion

🔗

Search functionality optimization is a gradual process:

SQL LIKE: Suitable for small systems (less than 10,000 records)
MSSQL Full-Text Search: Stable choice for medium systems (10,000-100,000 records)
Elasticsearch: Best solution for large-scale data (over 100,000 records)
Vector Semantic Search: Future development direction

👉 If system data volume exceeds millions of records, Elasticsearch is almost the only solution, as it can simultaneously balance performance, scalability, and cost-effectiveness.
Small systems don't need over-engineering, MSSQL Full-Text Search can meet daily needs.

References

SQL Elasticsearch Full-Text Search Performance Optimization

Alvin

Software engineer who dislikes pointless busyness, enjoys solving problems with logic, and strives to find balance between the blind pursuit of achievements and a relaxed lifestyle.