Database Search Performance Optimization: From SQL LIKE, MSSQL Full-Text Search to Elasticsearch

As website content and user base grow, search functionality often becomes a system bottleneck. This article documents the author's considerations, research, and trade-offs made based on system usage development when encountering bottlenecks, exploring how to gradually optimize from basic SQL LIKE queries to enterprise-level full-text search solutions.
Performance Bottlenecks in Traditional Search Methods
🔗When the number of website articles increases and user search frequency rises, traditional search methods encounter obvious performance issues.
Traditional Approach: Using SQL LIKE Syntax Queries
🔗sqlselect * from posts
where content like '%search_text%'
-
Pros: Simple to use, supports fuzzy search
-
Cons: Database cannot use indexes, causing full table scans, which becomes very slow with large datasets.
Search Performance Optimization Strategies
🔗When facing large-scale text search requirements, we need better solutions. Basically, all approaches extend from the concept of indexing.
Core Principles: Tokenizer and Inverted Index
🔗Tokenizer (Word Breaker)
🔗When data is written, the content to be searched is segmented, and inverted indexes are built to improve search performance.
json"apple is a red fruit"
["apple", "is", "a", "red", "fruit"]
"蘋果是一種紅色的水果"
["蘋果", "是", "一種", "紅色", "的", "水果"]
Building Inverted Index
🔗- Forward Index: Each document records what terms it contains
- Inverted Index: Each term records which documents it appears in
json[
{ "post_id": 1, "content": "apple is a red fruit" },
{ "post_id": 2, "content": "apple is green" },
{ "post_id": 3, "content": "banana and apple" }
]
The following simulates the actual storage format in the database:
Original Data (Forward Index):
post_id | content |
---|---|
1 | apple is a red fruit |
2 | apple green |
3 | banana and apple |
Inverted Index Results:
term | post_ids |
---|---|
apple | [1, 2, 3] |
red | [1] |
fruit | [1] |
green | [2] |
banana | [3] |
and | [3] |
So by simply querying keywords, we can quickly find the corresponding original data through the inverted index.
Next, we'll introduce two main optimization methods.
MSSQL Full-Text Search
🔗How to Enable Full-Text Search
🔗1. Verify Full-Text Search Feature Installation
🔗When installing SQL Server, Full-Text Search is an optional component. To verify if it's installed:
sqlSELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS IsFullTextInstalled;
- Returns 1 → Installed
- Returns 0 → Not installed, need to reinstall SQL Server features
2. Create Full-Text Catalog
🔗Full-text indexes need a Catalog for storage.
sqlCREATE FULLTEXT CATALOG ftCatalog AS DEFAULT;
This creates a full-text catalog named ftCatalog
and sets it as the default catalog.
3. Create Full-Text Index
🔗Assume there's an Articles table with Title and Content columns:
sqlCREATE TABLE Articles (
ID INT PRIMARY KEY,
Title NVARCHAR(200),
Content NVARCHAR(MAX)
);
You must first add a Unique Index to the table as an "anchor" for the full-text index:
sqlCREATE UNIQUE INDEX IX_Articles_ID ON Articles(ID);
Then create the full-text index:
sqlCREATE FULLTEXT INDEX ON Articles
(
Title LANGUAGE 1033, -- English
Content LANGUAGE 1028 -- Traditional Chinese
)
KEY INDEX IX_Articles_ID
ON ftCatalog
WITH CHANGE_TRACKING AUTO;
Explanation:
LANGUAGE 1033
→ English (LCID)LANGUAGE 1028
→ Traditional Chinese (LCID)CHANGE_TRACKING AUTO
→ Automatic index updates (recommended for test environments; consider MANUAL for production with periodic batch updates)
4. Query Testing
🔗sql-- Using CONTAINS
SELECT * FROM Articles WHERE CONTAINS(Content, N'資料庫');
-- Using FREETEXT
SELECT * FROM Articles WHERE FREETEXT(Content, N'SQL 很強大');
5. Maintaining Full-Text Index
🔗Full-text indexes expand over time and require regular maintenance:
sql-- Rebuild full-text index
ALTER FULLTEXT CATALOG ftCatalog REBUILD;
-- Reorganize full-text index
ALTER FULLTEXT CATALOG ftCatalog REORGANIZE;
6. Common Errors
🔗- Error: Full-text service not found → Full-Text Search feature not installed
- Error: Cannot create full-text index due to missing unique index → Must first create an index on primary key or unique column
- Cannot find Chinese content → Ensure column uses NVARCHAR and specify LANGUAGE 1028 (Traditional Chinese) when creating index
Differences Between CONTAINS and FREETEXT
🔗You can use CONTAINS
and FREETEXT
syntax
sqlSELECT *
FROM your_table
WHERE CONTAINS(column_name, @query_text)
or
WHERE FREETEXT(column_name, @query_text)
Multi-column Query
sqlSELECT *
FROM your_table
WHERE CONTAINS( (column1, column2, column3), @query_text );
CONTAINS
🔗sqlOriginal content:
SQL is powerful
SQL Server tokenizes to:
[SQL], [is], [powerful]
The inverted index tokens are [SQL], [is], [powerful]
sqlCONTAINS(content, 'sq') //If input 'sq', nothing will be found
sqlOriginal content:
SQL很強大
SQL Server tokenizes to:
[SQL],[很],[強大]
The inverted index tokens are [SQL],[很],[強大]
sqlCONTAINS(content, 'SQL') -- ✅ Found
CONTAINS(content, '強大') -- ✅ Found (if searching entire segment)
CONTAINS(content, '很強大') -- ❌ Not found (due to tokenization)
FREETEXT
🔗The principle is to preprocess the input query terms, expand search scope, and attempt to search for semantically related content
"很強大"
→["很", "強", "大"]
"running"
→"run"
driving"
→"drive"
This means preprocessing the input query string by segmenting it and then attempting to search, hoping to effectively match the inverted index tokens.
No need for exact terms, supports fuzzy search, but performance is slightly lower than CONTAINS, yet still much faster than LIKE
To better meet business requirements and prevent no search results when tokenization doesn't match, both syntaxes are used together.
sqlDECLARE @keyword NVARCHAR(100) = N'內容';
-- First try CONTAINS
IF EXISTS (
SELECT 1 FROM Articles
WHERE CONTAINS(Content, @keyword)
)
BEGIN
SELECT ID, Title, Content
FROM Articles
WHERE CONTAINS(Content, @keyword);
END
ELSE
BEGIN
-- If not found, fall back to FREETEXT
SELECT ID, Title, Content
FROM Articles
WHERE FREETEXT(Content, @keyword);
END
Elasticsearch
🔗Elasticsearch is a distributed search engine based on Apache Lucene, specifically designed to handle search and analysis of large volumes of text data. In modern website architectures, it has almost become the standard choice for building search functionality.
Why Choose Elasticsearch for Large Data Volumes
🔗When it comes to handling millions or tens of millions of records for search, Elasticsearch is indeed the best solution available today. This is not without reason:
1. Born for Distributed Design
🔗Elasticsearch was designed from the beginning to handle large amounts of data. When your data grows, you can easily add more machines to share the workload. Data is automatically distributed across different nodes, so no single machine becomes particularly busy.
Moreover, when many people search simultaneously, these query requests are also distributed to different nodes for processing, with everyone helping to answer questions together. Even if one machine fails, other machines can continue to provide service without affecting users.
2. Architecture Born for Search
🔗Elasticsearch also uses inverted indexes and utilizes memory to cache data, so search speed can be tens of times faster than traditional databases.
Additionally, it can use multiple threads simultaneously to handle different parts of queries, like having many people searching different bookshelves for you at the same time, which is naturally much faster than one person searching slowly.
3. Actual Performance Comparison
🔗Here's an actual performance comparison table to give everyone a more concrete understanding:
Data Volume | Traditional DB LIKE | MSSQL Full-Text | Elasticsearch |
---|---|---|---|
1 Million Records | Tens of seconds | Seconds | Millisecond-level |
10 Million Records | Minutes | Tens of seconds | Second-level |
Hundreds of millions+ | Impractical | Severe performance degradation | Still fast |
As you can see, when data volume increases, Elasticsearch's advantages become more and more apparent.
When to Consider Using Elasticsearch
🔗If your system has the following situations, please directly consider using Elasticsearch~
Data volume exceeds 1 million records: At this point, traditional databases will start to struggle.
Daily search requests exceed 10,000: This means your search functionality is heavily used and needs a more robust architecture.
Requires millisecond-level response: If users expect to see results immediately after entering keywords.
Requires complex data analysis: Such as statistics on popular search keywords, user behavior analysis, etc.
Elasticsearch's benefits also include providing data in API format, reducing development workflow.
Query Methods
🔗jsonPOST /articles/_search
{
"query": {
"match": {
"content": "Elasticsearch"
}
}
}
Enable fuzzy search
json{
"query": {
"fuzzy": {
"title": {
"value": "elasticsarch", ← Typo
"fuzziness": "AUTO", ← Auto-determine allowed errors
"prefix_length": 1, ← How many initial characters must be correct
"max_expansions": 3 ← Limit of matching terms
}
}
}
}
Chinese text in articles is always separated by commas and periods, unlike English which uses spaces to distinguish individual words. This creates complexity in tokenization and index building, because the tokenization algorithm corresponds to the accuracy of query results. Therefore, choosing a good tokenizer becomes very important. In MSSQL full-text search, the tokenizer is built-in and we can only rely on Microsoft, while Elasticsearch provides multiple tokenizers, allowing developers to choose different options to achieve the actual effects they want.
Main Differences Between Chinese and English in Inverted Indexing
🔗Item | English | Chinese |
---|---|---|
Space-based tokenization | ✅ Words have spaces between them, can be directly segmented | ❌ No spaces, cannot be directly segmented, requires additional tokenizer |
Tokenization difficulty | Simple: Direct segmentation by spaces and punctuation | Complex: Requires semantics and dictionary for word segmentation |
Analyzer | StandardAnalyzer etc. | Requires ChineseAnalyzer, IK Analyzer, jieba etc. |
Inverted index results | More direct, precise vocabulary | May produce different segmentation results, affecting indexing and querying |
- Chinese fuzzy search support is limited, because Chinese is not composed of letters, it cannot compare "letter differences" like English.
Elasticsearch can do fuzzy search on Chinese, but results are often unintuitive because it's not based on semantics or pinyin, but on character encoding differences
Fuzzy Search Strategies
🔗The following compares currently common fuzzy search strategies
Strategy | Description | Example |
---|---|---|
Prefix matching | Search term is at the beginning of words | "天" can match "天氣", "天文" |
Substring matching | Search term appears anywhere in words | "氣" can match "天氣", "氣候" |
Synonym matching | Search "天氣" also finds "氣象" | Requires synonym dictionary |
Pinyin fuzzy | Misspelled "tianqi" → finds "天氣" | Requires pinyin converter |
Typo tolerance | User types "田氣" also finds "天氣" | Requires NLP or fuzzy matching algorithm |
Fuzzy Search Comparison
🔗Feature | MSSQL Full-Text Search | Elasticsearch | Description |
---|---|---|---|
🔍 Prefix matching | ✅ CONTAINS('word*') | ✅ prefix query, edge_ngram | Supports prefix fuzzy search |
🔎 Substring matching | ❌ | ✅ wildcard , ngram | MSSQL doesn't support %word% full-text indexing, Elasticsearch can simulate with ngram |
🧠 Synonym search | ❌ | ✅ Synonym filter | MSSQL has no built-in synonym feature, Elasticsearch can use synonym dictionary |
🈚 Pinyin fuzzy search | ❌ | ✅ Pinyin plugin (like ik-pinyin) | MSSQL has no pinyin support, Elasticsearch requires corresponding plugin |
🔁 Typo tolerance / Fuzzy matching | ❌ | ✅ fuzzy query | MSSQL doesn't support Levenshtein distance, Elasticsearch has native support |
Overall Comparison
🔗Feature/Characteristic | Elasticsearch | MSSQL Full-Text Search |
---|---|---|
🔍 Full-text search accuracy | Very strong, supports word segmentation, relevance scoring, highlighting etc. | Basically usable, but limited word segmentation effects |
🈶 Chinese word segmentation support | ✅ Supported (requires installing IK/Jieba tokenizer) | ⚠️ Basic support, but no tokenizer choice and cannot customize |
🔧 Tokenizer adjustability | ✅ Highly flexible, can customize dictionary | ❌ Fixed and non-adjustable |
⚡ Query performance and scalability | Very high, horizontally scalable | Medium, limited by database itself |
📦 Support for complex queries and sorting | ✅ Supports relevance sorting, fuzzy spelling, proximity distance | ❌ More basic, difficult to do fuzzy matching |
🧠 Natural language search experience | Highly optimized, can create "Google-like" search experience | Average, cannot provide intelligent queries |
📊 Additional data analysis (aggregation, statistics) | ✅ Built-in powerful Aggregation functionality | ❌ No full-text statistical analysis support |
🔌 Integration and API support | Fully RESTful, easy to integrate with websites/backend services | Limited to SQL commands or CLR components |
📈 Suitable for big data/high search volume scenarios | ✅ Very suitable | ❌ Easily becomes slow |
Advanced Supplement
🔗Keyword Search vs Semantic Search
🔗Comparison Item | Keyword Search | Vector Semantic Search |
---|---|---|
🔍 Query method | Matches whether text appears in documents (usually using inverted index) | Converts queries and documents to vector space for similarity comparison (like cosine similarity) |
📄 Content requirements | Search terms must explicitly appear in text | Search terms don't need to appear completely, semantically similar can also match |
🌐 Language understanding ability | Weak, purely literal matching | Strong, can understand semantic associations (like causality, description, context) |
📦 Execution performance | Efficient (especially inverted index) | Relatively slow (vector comparison is costly, requires ANN technology for acceleration) |
🔧 Setup cost | Low, quick deployment | High, requires semantic models (like BERT) and vector databases |
📌 Suitable scenarios | Precise matching, regulations, technical documentation, early search engines | Intelligent Q&A, customer service queries, recommendation systems, unstructured data understanding |
Example:
Query: "How to buy iPhone"
System Type | Query Result Description |
---|---|
Keyword Search | Can only find articles containing "how", "buy" and "iPhone" |
Semantic Search | Can find content with titles like "Apple Phone Purchase Process", "Apple Ordering Methods" etc. |
Actual Performance Test Results
🔗To verify optimization effectiveness, I conducted performance comparisons in a test environment containing thousands of articles:
Query Method | Execution Time | Performance Improvement |
---|---|---|
SQL LIKE Query | 25.423 seconds | Baseline |
MSSQL CONTAINS | 0.093 seconds | 273x improvement |
Capacity Planning and Scalability Considerations
🔗When planning search systems, we need to consider the impact of data growth on performance:
Data Volume and Performance Relationship
🔗Article Count | Original Data Size | Index Size | Expected Performance |
---|---|---|---|
10,000 articles | ~100 MB | ~150 MB | Excellent |
100,000 articles | ~1 GB | ~1.5 GB | Good, needs attention to maintenance |
1,000,000 articles | ~10 GB | ~15 GB | Requires partitioning optimization |
Millions+ articles | >10 GB | >15 GB | Recommend Elasticsearch |
Capacity Calculation Explanation:
- Original data estimation: Based on an average of 10KB per article (including title, content, tags and other metadata)
- Full-text index size estimation: Based on actual testing experience, index size is typically 1.2-1.8 times the original data
- MSSQL full-text index: Includes inverted index, term frequency statistics, position information etc.
- Elasticsearch index: Includes tokenization, inverted index, segment merging, compression and other optimizations, index size may be larger depending on tokenizer and settings
- Performance test environment: SQL Server 2019, 8GB RAM, SSD hard drive
⚠️ Note: Actual index size will vary significantly due to content type, language, and tokenizer settings. It's recommended to test based on actual data.
Optimization Recommendations
🔗Based on actual business requirements, using MSSQL Full-Text Search can already achieve the goals. Here are several directions for continuous optimization of search systems:
Optimization Strategy | Description | Applicable Scenarios |
---|---|---|
Create computed columns to remove HTML | Use dbo.StripHTML(content) to reduce index noise | All scenarios |
Regular full-text index maintenance | Rebuild indexes monthly to maintain query efficiency | Medium to large systems |
Separate historical and latest data | Full-text index only maintains recent data, reducing maintenance costs | Large systems |
Batch index updates | Set CHANGE_TRACKING to MANUAL , avoid real-time updates | High write frequency systems |
Technology Selection Recommendations
🔗Choosing the right search technology requires considering your actual needs:
Small Projects (less than 10,000 articles)
🔗- Recommendation: SQL LIKE or MSSQL Full-Text Search
- Reason: Simple implementation, low maintenance cost, sufficient performance for daily use
Medium Projects (10,000-100,000 articles)
🔗- Recommendation: MSSQL Full-Text Search
- Reason: Relatively simple implementation, reasonable maintenance cost, much better performance than LIKE queries
Large Projects (over 100,000 articles)
🔗- Strong Recommendation: Elasticsearch
- Reasons:
- Distributed design: Can add machines anytime to handle larger data volumes
- Very fast: Can maintain millisecond-level query speed even with large amounts of data
- Rich functionality: Besides search, can also do various data analysis
- Easy to scale: No need to worry about future data growth issues
Super Large Projects (over 10 million articles)
🔗- Only Choice: Elasticsearch cluster deployment
- Recommended Configuration:
- Use multiple machines to form clusters
- Properly plan data sharding
- Use Kibana for data reporting
- Separate old and new data storage
Enterprise Applications (any scale)
🔗- Recommendation: Elasticsearch + Vector Semantic Search
- Reason: Can provide smart search experience similar to Google
Conclusion
🔗Search functionality optimization is a gradual process:
- SQL LIKE: Suitable for small systems (less than 10,000 records)
- MSSQL Full-Text Search: Stable choice for medium systems (10,000-100,000 records)
- Elasticsearch: Best solution for large-scale data (over 100,000 records)
- Vector Semantic Search: Future development direction
👉 If system data volume exceeds millions of records, Elasticsearch is almost the only solution, as it can simultaneously balance performance, scalability, and cost-effectiveness.
Small systems don't need over-engineering, MSSQL Full-Text Search can meet daily needs.
References
SQLElasticsearchFull-Text SearchPerformance Optimization

Alvin
Software engineer who dislikes pointless busyness, enjoys solving problems with logic, and strives to find balance between the blind pursuit of achievements and a relaxed lifestyle.