Database Search Performance Optimization: From SQL LIKE, MSSQL Full-Text Search to Elasticsearch


October 7, 2025
Program

Database Search Performance Optimization: From SQL LIKE, MSSQL Full-Text Search to Elasticsearch


In-depth exploration of database search performance optimization, comparing the pros and cons of SQL LIKE, MSSQL Full-Text Search, and Elasticsearch with implementation details.

As website content and user base grow, search functionality often becomes a system bottleneck. This article documents the author's considerations, research, and trade-offs made based on system usage development when encountering bottlenecks, exploring how to gradually optimize from basic SQL LIKE queries to enterprise-level full-text search solutions.

Performance Bottlenecks in Traditional Search Methods

🔗

When the number of website articles increases and user search frequency rises, traditional search methods encounter obvious performance issues.

Traditional Approach: Using SQL LIKE Syntax Queries

🔗
sql
select * from posts
where content like '%search_text%'
  • Pros: Simple to use, supports fuzzy search

  • Cons: Database cannot use indexes, causing full table scans, which becomes very slow with large datasets.


Search Performance Optimization Strategies

🔗

When facing large-scale text search requirements, we need better solutions. Basically, all approaches extend from the concept of indexing.

Core Principles: Tokenizer and Inverted Index

🔗

Tokenizer (Word Breaker)

🔗

When data is written, the content to be searched is segmented, and inverted indexes are built to improve search performance.

json
"apple is a red fruit"

["apple", "is", "a", "red", "fruit"]

"蘋果是一種紅色的水果"

["蘋果", "是", "一種", "紅色", "的", "水果"]

Building Inverted Index

🔗
  • Forward Index: Each document records what terms it contains
  • Inverted Index: Each term records which documents it appears in
json
[
  { "post_id": 1, "content": "apple is a red fruit" },
  { "post_id": 2, "content": "apple is green" },
  { "post_id": 3, "content": "banana and apple" }
]

The following simulates the actual storage format in the database:

Original Data (Forward Index):

post_idcontent
1apple is a red fruit
2apple green
3banana and apple

Inverted Index Results:

termpost_ids
apple[1, 2, 3]
red[1]
fruit[1]
green[2]
banana[3]
and[3]

So by simply querying keywords, we can quickly find the corresponding original data through the inverted index.


Next, we'll introduce two main optimization methods.

MSSQL Full-Text Search

🔗

How to Enable Full-Text Search

🔗

1. Verify Full-Text Search Feature Installation

🔗

When installing SQL Server, Full-Text Search is an optional component. To verify if it's installed:

sql
SELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS IsFullTextInstalled;
  • Returns 1 → Installed
  • Returns 0 → Not installed, need to reinstall SQL Server features

2. Create Full-Text Catalog

🔗

Full-text indexes need a Catalog for storage.

sql
CREATE FULLTEXT CATALOG ftCatalog AS DEFAULT;

This creates a full-text catalog named ftCatalog and sets it as the default catalog.

3. Create Full-Text Index

🔗

Assume there's an Articles table with Title and Content columns:

sql
CREATE TABLE Articles (
    ID INT PRIMARY KEY,
    Title NVARCHAR(200),
    Content NVARCHAR(MAX)
);

You must first add a Unique Index to the table as an "anchor" for the full-text index:

sql
CREATE UNIQUE INDEX IX_Articles_ID ON Articles(ID);

Then create the full-text index:

sql
CREATE FULLTEXT INDEX ON Articles
(
    Title LANGUAGE 1033,   -- English
    Content LANGUAGE 1028  -- Traditional Chinese
)
KEY INDEX IX_Articles_ID
ON ftCatalog
WITH CHANGE_TRACKING AUTO;

Explanation:

  • LANGUAGE 1033 → English (LCID)
  • LANGUAGE 1028 → Traditional Chinese (LCID)
  • CHANGE_TRACKING AUTO → Automatic index updates (recommended for test environments; consider MANUAL for production with periodic batch updates)

4. Query Testing

🔗
sql
-- Using CONTAINS
SELECT * FROM Articles WHERE CONTAINS(Content, N'資料庫');

-- Using FREETEXT
SELECT * FROM Articles WHERE FREETEXT(Content, N'SQL 很強大');

5. Maintaining Full-Text Index

🔗

Full-text indexes expand over time and require regular maintenance:

sql
-- Rebuild full-text index
ALTER FULLTEXT CATALOG ftCatalog REBUILD;

-- Reorganize full-text index
ALTER FULLTEXT CATALOG ftCatalog REORGANIZE;

6. Common Errors

🔗
  • Error: Full-text service not found → Full-Text Search feature not installed
  • Error: Cannot create full-text index due to missing unique index → Must first create an index on primary key or unique column
  • Cannot find Chinese content → Ensure column uses NVARCHAR and specify LANGUAGE 1028 (Traditional Chinese) when creating index

Differences Between CONTAINS and FREETEXT

🔗

You can use CONTAINS and FREETEXT syntax

sql
SELECT * 
FROM your_table
WHERE CONTAINS(column_name, @query_text)  
or 
WHERE FREETEXT(column_name, @query_text)

Multi-column Query

sql
SELECT *
FROM your_table
WHERE CONTAINS( (column1, column2, column3), @query_text );


CONTAINS

🔗
sql
Original content:
SQL is powerful

SQL Server tokenizes to:
[SQL], [is], [powerful]

The inverted index tokens are [SQL], [is], [powerful]

sql
CONTAINS(content, 'sq')  //If input 'sq', nothing will be found
sql
Original content:
SQL很強大
SQL Server tokenizes to:
[SQL],[很],[強大]

The inverted index tokens are [SQL],[很],[強大]

sql
CONTAINS(content, 'SQL')  -- ✅ Found
CONTAINS(content, '強大')          -- ✅ Found (if searching entire segment)
CONTAINS(content, '很強大')            -- ❌ Not found (due to tokenization)

FREETEXT

🔗

The principle is to preprocess the input query terms, expand search scope, and attempt to search for semantically related content

  • "很強大"["很", "強", "大"]
  • "running""run"
  • driving""drive"

This means preprocessing the input query string by segmenting it and then attempting to search, hoping to effectively match the inverted index tokens.

No need for exact terms, supports fuzzy search, but performance is slightly lower than CONTAINS, yet still much faster than LIKE

To better meet business requirements and prevent no search results when tokenization doesn't match, both syntaxes are used together.

sql
DECLARE @keyword NVARCHAR(100) = N'內容';

-- First try CONTAINS
IF EXISTS (
    SELECT 1 FROM Articles
    WHERE CONTAINS(Content, @keyword)
)
BEGIN
    SELECT ID, Title, Content
    FROM Articles
    WHERE CONTAINS(Content, @keyword);
END
ELSE
BEGIN
    -- If not found, fall back to FREETEXT
    SELECT ID, Title, Content
    FROM Articles
    WHERE FREETEXT(Content, @keyword);
END



Elasticsearch

🔗

Elasticsearch is a distributed search engine based on Apache Lucene, specifically designed to handle search and analysis of large volumes of text data. In modern website architectures, it has almost become the standard choice for building search functionality.

Why Choose Elasticsearch for Large Data Volumes

🔗

When it comes to handling millions or tens of millions of records for search, Elasticsearch is indeed the best solution available today. This is not without reason:

1. Born for Distributed Design

🔗

Elasticsearch was designed from the beginning to handle large amounts of data. When your data grows, you can easily add more machines to share the workload. Data is automatically distributed across different nodes, so no single machine becomes particularly busy.

Moreover, when many people search simultaneously, these query requests are also distributed to different nodes for processing, with everyone helping to answer questions together. Even if one machine fails, other machines can continue to provide service without affecting users.

2. Architecture Born for Search

🔗

Elasticsearch also uses inverted indexes and utilizes memory to cache data, so search speed can be tens of times faster than traditional databases.

Additionally, it can use multiple threads simultaneously to handle different parts of queries, like having many people searching different bookshelves for you at the same time, which is naturally much faster than one person searching slowly.

3. Actual Performance Comparison

🔗

Here's an actual performance comparison table to give everyone a more concrete understanding:

Data VolumeTraditional DB LIKEMSSQL Full-TextElasticsearch
1 Million RecordsTens of secondsSecondsMillisecond-level
10 Million RecordsMinutesTens of secondsSecond-level
Hundreds of millions+ImpracticalSevere performance degradationStill fast

As you can see, when data volume increases, Elasticsearch's advantages become more and more apparent.

When to Consider Using Elasticsearch

🔗

If your system has the following situations, please directly consider using Elasticsearch~

Data volume exceeds 1 million records: At this point, traditional databases will start to struggle.

Daily search requests exceed 10,000: This means your search functionality is heavily used and needs a more robust architecture.

Requires millisecond-level response: If users expect to see results immediately after entering keywords.

Requires complex data analysis: Such as statistics on popular search keywords, user behavior analysis, etc.

Elasticsearch's benefits also include providing data in API format, reducing development workflow.

Query Methods

🔗
json
POST /articles/_search
{
  "query": {
    "match": {
      "content": "Elasticsearch"
    }
  }
}

Enable fuzzy search

json
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "elasticsarch",      ← Typo
        "fuzziness": "AUTO",          ← Auto-determine allowed errors
        "prefix_length": 1,           ← How many initial characters must be correct
        "max_expansions": 3          ← Limit of matching terms
      }
    }
  }
}

Chinese text in articles is always separated by commas and periods, unlike English which uses spaces to distinguish individual words. This creates complexity in tokenization and index building, because the tokenization algorithm corresponds to the accuracy of query results. Therefore, choosing a good tokenizer becomes very important. In MSSQL full-text search, the tokenizer is built-in and we can only rely on Microsoft, while Elasticsearch provides multiple tokenizers, allowing developers to choose different options to achieve the actual effects they want.

Main Differences Between Chinese and English in Inverted Indexing

🔗
ItemEnglishChinese
Space-based tokenization✅ Words have spaces between them, can be directly segmented❌ No spaces, cannot be directly segmented, requires additional tokenizer
Tokenization difficultySimple: Direct segmentation by spaces and punctuationComplex: Requires semantics and dictionary for word segmentation
AnalyzerStandardAnalyzer etc.Requires ChineseAnalyzer, IK Analyzer, jieba etc.
Inverted index resultsMore direct, precise vocabularyMay produce different segmentation results, affecting indexing and querying
  • Chinese fuzzy search support is limited, because Chinese is not composed of letters, it cannot compare "letter differences" like English.

Elasticsearch can do fuzzy search on Chinese, but results are often unintuitive because it's not based on semantics or pinyin, but on character encoding differences

Fuzzy Search Strategies

🔗

The following compares currently common fuzzy search strategies

StrategyDescriptionExample
Prefix matchingSearch term is at the beginning of words"天" can match "天氣", "天文"
Substring matchingSearch term appears anywhere in words"氣" can match "天", "氣候"
Synonym matchingSearch "天氣" also finds "氣象"Requires synonym dictionary
Pinyin fuzzyMisspelled "tianqi" → finds "天氣"Requires pinyin converter
Typo toleranceUser types "田氣" also finds "天氣"Requires NLP or fuzzy matching algorithm

Fuzzy Search Comparison

🔗
FeatureMSSQL Full-Text SearchElasticsearchDescription
🔍 Prefix matchingCONTAINS('word*')prefix query, edge_ngramSupports prefix fuzzy search
🔎 Substring matchingwildcard, ngramMSSQL doesn't support %word% full-text indexing, Elasticsearch can simulate with ngram
🧠 Synonym search✅ Synonym filterMSSQL has no built-in synonym feature, Elasticsearch can use synonym dictionary
🈚 Pinyin fuzzy search✅ Pinyin plugin (like ik-pinyin)MSSQL has no pinyin support, Elasticsearch requires corresponding plugin
🔁 Typo tolerance / Fuzzy matchingfuzzy queryMSSQL doesn't support Levenshtein distance, Elasticsearch has native support

Overall Comparison

🔗
Feature/CharacteristicElasticsearchMSSQL Full-Text Search
🔍 Full-text search accuracyVery strong, supports word segmentation, relevance scoring, highlighting etc.Basically usable, but limited word segmentation effects
🈶 Chinese word segmentation support✅ Supported (requires installing IK/Jieba tokenizer)⚠️ Basic support, but no tokenizer choice and cannot customize
🔧 Tokenizer adjustability✅ Highly flexible, can customize dictionary❌ Fixed and non-adjustable
⚡ Query performance and scalabilityVery high, horizontally scalableMedium, limited by database itself
📦 Support for complex queries and sorting✅ Supports relevance sorting, fuzzy spelling, proximity distance❌ More basic, difficult to do fuzzy matching
🧠 Natural language search experienceHighly optimized, can create "Google-like" search experienceAverage, cannot provide intelligent queries
📊 Additional data analysis (aggregation, statistics)✅ Built-in powerful Aggregation functionality❌ No full-text statistical analysis support
🔌 Integration and API supportFully RESTful, easy to integrate with websites/backend servicesLimited to SQL commands or CLR components
📈 Suitable for big data/high search volume scenarios✅ Very suitable❌ Easily becomes slow

Advanced Supplement

🔗

Keyword Search vs Semantic Search

🔗
Comparison ItemKeyword SearchVector Semantic Search
🔍 Query methodMatches whether text appears in documents (usually using inverted index)Converts queries and documents to vector space for similarity comparison (like cosine similarity)
📄 Content requirementsSearch terms must explicitly appear in textSearch terms don't need to appear completely, semantically similar can also match
🌐 Language understanding abilityWeak, purely literal matchingStrong, can understand semantic associations (like causality, description, context)
📦 Execution performanceEfficient (especially inverted index)Relatively slow (vector comparison is costly, requires ANN technology for acceleration)
🔧 Setup costLow, quick deploymentHigh, requires semantic models (like BERT) and vector databases
📌 Suitable scenariosPrecise matching, regulations, technical documentation, early search enginesIntelligent Q&A, customer service queries, recommendation systems, unstructured data understanding

Example:

Query: "How to buy iPhone"

System TypeQuery Result Description
Keyword SearchCan only find articles containing "how", "buy" and "iPhone"
Semantic SearchCan find content with titles like "Apple Phone Purchase Process", "Apple Ordering Methods" etc.


Actual Performance Test Results

🔗

To verify optimization effectiveness, I conducted performance comparisons in a test environment containing thousands of articles:

Query MethodExecution TimePerformance Improvement
SQL LIKE Query25.423 secondsBaseline
MSSQL CONTAINS0.093 seconds273x improvement

Capacity Planning and Scalability Considerations

🔗

When planning search systems, we need to consider the impact of data growth on performance:

Data Volume and Performance Relationship

🔗
Article CountOriginal Data SizeIndex SizeExpected Performance
10,000 articles~100 MB~150 MBExcellent
100,000 articles~1 GB~1.5 GBGood, needs attention to maintenance
1,000,000 articles~10 GB~15 GBRequires partitioning optimization
Millions+ articles>10 GB>15 GBRecommend Elasticsearch

Capacity Calculation Explanation:

  • Original data estimation: Based on an average of 10KB per article (including title, content, tags and other metadata)
  • Full-text index size estimation: Based on actual testing experience, index size is typically 1.2-1.8 times the original data
    • MSSQL full-text index: Includes inverted index, term frequency statistics, position information etc.
    • Elasticsearch index: Includes tokenization, inverted index, segment merging, compression and other optimizations, index size may be larger depending on tokenizer and settings
  • Performance test environment: SQL Server 2019, 8GB RAM, SSD hard drive

⚠️ Note: Actual index size will vary significantly due to content type, language, and tokenizer settings. It's recommended to test based on actual data.



Optimization Recommendations

🔗

Based on actual business requirements, using MSSQL Full-Text Search can already achieve the goals. Here are several directions for continuous optimization of search systems:

Optimization StrategyDescriptionApplicable Scenarios
Create computed columns to remove HTMLUse dbo.StripHTML(content) to reduce index noiseAll scenarios
Regular full-text index maintenanceRebuild indexes monthly to maintain query efficiencyMedium to large systems
Separate historical and latest dataFull-text index only maintains recent data, reducing maintenance costsLarge systems
Batch index updatesSet CHANGE_TRACKING to MANUAL, avoid real-time updatesHigh write frequency systems

Technology Selection Recommendations

🔗

Choosing the right search technology requires considering your actual needs:

Small Projects (less than 10,000 articles)

🔗
  • Recommendation: SQL LIKE or MSSQL Full-Text Search
  • Reason: Simple implementation, low maintenance cost, sufficient performance for daily use

Medium Projects (10,000-100,000 articles)

🔗
  • Recommendation: MSSQL Full-Text Search
  • Reason: Relatively simple implementation, reasonable maintenance cost, much better performance than LIKE queries

Large Projects (over 100,000 articles)

🔗
  • Strong Recommendation: Elasticsearch
  • Reasons:
    • Distributed design: Can add machines anytime to handle larger data volumes
    • Very fast: Can maintain millisecond-level query speed even with large amounts of data
    • Rich functionality: Besides search, can also do various data analysis
    • Easy to scale: No need to worry about future data growth issues

Super Large Projects (over 10 million articles)

🔗
  • Only Choice: Elasticsearch cluster deployment
  • Recommended Configuration:
    • Use multiple machines to form clusters
    • Properly plan data sharding
    • Use Kibana for data reporting
    • Separate old and new data storage

Enterprise Applications (any scale)

🔗
  • Recommendation: Elasticsearch + Vector Semantic Search
  • Reason: Can provide smart search experience similar to Google

Conclusion

🔗

Search functionality optimization is a gradual process:

  • SQL LIKE: Suitable for small systems (less than 10,000 records)
  • MSSQL Full-Text Search: Stable choice for medium systems (10,000-100,000 records)
  • Elasticsearch: Best solution for large-scale data (over 100,000 records)
  • Vector Semantic Search: Future development direction

👉 If system data volume exceeds millions of records, Elasticsearch is almost the only solution, as it can simultaneously balance performance, scalability, and cost-effectiveness.
Small systems don't need over-engineering, MSSQL Full-Text Search can meet daily needs.



References

SQLElasticsearchFull-Text SearchPerformance Optimization



Avatar

Alvin

Software engineer who dislikes pointless busyness, enjoys solving problems with logic, and strives to find balance between the blind pursuit of achievements and a relaxed lifestyle.

Related Posts