Building a Scalable Event-Driven Search Architecture with Postgres’ Full-Text Search | by Jawad Margih | November, 2022

Designing full-text search based on Postgres with event-driven architecture

full Text search
full Text search

Software projects of all sizes and complexity share the challenge of building scalable search solutions. Who hasn’t seen an application running search using RDBMS SQL statements? You must be wondering if this is a good solution. well it depends.

Searching using SQL may be sufficient for your use case, but as your project grows more advanced features (synonyms, multilingual search, or even enabling machine learning, etc.) will be needed. When required, a relational database may not suffice.

Full-text search is supported in PostgreSQL. Full-text indexing allows documents to be pre-processed and an index saved for faster searching later.

In some cases PostgreSQL full-text search may be sufficient and should definitely be considered. For this reason and other reasons, many projects start using their own databases, and over time they move to search engines like ElasticSearch or Solr.

“…Full-text search refers to techniques for searching a single computer-stored document or collection in a full-text database…”

— Wikipedia

Recently I started exploring full-text search options. The use case is searching documents that contain three searchable fields, title and description (using full-text search) and document ID for strings containing the document ID. The search should run over a million documents with no more than 200ms.

stemming

It is the process of reducing a word to its word stem that joins suffixes and prefixes or roots of words, known as a lemma, to ensure that the variants of that word match the result during the search. eat. For example, Managing, Manager, Management can be derived from a word Managand search for words manag will return results of any type for that word. online stemmer tool

Source: https://thinkinfi.com/difference-between-stemming-and-lemmatizing-and-where-to-use/

engram

It is like a sliding window that moves across the word – a continuous sequence of characters up to a specified length. For example, the word word will be changed to {'w', 'wo, 'wor', 'ord', 'rd'}, NGram can be used to find a word in parts or even in the middle. most used type NGram Is Trigram ,

fuzziness

‘Fuzzy’ refers to solutions not looking for a perfect, position-by-position match when comparing two strings. Instead, they allow for some mismatch (or ‘fuzziness’). For example, search for the word succesful will also return the result of successful, Common applications include spell checking and spam filtering.

Equality

The similarity of two words can be measured by counting the number of trigrams they share. This simple idea is very effective for measuring the similarity of words in many natural languages.

grade

Ranking attempts to measure how many documents are relevant to a particular query so that when there are multiple matches, the most relevant ones are shown first. Postgres supports ranking and weighted ranking. Weights are typically used to mark words from particular areas of the document, such as titles or initial abstracts, so that they can be given more or less importance than words in the body of the document.

Before we move on, there are a few points to consider before rushing to implement full-text search.

knowing the data

Before implementing full-text search, we must know what the data looks like and understand the business requirements.

data size

There are different things to consider regarding data size, including storage, retention policies, speed, indexes, etc. It could be argued that you would need such a scheme regardless of data size, but two things have to be acknowledged:

  1. Scale increases complexity.
  2. Small scale architectural mistakes (long running queries, missing indices, etc) translate to more forgiving.

With regard to the technical process required to navigate a large data set, size matters. It matters in the planning and processing of data. It matters in tool selection and database choice.

Business requirements and features

We need to be very clear about the goal we want our search engine to accomplish. What kind of information does our search engine need? Who Uses Search Engines?

What other features should we offer besides full-text search? Which fields need to be indexable/searchable? Which search types should be supported? And, in addition to search, do we want to allow other capabilities like filtering?

Finally, this brings me to the final point defining our key performance metrics (KPIs) to measure the success of our search implementation.

KPI

KPIs are detailed specifications measured and analyzed by the software testing team to ensure process compliance with business objectives. Moreover, they help the team to take necessary steps if the performance of the product does not meet the defined objectives.

Our goal was to show customers the best match results at the top of the search results. In most cases, the customer will leave if the desired result is not listed in the first X results on the search page – or the first three pages of search results.

We knew that a customer could not have more than 500,000 documents; On an average, customers have 50K documents. Also, as a business requirement, the search response time should not exceed 200ms. So, we decided to use one million documents as our baseline form performance test, keeping a 200ms response time.

ElasticSearch is optimized for search only, but setting up and maintaining the infrastructure can be time-consuming. In addition, installing it yourself requires dedicated servers or services, which are more expensive than the PostgreSQL alternative. Also, in ElasticSearch, schema changes require a complete re-indexing of an index. By default, results are sorted by relevance—with the most relevant docs first. However, it is also possible to sort by field values. ElasticSearch also supports filtering by field and pagination out of the box.

We assumed that the team is already familiar with Postgres, which is already integrated into our product, making it a cheaper and faster option as it requires no additional setup and maintenance. We understand that the amount of data will increase over time. However, we also know that with the current data augmentation, our indexing and searching will still be within the limits of elasticsearch. Although PostgreSQL is slow, with [likely] slightly worse results and [possibly] Limited by capacity – it’s still “good enough” at a fairly large scale.

Figure 1. Full text search based on Postgres with event driven architecture
Figure 1. Full-text search based on Postgres with event-driven architecture

Indexing Service (Write)

The indexing service consumes events that contain document data that needs to be indexed. Any service that needs its documents to be searchable is required to publish an event that complies with a predefined schema created by the indexing service.

{
updated_on: number;
document_id: string;
document_type: string;
document_title: string;
document_description: string;
customer_id: string;
metadata: object;
}

In addition to consumption events, the indexing service provides operational APIs such as running VACUUM, health checks, re-indexing single or multiple documents, synchronization, etc.

Search service (read)

The search service (self-explanatory) is responsible for finding all types of documents supported by our product. It provides capabilities such as pagination, sorting (default is by relevance), and filtering.

read and write separation

The motivation behind the separation of two microservices into one microservices, for indexing documents. And the second one for searching documents allows us to scale appropriately. An indexing service is expected to perform a more difficult task than a search service. We do not want the indexing of bulk documents to affect the search functionality of our end users. So, it was our way to deploy it separately without sharing the same resources and connection pool.

After researching how to optimize our queries during Postgres full-text search, here are the results:

  • We added a new column, ts In a table to store the preprocessed search document (ie, list of lexeme). ts There is a generated column (new as of Postgres 12), automatically syncing with the source data. Then we created a GIN index on ts type column tsvector,
  • To enable fuzzy search, we used pg_trgm Postgres extension and added a words Column on the table to store searchable text. The column stores a concatenated string of searchable fields.
  • finally pg_trgm The extension provides the GIST and GIN index operator classes. index allows us to create an index on words Text column for fast similarity searches.

We ran the benchmark tests on a machine with the following specs CPU i9 2.3GHz (4 cores/threads) and 8GB of RAM.

We have filled the index storage with 3.5 million documents. We ran benchmark tests on a customer with one million indexed documents.

We tested the query below, which returns 436 rows, using the pgbench tool.

We were able to achieve ~170 transactions per second.

scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 20
duration: 60 s
number of transactions actually processed: 10187
latency average = 58.925 ms
initial connection time = 19.787 ms
tps = 169.707403 (without initial connection time)

However, a large amount of data is stored in the database, and performance and scaling are affected as the data grows. Partitioning solves this by splitting large tables into smaller tables, reducing memory swap problems and table scans, and increasing performance.

After partitioning the index storage table, we achieved about 60% improvement in query performance.

scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 20
duration: 60 s
number of transactions actually processed: 14936
latency average = 46.390 ms
initial connection time = 26.722 ms
tps = 248.933333 (without initial connection time)

Here’s what it looks like to run the query EXPLAIN ANALYZE,

With partitioning
"Planning Time: 0.230 ms"
"Execution Time: 25.808 ms"
Without partitioning
"Planning Time: 0.204 ms"
"Execution Time: 32.868 ms"

Postgres provides a variety of tools to support full-text search. In this article, you’ll see how we integrated full-text search into our product by separating writes from reads and how some FTS functions work. You also saw how it is taken advantage of SearchVectorField ,tsvector) class with GIN index to optimize performance in conjunction with table partitioning to scale further in the future.

If you already have Postgres as part of your stack, it’s worth considering experimenting with it before moving on to an external/expensive option that requires more attention to operation complexity.

So, if your project will not have a million records or a very large amount of data, then Postgres full-text search will be a great option.

Thanks for reading!

Leave a Comment