The evolution and future trends of search databases technological progress
Search-oriented databases
Trend
2024-03-03

Overview #

With the rapid advancement of digital technology and the explosive growth of information, search engines have become one of our preferred means of accessing information, with notable industry players such as Google. However, as user demands continue to evolve, traditional search technologies have struggled to meet people’s needs for real-time, personalized, and diverse information.

This demand is even more pronounced within enterprises. As digital transformation continues to deepen within organizations, unstructured data has become a major source of data growth and a crucial component of data ecosystems, carrying immense value. The importance of efficiently storing and leveraging unstructured data is increasingly evident. Enterprises require more efficient management and retrieval of massive internal data to support business decisions and operational requirements.

According to IDC’s projections, by 2025, 80% of data will be unstructured data. Furthermore, Gartner’s data indicates that the volume of unstructured data is expected to double from 2019 to 2024. However, currently, unstructured data faces various challenges such as diverse representations, high management complexity, and difficulties in extracting its value. Traditional database systems often fail to meet the real-time and diverse search requirements of enterprises. To address these challenges, search-oriented databases have emerged, with technologies such as automatic tokenization, inverted indexing, relevance scoring, and vector retrieval engines at their core. Since their inception in the 1990s, these databases have continued to evolve and are becoming an indispensable branch in the field of databases.

What is a search-oriented database? #

Search-oriented databases, formerly known as full-text databases or enterprise search engines, are specialized database systems designed to store and manage large volumes of textual data while facilitating efficient text search and information retrieval. However, with the continuous development of technology and the increasing diversity of application scenarios, search-oriented databases have evolved beyond handling long textual data alone. They can now handle common structured data such as numerical and date information, as well as unstructured data such as IP addresses, geographical location data, images, and audio-visual content. The scope of applications for search-oriented databases continues to expand, transitioning from supporting business system retrieval acceleration, IT operations observability, and aggregate query analysis to encompass various scenarios and multimodal data search.

Typical search databases exhibit the following characteristics:

1. Flexible indexing capabilities: Search databases can handle various types of data, including unstructured data such as text, images, audio, and video. They employ techniques like automated tokenization and inverted indexing to efficiently process different formats and types of data, offering flexible search and retrieval functionalities.

2. Efficient query performance: Search databases possess efficient query processing capabilities, allowing for rapid indexing and retrieval of large-scale data. Leveraging optimized index structures and query algorithms, search databases can accurately return relevant results within a short time, improving user search efficiency. They are commonly used to address the high-concurrency retrieval requirements of relational databases.

3. Support for complex search features: Search databases provide diverse search functionalities, including full-text search, fuzzy search, exact search, range search, vector search, geospatial information retrieval, and more. Users can flexibly select and combine different search features based on their specific needs and scenarios, enabling them to obtain desired search results.

4. High performance and scalability: Search databases are designed with high performance and scalability in mind, capable of handling massive amounts of data and concurrent access. They employ distributed architectures and parallel computing techniques, enabling horizontal scalability to meet the ever-growing demands of data volume and user traffic.

In summary, search databases possess several key characteristics, including the ability to handle unstructured data, real-time search and updates, diverse search functionalities, personalized recommendations and intelligent search capabilities, high performance and scalability, and comprehensive presentation of search results. They serve as crucial tools for processing large-scale data and providing efficient search services.

The applications of a search-oriented database #

The application scenarios of search-oriented databases are widespread across various industries. Here are some typical examples:

1. Retail and E-commerce: In the retail and e-commerce industry, search-oriented databases are extensively utilized in product search and recommendation systems. Through the search functionality, customers can easily find desired products, and personalized recommendation systems can suggest relevant products based on users' search history and behavioral patterns, thereby enhancing shopping experiences and transaction conversion rates.

2. Healthcare: In the healthcare sector, search-oriented databases are employed for medical literature retrieval, disease diagnosis, medication search, and more. Physicians and researchers can utilize the search functionality to find relevant medical literature and research findings, aiding in disease diagnosis and treatment plan formulation.

3. Financial Services: Search-oriented databases find applications in financial data retrieval, market analysis, investment decision-making, and related areas within the financial services industry. Investors can utilize the search functionality to access relevant financial data and market information, assisting them in making more accurate investment decisions.

4. Manufacturing: Search-oriented databases are used in manufacturing for production process monitoring, quality control, fault diagnosis, and similar purposes. Engineers can leverage the search functionality to find relevant production data and technical information, helping them solve production-related issues and challenges.

5. Media and Entertainment: In the media and entertainment industry, search-oriented databases are employed for content retrieval, copyright management, user recommendation, and more. Users can utilize the search functionality to find interesting news, music, videos, and other content, while personalized recommendation systems can suggest relevant content based on users' search history and preferences.

6. Education and Training: Search-oriented databases are utilized in education and training for learning resource retrieval, course management, learning analytics, and related aspects. Students and teachers can use the search functionality to find relevant learning resources and course content, while learning analytics systems can analyze students' search behavior and learning performance, providing references and support for teaching.

7. IT Operations Observability: Through search-oriented databases, real-time monitoring of system status, performance indicators, and log data becomes possible, assisting operations teams in promptly detecting and resolving system failures, performance issues, and anomalies to ensure stable system operation.

8. Security Monitoring and Threat Detection: By leveraging search-oriented databases, auditing and monitoring of security logs can be conducted to monitor user access behavior and system operations, promptly detecting abnormal behavior and security incidents. Additionally, search-oriented databases can integrate with threat intelligence data, enabling correlation analysis of internal log data, facilitating swift identification and response to various security threats and attack behaviors, thereby safeguarding system and data security.

In conclusion, search-oriented databases play a vital role in diverse industries, catering to data scales ranging from gigabytes to petabytes. They are present in various aspects of our lives, providing efficient, accurate, and personalized information search and retrieval services, thereby driving industry development and progress. With continuous innovation and advancement in search technologies, the application of search-oriented databases across industries will become increasingly widespread, delivering more convenient and intelligent search experiences to users.

The evolution of search-based databases #

The developmental trajectory of search-centric databases can be summarized into four distinctive stages:

1.Inception Phase (1990s): The foundation of search-centric databases can be traced back to the 1990s, when full-text retrieval emerged as the primary technological approach, initially applied to document retrieval and web search. Prominent examples during this phase include AltaVista and Excite.

2.Technological Breakthrough (2000s): With the rapid growth of the internet, search-centric databases expanded into various domains such as e-commerce and social networks. The advent of open-source search engines like Lucene and Sphinx spurred advancements in search technology.

3.Commercialization (2010s): The commercialization phase saw the rise of search-centric databases, with commercial search engines like Elasticsearch leading the way. Organizations began widespread adoption of search-centric databases for managing and retrieving extensive amounts of data.

4.Intelligent Transformation (2020s): As artificial intelligence technology advanced, search-centric databases gradually underwent an intelligent transformation by incorporating techniques like machine learning and natural language processing. This enabled personalized recommendations and intelligent search services. Moreover, search-centric databases found applications in various sectors, such as healthcare and financial services.

In conclusion, the developmental trajectory of search-centric databases includes the inception phase, technological breakthroughs, commercialization, and intelligent transformation. This journey underscores their significance in the field of information retrieval and highlights the ongoing progress and expanding scope of search technology. With the continued maturation of artificial intelligence, search-centric databases are expected to make significant strides in intelligence and personalization, offering users even more enhanced search experiences.

The evolution of searchable databases #

The development of search-centric databases has led to several mature products and vendors in the market. However, the boundaries of search-centric databases can sometimes be blurry, as with other types of databases. Many databases serve multiple purposes, including document storage, multimodal data handling, and vector data storage. Nevertheless, commonly encountered search-centric databases have emerged in the following ways:

  • Search databases derived from search engine core libraries, such as Elasticsearch.

  • Search databases expanded from other databases, such as Postgres Full-Text Search.

  • Search databases designed from scratch, such as INFINI Pizza.

By referring to the popular DB-Engines search engine ranking, one can gain a preliminary understanding of the popularity trends for mainstream search-centric databases, as shown in the graph below:

It is evident that Elastic’s Elasticsearch, ever since overturning Splunk’s dominance in the log management field more than a decade ago, has continued to assert its formidable presence. It charted a new course in the realm of log management and subsequently emerged victorious, surpassing the competition and maintaining its position at the helm of the search industry. Elastic’s commercial growth has remained robust, with revenue exceeding $1 billion in 2023.

OpenSearch, an open-source fork of Elasticsearch initiated by AWS, came into being in response to Elastic’s decision to change their licensing model to Elastic+SSPL, specifically targeting cloud vendors. OpenSearch is derived from Elasticsearch version 7.10 under the Apache 2.0 license and has already garnered a considerable user base.

Splunk is a software platform used for searching, monitoring, and analyzing large-scale machine-generated data. It is primarily utilized in the fields of log management and security analysis and is classified as a commercial proprietary product. In the middle of 2023, it made headlines when it was acquired by Cisco for a staggering $23 billion in cash, causing a sensation among professionals in the industry. Interestingly, among the top four contenders in this field, with Splunk included, all of them are built on the foundation of the Lucene kernel.

Established in 2001, MarkLogic positions itself as a NoSQL, multi-model database vendor. It operates as a commercial proprietary software company. While it boasts a mature ecosystem, its system can be considered overly complex, with a steep learning curve. In early 2023, MarkLogic was acquired by Progress Software for $355 million, which can be seen as a favorable outcome for the company.

Certainly, apart from the mentioned products, there are many outstanding contenders eagerly preparing for the challenge.

Some of these projects includeVespa, Rockset, Doris, ClickHouse, Quickwit, Pinot, SingleStore, Qdrant, Milvus, Algolia, MeiliSearch, Typesense, Manticore Search, and many more.These projects may not all position themselves solely as search-oriented databases; some focus on the AI field, while others emphasize real-time analytics and so on. Each of them has its own unique strengths, but they all possess certain capabilities in search and analysis. It wouldn’t be surprising if each claims to outperform Elasticsearch in their own way.

The development of search-based databases in China #

At the beginning of 2023, led by the Institute of Cloud Computing and Big Data of the Chinese Academy of Information and Communications Technology, relying on the Big Data Technology Standards Promotion Committee of the China Communications Standards Association, and jointly compiled by more than 30 enterprises such as Tors, INFINI Labs, and Transwarp Technology, the “Technical Requirements for Searchable Databases” was officially released. This standard has become a vane in the industry for search-based database technology selection and product development, and INFINI Labs’s INFINI Easysearch was the first to pass the standard.

The Modb community has also opened up a search-based database ranking, with a total of 6 companies' products on the list:

The domestic market for search-based databases is still in its early stages with a limited number of vendors and product options. However, as the market matures, I firmly believe that we will witness a significant surge in rapid development in the near future.

A look forward to the trend of searchable databases #

The field of search-based databases is undergoing notable trends as technology, scenarios, and data continue to evolve. These trends further drive the evolution and expansion of search technology’s application scope. The following directions represent significant trends that I have observed:

Trend 1: Real-time Search and Analysis

  • Real-time search is a crucial development trend in the field of search-based databases. Business applications are increasingly moving towards real-time operations, and there is a growing demand for instant access to the latest data and content.

  • Real-time search technology, achieved through real-time indexing and updating mechanisms, enables fast data retrieval and updates. It provides up-to-date search results that satisfy users' need for real-time access to information.

  • Currently, most search-based databases based on the Lucene core can achieve near real-time (NRT) search. However, frequent updates present challenges and resource waste. Achieving even more efficient real-time capabilities can greatly enhance user search experiences and real-time decision-making abilities.

Trend 2: Multi-Modal Hybrid Search

  • Multi-modal search involves considering multiple forms of information, such as text, images, videos, etc., during the search process to improve the accuracy and comprehensiveness of search results.

  • This technology, by analyzing and understanding the correlations between various forms of information, provides users with more comprehensive and rich search results. It is suitable for search scenarios that require the integration of different media formats. As real-world data becomes increasingly complex, and the utilization of unstructured data grows, multi-modal search can provide businesses with flexible analytical and exploratory capabilities. The ability to perform hybrid searches is particularly appealing.

Trend 3: AI-powered Semantic Search

  • The exploration of large-scale AI models and AI-powered search technology has progressed rapidly. AI techniques are employed to achieve intelligence, semantic understanding, and personalization during the search process. This involves leveraging technologies such as natural language processing and machine learning to analyze user intent and provide intelligent and personalized search services.

  • With the rise of large-scale models, search-based databases are beginning to adopt models like RAG (Retriever-Reader for Generative Question Answering) to enhance search effectiveness. Combining retrieval and reading capabilities, RAG models deliver more accurate and comprehensive search results, providing users with intelligent and personalized search services.

  • Search-based databases have become ideal testing grounds for AI applications. Elasticsearch, through embracing AI and large-scale models, has experienced a resurgence in stock price, which is worthy of celebration.

Trend 4: Cloud-native, Storage-Compute Separation, Serverless

  • As cloud computing technology advances, search-based databases are gradually transitioning to cloud-native architectures. Cloud-native search-based databases utilize technologies such as containerization and microservices to achieve higher flexibility, scalability, fault tolerance, and cost efficiency. They provide enterprises with more stable and efficient search services at lower costs and with greater elasticity.

  • Storage-compute separation is another important trend in the development of search-based databases. By decoupling storage and compute, search-based databases can better adapt to changing demands in data storage and processing, improving system performance and efficiency. Storage-compute separation enables search-based databases to achieve higher concurrent access and faster data processing speeds, providing users with smoother and more stable search experiences.

  • Serverless architectures offer out-of-the-box experiences, lower costs, and increased flexibility. This direction is currently being actively explored by many search service providers.

Trend 5: Augmented Reality (AR) Search

  • With the advancement of augmented reality technology, particularly through the introduction of Apple’s wearable device, Vision Pro, a revolutionary spatial computing device seamlessly integrating digital content into the physical world, search technology is gradually merging with augmented reality. This integration aims to provide users with more intuitive and immersive search experiences. Augmented reality search combines search results with the real-world context and leverages AI technology to offer personalized and convenient search services. This represents a new field with significant opportunities.

Trend 6: Efficient Utilization of Modern Hardware

  • Modern hardware and software environments have undergone significant transformations. With technologies such as on-chip computing, edge computing, FPGAs, DPUs, and GPUs, devices with hundreds of cores and terabytes of memory have become a reality. However, the software running on these devices often lags behind in terms of architecture, relying on designs from decades ago. For instance, Lucene, the core of Elasticsearch, was established in 1997, which means it’s been 27 years since its inception. Despite its continuous evolution, certain architectural and design concepts are no longer cutting-edge.

  • Leveraging more advanced algorithms, updated data structures, and design theories on modern hardware, along with utilizing the latest CPU instruction sets, vectorization, and batch processing, can fully tap into the advantages of multi-core processors, vast memory, and SSDs. This can achieve higher efficiency and lower costs and address previously impossible problems. It holds great potential and should be a focus for the next generation of search engines.

As the boundaries of various database functionalities become increasingly blurred, application scenarios overlap extensively, and market competition intensifies, I believe there are still significant opportunities for vertical-focused search-based databases. However, the market space for all-encompassing database products has dwindled. To thrive, it is crucial to have a specific and dedicated focus in vertical domains. At INFINI Labs, we are developing the next-generation search engine INFINI Pizza using Rust, with a focus on end-user scenarios. Our aim is to address the core business needs of real-time retrieval in high-concurrency and low-latency environments, particularly in the face of massive data updates.

Summary #

In conclusion, the field of search-based databases is currently in a stage of rapid development. With the continuous growth of internet data and evolving user needs, search database technology will continue to innovate and advance to meet users' demands for more instantaneous, personalized, and diverse information retrieval. In the future, as artificial intelligence technology further develops and is applied, search-based databases will become more intelligent, ubiquitous, and diverse. They will provide users with more efficient, accurate, and personalized search services, facilitating convenient access to and utilization of internet information.

热门文章
标签
Easysearch x
Gateway x
Console x