Web crawling and text embeddings form the core elements of modern search engine technology. By analyzing text content at a granular level, search engines can deliver precise results, enhancing user experience. In this article, we’ll explore how these two aspects of search technology function within the context of Rozz, a searchbox providing an unparalleled search experience by intelligently leveraging HTML and PDF content from websites. We’ll also delve into the role of the token size in understanding the context of different web content types, from HTML to PDFs and eventually, social media.
Web Crawling: The Foundation of Search
Web crawling involves the systematic browsing of the internet to index and analyze the content of websites. This forms the foundation of search engines, enabling them to find and serve the most relevant information in response to user queries.
At Rozz, we use Selenium for web crawling. Selenium is a powerful tool primarily used for automating web applications for testing purposes. It offers the unique advantage of rendering a web page just like a user would see it, including content rendered with JavaScript. This is critical in the contemporary web ecosystem where dynamic JavaScript-based websites have become the norm.
This capability allows us to go beyond the surface HTML code to access the fully rendered content of the web page, resulting in a more comprehensive index. While currently we focus on text, including both the HTML content and PDFs, future plans entail expanding this scope to other media such as images and social media content. This holistic approach ensures that we have a diverse and accurate representation of the web’s content, enhancing the search experience we provide.
Text Embeddings: The Key to Context
Once the content is crawled, it is processed through a technique called text embedding, where text data is converted into numerical vectors. These vectors can be processed by machine learning models to identify patterns and similarities, allowing for the intelligent retrieval of content.
The embedding process begins with tokenization, where the text is broken down into smaller chunks, or tokens. These tokens, which can be as short as one word or as long as a few sentences, serve as the basis for the numerical representation of the text.
The challenge lies in striking the right balance in token size. While smaller tokens might be ideal for pinpointing the exact URL or part of a page, larger tokens capture a broader context, which is often necessary for understanding complex content such as PDFs.
Token Sizing: A Balancing Act
At Rozz, we’ve been testing various token sizes for embedded text chunks, ranging from smaller (200 tokens) to larger (1000 tokens). This is to ensure that we stay within the limits of the input of our machine learning model. Currently, we work with a 4K token model, but we are prepared to extend this to a 16K token model when necessary.
In the case of HTML content, smaller token sizes are desirable as they provide a more targeted approach to search. They help us pinpoint the exact URL or part of the page relevant to a user’s query.
Conversely, for PDF content, larger token sizes are more beneficial. Given the typically long-form nature of PDFs, they contain complex arguments and explanations that require broader context for understanding. Larger tokens help capture this context, thereby delivering more accurate and relevant results to users.
As we plan to expand to social media content, like tweets, the token limit problem should be mitigated. Given the inherent brevity of such content, smaller tokens can be used without sacrificing the context.
Rozz: Enhancing the Search Experience
Rozz’s primary goal is to provide a superior search experience for users. By carefully optimizing web crawling practices with Selenium and fine-tuning token sizes for text embeddings, Rozz ensures that the search results are not only relevant but also contextually accurate.
The process of balancing token size is a continuous journey, as each type of content presents its unique challenges. HTML content requires pinpoint precision, PDFs require a broader context for comprehension, and tweets need smaller tokens that still capture the full meaning.
As we continue to refine our processes and expand our scope to include images and other forms of social media, our commitment to delivering an exceptional search experience remains steadfast. Through intelligent technology and innovative approaches, Rozz is shaping the future of search, one query at a time.