Structural Change of Domain-Specific Web-Page Repository for Efficient Searching

spot8flesh
Jan 8, 2019
4 min read

Web crawler crawls the Web-pages from World Wide Web (WWW) and store them in a repository. Domain-specific Web crawlers crawl only domain-specific Web-pages [1–4]. A domain-specific Web search engine is produced search results by considering the domain-specific Web-pages, which is already crawled by the domain-specific Web crawler [5, 6]. Now the change in storage structure is a very crucial task for producing quick response by the search engine [7, 8]. We have already discussed the generation mechanism of Relevance Page Tree (RPaT) [9] for single domain and Relevance Page Graph (RPaG) [10] for multiple domains specific Web-pages. RPaT and RPaG both are generated from the typical Ontology-based domain-specific Web-pages. However, those models took longer time to retrieve the data when a search was made by using specific model, especially for handling large data storage. In this background, we have proposed three new models, which gradually improve our system. The models are High-Efficient Relevance page Tree (HERT), Index-Based Acyclic Graph (IBAG), and Multilevel Index Based Acyclic Graph (M-IBAG).

The present post is organized as follows. In Sect. 2, we describe in brief of our proposed approach. This section is further divided into three main subsections. HERT model and its construction and searching mechanisms are presented in Sect. 2.1. The IBAG model and its various attributes are discussed in Sect. 2.2. In Sect. 2.3, we have described M-IBAG model. The detail experimental analyses are given in Sect. 3. Finally, we summarize the important findings and the conclusion reached from this experiment in Sect.

Proposed Approach In our approach, we have constructed a new model HERT from RPaT. RPaT is constructed from original crawling and supports single Ontology. Naturally, HERT also supports single Ontology as it is constructed from RPaT. Further, we have enhanced our concept and constructed another new model called IBAG, which supports multiple domains. IBAG is typically constructed from RPaG, which supports multiple domains. Further more; to achieve better time complexity, we have introduced another new model called M-IBAG. M-IBAG is constructed from IBAG and supports multiple domains.

HERT Model In this subsection, we have described High-Efficient Relevant Page Tree (HERT) model. To clarify the name HERT, we break this name into two parts. First part “High Efficient” and Second part is “Relevant Page”. Typically, high efficient means fast access or reduced time and relevant page means our domain related Web-pages or considered Ontology related Web-pages.

HERT contains relevant Web-pages in an organized way. HERT is generated from RPaT. In 1a sample HERT is shown. RPaT Web-pages are related to a single Ontology, and HERT is generated from that specified RPaT, hence it is also related to the same Ontology. Each node in the figure of HERT contains Web-page URL and relevance value. HERT is divided into different relevance span level and each span has an Index. Index 0 points to the root page, Index 1 points to the next level first Web-page and so on. HERT construction mechanism requires “Maximum Relevance Span Value” (αrsv), “Minimum Relevance Span Value” (βrsv) and “Number of Relevance Span level” (nrsl) for calculating “Gap Factor”.

Searching a Web-Page from HERT Model we need to traverse less number of Web-pages. To find the Web-page, we first lookup the RANGE_INDEX table, which is shown in Table 1. This table contains index of each range. Now according to our search we first find range, then find corresponding index of that range. From that index we start searching. Again, we want to mention that we used linear searching for HERT. How searching takes place in HERT.

Challenges Faced While Constructing HERT To construct HERT from RPaT, we first define range. Suppose there are four ranges x1–x2, x2–x3, x3–x4 and above x4. Now, consider such a situation where not a single Web-page belongs to the range x3–x4, but other ranges contain lot of Web-pages. In this situation, we cannot assign any parent Web-page identifier, which hampers our HERT construction. As a solution we brought Dummy Page concept, which resolves this problem. Initially, we will create a dummy structure of HERT, where all the Web-pages are dummy page. Each level contains a dummy page.

Searching a Web-Page from IBAG Model Existing IBAG model supports three Ontologies; hence all the levels start with three Ontology Indexes. At each level “Ontology 1 Index” points to the Web-page which is supported by the first occurrence of “Ontology 1”. Similarly “Ontology 2 Index” and “Ontology 3 Index” point to the first occurrence of “Ontology 2” and “Ontology 3” supported Web-pages respectively.

All pages in IBAG model contain three link fields. First one for “Ontology 1”, second one for “Ontology 2” and third one for “Ontology 3”. Now, if any page supports all three Ontologies, then we traverse next page through that page. Further, if any page supports the “Ontology 1” and “Ontology 3”; then we traverse next “Ontology 1” supported page and “Ontology 3” supported page through that page. Say, we would like to search one Web-page “m” from the IBAG model. Web-page “m” supports “Ontology 1” and “Ontology 3” and belongs to level 3. Now, the Web-page would be definitely read at level 3 starting with “Ontology 1 Index” and “Ontology 3 Index”. We have shown the reading mechanism of IBAG's leadership development goals .