Crawl specific pages and data and make it searchable [closed]

Crawl specific pages and data and make it searchable [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source.
For a client I'm gathering information for building a search engine/web spider combination. I do have experience with indexing webpages' inner links with a specific depth. I also have experience in scraping data from webpages. However, in this case, the volume is larger than I have experience with so I was hoping to gain some knowledge and insights in the best practice to do so.
First of all, what I need to make clear is that the client is going to deliver a list of websites that are going to be indexed. So, in fact, a vertical search engine. The results only need to have a link, title and description (like the way Google displays results). The main purpose of this search engine is to make it easier for visitors to search large amounts of sites and results to find what they need.
So: Website A containts a bunch of links -> save all links together with meta data.
Secondly, there's a more specific search engine. One that also indexes all the links to (let's call them) articles, these articles are spread over many smaller sites with a smaller amount of articles compared to the sites that end up in the vertical search engine. The reason is simple: the articles found on these pages have to be scraped in as many details as possible. This is where the first problem lies: it would take a huge amount of time to write a scraper for each website, data that needs to be collected is for example: city name, article date, article title. So: Website B contains more detailed articles than website A, we are going to index these articles and scrape usefull data.
I do have a method in my mind which might work, but that involves writing a scraper for each individual website, in fact it's the only solution I can think of right now. Since the DOM of each page is completely different I see no option to build a fool-proof algorithm that searches the DOM and 'knows' what part of the page is a location (however... it's a possibility if you can match the text against a full list of cities).
A few things that crossed my mind:
Vertical Search Engine
For the vertical search engine it's pretty straight forward, we have a list of webpages that need to be indexed, it should be fairly simple to crawl all pages that match a regular expression and store the full list of these URLs in a database.
I might want to split up saving page data (meta description, title, etc) into a seperate process to speed up the indexing.
There is a possbility that there will be duplicate data in this search engine due to websites that have matching results/articles. I haven't made my mind up on how to filter these duplicates, perhaps on article title but in the business segment where the data comes from there's a huge change on duplicate titles but different articles
Page scraping
Indexing the 'to-be-scraped'-pages can be done in a similar way, as long as we know what regex to match the URLs with. We can save the list of URLs in a database
Use a seperate process that runs all individual pages, based on the URL, the scraper should now what regex to use to match the needed details on the page and write these to the database
There are enough sites that index results already, so my guess is there should be a way to create a scraping algorithm that knows how to read the pages without having to match the regex completely. As I said before: if I have a full list of city names, there must be an option to use a search algorithm to get the city name without having to say the city name lies in "#content .about .city".
Data redundance
An important part of the spider/crawler is to prevent it from indexing duplicate data. What I was hoping to do is to keep track of the time a crawler starts indexing a website and when it ends, then I'd also keep track of the 'last update time' of an article (based on the URL to the article) and remove all articles that are older than the starting time of the crawl. Because as far as I can see, these articles do no longer exists.
The data reduncance is easier with the page scraper, since my client made a list of "good sources" (read: pages with unique articles). Data redundance for the vertical search engine is harder, because the sites that are being indexed already make their own selection of artciles from "good sources". So there's a chance that multiple sites have a selection from the same sources.
How to make the results searchable
This is a question apart from how to crawl and scrape pages, because once all data is stored in the database, it needs to be searchable in high speed. The amounts of data that are going to be saved is still unknown, compared to some competition my client had an indication of about 10,000 smaller records (vertical search) and maybe 4,000 larger records with more detailed information.
I understand that this is still a small amount compared to some databases you've possibly been working on. But in the end there might be up to 10-20 search fields that a user can use the find what they are looking for. With a high traffic volume and a lot of these searches I can imagine that using regular MySQL queries for search isn't a clever idea.
So far I've found SphinxSearch and ElasticSearch. I haven't worked with any of them and haven't really looked into the possibilities of both, only thing I know is that both should perform well with high volume and larger search queries within data.
To sum things up
To sum all things up, here's a shortlist of questions I have:
Is there an easy way to create a search algorithm that is able to match DOM data without having to specify the exact div the content lies within?
What is the best practice for crawling pages (links, title & description)
Should I split crawling URLs and saving page title/description for speed?
Are there out-of-the-box solutions for PHP to find (possible) duplicate data in a database (even if there are minor differences, like: if 80% matches -> mark as duplicate)
What is the best way to create a future proof search engine for the data (keep in mind that the amounts of data can increase aswel as the site traffic and search requests)
I hope I made all things clear and I'm sorry for the huge amount of text. I guess it does show that I spend some time already in trying to figure things out myself.

I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:
Figure out where your application can be logically split up
For me this meant building 3 distinct sections:
Web Scraper Manager
Web Scraper
HTML Processor
The work could then be divided up like so:
1) The Web Scraper Manager
The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"
2) The Web Scraper
The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure
ID | URL | HTML (BLOB) | PROCESSING
Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.
3) The HTML Processor
The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.
Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:
{
"url" : "http://example.com",
"meta" : {
"title" : "The meta title from the page",
"description" : "The meta description from the page",
"keywords" : "the,keywords,for,this,page"
},
"body" : "The body content in it's entirety",
"images" : [
"image1.png",
"image2.png"
]
}
Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.
What are the advantages?
The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes
What are the disadvantages?
It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.

The crawling and indexing actions can take a while, but you won't be crawling the same site every 2 minutes, so you can consider an algorithm in which you put more effort in crawling and indexing your data, and another algorithm to help you get a faster search.
You can keep crawling your data all the time and update the rest of the tables in the background (every X minutes/hours), so your search results will be fresh all the time but you won't have to wait for the crawl to end.
Crawling
Just get all the data you can (probably all the HTML code) and store it in a simple table. You'll need this data for the indexing analysis. This table might be big but you don't need good performance while working with it because it's going to be part of a background use and it's not going to be exposed for user's searches.
ALL_DATA
____________________________________________
| Url | Title | Description | HTML_Content |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Tables and Indexing
Create a big table that contains URLs and keywords
KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table will contain most of the words in each URL content (I would remove words like "the", "on", "with", "a" etc...
Create a table with keywords. For each occurrence add 1 to the occurrences column
KEYWORDS
_______________________________
| URL | Keyword | Occurrences |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Create another table with "hot" keywords which will be much smaller
HOT_KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table content will be loaded later according to search queries.
The most common search words will be store in the HOT_KEYWORDS table.
Another table will hold cached search results
CACHED_RESULTS
_________________
| Keyword | Url |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Searching algorithm
First, you'll search the cached result table. In case you have enough results, select them. If you don't, search the bigger KEYWORDS table. Your data is not that big so searching according to the keyword index won't take too long. If you find more relevant results add them to the cache for later usage.
Note: You have to select an algorithm in order to keep your CACHED_RESULTS table small (maybe to save the last use of the record and remove the oldest record if the cache is full).
This way the cache table will help you reduce the load on the keywords tables and give you ultra fast results for the common searches.

Just look at the Solr and solr-wiki. its a open source search platform from the lucene project(similar like Elasticsearch).
For web crawler, you can use either Aperture or Nutch. Both are written in java. Aperture is a light weight crawler. But with Nutch we can crawl 1000 even more websites.
Nutch will handle the process of crawling for websites. Moreover Nutch provides Solr support. It means that you can index the data crawled from Nutch directly into Solr.
Using Solr Cloud we can setup multiple clusters with shards and replication to prevent the data loss and fast data retrieving.
Implementing your own web crawler is not that much easy and for search, regular RDBMS is much complicated to retrieve the data at run time.

I've had my experiences with crawling websites and is a really complicated topic.
Whenever I've got some problem withing this area, I look what the best people at this do (yup, google).
They have a lot of nice presentations about what they are doing and they even release some (of their) tools.
phpQuery for example is a great tool when it comes to searching specific data on a website, I'd recommend to have a look at it if you don't know it yet.
A little trick I've done in a similar project was to have two tables for the data.
The data had to be as up to date as possible, so the crawler was running most of the time and there were problems with locked tables. So whenever the crawler wrote into one table, the other one was free to the search engine and vice versa.

I have built a Web Crawler for detecting news sites - and its performing very well.
It basically downloads the the whole page and then saves it prepares that for another scraping which is looking for keywords. It then basicallly tries to determine if the site is relevant using keywords. Dead simple.
You can find the sourcecode for it here. Please help contribute :-)
It's a focused crawler which doesnt really do anything else than look for sites and rank them according to the presence of keywords. Its not usable for huge data loads, but it's a quite good at finding relevant sites.
https://github.com/herreovertidogrom/crawler.git
It's a bit poorly documented - but I will get around to that.
If you want to do searches of the crawled data, and you have a lot of data, and aspire to build a future proof service - you should NOT create a table with N columns, one for each search term. This is a common design if you think the URL is the primary key. Rather, you should avoid a wide-table design like the pest. This is because IO disk reads get incredibly slow on wide table designs. You should instead store all data in one table, specify the key and the value, and then partition the table on variable name.
Avoiding duplicates is always hard. In my experience, from data warehousing - design the primary key and let the DB do the job. I try to use the source + key + value as a primary key makes you avoid double counting, and has few restrictions.
May I suggest you create a table like this :
URL, variable, value and make that primary key.
Then write all data into that table, partition on distinct variable and implement search on this table only.
It avoids duplicates, it's fast and easily compressable.

Did you tried the http://simplehtmldom.sourceforge.net/manual.htm? I found it useful for scrapping the pages and it might be helpful the search the contents.
use an asynchronous approach to crawl and store the data, so that you can run multiple parallel crawling and storing
ElasticSearch will be useful to search the stored data.

You can search the HTML using this code:
<?
//Get the HTML
$page = file_get_html('http://www.google.com')
//Parse the HTML
$html = new DOMDocument();
$html->loadHTML($page);
//Get the elemnts you are intersted in...
$divArr = $html->getElementsByTagName('div');
foreach($divArr as $div) {
echo $div->nodeValue;
}
?>

Related

best way to set up a MySQL database for storing web data

I will be using curl to retrieve thousands of adult websites. My goal is to store them in MySQL to allow users to easily search the new database and find their desired page without having to endure all the popups, spyware, etc.
It will be an adult website search engine... kinda the google of adult websites, but without the malware sites that find their way onto google from time to time.
At first run I downloaded about 700K lines at about 20 GB of data. Initially I stored all info in a single table with columns for URL, HTML PAGE CODE, PAGE WITH NO HTML TAGS, KEY WORDS, TITLE and a couple more.
I use a MATCH AGAINST query to search for the users desired page within TITLE, KEY WORDS, PAGE WITH NO HTML in any variety of combinations or singularly.
My question is... would I be better off to break all these columns into separate tables and would this improve the speed of the searches enough that it would matter?
Is there any advantage to storing all the data in multiple tables and then using JOIN's to pull the data out?
I am just wondering if I need to be proactive and thinking about high user search loads.

MySQL isn't good with full-text search and never was.
Look into Sphinx or Lucene/Solr, they're the best fit for the job. I'd suggest sticking to the former.

User content site wide search - PHP/ MySQL

For a user content website I am creating, it has lots of sub-sections: Movies, Jobs, People, Photos, Mail, etc. It's like a yahoo portal but very very detailed with information search, like I a niching as deep as possible per topic unlike any site out there. I have the site being developed in codeignitor php and mysql. Search can be global across all sub-sites and per sub-section as we see on google, yahoo. There are 22 possible user content objects on my system, each has about 12-15 search fields which i call object meta data + I a storing historical data (like user content version control) also which i want to include in the search.
Now the question is for per sub-section search it seems reasonable because the scope is limited so I think I can pull it off well with mysql. I don't foresee any performance issue. But for site wide search it will search not just title names, but keywords, tags, description, including people's mail, comments, historical data etc. So my worry is performance. Since this is a startup, I have limited hardware resources, so I have to depend 100% on the database and code to pull it off.
So what are the best practices for implementing such a search from the code and database point of view and should a mixture of databases be used depending on the sub-site? Currently everything is stored in 1 mysql database. But I see issues where it may work fine for people search, movies search, etc but not if i include mail search, geo locations, historical data search and even having to go searching items like photo tags, photo descriptions, etc -> all part of the global search there can be performance issues due to the high number of joins and number of rows.

I don't know about PHP, but for my ruby-on-rails projects, I always use Sphinx search engine to do such things. It is a standalone search engine that indexes your database, and when the user submits a search query, the query is matched against Sphinx's index database instead of the actual db. It is blazingly fast, and offers great control over how to index/search.
Sphinx Search Engine
PHP: Sphinx Extension (not sure if this is relevant)

For a generalized site wide search on a budget you could constrain one of the major search APIs to just your domain and handle and display results as if they had come from your own search.

I don't have a solution exactly, but am running into a similar problem with my in-development website.
I'm beginning to think a solution might lie in determining where the bulk of your searches lie, and limiting searches to those queries. If the user search requires a bit more in-depth results (such as your mail search, geo locations, historical data) then you can send the user to a second mysql query. Get the majority of your users to search using your simpler, low-performance queries, and the rest can use more resources if necessary.
As an example, majority of my site's users will be searching the news, calendar, and media sections, so my search looks their first. But visitors could also be searching for other users, groups, forum posts, tags/categories, and so on. But I'm going to let a second, more complicated script handle that.

Realtime MySQL search results on an advanced search page

I'm a hobbyist, and started learning PHP last September solely to build a hobby website that I had always wished and dreamed another more competent person might make.
I enjoy programming, but I have little free time and enjoy a wide range of other interests and activities.
I feel learning PHP alone can probably allow me to create 98% of the desired features for my site, but that last 2% is awfully appealing:
The most powerful tool of the site is an advanced search page that picks through a 1000+ record game scenario database. Users can data-mine to tremendous depths - this advanced page has upwards of 50 different potential variables. It's designed to allow the hardcore user to search on almost any possible combination of data in our database and it works well. Those who aren't interested in wading through the sea of options may use the Basic Search, which is comprised of the most popular parts of the Advanced search.
Because the advanced search is so comprehensive, and because the database is rather small (less than 1,200 potential hits maximum), with each variable you choose to include the likelihood of getting any qualifying results at all drops dramatically.
In my fantasy land where I can wield AJAX as if it were Excalibur, my users would have a realtime Total Results counter in the corner of their screen as they used this page, which would automatically update its query structure and report how many results will be displayed with the addition of each variable. In this way it would be effortless to know just how many variables are enough, and when you've gone and added one that zeroes out the results set.
A somewhat similar implementation, at least visually, would be the Subtotal sidebar when building a new custom computer on IBuyPower.com
For those of you actually still reading this, my question is really rather simple:
Given the time & ability constraints outlined above, would I be able to learn just enough AJAX (or whatever) needed to pull this one feature off without too much trouble? would I be able to more or less drop-in a pre-written code snippet and tweak to fit? or should I consider opening my code up to a trusted & capable individual in the future for this implementation? (assuming I can find one...)
Thank you.

This is a great project for a beginner to tackle.
First I'd say look into using a library like jquery (jquery.com). It will simplify the javascript part of this and the manual is very good.
What you're looking to do can be broken down into a few steps:
The user changes a field on the
advanced search page.
The user's
browser collects all the field
values and sends them back to the
server.
The server performs a
search with the values and returns
the number of results
The user's
browser receives the number of
results and updates the display.
Now for implementation details:
This can be accomplished with javascript events such as onchange and onfocus.
You could collect the field values into a javascript object, serialize the object to json and send it using ajax to a php page on your server.
The server page (in php) will read the json object and use the data to search, then send back the result as markup or text.
You can then display the result directly in the browser.
This may seem like a lot to take in but you can break each step down further and learn about the details bit by bit.

Hard to answer your question without knowing your level of expertise, but check out this short description of AJAX: http://blog.coderlab.us/rasmus-30-second-ajax-tutorial
If this makes some sense then your feature may be within reach "without too much trouble". If it seems impenetrable, then probably not.

Integrating search on a website where the backend is MYSQL

I have a location search website for a city, we started out with collecting data for all possible categories in the city like Schools, Colleges, Departmental Stores etc and stored their information in a separate table, as each entry had different details apart from their name, address and phone number.
We had to integrate search in the website to enable people to find information, so we built an index table where in we stored the categories and related keywords for the same category and the table which much be fetched if that category was searched for. Later on we added the functionality of searching on the name and address as well by adding another master table containing those fields from all the tables to one place. Now my doubt is the following
The application design is improper, and we have written queries like select * from master where name like "%$input%" , all over, since our database is MYSQL and PHP on serverside, is there any suggestion for me to improve on the design of the system?
People want more features like splitting the keywords and ranking them according to relevance etc, is there any ready framework available which runs search on a database.
I tried using Full Text Search in MYSQL and it seems effective to me, is that enough?
Correct me if i am wrong, i had a look into Lucene and Google Custom Search, don't they work on making an index by crawling existing webpages and building their own index? I have a collection of tables on a mysql database on which i have to apply searching. What options do i have?

To address your points:
Using %input% is very bad. That will cause a full table scan every query. Under any amount of load or on even a remotely large dataset your DB server will choke.
An RDBMS alone is not a good solution for this. You are looking in the right place by seeking a separate solution for search. Something which can communicate well with your RDBMS is good; something that runs inside an RDBMS won't do what you need.
Full Text Search in MySQL is workable for very basic keyword searches, nothing more. The scope of usefulness is extremely limited - you need a highly predictable usage model to leverage the built-in searching. It is called "search" but it's not really search the way most people think of it. Compared to the quality of search results we have come to expect from Google and Bing, it does not compare. In that sense of the word "search", it is something else - like Notepad vs Word. They both are things to type in, but that's about it.
As far as separate systems for handling search, Lucene is very good. Lucene works however you want it to work, essentially. You can interact with it programatically to insert indexable documents. Likewise, a Google Appliance (not Google Custom Search) can be given direct meta feeds which expose whatever you want to be indexed, such as data directly from a database.

Take a look at sphinx: http://www.sphinxsearch.com/
Per their site:
How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
It's quite popular with a lot of people in the rails community right now, and they all rave about how awesome it is :)

Catalog entries: Updated html files or on-the-fly from database?

I've got a database site that will serve approximately 1,200 primary entries at launch, with the prospect of adding ~100 new entries per year. Each entry would be composed of ~20 basic values from the database, as well as a calculated average rating, and a variable amount of user comments.
The rating and comments output would have to be generated upon page request, but the rest of the data would be static unless an error correction had been made by an admin.
Considering bandwith, database load, and download times - would it be better to just generate the entire page on the fly with a few queries after a GET or would it be better to have html files and append the ratings & comments then write a refresher script that would update all the html records when run?
In my mind the pros & cons are:
On-the-fly
+ saves hosting space by building pages only when needed
- leaves nothing for a search engine to find?
- slightly slower due to extra queries
HTML & appends
+ search engine friendly
+ slightly faster due to less queries
- uses disk space
- slightly more complex code requirements
Neutral
= templating would be the same either way
Thoughts?

Do whatever's easiest to code: you'll find that in practice all your pros and cons are actually the same for both options.
Personally, given that you will have to go to the database anyway to deliver the comments, I'd go for a completely on-the-fly generated page.
Creating a web page dynamically from 1,200 database records -- in fact, frankly, 1,200,000 database records -- are well within the capabilities of MySQL and PHP on even a moderately specified shared host. There are plenty of examples of sites that use this combination with millions of records so you won't find performance to be an issue for a long time!
And as it happens you'll probably not save hosting space as the database records take up space on the host in the same way that static data does.
A search engine will replicate what a user's browser does. It issues an good ol' HTTP GET request to the root of your site, then analyses each of the links and requests each of them until the spider has got every page that it can. So to make sure that a database-driven site is indexed by a search engine, provide Link text links in your page to each record.
Something as simple as an A-Z list of entries with a grid underneath would do - for an example a site I'm working on at the moment is http://arkive.org where we do just that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.