Building Search with Special Searches/Tools - php

I'm building a website that is basically a "Discover Your City" kind of website (eg, business directory, weather, news...). The main function of my website is just to be a business directory. I've already built a business listing search that populates from a couple mysql tables (there is only about 10,000 businesses in my town, meaning only 10,000 listings, so I don't really have issues with speed and needing to move to Fulltext or Sphinx).
Now, I want to build special searches, meaning a user, instead of typing in "Restaurants" or "Joe's Crab Shack" instead types in "News" or "Weather" and is presented with the current weather in town, or local news. Both news and weather are just pulled from external websites. Google has a good example of this here.
How can I build this? Switch Statements is one way I think but not sure if this will end up as a great solution.
Thanks for all help!

Make a 'keywords' table to store these special words in. Do a lookup on that table first and if you get zero results run the actual search. As long as you always do exact lookups on an index (WHERE term='$search') instead of using a LIKE, the query should be near instant so there's not much of a performance hit doing two queries.

Related

Crawl specific pages and data and make it searchable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source.
For a client I'm gathering information for building a search engine/web spider combination. I do have experience with indexing webpages' inner links with a specific depth. I also have experience in scraping data from webpages. However, in this case, the volume is larger than I have experience with so I was hoping to gain some knowledge and insights in the best practice to do so.
First of all, what I need to make clear is that the client is going to deliver a list of websites that are going to be indexed. So, in fact, a vertical search engine. The results only need to have a link, title and description (like the way Google displays results). The main purpose of this search engine is to make it easier for visitors to search large amounts of sites and results to find what they need.
So: Website A containts a bunch of links -> save all links together with meta data.
Secondly, there's a more specific search engine. One that also indexes all the links to (let's call them) articles, these articles are spread over many smaller sites with a smaller amount of articles compared to the sites that end up in the vertical search engine. The reason is simple: the articles found on these pages have to be scraped in as many details as possible. This is where the first problem lies: it would take a huge amount of time to write a scraper for each website, data that needs to be collected is for example: city name, article date, article title. So: Website B contains more detailed articles than website A, we are going to index these articles and scrape usefull data.
I do have a method in my mind which might work, but that involves writing a scraper for each individual website, in fact it's the only solution I can think of right now. Since the DOM of each page is completely different I see no option to build a fool-proof algorithm that searches the DOM and 'knows' what part of the page is a location (however... it's a possibility if you can match the text against a full list of cities).
A few things that crossed my mind:
Vertical Search Engine
For the vertical search engine it's pretty straight forward, we have a list of webpages that need to be indexed, it should be fairly simple to crawl all pages that match a regular expression and store the full list of these URLs in a database.
I might want to split up saving page data (meta description, title, etc) into a seperate process to speed up the indexing.
There is a possbility that there will be duplicate data in this search engine due to websites that have matching results/articles. I haven't made my mind up on how to filter these duplicates, perhaps on article title but in the business segment where the data comes from there's a huge change on duplicate titles but different articles
Page scraping
Indexing the 'to-be-scraped'-pages can be done in a similar way, as long as we know what regex to match the URLs with. We can save the list of URLs in a database
Use a seperate process that runs all individual pages, based on the URL, the scraper should now what regex to use to match the needed details on the page and write these to the database
There are enough sites that index results already, so my guess is there should be a way to create a scraping algorithm that knows how to read the pages without having to match the regex completely. As I said before: if I have a full list of city names, there must be an option to use a search algorithm to get the city name without having to say the city name lies in "#content .about .city".
Data redundance
An important part of the spider/crawler is to prevent it from indexing duplicate data. What I was hoping to do is to keep track of the time a crawler starts indexing a website and when it ends, then I'd also keep track of the 'last update time' of an article (based on the URL to the article) and remove all articles that are older than the starting time of the crawl. Because as far as I can see, these articles do no longer exists.
The data reduncance is easier with the page scraper, since my client made a list of "good sources" (read: pages with unique articles). Data redundance for the vertical search engine is harder, because the sites that are being indexed already make their own selection of artciles from "good sources". So there's a chance that multiple sites have a selection from the same sources.
How to make the results searchable
This is a question apart from how to crawl and scrape pages, because once all data is stored in the database, it needs to be searchable in high speed. The amounts of data that are going to be saved is still unknown, compared to some competition my client had an indication of about 10,000 smaller records (vertical search) and maybe 4,000 larger records with more detailed information.
I understand that this is still a small amount compared to some databases you've possibly been working on. But in the end there might be up to 10-20 search fields that a user can use the find what they are looking for. With a high traffic volume and a lot of these searches I can imagine that using regular MySQL queries for search isn't a clever idea.
So far I've found SphinxSearch and ElasticSearch. I haven't worked with any of them and haven't really looked into the possibilities of both, only thing I know is that both should perform well with high volume and larger search queries within data.
To sum things up
To sum all things up, here's a shortlist of questions I have:
Is there an easy way to create a search algorithm that is able to match DOM data without having to specify the exact div the content lies within?
What is the best practice for crawling pages (links, title & description)
Should I split crawling URLs and saving page title/description for speed?
Are there out-of-the-box solutions for PHP to find (possible) duplicate data in a database (even if there are minor differences, like: if 80% matches -> mark as duplicate)
What is the best way to create a future proof search engine for the data (keep in mind that the amounts of data can increase aswel as the site traffic and search requests)
I hope I made all things clear and I'm sorry for the huge amount of text. I guess it does show that I spend some time already in trying to figure things out myself.
I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:
Figure out where your application can be logically split up
For me this meant building 3 distinct sections:
Web Scraper Manager
Web Scraper
HTML Processor
The work could then be divided up like so:
1) The Web Scraper Manager
The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"
2) The Web Scraper
The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure
ID | URL | HTML (BLOB) | PROCESSING
Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.
3) The HTML Processor
The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.
Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:
{
"url" : "http://example.com",
"meta" : {
"title" : "The meta title from the page",
"description" : "The meta description from the page",
"keywords" : "the,keywords,for,this,page"
},
"body" : "The body content in it's entirety",
"images" : [
"image1.png",
"image2.png"
]
}
Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.
What are the advantages?
The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes
What are the disadvantages?
It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.
The crawling and indexing actions can take a while, but you won't be crawling the same site every 2 minutes, so you can consider an algorithm in which you put more effort in crawling and indexing your data, and another algorithm to help you get a faster search.
You can keep crawling your data all the time and update the rest of the tables in the background (every X minutes/hours), so your search results will be fresh all the time but you won't have to wait for the crawl to end.
Crawling
Just get all the data you can (probably all the HTML code) and store it in a simple table. You'll need this data for the indexing analysis. This table might be big but you don't need good performance while working with it because it's going to be part of a background use and it's not going to be exposed for user's searches.
ALL_DATA
____________________________________________
| Url | Title | Description | HTML_Content |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Tables and Indexing
Create a big table that contains URLs and keywords
KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table will contain most of the words in each URL content (I would remove words like "the", "on", "with", "a" etc...
Create a table with keywords. For each occurrence add 1 to the occurrences column
KEYWORDS
_______________________________
| URL | Keyword | Occurrences |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Create another table with "hot" keywords which will be much smaller
HOT_KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table content will be loaded later according to search queries.
The most common search words will be store in the HOT_KEYWORDS table.
Another table will hold cached search results
CACHED_RESULTS
_________________
| Keyword | Url |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Searching algorithm
First, you'll search the cached result table. In case you have enough results, select them. If you don't, search the bigger KEYWORDS table. Your data is not that big so searching according to the keyword index won't take too long. If you find more relevant results add them to the cache for later usage.
Note: You have to select an algorithm in order to keep your CACHED_RESULTS table small (maybe to save the last use of the record and remove the oldest record if the cache is full).
This way the cache table will help you reduce the load on the keywords tables and give you ultra fast results for the common searches.
Just look at the Solr and solr-wiki. its a open source search platform from the lucene project(similar like Elasticsearch).
For web crawler, you can use either Aperture or Nutch. Both are written in java. Aperture is a light weight crawler. But with Nutch we can crawl 1000 even more websites.
Nutch will handle the process of crawling for websites. Moreover Nutch provides Solr support. It means that you can index the data crawled from Nutch directly into Solr.
Using Solr Cloud we can setup multiple clusters with shards and replication to prevent the data loss and fast data retrieving.
Implementing your own web crawler is not that much easy and for search, regular RDBMS is much complicated to retrieve the data at run time.
I've had my experiences with crawling websites and is a really complicated topic.
Whenever I've got some problem withing this area, I look what the best people at this do (yup, google).
They have a lot of nice presentations about what they are doing and they even release some (of their) tools.
phpQuery for example is a great tool when it comes to searching specific data on a website, I'd recommend to have a look at it if you don't know it yet.
A little trick I've done in a similar project was to have two tables for the data.
The data had to be as up to date as possible, so the crawler was running most of the time and there were problems with locked tables. So whenever the crawler wrote into one table, the other one was free to the search engine and vice versa.
I have built a Web Crawler for detecting news sites - and its performing very well.
It basically downloads the the whole page and then saves it prepares that for another scraping which is looking for keywords. It then basicallly tries to determine if the site is relevant using keywords. Dead simple.
You can find the sourcecode for it here. Please help contribute :-)
It's a focused crawler which doesnt really do anything else than look for sites and rank them according to the presence of keywords. Its not usable for huge data loads, but it's a quite good at finding relevant sites.
https://github.com/herreovertidogrom/crawler.git
It's a bit poorly documented - but I will get around to that.
If you want to do searches of the crawled data, and you have a lot of data, and aspire to build a future proof service - you should NOT create a table with N columns, one for each search term. This is a common design if you think the URL is the primary key. Rather, you should avoid a wide-table design like the pest. This is because IO disk reads get incredibly slow on wide table designs. You should instead store all data in one table, specify the key and the value, and then partition the table on variable name.
Avoiding duplicates is always hard. In my experience, from data warehousing - design the primary key and let the DB do the job. I try to use the source + key + value as a primary key makes you avoid double counting, and has few restrictions.
May I suggest you create a table like this :
URL, variable, value and make that primary key.
Then write all data into that table, partition on distinct variable and implement search on this table only.
It avoids duplicates, it's fast and easily compressable.
Did you tried the http://simplehtmldom.sourceforge.net/manual.htm? I found it useful for scrapping the pages and it might be helpful the search the contents.
use an asynchronous approach to crawl and store the data, so that you can run multiple parallel crawling and storing
ElasticSearch will be useful to search the stored data.
You can search the HTML using this code:
<?
//Get the HTML
$page = file_get_html('http://www.google.com')
//Parse the HTML
$html = new DOMDocument();
$html->loadHTML($page);
//Get the elemnts you are intersted in...
$divArr = $html->getElementsByTagName('div');
foreach($divArr as $div) {
echo $div->nodeValue;
}
?>

Will my searches be slow in my database design?

I'm building a database of IT candidates for a friend who owns a recruitment company. He has a database of thousands of candidates currently in an excel spreadsheet and I'm converting it into mySQL database.
Each candidate has a skill field with their skills listed as a string e.g. "javascript, php, nodejs..." etc.
My friend will have employees under him who will also search the database, however we want to make it so they are limited to search results with candidates with specific skills depending on what vacancy they are working on for security reasons (so they don't steal large sections of the database and go and setup their own recruitment company with the data).
So if an employee is working on a javascript role, they will be limited to search results where the candidate has the word "javascript" in their skills field. So if they searched for all candidates named "Michael" then it would only return "Michaels" with javascript skills for instance.
My concern is that the searches might take too long if for every search since it must scan the skills field which can sometimes be a long string.
Is my concern justified? If so is there a way to optimize this?
If the number of records are in the thousands, you probably won't have any speed issues (just make sure you're not querying more often than you should).
You've tagged this question with a 'mysql' tag so I'm assuming that's the database you're using. Make sure you add a FULLTEXT index to speed up the search. Please note, however, that this type of index is only available for INNODB table starting with MySQL 5.6.
Try the builtin search first, but if you find it to be too slow, or not accurate enough in it's results, you can look at external full-text search engines. I've personally had very good experience with the Sphinx search server, where it easily indexed millions of text records and returned good results.
Your queries will require a full table scan (unless you use a full text index). I highly recommend that you change the data structure in the database by introducing two more tables: Skills and CandidateSkills.
The first would be a list of available skills, containing rows such as:
SkillId SkillName
1 javascript
2 php
3 nodejs
The second would say which skills each person has:
CandidateId SkillId
1 1
2 1
2 2
This will speed up the searches, but that is not the primary reason. The primary reason is fix problems and enable functionality such as:
Preventing spelling errors in the list of searchs.
Providing a basis for enabling synonym searches.
Making sure thought goes into adding new skills (because they need to be added to the Skills table.
Allowing the database to scale.
If you attempt to do what you want using a full text index, you will learn a few things. For instance, the default minimum word length is 4, which would be a problem if your skills include "C" or "C++". MySQL doesn't support synonyms, so you'd have to muck around to get that functionality. And, you might get unexpected results if you have have skills that are multiple words.

best way to set up a MySQL database for storing web data

I will be using curl to retrieve thousands of adult websites. My goal is to store them in MySQL to allow users to easily search the new database and find their desired page without having to endure all the popups, spyware, etc.
It will be an adult website search engine... kinda the google of adult websites, but without the malware sites that find their way onto google from time to time.
At first run I downloaded about 700K lines at about 20 GB of data. Initially I stored all info in a single table with columns for URL, HTML PAGE CODE, PAGE WITH NO HTML TAGS, KEY WORDS, TITLE and a couple more.
I use a MATCH AGAINST query to search for the users desired page within TITLE, KEY WORDS, PAGE WITH NO HTML in any variety of combinations or singularly.
My question is... would I be better off to break all these columns into separate tables and would this improve the speed of the searches enough that it would matter?
Is there any advantage to storing all the data in multiple tables and then using JOIN's to pull the data out?
I am just wondering if I need to be proactive and thinking about high user search loads.
MySQL isn't good with full-text search and never was.
Look into Sphinx or Lucene/Solr, they're the best fit for the job. I'd suggest sticking to the former.

User content site wide search - PHP/ MySQL

For a user content website I am creating, it has lots of sub-sections: Movies, Jobs, People, Photos, Mail, etc. It's like a yahoo portal but very very detailed with information search, like I a niching as deep as possible per topic unlike any site out there. I have the site being developed in codeignitor php and mysql. Search can be global across all sub-sites and per sub-section as we see on google, yahoo. There are 22 possible user content objects on my system, each has about 12-15 search fields which i call object meta data + I a storing historical data (like user content version control) also which i want to include in the search.
Now the question is for per sub-section search it seems reasonable because the scope is limited so I think I can pull it off well with mysql. I don't foresee any performance issue. But for site wide search it will search not just title names, but keywords, tags, description, including people's mail, comments, historical data etc. So my worry is performance. Since this is a startup, I have limited hardware resources, so I have to depend 100% on the database and code to pull it off.
So what are the best practices for implementing such a search from the code and database point of view and should a mixture of databases be used depending on the sub-site? Currently everything is stored in 1 mysql database. But I see issues where it may work fine for people search, movies search, etc but not if i include mail search, geo locations, historical data search and even having to go searching items like photo tags, photo descriptions, etc -> all part of the global search there can be performance issues due to the high number of joins and number of rows.
I don't know about PHP, but for my ruby-on-rails projects, I always use Sphinx search engine to do such things. It is a standalone search engine that indexes your database, and when the user submits a search query, the query is matched against Sphinx's index database instead of the actual db. It is blazingly fast, and offers great control over how to index/search.
Sphinx Search Engine
PHP: Sphinx Extension (not sure if this is relevant)
For a generalized site wide search on a budget you could constrain one of the major search APIs to just your domain and handle and display results as if they had come from your own search.
I don't have a solution exactly, but am running into a similar problem with my in-development website.
I'm beginning to think a solution might lie in determining where the bulk of your searches lie, and limiting searches to those queries. If the user search requires a bit more in-depth results (such as your mail search, geo locations, historical data) then you can send the user to a second mysql query. Get the majority of your users to search using your simpler, low-performance queries, and the rest can use more resources if necessary.
As an example, majority of my site's users will be searching the news, calendar, and media sections, so my search looks their first. But visitors could also be searching for other users, groups, forum posts, tags/categories, and so on. But I'm going to let a second, more complicated script handle that.

Integrating search on a website where the backend is MYSQL

I have a location search website for a city, we started out with collecting data for all possible categories in the city like Schools, Colleges, Departmental Stores etc and stored their information in a separate table, as each entry had different details apart from their name, address and phone number.
We had to integrate search in the website to enable people to find information, so we built an index table where in we stored the categories and related keywords for the same category and the table which much be fetched if that category was searched for. Later on we added the functionality of searching on the name and address as well by adding another master table containing those fields from all the tables to one place. Now my doubt is the following
The application design is improper, and we have written queries like select * from master where name like "%$input%" , all over, since our database is MYSQL and PHP on serverside, is there any suggestion for me to improve on the design of the system?
People want more features like splitting the keywords and ranking them according to relevance etc, is there any ready framework available which runs search on a database.
I tried using Full Text Search in MYSQL and it seems effective to me, is that enough?
Correct me if i am wrong, i had a look into Lucene and Google Custom Search, don't they work on making an index by crawling existing webpages and building their own index? I have a collection of tables on a mysql database on which i have to apply searching. What options do i have?
To address your points:
Using %input% is very bad. That will cause a full table scan every query. Under any amount of load or on even a remotely large dataset your DB server will choke.
An RDBMS alone is not a good solution for this. You are looking in the right place by seeking a separate solution for search. Something which can communicate well with your RDBMS is good; something that runs inside an RDBMS won't do what you need.
Full Text Search in MySQL is workable for very basic keyword searches, nothing more. The scope of usefulness is extremely limited - you need a highly predictable usage model to leverage the built-in searching. It is called "search" but it's not really search the way most people think of it. Compared to the quality of search results we have come to expect from Google and Bing, it does not compare. In that sense of the word "search", it is something else - like Notepad vs Word. They both are things to type in, but that's about it.
As far as separate systems for handling search, Lucene is very good. Lucene works however you want it to work, essentially. You can interact with it programatically to insert indexable documents. Likewise, a Google Appliance (not Google Custom Search) can be given direct meta feeds which expose whatever you want to be indexed, such as data directly from a database.
Take a look at sphinx: http://www.sphinxsearch.com/
Per their site:
How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
It's quite popular with a lot of people in the rails community right now, and they all rave about how awesome it is :)

Categories