I'm a hobbyist, and started learning PHP last September solely to build a hobby website that I had always wished and dreamed another more competent person might make.
I enjoy programming, but I have little free time and enjoy a wide range of other interests and activities.
I feel learning PHP alone can probably allow me to create 98% of the desired features for my site, but that last 2% is awfully appealing:
The most powerful tool of the site is an advanced search page that picks through a 1000+ record game scenario database. Users can data-mine to tremendous depths - this advanced page has upwards of 50 different potential variables. It's designed to allow the hardcore user to search on almost any possible combination of data in our database and it works well. Those who aren't interested in wading through the sea of options may use the Basic Search, which is comprised of the most popular parts of the Advanced search.
Because the advanced search is so comprehensive, and because the database is rather small (less than 1,200 potential hits maximum), with each variable you choose to include the likelihood of getting any qualifying results at all drops dramatically.
In my fantasy land where I can wield AJAX as if it were Excalibur, my users would have a realtime Total Results counter in the corner of their screen as they used this page, which would automatically update its query structure and report how many results will be displayed with the addition of each variable. In this way it would be effortless to know just how many variables are enough, and when you've gone and added one that zeroes out the results set.
A somewhat similar implementation, at least visually, would be the Subtotal sidebar when building a new custom computer on IBuyPower.com
For those of you actually still reading this, my question is really rather simple:
Given the time & ability constraints outlined above, would I be able to learn just enough AJAX (or whatever) needed to pull this one feature off without too much trouble? would I be able to more or less drop-in a pre-written code snippet and tweak to fit? or should I consider opening my code up to a trusted & capable individual in the future for this implementation? (assuming I can find one...)
Thank you.
This is a great project for a beginner to tackle.
First I'd say look into using a library like jquery (jquery.com). It will simplify the javascript part of this and the manual is very good.
What you're looking to do can be broken down into a few steps:
The user changes a field on the
advanced search page.
The user's
browser collects all the field
values and sends them back to the
server.
The server performs a
search with the values and returns
the number of results
The user's
browser receives the number of
results and updates the display.
Now for implementation details:
This can be accomplished with javascript events such as onchange and onfocus.
You could collect the field values into a javascript object, serialize the object to json and send it using ajax to a php page on your server.
The server page (in php) will read the json object and use the data to search, then send back the result as markup or text.
You can then display the result directly in the browser.
This may seem like a lot to take in but you can break each step down further and learn about the details bit by bit.
Hard to answer your question without knowing your level of expertise, but check out this short description of AJAX: http://blog.coderlab.us/rasmus-30-second-ajax-tutorial
If this makes some sense then your feature may be within reach "without too much trouble". If it seems impenetrable, then probably not.
Related
I am in the processing of developing a video sharing site, on the video page I was displaying "similar videos" using just a database query (based on tags/category) I haven't run into any problems with this, but I was debating basically running using my custom search function to match similar videos even more closely (so its not only based on similar categories, but tags, similar words, etc..) however my fear is running this on every video view would be too much (in terms of resources, and just not being worth it since its not a major part of the site)
So I was debating doing that - but storing results (maybe store 50 and pull 6 from that 50 by id) - I can update them maybe once a week or whenever, (again since its not a major part of the site, i don't need live searching), but my question is.... is there any down or upside to this?
I'm looking specifically at cacheing the similar video results or simply saying "never mind it" and keep it based on tags. Does anyone have any experience/knowledge on how sites deal with offering similar options for something like this?
(I'm using php, mysql, built using laravel framework, search is custom class built on the back of laravel scout)
Every decision you make as a developer is a tradeoff. If you cache results, you get speed on display, but get more complexity during cache management (and probably bugs). You should decide is it worth it, as we do not know your page load time requirements (or other KPI), user load, hardware and etc.
But in general i would cache this data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source.
For a client I'm gathering information for building a search engine/web spider combination. I do have experience with indexing webpages' inner links with a specific depth. I also have experience in scraping data from webpages. However, in this case, the volume is larger than I have experience with so I was hoping to gain some knowledge and insights in the best practice to do so.
First of all, what I need to make clear is that the client is going to deliver a list of websites that are going to be indexed. So, in fact, a vertical search engine. The results only need to have a link, title and description (like the way Google displays results). The main purpose of this search engine is to make it easier for visitors to search large amounts of sites and results to find what they need.
So: Website A containts a bunch of links -> save all links together with meta data.
Secondly, there's a more specific search engine. One that also indexes all the links to (let's call them) articles, these articles are spread over many smaller sites with a smaller amount of articles compared to the sites that end up in the vertical search engine. The reason is simple: the articles found on these pages have to be scraped in as many details as possible. This is where the first problem lies: it would take a huge amount of time to write a scraper for each website, data that needs to be collected is for example: city name, article date, article title. So: Website B contains more detailed articles than website A, we are going to index these articles and scrape usefull data.
I do have a method in my mind which might work, but that involves writing a scraper for each individual website, in fact it's the only solution I can think of right now. Since the DOM of each page is completely different I see no option to build a fool-proof algorithm that searches the DOM and 'knows' what part of the page is a location (however... it's a possibility if you can match the text against a full list of cities).
A few things that crossed my mind:
Vertical Search Engine
For the vertical search engine it's pretty straight forward, we have a list of webpages that need to be indexed, it should be fairly simple to crawl all pages that match a regular expression and store the full list of these URLs in a database.
I might want to split up saving page data (meta description, title, etc) into a seperate process to speed up the indexing.
There is a possbility that there will be duplicate data in this search engine due to websites that have matching results/articles. I haven't made my mind up on how to filter these duplicates, perhaps on article title but in the business segment where the data comes from there's a huge change on duplicate titles but different articles
Page scraping
Indexing the 'to-be-scraped'-pages can be done in a similar way, as long as we know what regex to match the URLs with. We can save the list of URLs in a database
Use a seperate process that runs all individual pages, based on the URL, the scraper should now what regex to use to match the needed details on the page and write these to the database
There are enough sites that index results already, so my guess is there should be a way to create a scraping algorithm that knows how to read the pages without having to match the regex completely. As I said before: if I have a full list of city names, there must be an option to use a search algorithm to get the city name without having to say the city name lies in "#content .about .city".
Data redundance
An important part of the spider/crawler is to prevent it from indexing duplicate data. What I was hoping to do is to keep track of the time a crawler starts indexing a website and when it ends, then I'd also keep track of the 'last update time' of an article (based on the URL to the article) and remove all articles that are older than the starting time of the crawl. Because as far as I can see, these articles do no longer exists.
The data reduncance is easier with the page scraper, since my client made a list of "good sources" (read: pages with unique articles). Data redundance for the vertical search engine is harder, because the sites that are being indexed already make their own selection of artciles from "good sources". So there's a chance that multiple sites have a selection from the same sources.
How to make the results searchable
This is a question apart from how to crawl and scrape pages, because once all data is stored in the database, it needs to be searchable in high speed. The amounts of data that are going to be saved is still unknown, compared to some competition my client had an indication of about 10,000 smaller records (vertical search) and maybe 4,000 larger records with more detailed information.
I understand that this is still a small amount compared to some databases you've possibly been working on. But in the end there might be up to 10-20 search fields that a user can use the find what they are looking for. With a high traffic volume and a lot of these searches I can imagine that using regular MySQL queries for search isn't a clever idea.
So far I've found SphinxSearch and ElasticSearch. I haven't worked with any of them and haven't really looked into the possibilities of both, only thing I know is that both should perform well with high volume and larger search queries within data.
To sum things up
To sum all things up, here's a shortlist of questions I have:
Is there an easy way to create a search algorithm that is able to match DOM data without having to specify the exact div the content lies within?
What is the best practice for crawling pages (links, title & description)
Should I split crawling URLs and saving page title/description for speed?
Are there out-of-the-box solutions for PHP to find (possible) duplicate data in a database (even if there are minor differences, like: if 80% matches -> mark as duplicate)
What is the best way to create a future proof search engine for the data (keep in mind that the amounts of data can increase aswel as the site traffic and search requests)
I hope I made all things clear and I'm sorry for the huge amount of text. I guess it does show that I spend some time already in trying to figure things out myself.
I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:
Figure out where your application can be logically split up
For me this meant building 3 distinct sections:
Web Scraper Manager
Web Scraper
HTML Processor
The work could then be divided up like so:
1) The Web Scraper Manager
The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"
2) The Web Scraper
The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure
ID | URL | HTML (BLOB) | PROCESSING
Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.
3) The HTML Processor
The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.
Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:
{
"url" : "http://example.com",
"meta" : {
"title" : "The meta title from the page",
"description" : "The meta description from the page",
"keywords" : "the,keywords,for,this,page"
},
"body" : "The body content in it's entirety",
"images" : [
"image1.png",
"image2.png"
]
}
Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.
What are the advantages?
The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes
What are the disadvantages?
It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.
The crawling and indexing actions can take a while, but you won't be crawling the same site every 2 minutes, so you can consider an algorithm in which you put more effort in crawling and indexing your data, and another algorithm to help you get a faster search.
You can keep crawling your data all the time and update the rest of the tables in the background (every X minutes/hours), so your search results will be fresh all the time but you won't have to wait for the crawl to end.
Crawling
Just get all the data you can (probably all the HTML code) and store it in a simple table. You'll need this data for the indexing analysis. This table might be big but you don't need good performance while working with it because it's going to be part of a background use and it's not going to be exposed for user's searches.
ALL_DATA
____________________________________________
| Url | Title | Description | HTML_Content |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Tables and Indexing
Create a big table that contains URLs and keywords
KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table will contain most of the words in each URL content (I would remove words like "the", "on", "with", "a" etc...
Create a table with keywords. For each occurrence add 1 to the occurrences column
KEYWORDS
_______________________________
| URL | Keyword | Occurrences |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Create another table with "hot" keywords which will be much smaller
HOT_KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table content will be loaded later according to search queries.
The most common search words will be store in the HOT_KEYWORDS table.
Another table will hold cached search results
CACHED_RESULTS
_________________
| Keyword | Url |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Searching algorithm
First, you'll search the cached result table. In case you have enough results, select them. If you don't, search the bigger KEYWORDS table. Your data is not that big so searching according to the keyword index won't take too long. If you find more relevant results add them to the cache for later usage.
Note: You have to select an algorithm in order to keep your CACHED_RESULTS table small (maybe to save the last use of the record and remove the oldest record if the cache is full).
This way the cache table will help you reduce the load on the keywords tables and give you ultra fast results for the common searches.
Just look at the Solr and solr-wiki. its a open source search platform from the lucene project(similar like Elasticsearch).
For web crawler, you can use either Aperture or Nutch. Both are written in java. Aperture is a light weight crawler. But with Nutch we can crawl 1000 even more websites.
Nutch will handle the process of crawling for websites. Moreover Nutch provides Solr support. It means that you can index the data crawled from Nutch directly into Solr.
Using Solr Cloud we can setup multiple clusters with shards and replication to prevent the data loss and fast data retrieving.
Implementing your own web crawler is not that much easy and for search, regular RDBMS is much complicated to retrieve the data at run time.
I've had my experiences with crawling websites and is a really complicated topic.
Whenever I've got some problem withing this area, I look what the best people at this do (yup, google).
They have a lot of nice presentations about what they are doing and they even release some (of their) tools.
phpQuery for example is a great tool when it comes to searching specific data on a website, I'd recommend to have a look at it if you don't know it yet.
A little trick I've done in a similar project was to have two tables for the data.
The data had to be as up to date as possible, so the crawler was running most of the time and there were problems with locked tables. So whenever the crawler wrote into one table, the other one was free to the search engine and vice versa.
I have built a Web Crawler for detecting news sites - and its performing very well.
It basically downloads the the whole page and then saves it prepares that for another scraping which is looking for keywords. It then basicallly tries to determine if the site is relevant using keywords. Dead simple.
You can find the sourcecode for it here. Please help contribute :-)
It's a focused crawler which doesnt really do anything else than look for sites and rank them according to the presence of keywords. Its not usable for huge data loads, but it's a quite good at finding relevant sites.
https://github.com/herreovertidogrom/crawler.git
It's a bit poorly documented - but I will get around to that.
If you want to do searches of the crawled data, and you have a lot of data, and aspire to build a future proof service - you should NOT create a table with N columns, one for each search term. This is a common design if you think the URL is the primary key. Rather, you should avoid a wide-table design like the pest. This is because IO disk reads get incredibly slow on wide table designs. You should instead store all data in one table, specify the key and the value, and then partition the table on variable name.
Avoiding duplicates is always hard. In my experience, from data warehousing - design the primary key and let the DB do the job. I try to use the source + key + value as a primary key makes you avoid double counting, and has few restrictions.
May I suggest you create a table like this :
URL, variable, value and make that primary key.
Then write all data into that table, partition on distinct variable and implement search on this table only.
It avoids duplicates, it's fast and easily compressable.
Did you tried the http://simplehtmldom.sourceforge.net/manual.htm? I found it useful for scrapping the pages and it might be helpful the search the contents.
use an asynchronous approach to crawl and store the data, so that you can run multiple parallel crawling and storing
ElasticSearch will be useful to search the stored data.
You can search the HTML using this code:
<?
//Get the HTML
$page = file_get_html('http://www.google.com')
//Parse the HTML
$html = new DOMDocument();
$html->loadHTML($page);
//Get the elemnts you are intersted in...
$divArr = $html->getElementsByTagName('div');
foreach($divArr as $div) {
echo $div->nodeValue;
}
?>
I have an application that is fetching several e-commerce websites using Curl, looking for the best price.
This process returns a table comparing the prices of all searched websites.
But now we have a problem, the number of stores are starting to increase, and the loading time actually is unacceptable at the user experience side. (actually 10s pageload)
So, we decided to create a database, and start to inject all Curl filtered result inside this database, in order to reduce the DNS calls, and increase Pageload.
I want to know, despite of all our efforts, is still an advantage implement a Memcache module?
I mean, will it help even more or it is just a waste of time?
The Memcache idea was inspired by this topic, of a guy that had a similar problem: Memcache to deal with high latency web services APIs - good idea?
Memcache could be helpful, but (in my opinion) it's kind of a weird way to approach the issue. If it was me, I'd go about it this way:
Firstly, I would indeed cache everything I could in my database. When the user searches, or whatever interaction triggers this, I'd show them a "searching" page with whatever results the server currently has, and a progress bar that fills up as the asynchronous searches complete.
I'd use AJAX to add additional results as they become available. I'm imagining that the search takes about ten seconds - it might take longer, and that's fine. As long as you've got a progress bar, your users will appreciate and understand that Stuff Is Going On.
Obviously, the more searches go through your system, the more up-to-date data you'll have in your database. I'd use cached results that are under a half-hour old, and I'd also record search terms and make sure I kept the top 100 (or so) searches cached at all times.
Know your customers and have what they want available. This doesn't have much to do with any specific technology, but it is all about your ability to predict what they want (or write software that predicts for you!)
Oh, and there's absolutely no reason why PHP can't handle the job. Tying together a bunch of unrelated interfaces is one of the things PHP is best at.
Your result is found outside the bounds of only PHP. Do not bother hacking together a result in PHP when a cronjob could easily be used to populate your database and your PHP script can simply query your database.
If you plan to only stick with PHP then I suggest you change your script to index your database from the results you have populated it with. To populate the results, have a cronjob ping a PHP script that is not accessible to the users which will perform all of your curl functionality.
I'm currently developing a Zend Framework project, using Doctrine as ORM.
I ran into the typical situation where you have to show a list of items (around 400) in a table, and of course, I don't want to show them all at once.
I've already used Zend_Paginator before (only some basic usage), but i always used to get all the items from the DB, and then paginate them, but now it doesn't feel quite right.
My question is this: is it better to get all items from DB first and then "paginate" them, or to get "pages" of items as they are requested? which would have a larger impact on performance?
For me, it is better to get a part of the data and then paginate through them.
If you get All the data from a DB you paginate with the help of JavaScript.
The first opening of the page will take a long time (for 400 rec. is OK).
Browser has a limited memory. If a user opens up a lot of tabs in the browser
and you take a lot of memory (with your data)
this will slow down the speed of the browser and the speed of your application.
You have only 400 records but the increase of the data happens very often.
At worst, the whole browser may break when the page is opened.
What if browser doesn't support JS ...
If you get part of the data from DB, the only defect is if
a user has a very slow Internet speed(but this is the defect in the first option - in the first refresh of the page).
If someone wants to get to another page, it will take a little bit longer than JavaScript.
The second option is better(for me) in the long run, because if it works it will work for years.
The database engines are usually best suited to do the retrieval for you. So, in general, if you can delegate a data-retrieval task to the DB engine instead of doing it in-memory and using your programming language, the best bet for performance is to let the DB engine do it for you.
But also remember that if you don't configure the indices correctly or don't run a good query, you won't get the best result out of your DB engine.
However, most DB engines nowadays are capable of optimizing your queries for you and running them in their most normal form.
I'd like to find a way to take a piece of user supplied text and determine what addresses on the map are mentioned within the text. I'd be happy to use a free web service if it exists or use a script which will not consume too many resources.
One way I can imagine doing this is taking a gigantic database of addressing and searching for each of them individually in the text, but this does not seem efficient. Is there a better algorithm or technique one can suggest?
My basic idea is to take the location information and turn it into markers on a Google Map. If it is too difficult or CPU intensive to determine the locations automatically, I could require users to add information in a location field if necessary but I would prefer not to do this as some of the users are going to be quite young students.
This needs to be done in PHP as that is the scripting language available on my school hosted server.
Note this whole set-up will happen within the context of a Drupal node, and I plan on using a filter to collect the necessary location information from the individual node, so this parsing would only happen once (when the new text enters the database).
You could get something like opencalais to tag your text. One of the catigories which it returns is "city" you coud then use another third party module to show the location of the city.
If you did have a gigantic list of locations in a relational database, and you're only concerned about 500 to 1000 words, then you could definitely just pass the SQL command to find matches for the 500-1000 words and it would be quite efficient.
But even if you did have to call a slow API, you could feasibly request for 500 words one by one. If you kept a cache of the matches, then the cache would probably quickly fill up with all the stop words (you know, like "the", "if", "and") and then using the cache, it'd be likely that you would be searching much less than 500 words each time.
I think you might be surprised at how fast the brute force approach would work.
For future reference I would just like to mention the Yahoo API called Placemaker and the service GeoMaker that is built on top of it.
Those tools can be used to parse out locations from a text as requested here. Unfortunately no Drupal module seems to exists right now- but a custom solution seems easy to code.