How to scrape SERP with PHP (for small project) - php

I thought this would be fairly simple but it's proving challenging. Google uses https:// now and bing redirects to remove HTTP://.
How can I grab the top 5 URLs for a given search term?
I've tried several methods (including loading results into an iframe), but keep hitting brick walls with everything I try.
I wouldn't even need a proxy, as I'm talking about a very small amount results to be harvested, and will only use it for 20-30 terms once ever few months. Hardly enough to trigger whiplash from the search giants.
Any help would be much appreciated!
Here's one example of what I've tried:
$query = urlencode("test");
preg_match_all('/<a title=".*?" href=(.*?)>/', file_get_contents("http://www.bing.com/search?q=" . urlencode($query) ), $matches);
echo implode("<br>", $matches[1]);

There's three main ways to do this. Firstly, use the official API for the search engine you're using - Google has one, and most of them will. These are often volume limited, but for the numbers you're talking about, you'll be fine.
The second way is to use a scraper program to visit the search page, enter a search term, and submit the associated form. Since you've specified PHP, I'd recommend Goutte. Internally it uses Guzzle and Symfony Components, so it must be good! The README at the above link shows you how easy it is. Selection of HTML fragments is done using either XPath or CSS, so it is flexible too.
Lastly, given the low volume of required scrapes, consider downloading a free software package from Import.io. This lets you build a scraper using a point-and-click interface, and it learns how to scrape various areas of the page before storing the data in a local or cloud database.

You can also use a third party service like Serp Api to get Google results.
It should be pretty easy to integrate::
$query = [
"q" => "Coffee",
"google_domain" => "google.com",
];
$serp = new GoogleSearchResults();
$json_results = $serp.json($query);
GitHub project.

Related

URL Pattern Matching (PHP)?

(Programming Language: PHP v5.3)
I am working on this website where I make search on specific websites using google and bing search APIs.
The Project:
A user can select a website to search from a drop-down list. We have an admin panel on this website. If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below.
On the submit of form a code goes through input and generates a regex that we later use for pattern matching. The regex is stored in database for later use.
In a different form the visiting user selects a website from the drop-down list. He then enters the search "query" in a text box. We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string:
"site:website query"
(where we replace "website" with the website user chose for search and replace "query" with user's search query).
The Problem
Now what we have to do is get the best match of the url. The reason for doing a pattern match is that some times there are unwanted links in search results. For example lets say I search on website "www.example.com" for an article names "abcd". Search engines might return these two urls:
1) www.example.com/articles/854/abcd
2) www.example.com/search/abcd
The first url is the one that I want. Now I have two issues to resolve.
1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. There can never be enough conditions to check for creating a pattern for different websites from same code. Is there a better way to do this or regex is my only option?
2) I am developing on a machine running Windows 7 OS. preg_match_all() returns results here. But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? I can't seem to get why that is happening. Anyone knows why is this happening?
I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems.
About question 1:
I can't quite grasp what you're trying to accomplish so I can't give any valid opinion.
Regarding question 2:
If both servers are running the same version of PHP, the regex library used ought to be the same. You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same.
Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. Google/Bing change part of the data regarding the OS you use and that might alter preg results.

Is it better to try for one mega screen scraper or split it into a scraper for different sites?

I will explain my situation.
Our Social Media Manager (yay) suddenly wants something to scrape a list of about 40 websites for information about our company, for example there's a lot of review sites in the list.
(I have read a ton of tutorials and SO questions but still) My questions are:
Is it possible to build a generic scraper that will work across all of these sites or do I need a separate scraper for each site?
I think I understand how to parse an individual web page but how do you do it, where, for example there's a website structure of review-website.com/company-name and on that page are titles and a snippet of the review that then link to the actual full page review?
i.e. Crawling and scraping multiple pages on multiple sites. Some are 'easier' than others because they have dedicated pages like the urls previously mentioned but some are forums etc with no particular structure that just happen to mention our company name so I don't know how to get relevant information on those.
Does the time spent creating this justify that the Social Media Manager could just search these sites manually himself? Especially considering that a HTML change on any of the sites could possibly end up breaking the scraper?
I really don't think this is a good idea yet my Line Manager seems to think it will take a morning's worth of work to write a scraper for all of these sites and I have no idea how to do it!
UPDATE
Thank you very much for the answers so far, I also thought I'd provide a list of the sites just to clarify what I think is an extreme task:
Facebook - www.facebook.com
Social Mention - www.socialmention.com
Youtube - www.youtube.com
Qype - www.qype.co.uk
Money Saving Expert - www.moneysavingexpert.co.uk
Review Centre - www.reviewcentre.com
Dooyoo - www.dooyoo.co.uk
Yelp - www.yelp.co.uk
Ciao - www.ciao.co.uk
All in London - www.allinlondon.co.uk
Touch Local - www.touchlocal.com
Tipped - www.tipped.co.uk
What Clinic - www.whatclinic.com
Wahanda - www.wahanda.com
Up My Street - www.upmystreet.com
Lasik Eyes - www.lasik-eyes.co.uk/
Lasik Eyes (Forum) - forums.lasik-eyes.co.uk/default.asp
Laser Eye Surgery - www.laser-eye-surgery-review.com/
Treatment Saver - www.treatmentsaver.com/lasereyesurgery
Eye Surgery Compare - www.eyesurgerycompare.co.uk/best-uk-laser-eye-surgery-clinics
The Good Surgeon Guide - www.thegoodsurgeonguide.co.uk/
Private Health -www.privatehealth.co.uk/hospitaltreatment/find-a-treatment/laser-eye-surgery/
Laser Eye Surgery Wiki - www.lasereyesurgerywiki.co.uk
PC Advisor - www.pcadvisor.co.uk/forums/2/consumerwatch/
Scoot - www.scoot.co.uk
Cosmetic Surgery Reviews - www.cosmetic-surgery-reviews.co.uk
Lasik Reviews - www.lasikreviews.co.uk
Laser Eye Surgery Costs - www.lasereyesurgerycosts.co.uk
Who Calls Me - www.whocallsme.com
Treatment Adviser - www.treatmentadviser.com/
Complaints Board - http://www.complaintsboard.com
Toluna - http://uk.toluna.com/
Mums Net - http://www.mumsnet.com
Boards.ie - http://www.boards.ie
AV Forums - http://www.avforums.com
Magic Mum - http://www.magicmum.com
That really deppends on what sort of websites and data you face.
Option 1: DOM / XPATH based
If you need to parse tables and very detailed things you need to parse each site with a separate algorithm. One way would be to parse each of the specific site into a DOM representation and adress each value per XPATH. This will take some time and is affected by structure changes and if you have to scrape each of these sites with this it will cost you more than a morning.
Option 2: Density based
However if you need to parse something like a blog article and you may want to extract only the articles text there are pretty good density based algorithm which work accross HTML structure changes. One of those is described here: https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf
A implementation is provided here: http://apoc.sixserv.org/code/ce_density.rb
You would have to port it to php. For blogs and news sites this is a really effective way.
Option 3: Pragmatic
If you do not care about layout and structure and only want to have the data provided. You might download contents and try to strip the tags solely. However this will have a lot of noise in the resulting text.
Update
After updating your post you might follow the following in order:
Check which page is illegal to scrape. On this list there are for sure some which you will not be allowed to scrape.
You will need much more time than a day. I would talk about this and the legal problems with project lead.
Choose one option per page
I would create a scraper for each site but create a library with common functionality (e.g. opening up a page, convert to DOM, report errors, storing results etc)
Try to avoid regular expressions when scraping. A small change will stop the scraping working. Use the web sites DOM structure instead (XPaths?). Much more reliable.
Tell your boss it is going take quite a bit of time.
Good luck.

Determine context/meaning of a web page (or paragraph of text)

Of course Google has been doing this for years! However, rather than start from scratch, spend 10 years+ and squander large sums of money :) I was wondering if anyone knows of a simple PHP library that would return a list of important words (and/or some sort of context) from a web page or chunk of text using PHP?
On a basic level, I am guessing the most spiders will pull in words, remove words without real meaning, then count the rest. The most occurring words would most likely be what I'm interested in.
Any sort of pointers would be really appreciated!
Latent Semantic Indexing.
I can give you pointers, but you want to look up/research Latent Semantic Indexing.
Rather than explain it, here is a quick snippet from a webpage.
Latent semantic indexing is
essentially a way of extracting the
meaning from a document without
matching a specific phrase. A simple
example would be that a document
featuring the words ‘Windows’, ‘Bing’,
‘Excel’ and ‘Outlook’ would be about
Microsoft. You wouldn’t need
‘Microsoft’ to appear again and again
to know that.
This example also highlights the
importance of taking into account
related words because if ‘windows’
appeared on a page that also featured
‘glazing’, it would most likely be an
entirely different meaning.
You can of course go down the easy route of dropping all stop words from the text corpus, but LSI is definately more accurate.
I will update this post with more info in about 30 minutes.
(Still intending to update this post - Got too busy with work).
Update
Okay, so the basics behind LSA, is to offer a new/different approach for retieving a document based on a particular search time. You could very easily use it for determining the meaning of a document however though too.
One of the problems with the search of yester-years was that they were based on keywords analysis. If you take Yahoo/Altavista from the late 1999's through to probably 2002/03 (don't quote me on this), they were extremely dependant on ONLY using keywords as a factor of retrieving a document from their index. Keywords however, don't translate to anything other than the keyword which they represent.
However, the keyword "Hot", means lots of things depending on the context which it is placed. If you were to take the term "hot" and identity that it was placed around other terms such as "chillies", "spices" or "herbs", then conceptually it means something totally different to the term "hot" when surronding by other terms such as "heat" or "warmth" or "sexy" and "girl".
LSA attempts to overcome these defficiencies by working upon a matrix of statisical probalities, (which you build yourself).
Anyway onto some tools that help you to build this matrix of document/terms (and cluster them in a proximity which relates to their corpus). This works to the benefit of search engines, by transposing keywords into concepts, so that if you search for a particular keyword, that keyword might not even appear in documents which are retrieved, but the concept which the keyword represents does.
I've always used Lucence / Solr for search. And doing a quick Google search, for Solr LSA LSI returned a few links.
http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
This guy seems to have created a plugin for it.
http://github.com/algoriffic/lsa4solr
I might check it out over the next few weeks and see how it gets on.
Go have a look at Calais and Zemanta. Very cool stuff!
Personally, I'd be inclined to use something like a Brill parser to identify the part of speech of each word, discarding pronouns, verbs, etc and using that to extract a list of nouns (possibly with any qualifying adjectives) to build that list of keywords. You can find a PHP implementation of a Brill Parser on Ian Barber's PHP/IR site.

How to do search with out using the scripting language and by using jquery?

I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.

How do I grab just the parsed Infobox of a wikipedia article?

I'm still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has references to countries and on each country page I would like to be able to show the infobox which is on corresponding wikipedia article of that country. I'm using php here - I would greatly appreciate it if anyone has any code snippets or advice on what should I be doing here.
Thanks again.
EDIT
Well I have a db table with names of countries. And I have a script that takes a country and shows its details. I would like to grab the infobox - the blue box with all country details images etc as it is from wikipedia and show it on my page. I would like to know a really simple and easy way to do that - or have a script that just downloads the information of the infobox to a local remote system which I could access myself later on. I mean I'm open to ideas here - except that the end result I want is to see the infobox on my page - of course with a little Content by Wikipedia link at the bottom :)
EDIT
I think I found what I was looking for on http://infochimps.org - they got loads of datasets in I think the YAML language. I can use this information straight up as it is but I would need a way to constantly update this information from wikipedia now and then although I believe infoboxes rarely change especially o countries unless some nation decides to change their capital city or so.
I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.
Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.
Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0
I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.
It depends what route you want to go. Here are some possibilities:
Install MediaWiki with appropriate
modifications. It is a after all a
PHP app designed precisely to parse
wikitext...
Download the static HTML version, and parse out the parts you want.
Use the Wikipedia API with appropriate caching.
DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.
There is a number of semantic data providers from which you can extract structured data instead of trying to parse it manually:
DbPedia - as already mentioned provides SPARQL endpoint which could be use for data queries. There is a number of libraries available for multiple platforms, including PHP.
Freebase - another creative commons data provider. Initial dataset is based on parsed Wikipedia data, but there is some information taken from other sources. Data set could be edited by anyone and, in contrast to Wikipedia, you can add your own data into your own namespace using custom defined schema. Uses its own query language called MQL, which is based on JSON. Data has WebID links back to correspoding Wikipedia articles. Free base also provides number of downloadable data dumps. Freebase has number of client libraries including PHP.
Geonames - database of geographical locations. Has API which provides Country and Region information for given coordinates, nearby locations (e.g. city, railway station, etc.)
Opensteetmap - community built map of the world. Has API allowing to query for objects by location and type.
Wikimapia API - another location service
To load the parsed first section, Simply add this parameter to the end of the api url
rvparse
Like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0&rvparse
Then parse the html to get the infobox table (using Regex)
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Niger&rvsection=0&rvparse";
$data = json_decode(file_get_contents($url), true);
$data = current($data['query']['pages']);
$regex = '#<\s*?table\b[^>]*>(.*)</table\b[^>]*>#s';
$code = preg_match($regex, $data["revisions"][0]['*'], $matches);
echo($matches[0]);
if you want to parse one time all the articles, wikipedia has all the articles in xml format available,
http://en.wikipedia.org/wiki/Wikipedia_database
otherwise you can screen scrape individual articles e.g.
To update this a bit: a lot of data in Wikipedia infoboxes are now taken from Wikidata, which is a free database of structured information. See data page for Germany for example, and https://www.wikidata.org/wiki/Wikidata:Data_access for information on how to access the data programatically.
def extract_infobox(term):
url = "https://en.wikipedia.org/wiki/"+term
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tbl = soup.find("table", {"class": "infobox"})
if not tbl:
return {}
list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
if th is not None and td is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
# remove references
clean = re.sub("([\[]).*?([\]])", "\g<1>\g<2>", elem.strip())
# add a simple space after removing references for word-separation
innerText += clean.replace('[]','') + ' '
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText
return info
I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!
EDIT - I would add to this answer that you can use HtmlAgilityPack for those in C# land. For PHP it looks like SimpleHtmlDom. Having said that it looks like Wikipedia has a more than adequate API. This question probably answers your needs best:
Is there a Wikipedia API?

Categories