Parse google search result in PHP - Keyword checker [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to do a little script, where I can Google for my keywords daily.
What is the best approach for this?
If I use the API, which i don't think there is any for this task, is there a limit?
I want to check for the first 100-200 results.

Do your search manually once, copy the resulting URL that points to the results page
Write a PHP script that:
fetches the content from that URL using file-get-contents()
parses the full HTML result back to a PHP array containing only search result data that is relevant to you
writes the array to database or file system
Run the PHP-script as a cron job on your server (hourly, daily, whatever you prefer)
Be prepared to update your script whenever Google changes the format of its results page
Get yourself a lawyer
Better yet, get yourself a commercial license as indicated by mario. That way you can skip all steps above (especially 4 and 5 can be missed).

Your big problem is that Google results are very customised now - depending on what you are searching for, results can be customised based on your exact location (not just country), time of day, search history, etc.
Hence, your results probably won't be completely constant and certainly won't be the same as for somebody a few miles away with a different browser history, even if they search for exactly the same thing.
There are various SEO companies offering tools to make the results more standardised, and these tools won't break the Google Terms of Service.
Try: http://www.seomoz.org/tools and http://tools.seobook.com/firefox/rank-checker/

I wrote a php script which does the task of parsing/scraping the top 1000 results gracefully without any personalized effects by google, along with a better version called true Google Search API (which generalizes the task, returning an array of nicely formatted results)
Both of these scripts work server-side and parse the results directly from results page uging cURL and regex

A few moths ago, I worked with GooHackle guys and they have a web application that does exactly what you're looking for, plus the cost is not high, they have under $30/month plans.
Like Blowski already said, nowadays Google results are very customized, but if you search always using the same country and query parameters, you can have a pretty accurate view of your rankings for several keywords and domains.
If you want to develop the app yourself is not going to be too difficult neither, you can use PHP or any other language to periodically do the queries and save the results in a DB. There are basically only two points to resolve, do the HTTP queries(easily done with cURL) and parse the results(you can use regex or the DOM structure). Then if you want to monitor thousands of keywords and domains, things turns a little more difficult because Google starts to ban your IP addresses.
I think that the apps like this from the "big guys" have hundreds or thousands of different IP addresses, from different countries. That allows them to collect the Google results for a huge number of keywords.
Regarding the online tool that I initially mentioned, they also have an online Google scraper that anybody can use and shows how this works, just query and parse.

SEOPanel works around a similar issue: your could download open source code and extract a simple keyword parser for your search results. The "trick" used is about slowing query searches, while project is hosted by Google (Code) self.

Related

AWS S3, speed and server load [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I Hope this question is the right sort (If not, instead of just down-voting could someone point me to where I can get this answered).
I am trying to be a forward thinker and tackle a problem before it arises.
Scenario
I have a small mailing list website where every week I post links to things web related that I like and find useful.
There is a latest article page which shows everything that I added into the database that month. At the moment there are 6 sections; Intro, News, Design, Development, Twitter, Q&A. The website shows these like so:
Section
Check database for all entries that match {Month} && {Section}
Foreach {Section} return {Title} {Desc} {Link}
I usually have about 3 links per section. This also means 6 db requests per page view.
Concerns
When/if the site gains in popularity, let's say I get a 5k visitor spike thats 30,000 db requests which I don't think my server host will either like nor look the other way, probably ending up with me being charged a lot and my site crashing.
Question
Which one of these solutions do you think will be the wisest in terms of speed and lowering server resources with requests:
1) Use PHP to make one db request getting all the entries, adding them into an array and then lopping through the array to generate the sections
2) Use a cron to generate all the months entries, make them into a JSON file and parse that JSON on page load, Host on my server
3) Use a cron to generate all the months entries, make them into a JSON file and parse that JSON on page load, Host them on AWS S3
4) Use a cron to generate all the entries as separate text files, for example february-2016-intro-one.txt and save that on S3, then on the latest article page get the text files for each one and parse them
Discussion
If you have any other ideas, I would be happy to hear them :)
Thanks for your patience with reading this, Looking forward to your replies.
CloudFront is a web service that speeds up distribution of your static and dynamic web content, for example, .html, .css, .php, and image files, to end users.
My advice would be to stick with the cron job and save the JSON file to S3, but also use CloudFront to host the file from your S3 bucket. Though hosting from your server my seem fastest, because it is located in only 1 place or region, speed will vary depending on where people access it from. If a user is viewing your site from far away then they will have a slower load time than someone who is closer.
With CloudFront, your file(s) are distributed and cached on Amazon's 50+ edge locations around the world, giving you the fastest and most reliable delivery times.
Also, in the future if you want your users to see new content as soon as it is update, take a look into Lambda. It runs code in response to events from other AWS services. So whenever your database is updated (if it's DynamoDB or RDS) you can automatically generate a new JSON file and save it to S3. It will still continue to be distributed by CloudFront once you've gotten that connection set up.
More information about CloudFront here
More information about Lambda here
Build a cache!
When you do one of these queries, store the result in a cache together with the current time. When you next go to do one of these queries, check the cache -- if it's within a desired timeframe (eg 10 hours?), just use the stored results; if the time has expired, do another query and store the results again.
This way, you'll only ever do such a complex query once per time period.
As to making the cache -- it could be something as simple as storing the result in file, but it would work better keeping it in memory (global variable?). I'm not a PHP person, but there seems to be an apc_store() command.

Crawl specific pages and data and make it searchable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source.
For a client I'm gathering information for building a search engine/web spider combination. I do have experience with indexing webpages' inner links with a specific depth. I also have experience in scraping data from webpages. However, in this case, the volume is larger than I have experience with so I was hoping to gain some knowledge and insights in the best practice to do so.
First of all, what I need to make clear is that the client is going to deliver a list of websites that are going to be indexed. So, in fact, a vertical search engine. The results only need to have a link, title and description (like the way Google displays results). The main purpose of this search engine is to make it easier for visitors to search large amounts of sites and results to find what they need.
So: Website A containts a bunch of links -> save all links together with meta data.
Secondly, there's a more specific search engine. One that also indexes all the links to (let's call them) articles, these articles are spread over many smaller sites with a smaller amount of articles compared to the sites that end up in the vertical search engine. The reason is simple: the articles found on these pages have to be scraped in as many details as possible. This is where the first problem lies: it would take a huge amount of time to write a scraper for each website, data that needs to be collected is for example: city name, article date, article title. So: Website B contains more detailed articles than website A, we are going to index these articles and scrape usefull data.
I do have a method in my mind which might work, but that involves writing a scraper for each individual website, in fact it's the only solution I can think of right now. Since the DOM of each page is completely different I see no option to build a fool-proof algorithm that searches the DOM and 'knows' what part of the page is a location (however... it's a possibility if you can match the text against a full list of cities).
A few things that crossed my mind:
Vertical Search Engine
For the vertical search engine it's pretty straight forward, we have a list of webpages that need to be indexed, it should be fairly simple to crawl all pages that match a regular expression and store the full list of these URLs in a database.
I might want to split up saving page data (meta description, title, etc) into a seperate process to speed up the indexing.
There is a possbility that there will be duplicate data in this search engine due to websites that have matching results/articles. I haven't made my mind up on how to filter these duplicates, perhaps on article title but in the business segment where the data comes from there's a huge change on duplicate titles but different articles
Page scraping
Indexing the 'to-be-scraped'-pages can be done in a similar way, as long as we know what regex to match the URLs with. We can save the list of URLs in a database
Use a seperate process that runs all individual pages, based on the URL, the scraper should now what regex to use to match the needed details on the page and write these to the database
There are enough sites that index results already, so my guess is there should be a way to create a scraping algorithm that knows how to read the pages without having to match the regex completely. As I said before: if I have a full list of city names, there must be an option to use a search algorithm to get the city name without having to say the city name lies in "#content .about .city".
Data redundance
An important part of the spider/crawler is to prevent it from indexing duplicate data. What I was hoping to do is to keep track of the time a crawler starts indexing a website and when it ends, then I'd also keep track of the 'last update time' of an article (based on the URL to the article) and remove all articles that are older than the starting time of the crawl. Because as far as I can see, these articles do no longer exists.
The data reduncance is easier with the page scraper, since my client made a list of "good sources" (read: pages with unique articles). Data redundance for the vertical search engine is harder, because the sites that are being indexed already make their own selection of artciles from "good sources". So there's a chance that multiple sites have a selection from the same sources.
How to make the results searchable
This is a question apart from how to crawl and scrape pages, because once all data is stored in the database, it needs to be searchable in high speed. The amounts of data that are going to be saved is still unknown, compared to some competition my client had an indication of about 10,000 smaller records (vertical search) and maybe 4,000 larger records with more detailed information.
I understand that this is still a small amount compared to some databases you've possibly been working on. But in the end there might be up to 10-20 search fields that a user can use the find what they are looking for. With a high traffic volume and a lot of these searches I can imagine that using regular MySQL queries for search isn't a clever idea.
So far I've found SphinxSearch and ElasticSearch. I haven't worked with any of them and haven't really looked into the possibilities of both, only thing I know is that both should perform well with high volume and larger search queries within data.
To sum things up
To sum all things up, here's a shortlist of questions I have:
Is there an easy way to create a search algorithm that is able to match DOM data without having to specify the exact div the content lies within?
What is the best practice for crawling pages (links, title & description)
Should I split crawling URLs and saving page title/description for speed?
Are there out-of-the-box solutions for PHP to find (possible) duplicate data in a database (even if there are minor differences, like: if 80% matches -> mark as duplicate)
What is the best way to create a future proof search engine for the data (keep in mind that the amounts of data can increase aswel as the site traffic and search requests)
I hope I made all things clear and I'm sorry for the huge amount of text. I guess it does show that I spend some time already in trying to figure things out myself.
I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:
Figure out where your application can be logically split up
For me this meant building 3 distinct sections:
Web Scraper Manager
Web Scraper
HTML Processor
The work could then be divided up like so:
1) The Web Scraper Manager
The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"
2) The Web Scraper
The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure
ID | URL | HTML (BLOB) | PROCESSING
Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.
3) The HTML Processor
The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.
Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:
{
"url" : "http://example.com",
"meta" : {
"title" : "The meta title from the page",
"description" : "The meta description from the page",
"keywords" : "the,keywords,for,this,page"
},
"body" : "The body content in it's entirety",
"images" : [
"image1.png",
"image2.png"
]
}
Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.
What are the advantages?
The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes
What are the disadvantages?
It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.
The crawling and indexing actions can take a while, but you won't be crawling the same site every 2 minutes, so you can consider an algorithm in which you put more effort in crawling and indexing your data, and another algorithm to help you get a faster search.
You can keep crawling your data all the time and update the rest of the tables in the background (every X minutes/hours), so your search results will be fresh all the time but you won't have to wait for the crawl to end.
Crawling
Just get all the data you can (probably all the HTML code) and store it in a simple table. You'll need this data for the indexing analysis. This table might be big but you don't need good performance while working with it because it's going to be part of a background use and it's not going to be exposed for user's searches.
ALL_DATA
____________________________________________
| Url | Title | Description | HTML_Content |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Tables and Indexing
Create a big table that contains URLs and keywords
KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table will contain most of the words in each URL content (I would remove words like "the", "on", "with", "a" etc...
Create a table with keywords. For each occurrence add 1 to the occurrences column
KEYWORDS
_______________________________
| URL | Keyword | Occurrences |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Create another table with "hot" keywords which will be much smaller
HOT_KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table content will be loaded later according to search queries.
The most common search words will be store in the HOT_KEYWORDS table.
Another table will hold cached search results
CACHED_RESULTS
_________________
| Keyword | Url |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Searching algorithm
First, you'll search the cached result table. In case you have enough results, select them. If you don't, search the bigger KEYWORDS table. Your data is not that big so searching according to the keyword index won't take too long. If you find more relevant results add them to the cache for later usage.
Note: You have to select an algorithm in order to keep your CACHED_RESULTS table small (maybe to save the last use of the record and remove the oldest record if the cache is full).
This way the cache table will help you reduce the load on the keywords tables and give you ultra fast results for the common searches.
Just look at the Solr and solr-wiki. its a open source search platform from the lucene project(similar like Elasticsearch).
For web crawler, you can use either Aperture or Nutch. Both are written in java. Aperture is a light weight crawler. But with Nutch we can crawl 1000 even more websites.
Nutch will handle the process of crawling for websites. Moreover Nutch provides Solr support. It means that you can index the data crawled from Nutch directly into Solr.
Using Solr Cloud we can setup multiple clusters with shards and replication to prevent the data loss and fast data retrieving.
Implementing your own web crawler is not that much easy and for search, regular RDBMS is much complicated to retrieve the data at run time.
I've had my experiences with crawling websites and is a really complicated topic.
Whenever I've got some problem withing this area, I look what the best people at this do (yup, google).
They have a lot of nice presentations about what they are doing and they even release some (of their) tools.
phpQuery for example is a great tool when it comes to searching specific data on a website, I'd recommend to have a look at it if you don't know it yet.
A little trick I've done in a similar project was to have two tables for the data.
The data had to be as up to date as possible, so the crawler was running most of the time and there were problems with locked tables. So whenever the crawler wrote into one table, the other one was free to the search engine and vice versa.
I have built a Web Crawler for detecting news sites - and its performing very well.
It basically downloads the the whole page and then saves it prepares that for another scraping which is looking for keywords. It then basicallly tries to determine if the site is relevant using keywords. Dead simple.
You can find the sourcecode for it here. Please help contribute :-)
It's a focused crawler which doesnt really do anything else than look for sites and rank them according to the presence of keywords. Its not usable for huge data loads, but it's a quite good at finding relevant sites.
https://github.com/herreovertidogrom/crawler.git
It's a bit poorly documented - but I will get around to that.
If you want to do searches of the crawled data, and you have a lot of data, and aspire to build a future proof service - you should NOT create a table with N columns, one for each search term. This is a common design if you think the URL is the primary key. Rather, you should avoid a wide-table design like the pest. This is because IO disk reads get incredibly slow on wide table designs. You should instead store all data in one table, specify the key and the value, and then partition the table on variable name.
Avoiding duplicates is always hard. In my experience, from data warehousing - design the primary key and let the DB do the job. I try to use the source + key + value as a primary key makes you avoid double counting, and has few restrictions.
May I suggest you create a table like this :
URL, variable, value and make that primary key.
Then write all data into that table, partition on distinct variable and implement search on this table only.
It avoids duplicates, it's fast and easily compressable.
Did you tried the http://simplehtmldom.sourceforge.net/manual.htm? I found it useful for scrapping the pages and it might be helpful the search the contents.
use an asynchronous approach to crawl and store the data, so that you can run multiple parallel crawling and storing
ElasticSearch will be useful to search the stored data.
You can search the HTML using this code:
<?
//Get the HTML
$page = file_get_html('http://www.google.com')
//Parse the HTML
$html = new DOMDocument();
$html->loadHTML($page);
//Get the elemnts you are intersted in...
$divArr = $html->getElementsByTagName('div');
foreach($divArr as $div) {
echo $div->nodeValue;
}
?>

How to add a search functionality to a PHP website, without using Java or Google site search? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I would like to add a search field to my site. The site is based on PHP and the Yii framework. The web-server assembles multiple data (from files and APIs) before serving the resulting web-page (the web-server will get these pieces of data out of a MySQL database sooner or later, but it's just files at the moment, and API results).
Apache's Lucene could answer the problem, but there is no way to use Java on the server - I am on a shared Linux host.
Google site search (or bing's,..) could answer the problem, but I would like to have a fully-customizable search box, and add some results to the proposed result.
I could create my own search engine, indexing pages and using different weights according to where each piece of data come from, to have a precise result ; but I think there must be something out there that would be more efficient, and quicker to implement.
What'd be a way to add a quick search functionality to a PHP based website, without using Java or Google site search ?
I use Zend Framework and consequently Zend_Search_Lucene. It's a pure PHP implementation of a faceted search. You can define your own "document" (as an aggregate of your data), weight axes, and build indexes relatively straight-forwardly. The downside, in my experience, is that it's much slower on indexing and query than (eg) solr.
Update 1
In response to comment, here's a link: how I use Zend_Search_Lucene for spatial searches. The code there demonstrates a few things:
Lines 54-62 show how to add a "document" to the index. In this example, the document only has two fields (longitude & latitude), but you get the idea. Just put this in a loop and add documents to your index. In production operation, I keep track of changes to data, and update the index when any data going into indexed documents changes. The initial import is very slow -- empirically, I found the algorithm is at least O(n log n) with a pretty big K, while solr was more like O(log n).
Lines 42-52 show how to search an index. This search is a bit more complicated than usual, because I have to encode longitude and latitude in the same way its encoded in the index. The article explains why this has to be done, but suffice to say: if you just have text data, the index searching is not this hard.
Line 40 is creating the index, which both the "add" and "search" mentioned in the previous two bullets requires. Note that keeping the index on a fast medium (like SD storage) lowers the K in the algorithm, but it's still (empirically, not analytically) O(n log n).
Lines 1-38 are the helpers needed to normalize a longitude and latitude into a format that Zend_Search_Lucene supports. Again, if you have only text data, this complication isn't necessary.
Update 2 Responding to the comment on performance. Putting the index on a fast medium (SD, RAM disk w/ sync, whatever) speeds it up a bit. Using unstored fields also helps a bit. Both of these reduce the constant in the empirical O(n log n), but still the dominant problem is that n multiplier. What Zend appears to do is, upon each add, re-shuffle most or all of the previous adds to the index. As far as I can tell, this is the algorithm in play during index build and can't be modified.
The way I got around that n-multiplier was to use a Zend Page Cache based on the stemmed query (so if someone types "blueberries", "blueberry", "blue berry", "blu bary", etc. they all get stemmed and fixed to the soundex phonetic "blue-bear-ee"). Common queries get almost instant results, and since the particular domain was read-heavy and insert-latent, this was an acceptable solution. Obviously in general it's not.
In other circumstances, there is the setResultSetLimit() method, which when used with scoring, will return results faster. If you don't care about all possible results, just the top N results, then this is the way to go.
Finally, all this experience is with respect to Zend 1.x. I do not know if this has been addressed in 2.x.
There is a lot search engines. Personally I like Sphinx Search. But you need able to compile and run it on your (or remote) server. You can look on php based search engines like seekquarry
You need to have all the data(page name and url) in a database, and than you can make the search function using the LIKE operator in a MySql query:
mysql_query("
SELECT *
FROM `table`
WHERE name LIKE '%keyword%'
")

I need a local site search with a particular set of features [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm having to build a directory type website that will be fairly feature intensive on the search side of things.
It will have a lot of Doctors in it (Doctors' names, addresses, biographys, specialties, etc). All of this will be stored in a database. There will be a few preferred member doctors, who, when showing up as a search result, will be at the top of the search above all others.
I need the user to be able to search by entering their query into ONE field, with the option of adding their zip code to a second field by the search (but is not required). If they add their zip code, the search will bring up doctors closest to them first (like a Google local search). If they don't enter their zip code, can I geo-target them and still use the same feature to show doctors closest to them first?
I'm entering all of the doctors information into a database table, because I don't want to have to manually create a page for each doctor that would take forever as there are over 4,000 doctors. I'm going to just create a page they can each log onto to type their info into fields and hit submit and have that place it in the tables for me. (this I'm fine with, I can handle).
Can anyone suggest a search engine or search tool/program I can use that I can tweak to get this all to work?
I've used Sphider on several sites and really like it, but I can't imagine how I could use it and get all of these features. I know this is a very ambitious project, so any help anywhere is invaluable.
Thank you all! Wish me luck!
I just need to understand one thing, as you said, the doctors data is listed a table of your database, so why you need an extra or third party tool to search? You are able to write a search script for your table easily.
As I understood too, your table includes about 4000 records, I have an application has a table includes 6236 records, Verses of the Quran and I manage my own search through it easily.

best way to set up a MySQL database for storing web data

I will be using curl to retrieve thousands of adult websites. My goal is to store them in MySQL to allow users to easily search the new database and find their desired page without having to endure all the popups, spyware, etc.
It will be an adult website search engine... kinda the google of adult websites, but without the malware sites that find their way onto google from time to time.
At first run I downloaded about 700K lines at about 20 GB of data. Initially I stored all info in a single table with columns for URL, HTML PAGE CODE, PAGE WITH NO HTML TAGS, KEY WORDS, TITLE and a couple more.
I use a MATCH AGAINST query to search for the users desired page within TITLE, KEY WORDS, PAGE WITH NO HTML in any variety of combinations or singularly.
My question is... would I be better off to break all these columns into separate tables and would this improve the speed of the searches enough that it would matter?
Is there any advantage to storing all the data in multiple tables and then using JOIN's to pull the data out?
I am just wondering if I need to be proactive and thinking about high user search loads.
MySQL isn't good with full-text search and never was.
Look into Sphinx or Lucene/Solr, they're the best fit for the job. I'd suggest sticking to the former.

Categories