Can't scrape search results, google keep changing html structure

Can't scrape search results, google keep changing html structure - php

My goal is to scrape search results with PHP Simple HTML DOM Parser
which is working fine for me. But after every one or two days, Google changes their HTML structure and my code stop working.
Here's my code that was working before:
include("simple_html_dom.php");
$data = file_get_contents('https://www.google.com/search?q=stackoverflow');
$html = str_get_html($data);
$i=0;
$linkObjs = $html->find('h3[class=r] a');
foreach ($linkObjs as $linkObj) {
$i++;
$url = trim($linkObj->href);
$trim = substr($url, 0, 7);
if ($trim=="/url?q=") {
$url = substr($url, 7);
}
$trim_2 = stripos($url, '&sa=U');
if ($trim_2 != false) {
$url = substr($url, 0, $trim_2);
}
echo "$i:".$url.'<br>';
}
They usually change class names and tag name along with HTML links structure

I had the same problem. Try
$linkObjs = $html->find('div[class=jfp3ef] a');
and it will work again.

I had a similar experience. When I search Google from the ordinary user interface, the URLs of the "hit" pages are still showing up in an A tag (of course) after a div class 'r'. But when I run my scraping program with the exact same search terms and parameters, the 'r' changes to 'kCrYT'. I changed that in my code and got the program working again. (Yay!)
But I suspect the class will change regularly when Google detects that someone is submitting the search programmatically. So this might not be a permanent solution.
Maybe I could add a little extra code that determines what class name is currently being used for this, so that my program could automatically adapt to these changes.

Related

How can I copy two tables from someones website to my website? Wordpress

I'm attempting to copy two tables from a specific website address and mimic them onto my website, but I'm unsure how to go about it.
I've looked into Simple-HTML-Dom, and I actually got it to work locally on my machine - to a certain extent. Although I have a couple of problems with this approach.
Although I got it to work, it wasn't perfect. I also dragged over some random text, and I copied across 5 tables - instead of the intended 2.
I would rather use a different method - without using 3rd party scripts?
The tables that I'm attempting to copy can be located here:
https://www.gov.uk/government/publications/rates-and-allowances-monthly-euro-conversion-rates-for-calculating-duty/monthly-euro-conversion-rates-for-calculating-duty
I only want the tables containing 2017 and 2016 data (with the headings).
Rates for 2017
Table Headings
Table Contents
Rates for 2016
Table Headings
Table Contents
This would be for my Wordpress website.
I don't know if something could be programmed in PHP to achieve this, or another library that Wordpress natively supports without the use of SQL / 3rd party scripts or anything like that.
Thank You
////
--- HUGE UPDATE ---
Ok, so I've been playing around and trying to debug the code, and I'm almost there!!
I've managed to copy over the first table only (second table will be easy).
The final part is trying to add classes to the createElements line? Is that possible??
Here's my almost finished code.
<h1 class="roeheader">Monthly Industrial Euro Rate:</h1>
[insert_php]
$url = "https://www.gov.uk/government/publications/rates-and-allowances-monthly-euro-conversion-rates-for-calculating-duty/monthly-euro-conversion-rates-for-calculating-duty";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if($doc->loadHTML($html))
{
$result = new \DOMDocument();
$result->formatOutput = true;
$table = $result->appendChild($result->createElement("table"));
$thead = $table->appendChild($result->createElement("thead"));
$tbody = $table->appendChild($result->createElement("tbody"));
$xpath = new \DOMXPath($doc);
$newRow = $thead->appendChild($result->createElement("tr"));
foreach($xpath->query("//table[1]/thead/tr/th[position()>0]") as $header)
{
$newRow->appendChild($result->createElement("th", trim($header->nodeValue)));
}
foreach($xpath->query("//table[1]/tbody/tr") as $row)
{
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach($xpath->query("./td[position()>0 and position()<5]", $row) as $cell)
{
$newRow->appendChild($result->createElement("td", trim($cell->nodeValue)));
}
}
echo $result->saveXML($result->documentElement);
}
[/insert_php]
I'm trying to change this line:
$newRow->appendChild($result->createElement("td", trim($cell->nodeValue)));
To this:
$newRow->appendChild($result->createElement("td class"roe"", trim($cell->nodeValue)));
But it's not working?? The page just refuses to load? I guess it's because of the double quotation.
Thanks

Now you can inspect the tables and see the html and css behind them and copy those to your website to make them exactly the same. A good WordPress plugin for making tables is https://wordpress.org/plugins/tablepress/. Please let me know if you have further information.

Fetch data from site Page by Page & go through sub links

URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;

I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.

How to handle ?_escaped_fragment_= for AJAX crawlers?

I'm struggling to make AJAX-based website SEO-friendly. As recommended in tutorials on the web, I've added "pretty" href attributes to links: контакт and, in a div where content is loaded with AJAX by default, a PHP script for crawlers:
$files = glob('./pages/*.php');
foreach ($files as &$file) {
$file = substr($file, 8, -4);
}
if (isset($_GET['site'])) {
if (in_array($_GET['site'], $files)) {
include ("./pages/".$_GET['site'].".php");
}
}
I have a feeling that at the beginning I need to additionaly cut the _escaped_fragment_= part from (...)/index.php?_escaped_fragment_=site=about because otherwise the script won't be able to GET the site value from URL , am I right?
but, anyway, how do I know that the crawler transforms pretty links (those with #!) to ugly links (containing ?_escaped_fragment_=)? I've been told that it happens automatically and I don't need to provide this mapping, but Fetch as Googlebot doesn't provide me with any information about what happens to URL.

Google bot will automatically query for ?_escaped_fragment_= urls.
So from www.example.com/index.php#!site=about
Google bot will query: www.example.com/index.php?_escaped_fragment_=site=about
On PHP site you will get it as $_GET['_escaped_fragment_'] = "site=about"
If you want to get the value of the "site" you need to do something like this:
if(isset($_GET['_escaped_fragment_'])){
$escaped = explode("=", $_GET['_escaped_fragment_']);
if(isset($escaped[1]) && in_array($escaped[1], $files)){
include ("./pages/".$escaped[1].".php");
}
}
Take a look at the documentation:
https://developers.google.com/webmasters/ajax-crawling/docs/specification

PHP-Retrieve specific content from multiple pages of a website

What I want to accomplish might be a little hardcore, but I want to know if it's possible:
The question:
My question is the same as PHP-Retrieve content from page, but I want to use it on multiple pages.
The situation:
I'm using a website about TV shows. All the TV shows have the same URL and then the name of the show:
http://bierdopje.com/shows/NAME_OF_SHOW
On every show page, there's a line which tells you if the show is cancelled or still running. I want to retrieve that line to make an overview of the cancelled shows (the website only supports an overview of running shows, so I want to make an extra functionality).
The real question:
How can I tell DOM to retrieve all the shows and check for the status of the show?
(http://bierdopje.com/shows/*).
The Note:
I understand that this process may take a while because it is reading the whole website (or is it too much data?).

use this code to fetch only the links from the single website.
include_once('simple_html_dom.php');
$html = file_get_html('http://www.couponrani.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

I use phpquery to fetch data from a web page, like jQuery in Dom.
For example, to get the list of all shows, you can do this :
<?php
require_once 'phpQuery/phpQuery/phpQuery.php';
$doc = phpQuery::newDocumentHTML(
file_get_contents('http://www.bierdopje.com/shows')
);
foreach (pq('.listing a') as $key => $a) {
$url = pq($a)->attr('href'); // will give "/shows/07-ghost"
$show = pq($a)->text(); // will give "07 Ghost"
}
Now you can process all shows individualy, make a new phpQuery::newDocumentHTML for each show and with an selector extract the information you need.
Get the status of a show
$html = file_get_contents('http://www.bierdopje.com/shows/alcatraz');
$doc = phpQuery::newDocumentHTML($html);
$status = pq('.content>span:nth-child(6)')->text();

Changing base URL on part of a page only

I have a page on my site that fetches and displays news items from the database of another (legacy) site on the same server. Some of the items contain relative links that should be fixed so that they direct to the external site instead of causing 404 errors on the main site.
I first considered using the <base> tag on the fetched news items, but this changes the base URL of the whole page, breaking the relative links in the main navigation - and it feels pretty hackish too.
I'm currently thinking of creating a regex to find the relative URLs (they all start with /index.php?) and prepending them with the desired base URL. Are there any more elegant solutions to this? The site is built on Symfony 2 and uses jQuery.

Here is how I would tackle the problem:
function prepend_url ($prefix, $path) {
// Prepend $prefix to $path if $path is not a full URL
$parts = parse_url($path);
return empty($parts['scheme']) ? rtrim($prefix, '/').'/'.ltrim($path, '/') : $path;
}
// The URL scheme and domain name of the other site
$otherDomain = 'http://othersite.tld';
// Create a DOM object
$dom = new DOMDocument('1.0');
$dom->loadHTML($inHtml); // $inHtml is an HTML string obtained from the database
// Create an XPath object
$xpath = new DOMXPath($dom);
// Find candidate nodes
$nodesToInspect = $xpath->query('//*[#src or #href]');
// Loop candidate nodes and update attributes
foreach ($nodesToInspect as $node) {
if ($node->hasAttribute('src')) {
$node->setAttribute('src', prepend_url($otherDomain, $node->getAttribute('src')));
}
if ($node->hasAttribute('href')) {
$node->setAttribute('href', prepend_url($otherDomain, $node->getAttribute('href')));
}
}
// Find all nodes to export
$nodesToExport = $xpath->query('/html/body/*');
// Iterate and stringify them
$outHtml = '';
foreach ($nodesToExport as $node) {
$outHtml .= $node->C14N();
}
// $outHtml now contains the "fixed" HTML as a string
See it working

You can override the base tag by putting http:\\ in front of the link. That is, give a full url, not a relative URL.

Well, not actually a solution, but mostly a tip...
You could start playing aroung with ExceptionController.
There, just for example, you could seek for 404 error and check query string appended to request:
$request = $this->container->get('request');
....
if (404 === $exception->getStatusCode()) {
$query = $request->server->get('QUERY_STRING');
//...handle your logic
}
The other solution would be to define special route with its controller for such purposes, which would catch requests to index.php and do redirects and so on. Just define index.php in requirements of route and move this route on the top of your routing.
Not a clearest answer ever, but at least I hope I gave you a direction...
Cheers ;)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Can't scrape search results, google keep changing html structure - php

I had the same problem. Try $linkObjs = $html->find('div[class=jfp3ef] a'); and it will work again.

Related

How can I copy two tables from someones website to my website? Wordpress

Fetch data from site Page by Page & go through sub links

How to handle ?_escaped_fragment_= for AJAX crawlers?

PHP-Retrieve specific content from multiple pages of a website

Changing base URL on part of a page only

Categories

Resources