regex to scrape data from web page - php

I tried to scrap data from web page using regex but it gives DOM warning. So I want to know, is it possible for regex to scrape date, review, rate value from this page?
http://www.yelp.com/biz/franchino-san-francisco?start=80
Here is with DOM:
https://eval.in/143074 give error.
This works for smaller code : https://eval.in/143036
Is it possible using regex?
<?php
$html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80');
$html = escapeshellarg($html) ;
$html = nl2br($html);
$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
?>

Related

Extracting information from <i> tag from HTML using PHP

I am having some code and getting HTTP 500 Error. A bit getting confused. I need to extract from the web of weather cast weather digit information and add in the website.
Here is a code:
orai_class.php
<?php
Class orai{
var $url;
function generate_orai($url){
$html = file_get_contents($url);
$classname = 'wi wi-1';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
$i=0;
foreach($results as $node)
{
if ($results->length > 0) {
$array[] = $results->item($i)->nodeValue;
}
$i++;
}
return $array;
}
}
?>
index.php
<?php
include("orai.class.php");
$orai = new orai();
print_r($orai->generate_orai('https://orai.15min.lt/prognoze/vilnius'));
?>
Thank You.

I'm currently working on a scraper, but how do I scrape more than one thing?

I've been working on a scraper for the past few days and I was wondering if it was possible to scrape urls from "A" tags and echo them as well under the "$titles->nodeValue".
<?php
$html = file_get_contents('https://webpage/apps.html');
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
if(!empty($html)){
$doc->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$title_row = $xpath->query('//div[#class="item-title"]');
$button = $xpath->query('//a[#style="background: #F0F1F6; color: #007AFF; font-weight:bold;"]');
if($title_row->length > 0){
foreach($title_row as $titles){
echo "<li>" . $titles->nodeValue . "</li>";
}
}
}

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

iterate though all class blocks using DOM

I am scraping data from web page using DOM classes.
There are various blocks of div each with review, image, date, rate etc.
Here is code which scrap data for particular class. But here it scrap data for first class only. How can I iterate so that I can get details from all classes?
Here is my code:
libxml_use_internal_errors(true);
$html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80');
$html = escapeshellarg($html) ;
$html = nl2br($html);
$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
Output: http://codepad.viper-7.com/j0cTNi
UPDATE
http://codepad.viper-7.com/lHS9jk
Here I added :
$classname = 'review-wrapper';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $classname . "']");
foreach($results as $node)
{
// scrapping code here
}
But it scrap same class value during each iteration. SEee result : http://codepad.viper-7.com/lHS9jk

PHP Xpath Error already defined in Entity not showing results

I am getting errors in this php xpath app and i cannot fix, i would love some help if possible
<?php
//Get Username
$username = $_GET["u"];
$html = file_get_contents('http://us.playstation.com/publictrophy/index.htm?onlinename=' .$username);
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//*[#id="id-handle"]') as $node) {
echo $node, "\n";
}
foreach ($xpath->query('//*[#id="leveltext"]') as $node1) {
echo $node1, "\n";
}
?>
put # before $dom->loadHTML($html) because loadHTML usually rises a lot of warnings and notices
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

Categories