XPath do not retrieve some content - php

Im a a newbie trying to code a crawler to make some stats from a forum.
Here is my code :
<?php
$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[#class='who-post']/a");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$dates = $xpath->query("//div[#class='date-post']");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$contents = $xpath->query("//div[#class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$i = 0;
foreach ($posts as $post) {
$nodes = $post->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['author'] = $value;
$i++;
}
}
$i = 0;
foreach ($dates as $date) {
$nodes = $date->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['date'] = $value;
$i++;
}
}
$i = 0;
foreach ($contents as $content) {
$nodes = $content->childNodes;
foreach ($nodes as $node) {
$value = $node->nodeValue;
echo $value;
$tab[$i]['content'] = trim($value);
$i++;
}
}
?>
<h1>Participants</h2>
<pre>
<?php
print_r($tab);
?>
</pre>
As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm
The second post is a picture and my code do not work.
On the second hand, I guess i made some errors, I find my code ugly.
Can you help me please ?

You could simply select the posts first, then grab each subdata separately using:
DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.
Code:
$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[#class="post"]');
$posts = [];
foreach ($postsElements as $postElement) {
$author = $xpath->evaluate('normalize-space(.//*[#class="who-post"])', $postElement);
$date = $xpath->evaluate('normalize-space(.//*[#class="date-post"])', $postElement);
$message = '';
foreach ($xpath->query('.//*[contains(#class, "message")]/p', $postElement) as $messageParagraphElement) {
$message .= $dom->saveHTML($messageParagraphElement);
}
$posts[] = (object)compact('author', 'date', 'message');
}
print_r($posts);
Unrelated note: scraping a website's HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.

Related

Why is new SimpleXMLElement causing a 500 error?

I have a simple script that until yesterday had worked fine for 2 years. Im just taking a XML feed from a WP site and formatting it to be displayed on a different website. Here is the code:
<?php
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://example.com/tradeblog/feed/atom/');
$oXML = new SimpleXMLElement($sXML);
$items = $oXML->entry;
$i = 0;
foreach($items as $item) {
$title = $item->title;
$link = $item->link;
echo '<li>';
foreach($link as $links) {
$loc = $links['href'];
$href = str_replace("/feed/atom/", "", $loc);
echo "<a href=\"$href\" target=\"_blank\">";
}
echo $title;
echo "</a>";;
echo "</li>";
if(++$i == 3) break;
}
?>
I can echo out $sXML and it will display the entire XML contents as expected. When I try and echo $oXML I get the 500 error. Any use of $oXML causes the 500. What changed? Is there a different / better way to do this using PHP?
It seems your xml source is not exactly a xml. I tried to validate it using w3 scholl validator and it throws an error. Tried here too, and got the same error.
Not sure why, but this worked
<?php
$rss = new DOMDocument();
$rss->load('https://example.com/tradeblog/feed/rss2/');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
);
array_push($feed, $item);
}
$limit = 3;
for($x=0;$x<$limit;$x++) {
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
$link = $feed[$x]['link'];
echo '<li>'.$title.'</li>';
}
?>

Why the query doesn't match the DOM?

Here is my code:
$res = file_get_contents("http://www.lenzor.com/photo/search/index/type/user/%D8%B9%D9%84%DB%8C//text/%D9%81%D8%A7%D8%B7%D9%85%D9%87");
$doc = new \DOMDocument();
#$doc->loadHTMLFile($res);
$xpath = new \DOMXpath($doc);
$links = $xpath->query("//ul[#class='user_box']/li");
$result = array();
if (!is_null($links)) {
foreach ($links as $link) {
$href = $link->getAttribute('class');
$result[] = [$href];
}
}
print_r($result);
Here is the content I'm working on. I mean it's the result of echo $res.
Ok well, the result of my code is an empty array. So $links is empty and that foreach won't be executed. Why? Why //ul[#class='user_box']/li query doesn't match the DOM ?
Expected result is an array contains the class attribute of lis.
Try this, Hope this will be helpful. There are few mistakes in your code.
1. You should search like this '//ul[#class="user_box clearfix"]/li' because class="user_box clearfix" class attribute of that HTML source contains two classes.
2. You should use loadHTMLinstead of loadHTMLFile.
<?php
ini_set('display_errors', 1);
libxml_use_internal_errors(true);
$res = file_get_contents("http://www.lenzor.com/photo/search/index/type/user/%D8%B9%D9%84%DB%8C//text/%D9%81%D8%A7%D8%B7%D9%85%D9%87");
$doc = new \DOMDocument();
$doc->loadHTML($res);
$xpath = new \DOMXpath($doc);
$links = $xpath->query('//ul[#class="user_box clearfix"]/li');
$result = array();
if (!is_null($links)) {
foreach ($links as $link) {
$href = $link->getAttribute('class');
$result[] = [$href];
}
}
print_r($result);

PHP script working locally but not when placed on webserver

The following codes scrapes a list of links from a given webpage and then place them into another script that scrapes the text from the given links and places the data into a csv document. The code runs perfectly on localhost (wampserver 5.5 php) but fails horribly when placed on domain.
You can check out the functionality of the script at http://miskai.tk/ANOFM/csv.php .
Also, file get html and curl are both enabled onto the server.
<?php
header('Content-Type: application/excel');
header('Content-Disposition: attachment; filename="Mehedinti.csv"');
include_once 'simple_html_dom.php';
include_once 'csv.php';
$urls = scrape_main_page();
function scraping($url) {
// create HTML DOM
$html = file_get_html($url);
// get article block
if ($html && is_object($html) && isset($html->nodes)) {
foreach ($html->find('/html/body/table') as $article) {
// get title
$item['titlu'] = trim($article->find('/tbody/tr[1]/td/div', 0)->plaintext);
// get body
$item['tr2'] = trim($article->find('/tbody/tr[2]/td[2]', 0)->plaintext);
$item['tr3'] = trim($article->find('/tbody/tr[3]/td[2]', 0)->plaintext);
$item['tr4'] = trim($article->find('/tbody/tr[4]/td[2]', 0)->plaintext);
$item['tr5'] = trim($article->find('/tbody/tr[5]/td[2]', 0)->plaintext);
$item['tr6'] = trim($article->find('/tbody/tr[6]/td[2]', 0)->plaintext);
$item['tr7'] = trim($article->find('/tbody/tr[7]/td[2]', 0)->plaintext);
$item['tr8'] = trim($article->find('/tbody/tr[8]/td[2]', 0)->plaintext);
$item['tr9'] = trim($article->find('/tbody/tr[9]/td[2]', 0)->plaintext);
$item['tr10'] = trim($article->find('/tbody/tr[10]/td[2]', 0)->plaintext);
$item['tr11'] = trim($article->find('/tbody/tr[11]/td[2]', 0)->plaintext);
$item['tr12'] = trim($article->find('/tbody/tr[12]/td/div/]', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;}
}
$output = fopen("php://output", "w");
foreach ($urls as $url) {
$ret = scraping($url);
foreach($ret as $v){
fputcsv($output, $v);}
}
fclose($output);
exit();
second file
<?php
function get_contents($url) {
// We could just use file_get_contents but using curl makes it more future-proof (setting a timeout for example)
$ch = curl_init($url);
curl_setopt_array($ch, array(CURLOPT_RETURNTRANSFER => true,));
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
function scrape_main_page() {
set_time_limit(300);
libxml_use_internal_errors(true); // Prevent DOMDocument from spraying errors onto the page and hide those errors internally ;)
$html = get_contents("http://lmvz.anofm.ro:8080/lmv/index2.jsp?judet=26");
$dom = new DOMDocument();
$dom->loadHTML($html);
die(var_dump($html));
$xpath = new DOMXPath($dom);
$results = $xpath->query("//table[#width=\"645\"]/tr");
$all = array();
//var_dump($results);
for($i = 1; $i < $results->length; $i++) {
$tr = $results->item($i);
$id = $tr->childNodes->item(0)->textContent;
$requesturl = "http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=" . urlencode($id) .
"&judet=26";
$details = scrape_detail_page($requesturl);
$newObj = new stdClass();
$newObj = $id;
$all[] = $newObj;
}
foreach($all as $xtr) {
$urls[] = "http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=" . $xtr .
"&judet=26";
}
return $urls;
}
scrape_main_page();
Yeah, the problem here is your php.ini configuration. Make sure the server supports curl and fopen. If not start your own linux server.

Parsing XML in PHP DOM via cURL - can't get nodeValue if it is url address or date

I have this strange problem parsing XML document in PHP loaded via cURL. I cannot get nodeValue containing URL address (I'm trying to implement simple RSS reader into my CMS). Strange thing is that it works for every node except that containing url addresses and date ( and ).
Here is the code (I know it is a stupid solution, but I'm kinda newbie in working with DOM and parsing XML documents).
function file_get_contents_curl($url) {
$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // return into a variable
curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
$result = curl_exec($ch); // run the whole process
return $result;
}
function vypis($adresa) {
$html = file_get_contents_curl($adresa);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$desc = $doc->getElementsByTagName('description');
$ctg = $doc->getElementsByTagName('category');
$pd = $doc->getElementsByTagName('pubDate');
$ab = $doc->getElementsByTagName('link');
$aut = $doc->getElementsByTagName('author');
for ($i = 1; $i < $desc->length; $i++) {
$dsc = $desc->item($i);
$titles = $nodes->item($i);
$categorys = $ctg->item($i);
$pubDates = $pd->item($i);
$links = $ab->item($i);
$autors = $aut->item($i);
$description = $dsc->nodeValue;
$title = $titles->nodeValue;
$category = $categorys->nodeValue;
$pubDate = $pubDates->nodeValue;
$link = $links->nodeValue;
$autor = $autors->nodeValue;
echo 'Title:' . $title . '<br/>';
echo 'Description:' . $description . '<br/>';
echo 'Category:' . $category . '<br/>';
echo 'Datum ' . gmdate("D, d M Y H:i:s",
strtotime($pubDate)) . " GMT" . '<br/>';
echo "Autor: $autor" . '<br/>';
echo 'Link: ' . $link . '<br/><br/>';
}
}
Can you please help me with this?
To read RSS you shouldn't use loadHTML, but loadXML. One reason why your links don't show is because the <link> tag in HTML ignores its contents. See also here: http://www.w3.org/TR/html401/struct/links.html#h-12.3
Also, I find it easier to just iterate over the <item> tags and then iterate over their children nodes. Like so:
$d = new DOMDocument;
// don't show xml warnings
libxml_use_internal_errors(true);
$d->loadXML($xml_contents);
// clear xml warnings buffer
libxml_clear_errors();
$items = array();
// iterate all item tags
foreach ($d->getElementsByTagName('item') as $item) {
$item_attributes = array();
// iterate over children
foreach ($item->childNodes as $child) {
$item_attributes[$child->nodeName] = $child->nodeValue;
}
$items[] = $item_attributes;
}
var_dump($items);

Getting the src of an image in a curled html with dom

function getPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$page = getPage(trim('http://localhost/test/test.html'));
$dom = new DOMDocument();
$dom->loadHTML($page);
$xp = new DOMXPath($dom);
$result = $xp->query("//img[#class='wallpaper']");
I'm trying to find all images with a class wallpaper and now I'm stuck to that point. I tried to var_dump($result) but it's giving me a weird object(DOMNodeList)[3]. How do i finally get the src of the image?
$result is a DOMNodeList object.
You can find out how many items it contains with
$count = $result->length;
You access items individually using DOMNodeList::item()
if ($result->length > 0) {
$first = $result->item(0);
$src = $first->getAttribute('src');
}
You can also iterate it like an array, eg
foreach ($result as $img) {
$src = $img->getAttribute('src');
}
In addition to #Phil's answer, you can also grab the src attribute directly in your xpath query instead of grabbing the img element:
$srcs = array();
$result = $xp->query("//img[#class='wallpaper']/#src");
foreach($result as $attr) {
$srcs[] = $attr->value;
}
You can access the images in the DOMNodeList with a foreach loop.
foreach($result as $img) {
echo $img->getAttribute('src');
}
You could get the first with echo $result->item(0)->getAttribute('src'). You may want to confirm the DOMNodeList has items by checking the length property of $result.
Try
echo $result->getAttribute('src');

Categories