PHP creating multiple DOMDocuments in a loop issue - php

I have a list of items to be added to the end of a base url and am trying to retrieve the html from each of these generated url's in a loop. However, I am encountering an error and i've really been struggling to fix it!
current code:
($items is just an array of strings)
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output = $output . json_encode($dom->saveHTML());
}
echo $output;
Can anyone tell me why I can't load multiple HTML documents like this?
Annoyingly i'm not getting any PHP error logs and the ajax xhr text is not providing any useful info, it's just returning a section of the first html page loaded as the 'error' (it seems to be able to load the first item in the array but then fails)

You were almost there. This way it should do the trick:
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output .= json_encode($dom->saveHTML(),JSON_ERROR_UTF8);
}
echo $output;

Related

Webscraping with Goutte and Guzzle

I have the following method from my controller that gets the data from the site:
$goutteClient = new Client();
$guzzleClient = new GuzzleClient([
'timeout' => 60,
]);
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://html.duckduckgo.com/html/?q=Laravel');
$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});
The above code gives me the title of contents from the search results. I also want to get the link of the corresponding search result. That resides in class result__extras__url.
How do I filter the link in and the title at once? Or do I have to run another method for that?
Try to inspect the attributes of the nodes. Once you get the href attribute, parse it to get the URL.
$crawler->filter('.result__title .result__a')->each(function ($node) {
$parts = parse_url(urldecode($node->attr('href')));
parse_str($parts['query'], $params);
$url = $params['uddg']; // DDG puts their masked URL and places the actual URL as a query param.
$title = $node->text();
});
For parsing, I usually do the following:
$doc = new DOMDocument();
$doc->loadHTML((string)$crawler->getBody());
from then on, you can access using getElementsByTagName functions on your DOMDocument.
for example:
$rows = $doc->getElementsByTagName('tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
$value = trim($cols->item(0)->nodeValue);
}
You can find more information in
https://www.php.net/manual/en/class.domdocument.php

Second DOMDocument on php web crawler returns Internal Server Error 500

I am writing a web crawler in php, and everything was going ok until i tried to get information from secondary pages, now my code works for a few second and then returns a Internal Server Error 500. Can someone tell me why?
$dom = new DOMDocument('1.0');
libxml_use_internal_errors(true);
#$dom->loadHTMLFile($curPage);
$dom_xpath = new DOMXPath($dom);
$aElements = $dom_xpath->query("//a[#class='js-publication-title-link ga-publication-item']");
foreach ($aElements as $element) {
$href = $element->getAttribute('href');
if(0 === stripos($href,'publication/')){
$num = $num+1;
$publicationNum = $publicationNum+1;
$spans = $dom_xpath->query(".//span[#class='publication-title js-publication-title']",$element);
$publicationName = $spans->item(0)->nodeValue;
$publicationUrl = "https://www.researchgate.net/".$href;
//Here' where things start to go wrong
getPublicationData($publicationUrl);
That function receives a url and tries to extract some data from it.
function getPublicationData($url){
static $seen = array();
if (isset($seen[$url])) {
return;
}
$seen[$url] = true;
$dom= new DOMDocument('1.0');
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$dom_xpath = new DOMXPath($dom);
//metodo 1
$strongElements = $dom_xpath->query("//strong[#class='publication-meta-type']");
foreach( $strongElements as $strongElement){
echo $strongElement->nodeValue;
}
}
Then after a few seconds working fine ( I know that its working fine because the code is inside a loop, and it only crashes after a few loops) it returnsa Internal Server Error 500.
edit
I already using ini_set('display_errors', 1);and it doenst show me anything :(

preg_match_all to show multiple results

I'm trying to get this messages (METAR) from a page and show everything just in other php file without the styles and extra info.
At the moment I'm using this code:
<?php
$options = array('http' => array(
'method' => 'GET',
));
$config= stream_context_create($options);
$config_final=file_get_contents('http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on',false, $config);
preg_match_all("|<td width=\"100%\">METAR (.*)</td>|sU", $config_final, $tiempo);
echo $tiempo[1][0];
?>
</div>
Using that code I can get only the first METAR, Waht I need is to see all of them in different lines, like showing different results.
Any ideas?
Thanks in advance
I suggest you utilize PHP's built in HTML Parsers for this, in particular the DOMDocument, and use DOMXpath to search for those needle.
Example:
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's with contains METAR
$metars = $xpath->query('//td[contains(text(), "METAR")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);

PHP DOMDocument how to get element?

I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>
You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.
I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.

Prevent save() from overwriting dtd in an xml file

I'm writing a script that adds nodes to an xml file. In addition to this I have an external dtd I made to handle the organization to the file. However the script I wrote keeps overwriting the dtd in the empty xml file when it's done appending nodes. How can I stop this from happening?
Code:
<?php
/*Dom vars*/
$dom = new DOMDocument("1.0", "UTF-8");
$previous_value = libxml_use_internal_errors(TRUE);
$dom->load('post.xml');
libxml_clear_errors();
libxml_use_internal_errors($previous_value);
$dom->formatOutput = true;
$entry = $dom->getElementsByTagName('entry');
$date = $dom->getElementsByTagName('date');
$para = $dom->getElementsByTagname('para');
$link = $dom->getElementsByTagName('link');
/* Dem POST vars used by dat Ajax mah ziggen, yeah boi*/
if (isset($_POST['Text'])){
$text = trim($_POST['Text']);
}
/*
function post(){
global $dom, $entry, $date, $para, $link,
$home, $about, $contact, $text;
*/
$entryC = $dom->createElement('entry');
$dateC = $dom->createElement('date', date("m d, y H:i:s")) ;
$entryC->appendChild($dateC);
$tab = "\n";
$frags = explode($tab, $text);
$i = count($frags);
$b = 0;
while($b < $i){
$paraC = $dom->createElement('para', $frags[$b]);
$entryC->appendChild($paraC);
$b++;
}
$linkC = $dom->createElement('link', rand(100000, 999999));
$entryC->appendChild($linkC);
$dom->appendChild($entryC);
$dom->save('post.xml');
/*}
post();
*/echo 1;
?>
It looks like in order to do this, you'd have to create a DOMDocumentType using
DOMImplementation::createDocumentType
then create an empty document using the DOMImplementation, and pass in the DOMDocumentType you just created, then import the document you loaded. This post: http://pointbeing.net/weblog/2009/03/adding-a-doctype-declaration-to-a-domdocument-in-php.html and the comments looked useful.
I'm guessing this is happening because after parsing/validation, the DTD isn't part of the DOM anymore, and PHP therefore isn't able to include it when the document is serialized.
Do you have to use a DTD? XML Schemas can be linked via attributes (and the link is therefore part of the DOM). Or there's RelaxNG, which can be linked via a processing instruction. DTDs have all this baggage that comes with them as a holdover from SGML. There are better alternatives.

Categories