How to scrape multiple divs? - php

Hello I've got a bunch of divs I'm trying to scrape the content values from and I've managed to successfully pull out one of the values, result! However I've hit a brick wall, I want to now pull out the one after it inside the current code I've done. Hit a brick wall here would appreciate any help.
Here is the bit of code i'm currently using.
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'covGroupBoxContent')]//div[3]//div[2]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
$maps = $node->nodeValue;
echo $maps;
}
}
}
}
I simply want them all to have separate outputs that I can echo out.

I recommend you use Simple HTML DOM. Beyond that I need to see a sample of the HTML you are scraping.
If you are scraping a website outside your domain I'd recommend saving the source HTML to a file for review and testing. Some websites combat scraping, thus what you see in the browser is not what your scraper would see.
Also, I'd recommend setting a random user agent via ini_set(). If you need a function for this I have one.
<?php
$html = file_get_html($url);
IF ($html) {
$myfile = fopen("testing.html", "w") or die("Unable to open file!");
fwrite($myfile, $html);
fclose($myfile);
}
?>

Related

Getting image source in php is extremely slow

I need to get the image src based on the class of the image.
This is the code I wrote.
It works but it is extremely slow.
$url='https://' . $_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
$html= file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
if ($nodes->length > 0) {
$src = $nodes->item(0)->getAttribute('src');
}
else {
$src = null;
}
Any clues on how to improve speed?
Assuming you're trying to parse a somewhat convoluted HTML document, and especially considering your rather limited use case, you might be better off resorting to regular expressions and some string parsing (again, in this concrete circumstances, cf. this post's closing remarks).
For testing purposes, let's set up an HTML document with 10,000 image tags, each of them looking like this one:
<img class="imgbanner" src="a49851fb74.jpg">
To benchmark both approaches more easily (XPath vs. regular expression + string parsing), let's wrap them in two functions (the first one is pretty much the same as the sample code you've provided):
function xpath(string $html): array {
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
$src = [];
if ($nodes) {
foreach ($nodes as $node) {
$src[] = $node->getAttribute('src');
}
}
libxml_clear_errors(); // Free up memory
return $src;
}
function regex(string $html): array {
preg_match_all("/<img[^>]+src=[\"']([^\"']+)[\"'][^>]*>/i", $html, $matches);
$matches = array_combine($matches[0], $matches[1]);
$filtered = [];
foreach ($matches as $key => $value) {
if (strpos($key, 'class="imgbanner"') || strpos($key, "class='imgbanner'")) {
$filtered[] = $value;
}
}
return $filtered;
}
Since the HTML document doesn't contain much else but the image tags, XPath is pretty fast (~0.06 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = xpath($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.06 sec
Nevertheless, the second approach turned out to be about ten times faster (~0.005 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = regex($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.005 sec
While the second approach is obviously faster for this very limited use case, bear in mind it's usually a bad idea to parse HTML using regular expressions:
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
If your parsing needs grow in complexity (i.e., anything above the case at hand), you should consider factoring out the parsing into a dedicated command line script and cache its results.

PHP Write Code To File From Current Page On Click Of Button

Goal
What I'm aiming to achieve is to create a small web app where the user can edit certain elements such as img src, href, etc. Then save it to a file.
I've created the basic code which allows the user to edit however I'm struggling to take the amended code and save it to a .html file.
Progress
<?php
function getHTMLByID($id, $html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$node = $dom->getElementById($id);
if ($node) {
return $dom->saveXML($node);
}
return FALSE;
}
$html = file_get_contents('http://www.mysql.com/');
$codeString = getHTMLByID('l1-nav-container', $html);
echo $codeString;
$myfile = fopen("newfile.html", "w") or die("Unable to open file!");
fwrite($myfile, $codeString);
fclose($myfile);
?>
I've created some basic code which saves a string to a file which is being received from another website however, I can't seem to work out how to get the code from the page.

php file_get_contents from different URL if first one not available

I have the following code to read an XML file which works well when the URL is available:
$url = 'http://www1.blahblah.com'."param1"."param2";
$xml = file_get_contents($url);
$obj = SimpleXML_Load_String($xml);
How can I change the above code to cycle through a number of different URL's if the first one is unavailable for any reason? I have a list of 4 URL's all containing the same file but I'm unsure how to go about it.
Replace your code with for example this
//instead of simple variable use an array with links
$urls = [ 'http://www1.blahblah.com'."param1"."param2",
'http://www1.anotherblahblah.com'."param1"."param2",
'http://www1.andanotherblahblah.com'."param1"."param2",
'http://www1.andthelastblahblah.com'."param1"."param2"];
//for all your links try to get a content
foreach ($urls as $url) {
$xml = file_get_contents($url);
//do your things if content was read without failure and break the loop
if ($xml !== false) {
$obj = SimpleXML_Load_String($xml);
break;
}
}

Need a help on PHP domDocument

I want to append some text to divs which has same class.
$dom = new DOMdocument();
$dom->formatOutput = true;
#$dom->loadHTMLFile('first.html');
$xpath = new DOMXPath($dom)
$after = new DOMText('Newly appended text');
$elements = $xpath->query('//div[#class="mix"]');
foreach($elements as $element)
{
$element->appendChild($after);
//echo $dom->saveHTML();
}
$dom->saveHTMLFile('first.html');
But when I open first.html, The appended text is only appeded to last div of above class.
If I uncomment saveHTML() then it shows perfect result. Just problem after saving.
You cannot append the same DOM node to multiple points in the tree, which is what you are doing here. You need to create a separate (but identical) node each time:
foreach($elements as $element)
{
$after = new DOMText('Newly appended text'); // moved this inside the loop
$element->appendChild($after);
}

PHP: regex search a pattern in a file and pick it up

I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
so I think I can user this script :
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[#title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit
Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(#title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(#title, "Download PDF")]/#href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)" will get you all the links of that form.

Categories