I'm creating a tool that works with file strings and I need to get the line number where a node is found. It is, I have this:
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//text()") as $q) {
// $line = WHAT???
$strings[trim($q->nodeValue)] = $line;
}
and I need to know in which line begins the string I'm storing in $strings array. Is it possible to get it?
Each DOMNode object has a getLineNo() function that returns this. In your case it's a DOMText object that extends from DOMNode:
foreach ($xpath->query("//text()") as $q) {
$line = $q->getLineNo();
$strings[trim($q->nodeValue)] = $line;
}
You might need to upgrade to PHP 5.3 if you have not yet to make use of that function.
Related
I need to get the image src based on the class of the image.
This is the code I wrote.
It works but it is extremely slow.
$url='https://' . $_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
$html= file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
if ($nodes->length > 0) {
$src = $nodes->item(0)->getAttribute('src');
}
else {
$src = null;
}
Any clues on how to improve speed?
Assuming you're trying to parse a somewhat convoluted HTML document, and especially considering your rather limited use case, you might be better off resorting to regular expressions and some string parsing (again, in this concrete circumstances, cf. this post's closing remarks).
For testing purposes, let's set up an HTML document with 10,000 image tags, each of them looking like this one:
<img class="imgbanner" src="a49851fb74.jpg">
To benchmark both approaches more easily (XPath vs. regular expression + string parsing), let's wrap them in two functions (the first one is pretty much the same as the sample code you've provided):
function xpath(string $html): array {
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
$src = [];
if ($nodes) {
foreach ($nodes as $node) {
$src[] = $node->getAttribute('src');
}
}
libxml_clear_errors(); // Free up memory
return $src;
}
function regex(string $html): array {
preg_match_all("/<img[^>]+src=[\"']([^\"']+)[\"'][^>]*>/i", $html, $matches);
$matches = array_combine($matches[0], $matches[1]);
$filtered = [];
foreach ($matches as $key => $value) {
if (strpos($key, 'class="imgbanner"') || strpos($key, "class='imgbanner'")) {
$filtered[] = $value;
}
}
return $filtered;
}
Since the HTML document doesn't contain much else but the image tags, XPath is pretty fast (~0.06 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = xpath($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.06 sec
Nevertheless, the second approach turned out to be about ten times faster (~0.005 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = regex($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.005 sec
While the second approach is obviously faster for this very limited use case, bear in mind it's usually a bad idea to parse HTML using regular expressions:
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
If your parsing needs grow in complexity (i.e., anything above the case at hand), you should consider factoring out the parsing into a dedicated command line script and cache its results.
I'm very new to php. I understand that echo is how you output text, but not sure how to apply it with the below scenario. Below, data is being scraped and outputted. Wondering if there's a way with the file_put_contents to add a text to the output, and the text I'm trying to add is a "%". Reason is the output of the below code is a random number that changes daily, and it's in fact a percent, so I'd like to add that to the end of the output every time.
Thanks so much for any assistance.
// get japanchange
function getJapanchange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://________________//global-
indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "japanchange",
$ret_[56]);
}
}
If you just want to add a % to the end of the output to the file your already using. You could simple do
file_put_contents(globalVars::$_cache_dir . "japanchange",
$ret_[56].'%');
I want to append some text to divs which has same class.
$dom = new DOMdocument();
$dom->formatOutput = true;
#$dom->loadHTMLFile('first.html');
$xpath = new DOMXPath($dom)
$after = new DOMText('Newly appended text');
$elements = $xpath->query('//div[#class="mix"]');
foreach($elements as $element)
{
$element->appendChild($after);
//echo $dom->saveHTML();
}
$dom->saveHTMLFile('first.html');
But when I open first.html, The appended text is only appeded to last div of above class.
If I uncomment saveHTML() then it shows perfect result. Just problem after saving.
You cannot append the same DOM node to multiple points in the tree, which is what you are doing here. You need to create a separate (but identical) node each time:
foreach($elements as $element)
{
$after = new DOMText('Newly appended text'); // moved this inside the loop
$element->appendChild($after);
}
foreach ($filePaths as $filePath) {
/*Open a file, run a function to write a new file
that rewrites the information to meet design specifications */
$fileHandle = fopen($filePath, "r+");
$newHandle = new DOMDocument();
$newHandle->loadHTMLFile( $filePath );
$metaTitle = trim(retrieveTitleText($newHandle));
$pageMeta = array('metaTitle' => $metaTitle, 'pageTitle' => 'Principles of Biology' );
$attributes = retrieveBodyAttributes($filePath);
cleanfile($fileHandle, $filePath);
fclose($fileHandle);
}
function retrieveBodyAttributes($filePath) {
$dom = new DOMDocument;
$dom->loadHTMLFile($filePath);
$p = $dom->getElementsByTagName('body')->item(0);
/*if (!$p->hasAttribute('body')) {
$bodyAttr[] = array('attr'=>" ", 'value'=>" ");
return $bodyAttr;
}*/
if ($p->hasAttributes()) {
foreach ($p->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
$bodyAttr[] = array('attr'=>$name, 'value'=>$value);
}
return $bodyAttr;
}
}
$filePaths is an array of strings. When I run the code, it give me a "Call to member function hasAttributes() on non-object" error for the line that calls hasAttributes. When it's not commented out, I get the same error on the line that calls hasAttribute('body'). I tried a var_dump on $p, on the line just after the call to getElementsByTagName, and I got "object (DOMElement) [5]". Well, the number changed because I was running the code on multiple files at once, but I didn't know what the number meant. I can't find what I'm doing wrong.
with:
$p = $dom->getElementsByTagName('body')->item(0);
You are executing: DOMNodelist::item (See: http://www.php.net/manual/en/domnodelist.item.php) which returns NULL if, at the given index, no element is found.
But you're not checking for that possibility, you're just expecting $p to be not null.
Try adding something like:
if ($p instanceof DOMNode) {
// the hasAttributes code
}
Although, if you're sure that there should be a body element, you'll probably have to check your file paths.
It should be because there is no <body> tag in your DOM Document.
I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
so I think I can user this script :
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[#title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit
Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(#title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(#title, "Download PDF")]/#href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)" will get you all the links of that form.