I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
so I think I can user this script :
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[#title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit
Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(#title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(#title, "Download PDF")]/#href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)" will get you all the links of that form.
Related
I need to get the image src based on the class of the image.
This is the code I wrote.
It works but it is extremely slow.
$url='https://' . $_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
$html= file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
if ($nodes->length > 0) {
$src = $nodes->item(0)->getAttribute('src');
}
else {
$src = null;
}
Any clues on how to improve speed?
Assuming you're trying to parse a somewhat convoluted HTML document, and especially considering your rather limited use case, you might be better off resorting to regular expressions and some string parsing (again, in this concrete circumstances, cf. this post's closing remarks).
For testing purposes, let's set up an HTML document with 10,000 image tags, each of them looking like this one:
<img class="imgbanner" src="a49851fb74.jpg">
To benchmark both approaches more easily (XPath vs. regular expression + string parsing), let's wrap them in two functions (the first one is pretty much the same as the sample code you've provided):
function xpath(string $html): array {
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//img[#class='imgbanner']");
$src = [];
if ($nodes) {
foreach ($nodes as $node) {
$src[] = $node->getAttribute('src');
}
}
libxml_clear_errors(); // Free up memory
return $src;
}
function regex(string $html): array {
preg_match_all("/<img[^>]+src=[\"']([^\"']+)[\"'][^>]*>/i", $html, $matches);
$matches = array_combine($matches[0], $matches[1]);
$filtered = [];
foreach ($matches as $key => $value) {
if (strpos($key, 'class="imgbanner"') || strpos($key, "class='imgbanner'")) {
$filtered[] = $value;
}
}
return $filtered;
}
Since the HTML document doesn't contain much else but the image tags, XPath is pretty fast (~0.06 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = xpath($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.06 sec
Nevertheless, the second approach turned out to be about ten times faster (~0.005 seconds over the course of ten runs):
$start = microtime(true);
$html = file_get_contents('pics.html'); // 10,000 random image tags
$src = regex($html);
$time_elapsed_secs = (microtime(true) - $start);
echo "Total execution time: {$time_elapsed_secs}\n"; // ~0.005 sec
While the second approach is obviously faster for this very limited use case, bear in mind it's usually a bad idea to parse HTML using regular expressions:
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
If your parsing needs grow in complexity (i.e., anything above the case at hand), you should consider factoring out the parsing into a dedicated command line script and cache its results.
Hello I've got a bunch of divs I'm trying to scrape the content values from and I've managed to successfully pull out one of the values, result! However I've hit a brick wall, I want to now pull out the one after it inside the current code I've done. Hit a brick wall here would appreciate any help.
Here is the bit of code i'm currently using.
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'covGroupBoxContent')]//div[3]//div[2]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
$maps = $node->nodeValue;
echo $maps;
}
}
}
}
I simply want them all to have separate outputs that I can echo out.
I recommend you use Simple HTML DOM. Beyond that I need to see a sample of the HTML you are scraping.
If you are scraping a website outside your domain I'd recommend saving the source HTML to a file for review and testing. Some websites combat scraping, thus what you see in the browser is not what your scraper would see.
Also, I'd recommend setting a random user agent via ini_set(). If you need a function for this I have one.
<?php
$html = file_get_html($url);
IF ($html) {
$myfile = fopen("testing.html", "w") or die("Unable to open file!");
fwrite($myfile, $html);
fclose($myfile);
}
?>
I want to check whether a <img> tag has alt="" text or not and also need to find what line number in DOM that img tag is. At the moment I have the following codes written but stuck with finding the line number.
for example:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.google.com');
$htmlElement = $doc->getElementsByTagName('html');
$tags = $doc->getElementsByTagName('img');
echo $tags->item(0)->getLineNo();
foreach ($tags as $image) {
// Get sizes of elements via width and height attributes
$alt = $image->getAttribute('alt');
if($alt == ""){
$src = $image->getAttribute('src');
echo "No alt text ";
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
else{
$src = $image->getAttribute('src');
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
}
from the above code at the moment I am getting images and text saying that "no alt text" beside the image, but I want to get what line number that img tag appears.
for example here the line number is 57,
56. <div class="work_item">
57. <p class="pich"><img src="images/works/1.jpg" alt=""></p>
58. </div>
Use DOMNode::getLineNo(), e.g.$line = $image->getLineNo().
HTML has no real concept of line numbers, since they are just whitespace.
With that in mind, you might be able to count how many newlines there are in all the text nodes preceding the target node. You might be able to do this with DOMXPath:
$xpath = new DOMXPath($doc);
$node = /* your target node */;
$textnodes = $xpath->query("./preceding::*[contains(text(),'\n')]",$node);
$line = 1;
foreach($textnodes as $textnode) $line += substr_count($textnode->textContent,"\n");
// $line is now the line number of the node.
Please note that I have not tested this, nor have I ever used axes in xpath.
I think i have figured out what i was trying to achieve but not sure is that the right way. It is doing the job. Please leave comments or any other idea how can i improve it.
If you go to the following site and type any URL. It will produce a report with accessibility issues in a webpage. It is an accessibility checker tool.
http://valet.webthing.com/page/
All i am trying to do is achieve that kind of layout. The code below will produce the DOM of supplied URL and find any image tag that does not have alternative text.
<html>
<body>
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$yourURLAddress');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
$lines = preg_split('/\r\n|\r|\n/', $new); //split the string on new lines
echo "<pre>";
//find 'alt=""' and print the line number and html tag
foreach ($lines as $lineNumber => $line) {
if (strpos($line, htmlspecialchars('alt=""')) !== false) {
echo "\r\n" . $lineNumber . ". " . $line;
}
}
echo "\n\n\nBelow is the whole DOM\n\n\n";
//print out the whole DOM including line numbers
foreach ($lines as $lineNumber => $line) {
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
?>
</body>
</html>
I like to thank everyone who helped specially "chwagssd" and Mike Johnson.
I'm creating a tool that works with file strings and I need to get the line number where a node is found. It is, I have this:
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//text()") as $q) {
// $line = WHAT???
$strings[trim($q->nodeValue)] = $line;
}
and I need to know in which line begins the string I'm storing in $strings array. Is it possible to get it?
Each DOMNode object has a getLineNo() function that returns this. In your case it's a DOMText object that extends from DOMNode:
foreach ($xpath->query("//text()") as $q) {
$line = $q->getLineNo();
$strings[trim($q->nodeValue)] = $line;
}
You might need to upgrade to PHP 5.3 if you have not yet to make use of that function.
Hi I've got these lines here, I am trying to extract the first paragraph found in the file, but this fails to return any results, if not it returns results that are not even in <p> tags which is odd?
$file = $_SERVER['DOCUMENT_ROOT'].$_SERVER['REQUEST_URI'];
$hd = fopen($file,'r');
$cn = fread($hd, filesize($file));
fclose($hd);
$cnc = preg_replace('/<p>(.+?)<\/p>/','$1',$cn);
Try this:
$html = file_get_contents("http://localhost/foo.php");
preg_match('/<p>(.*)<\/p>/', $html, $match);
echo($match[1]);
I would use DOM parsing for that:
// SimpleHtmlDom example
// Create DOM from URL or file
$html = file_get_html('http://localhost/blah.php');
// Find all paragraphs
foreach($html->find('p') as $element)
echo $element->innerText() . '<br>';
It would allow you to more reliably replace some of the markup:
$html->find('p', 0)->innertext() = 'foo';