I want to check whether a <img> tag has alt="" text or not and also need to find what line number in DOM that img tag is. At the moment I have the following codes written but stuck with finding the line number.
for example:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.google.com');
$htmlElement = $doc->getElementsByTagName('html');
$tags = $doc->getElementsByTagName('img');
echo $tags->item(0)->getLineNo();
foreach ($tags as $image) {
// Get sizes of elements via width and height attributes
$alt = $image->getAttribute('alt');
if($alt == ""){
$src = $image->getAttribute('src');
echo "No alt text ";
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
else{
$src = $image->getAttribute('src');
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
}
from the above code at the moment I am getting images and text saying that "no alt text" beside the image, but I want to get what line number that img tag appears.
for example here the line number is 57,
56. <div class="work_item">
57. <p class="pich"><img src="images/works/1.jpg" alt=""></p>
58. </div>
Use DOMNode::getLineNo(), e.g.$line = $image->getLineNo().
HTML has no real concept of line numbers, since they are just whitespace.
With that in mind, you might be able to count how many newlines there are in all the text nodes preceding the target node. You might be able to do this with DOMXPath:
$xpath = new DOMXPath($doc);
$node = /* your target node */;
$textnodes = $xpath->query("./preceding::*[contains(text(),'\n')]",$node);
$line = 1;
foreach($textnodes as $textnode) $line += substr_count($textnode->textContent,"\n");
// $line is now the line number of the node.
Please note that I have not tested this, nor have I ever used axes in xpath.
I think i have figured out what i was trying to achieve but not sure is that the right way. It is doing the job. Please leave comments or any other idea how can i improve it.
If you go to the following site and type any URL. It will produce a report with accessibility issues in a webpage. It is an accessibility checker tool.
http://valet.webthing.com/page/
All i am trying to do is achieve that kind of layout. The code below will produce the DOM of supplied URL and find any image tag that does not have alternative text.
<html>
<body>
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$yourURLAddress');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
$lines = preg_split('/\r\n|\r|\n/', $new); //split the string on new lines
echo "<pre>";
//find 'alt=""' and print the line number and html tag
foreach ($lines as $lineNumber => $line) {
if (strpos($line, htmlspecialchars('alt=""')) !== false) {
echo "\r\n" . $lineNumber . ". " . $line;
}
}
echo "\n\n\nBelow is the whole DOM\n\n\n";
//print out the whole DOM including line numbers
foreach ($lines as $lineNumber => $line) {
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
?>
</body>
</html>
I like to thank everyone who helped specially "chwagssd" and Mike Johnson.
Related
I'm very new to php. I understand that echo is how you output text, but not sure how to apply it with the below scenario. Below, data is being scraped and outputted. Wondering if there's a way with the file_put_contents to add a text to the output, and the text I'm trying to add is a "%". Reason is the output of the below code is a random number that changes daily, and it's in fact a percent, so I'd like to add that to the end of the output every time.
Thanks so much for any assistance.
// get japanchange
function getJapanchange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://________________//global-
indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "japanchange",
$ret_[56]);
}
}
If you just want to add a % to the end of the output to the file your already using. You could simple do
file_put_contents(globalVars::$_cache_dir . "japanchange",
$ret_[56].'%');
say i have
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
html page like this i wanna parse it with simple dom php parser and i wanna get download mp4 114p 19.1 as out put while i tried this code
foreach($displaybody->find('a ') as $element) {
// echo $element->innertext . '<br/>';
it returned me download mp4 only how do i parse remaining values download mp4 114p 19.1 please help me out
You can't use the <a> tag anymore since some of the text you're trying to access isn't inside it anymore, target the document itself and then use ->plaintext:
$html = <<<EOT
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
EOT;
$displaybody = str_get_html($html);
echo $displaybody->plaintext;
Here is another way of accessing each row thru DOMDocument with xpath:
// load the sites html page in DOMDocument
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$html_page = file_get_contents('http://www.mohammediatechnologies.in/download/downloadtest.php?name=8KPEiGqDQHg');
$dom->loadHTML(mb_convert_encoding($html_page, 'HTML-ENTITIES', 'UTF-8'));
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// target elements which is inside an anchor and a line break (treat them as each row)
$links = $xpath->query('//*[following-sibling::a and preceding-sibling::br]');
$temp = '';
foreach($links as $link) { // for each rows of the link
$temp .= $link->textContent . ' '; // get all text contents
if($link->tagName == 'br') {
$unit = $xpath->evaluate('string(./preceding-sibling::text()[1])', $link);
$data[] = $temp . $unit; // push them inside an array
$temp = '';
}
}
echo '<pre>';
print_r($data);
Sample Output
I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
so I think I can user this script :
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[#title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit
Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(#title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(#title, "Download PDF")]/#href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)" will get you all the links of that form.
I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".
This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:
<?php
/* Using the DOMDocument class */
/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");
/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");
/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");
echo "The initial number of paragraphs is " . $pars->length . ".<br />";
/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent) == ""){
$pars->item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}
echo "The final number of paragraphs is " . $pars->length . ".<br />";
// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>
This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:
<?php
/* Using simple_html_dom.php */
include("simple_html_dom.php");
$html = file_get_html("HTML File With Empty Paragraphs.html");
$pars = $html->find("p");
for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext) == "") {
unset($pars[$i]);
$i--;
}
}
$html->save("HTML File without Empty Paragraphs.html");
?>
It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext) == "") {".
Does anyone know how I can fix this?
Thank you.
I also asked on php devnetwork.
Looking at the documentation for Simple HTML DOM Parser, I think this should do the trick:
include('simple_html_dom.php');
$html = file_get_html('HTML File With Empty Paragraphs.html');
$pars = $html->find('p');
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
// Remove an element, set it's outertext as an empty string
$par->outertext = '';
}
}
$html->save('HTML File without Empty Paragraphs.html');
I did a quick test and this works for me:
include('simple_html_dom.php');
$html = str_get_html('<html><body><h1>Test</h1><p></p><p>Test</p></body></html>');
$pars = $html->find("p");
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
$par->outertext = '';
}
}
echo $html;
// Output: <html><body><h1>Test</h1><p>Test</p></body></html>
Empty paragraphs looks like <p [attributes]> [spaces or newlines] </p> (case-insensitive). You can use preg_replace (or str_replace) for removing empty paragraphs.
The following will only work if an empty paragraph is <p></p>:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = str_replace('<p></p>', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);
This will also work on paragraphs with attributes, like <p class="msoNormal"> </p>:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = preg_replace('#<p[^>]*>\s*</p>#i', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);
Hi I've got these lines here, I am trying to extract the first paragraph found in the file, but this fails to return any results, if not it returns results that are not even in <p> tags which is odd?
$file = $_SERVER['DOCUMENT_ROOT'].$_SERVER['REQUEST_URI'];
$hd = fopen($file,'r');
$cn = fread($hd, filesize($file));
fclose($hd);
$cnc = preg_replace('/<p>(.+?)<\/p>/','$1',$cn);
Try this:
$html = file_get_contents("http://localhost/foo.php");
preg_match('/<p>(.*)<\/p>/', $html, $match);
echo($match[1]);
I would use DOM parsing for that:
// SimpleHtmlDom example
// Create DOM from URL or file
$html = file_get_html('http://localhost/blah.php');
// Find all paragraphs
foreach($html->find('p') as $element)
echo $element->innerText() . '<br>';
It would allow you to more reliably replace some of the markup:
$html->find('p', 0)->innertext() = 'foo';