how to get image source from an img tag using php function.
Or, you can use the built-in DOM functions (if you use PHP 5+):
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$imgs = $xpath->query("//img");
for ($i=0; $i < $imgs->length; $i++) {
$img = $imgs->item($i);
$src = $img->getAttribute("src");
// do something with $src
}
This keeps you from having to use external classes.
Consider taking a look at this.
I'm not sure if this is an accepted method of solving your problem, but check this code snippet out:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
You can use PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/)
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element) {
echo $element->src.'<br>';
}
// Find all links
foreach($html->find('a') as $element) {
echo $element->href.'<br>';
}
$path1 = 'http://example.com/index.html';//path of the html page
$file = file_get_contents($path1);
$dom = new DOMDocument;
#$dom->loadHTML($file);
$links = $dom->getElementsByTagName('img');
foreach ($links as $link)
{
$re = $link->getAttribute('src');
$a[] = $re;
}
Output:
Array
(
[0] => demo/banner_31.png
[1] => demo/my_code.png
)
Related
say i have
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
html page like this i wanna parse it with simple dom php parser and i wanna get download mp4 114p 19.1 as out put while i tried this code
foreach($displaybody->find('a ') as $element) {
// echo $element->innertext . '<br/>';
it returned me download mp4 only how do i parse remaining values download mp4 114p 19.1 please help me out
You can't use the <a> tag anymore since some of the text you're trying to access isn't inside it anymore, target the document itself and then use ->plaintext:
$html = <<<EOT
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
EOT;
$displaybody = str_get_html($html);
echo $displaybody->plaintext;
Here is another way of accessing each row thru DOMDocument with xpath:
// load the sites html page in DOMDocument
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$html_page = file_get_contents('http://www.mohammediatechnologies.in/download/downloadtest.php?name=8KPEiGqDQHg');
$dom->loadHTML(mb_convert_encoding($html_page, 'HTML-ENTITIES', 'UTF-8'));
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// target elements which is inside an anchor and a line break (treat them as each row)
$links = $xpath->query('//*[following-sibling::a and preceding-sibling::br]');
$temp = '';
foreach($links as $link) { // for each rows of the link
$temp .= $link->textContent . ' '; // get all text contents
if($link->tagName == 'br') {
$unit = $xpath->evaluate('string(./preceding-sibling::text()[1])', $link);
$data[] = $temp . $unit; // push them inside an array
$temp = '';
}
}
echo '<pre>';
print_r($data);
Sample Output
I want to append some text to divs which has same class.
$dom = new DOMdocument();
$dom->formatOutput = true;
#$dom->loadHTMLFile('first.html');
$xpath = new DOMXPath($dom)
$after = new DOMText('Newly appended text');
$elements = $xpath->query('//div[#class="mix"]');
foreach($elements as $element)
{
$element->appendChild($after);
//echo $dom->saveHTML();
}
$dom->saveHTMLFile('first.html');
But when I open first.html, The appended text is only appeded to last div of above class.
If I uncomment saveHTML() then it shows perfect result. Just problem after saving.
You cannot append the same DOM node to multiple points in the tree, which is what you are doing here. You need to create a separate (but identical) node each time:
foreach($elements as $element)
{
$after = new DOMText('Newly appended text'); // moved this inside the loop
$element->appendChild($after);
}
I want to check whether a <img> tag has alt="" text or not and also need to find what line number in DOM that img tag is. At the moment I have the following codes written but stuck with finding the line number.
for example:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.google.com');
$htmlElement = $doc->getElementsByTagName('html');
$tags = $doc->getElementsByTagName('img');
echo $tags->item(0)->getLineNo();
foreach ($tags as $image) {
// Get sizes of elements via width and height attributes
$alt = $image->getAttribute('alt');
if($alt == ""){
$src = $image->getAttribute('src');
echo "No alt text ";
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
else{
$src = $image->getAttribute('src');
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
}
from the above code at the moment I am getting images and text saying that "no alt text" beside the image, but I want to get what line number that img tag appears.
for example here the line number is 57,
56. <div class="work_item">
57. <p class="pich"><img src="images/works/1.jpg" alt=""></p>
58. </div>
Use DOMNode::getLineNo(), e.g.$line = $image->getLineNo().
HTML has no real concept of line numbers, since they are just whitespace.
With that in mind, you might be able to count how many newlines there are in all the text nodes preceding the target node. You might be able to do this with DOMXPath:
$xpath = new DOMXPath($doc);
$node = /* your target node */;
$textnodes = $xpath->query("./preceding::*[contains(text(),'\n')]",$node);
$line = 1;
foreach($textnodes as $textnode) $line += substr_count($textnode->textContent,"\n");
// $line is now the line number of the node.
Please note that I have not tested this, nor have I ever used axes in xpath.
I think i have figured out what i was trying to achieve but not sure is that the right way. It is doing the job. Please leave comments or any other idea how can i improve it.
If you go to the following site and type any URL. It will produce a report with accessibility issues in a webpage. It is an accessibility checker tool.
http://valet.webthing.com/page/
All i am trying to do is achieve that kind of layout. The code below will produce the DOM of supplied URL and find any image tag that does not have alternative text.
<html>
<body>
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$yourURLAddress');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
$lines = preg_split('/\r\n|\r|\n/', $new); //split the string on new lines
echo "<pre>";
//find 'alt=""' and print the line number and html tag
foreach ($lines as $lineNumber => $line) {
if (strpos($line, htmlspecialchars('alt=""')) !== false) {
echo "\r\n" . $lineNumber . ". " . $line;
}
}
echo "\n\n\nBelow is the whole DOM\n\n\n";
//print out the whole DOM including line numbers
foreach ($lines as $lineNumber => $line) {
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
?>
</body>
</html>
I like to thank everyone who helped specially "chwagssd" and Mike Johnson.
I need to parse an XML file and I need also to parse the doctype. I've tried with XML Reader but when I found a nodetype 10 (doctype), I can't get it's value.
There is a way to extract the doctype from an XML file, with XMLReader?
Edit: as asked, some sample code. however is nothing rather than a dump, right now.
$reader = new XMLReader( );
$filename = 'test.xhtml';
$reader->open($filename);
while( $reader->read( ) )
{
$nodeType = $reader->nodeType;
$nodeName = $reader->name;
$nodeValue = $reader->value;
if( $nodeType == 10 )
{
echo $nodeType ."\n";
echo $nodeName ."\n";
echo $nodeValue ."\n";
echo $reader->localName ."\n";
echo $reader->namespaceURI ."\n";
echo $reader->prefix ."\n";
echo $reader->xmlLang ."\n";
echo $reader->readString() . "\n";
echo $reader->readInnerXML() . "\n";
while( $reader->moveToNextAttribute( ) )
{
echo $reader->name . "=" . $reader->value;
}
}
You can use DOM to read the DOCTYPE data:
$doc = new DOMDocument();
$doc->loadXML($xmlData);
var_dump($doc->doctype->publicId);
var_dump($doc->doctype->systemId);
var_dump($doc->doctype->name);
var_dump($doc->doctype->entities);
var_dump($doc->doctype->notations);
I have not found a way to do this with XMLReader despite a lot of looking. However you can use DOMDocument to read the doctype quite easily, then revert to XMLReader to read the rest of the stream. For example, to get the system ID part of the doctype before processing the rest of the XML file:
$doc = new DOMDocument();
$doc->load($xmlfile);
$systemId = $doc->doctype->systemId;
unset($doc);
// Then proceed with XMLReader:
$reader = new XMLReader();
$reader->open($xmlfile);
while($reader->read())
{
// etc
I suppose that this may not be practical in all circumstances but it worked for me while processing very large XML files for which I needed to read the system ID from the doctype.
I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
so I think I can user this script :
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[#title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit
Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = #file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(#title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(#title, "Download PDF")]/#href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)" will get you all the links of that form.