$doc->getElementById('id'), $doc->getElementsByName('id') not working [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the #Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[#id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.

are you using the the javascript getElementById and getElementsByTagName if yes than this is the problem
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');

You will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using function $doc->getElementById('center_col');
$doc->validateOnParse = true;
$doc->loadHTML($html);
stackoverflow: getelementbyid-problem
http://php.net/manual/de/domdocument.getelementbyid.php
it's in the question #Jon post in his comment!

Related

Can't get picture url by Parsing [duplicate]

This question already has answers here:
Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
(2 answers)
Closed 8 years ago.
Im building a script that give me an product array by parsing html from a list of websites.
I believe that Im doing everything right.. But for some reason i have alots of difficulty with only one website Makita.ca
So.. Im using DOMXPath for retrieving element. i am providing the RAW html that im getting from makita.ca
What picture i want to get is those on the pictures that are on the left
please also note that the only thing i need is the link of the image and not the actual
image.
the folowing image page is at http://www.makita.ca/index2.php?event=tool&id=100
$productArray = array();
$Dom = new DOMDocument();
#$Dom -> loadHTML($this->html);
$xpath = new DOMXPath($Dom);
echo $xpath -> query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr[1]/td/div/a/img')->length;
if($xpath -> query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table')->length > 0)
{
for($i=0;$i<$xpath->query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr')->length;$i++)
{
if($xpath->query('//*[#id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img') > 0)
$productArray['picture'][] = $xpath -> query('//*[#id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img')->item(0)->nodeValue;
}
}
Do you see what is my mistake ? since now im really lost.
Edit:
ok for test purposes i am echoing the length of the query() method witch should give me how much element match the query
So I retyped to hole query down so they can't have any non asci character
So i retyped the hole query '//*[#id="content_other"]/table[2]//tr/td1/table//tr[4]/td/table//tr1/td/div‌​/a/img'
then the result is 0
So i removed the end of the query part by part..
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div‌​/a = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div‌​ = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1] = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr = 5
Wooo i got some element matching here !
ok let try the last element witch is the one i need
so since it is zero based then to get the tr number 5 i need to enter as a path this
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]
But I still get 0.... So i dont know what to do any more..
//div[#class='product_heading']/ancestor-or-self::table[1]//a/img selects firstly the "Action Shots", then all the images found under this bloc.
This XPath expression will be more reliable than yours, because of the low number of positional expressions which tends to break easily as the markup changes.
//div[#class='product_heading']/ancestor-or-self::table[1]//a[#rel='thumbnail']/img would be a stronger security

Going crazy getElementsByTagName not working on PHP 5.3.3

I have a very simple code like this:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($data);
libxml_clear_errors();
$dom->preserveWhiteSpace = false;
foreach($dom->getElementsByTagName('img') as $img) {
// do something here
}
The variable $data contains HTML from a external URL. Ok, if i test this code using my local webserver using PHP 5.3.6 it works and returns ALL img tags in that page, but same code running on another server with PHP 5.3.3 DOESN'T WORK! It doesn't return all img tags from the SAME $data value... it only returns the first 13 images.
I suspect that this has something to do with encoding, may be that some characters in $data have a bad encoding or something, but i don't know how to solve it. There is a known bug on PHP 5.3.3 related to this?
I'd suggest you check out the comments on the php docs page,
looks like there's some useful advice on the DOMDocument usage:
http://de.php.net/manual/en/domdocument.getelementsbytagname.php
And before you ask for (eventually) known bugs on stackoverflow,
you might want to look for it on https://bugs.php.net/
Edit:
I think I've found the bug related to that behavior:
https://bugs.php.net/bug.php?id=60762
Even though it's flagged 5.4.0 RC 5 I couldn't replicate
the behavior mentioned.
Probably an issue with the HTML data (as you mentioned).

PHP & file_get_contents() leading in failing HTTP request!

i wrote a php-script which is loading the xml/html of a given url, parses it and writes it to an database. Since some hours ago I'm getting the strange mentioned error, not all the times but definitely too much.
Do you have any suggestions as to what is going wrong?
Here are the lines of code which are supposed to cause the logged error:
libxml_use_internal_errors( true );
$data = file_get_contents($item->get_link());
$dom = new DOMDocument();
$dom->loadHTML($data);
Well, there's not much to work off, so here's some possible issues:
1) $item->get_link() is not returning a valid URL
2) You assume file_get_contents will always get the data. What happens when there's a network issue? The server is down? You need to make sure $data is valid before doing something with it.
$data = file_get_contents($item->get_link());
3) $data is not parse-able by the dom parser, possibly for one of the previous reasons.
$dom = new DOMDocument();
$dom->loadHTML($data);

Find what part of Html is invalid with PHP

I tried several methods to find out what part of a html string is invalid
$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);
None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?
I need this to manually fix html input from users. I don't want to relay on automated processes.
I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.
error_reporting(0);
$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($badHTML);
$goodHTML = simplexml_import_dom($doc)->asXML();
You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.

How do I insert HTML into a PHP DOM object? [duplicate]

This question already has answers here:
How to insert HTML to PHP DOMNode?
(5 answers)
Closed 7 years ago.
I am using PHP's DOM object to create HTML pages for my website. This works great for my head, however since I will be entering a lot of HTML into the body (not via DOM), I would think I would need to use DOM->createElement($bodyHTML) to add my HTML from my site to the DOM object.
However DOM->createElement seems to parse all HTML entities so my end result ended up displaying the HTML on the page and not the actual renders HTML.
I am currently using a hack to get this to work,
$body = $this->DOM
->createComment('DOM Glitch--><body>'.$bodyHTML."</body><!--Woot");
Which puts all my site code in a comment, which I bypass athe comment and manually add the <body> tags.
Currently this method works, but I believe there should be a more proper way of doing this. Ideally something like DOM->createElement() that will not parse any of the string.
I also tried using DOM->createDocumentFragment() However it does not like some of the string so it would error and not work (Along with take up extra CPU power to re-parse the body's HTML).
So, my question is, is there a better way of doing this other than using DOM->createComment()?
You use the DOMDocumentFragment objec to insert arbitrary HTML chunks into another document.
$dom = new DOMDocument();
#$dom->loadHTML($some_html_document); // # to suppress a bajillion parse errors
$frag = $dom->createDocumentFragment(); // create fragment
$frag->appendXML($some_other_html_snippet); // insert arbitary html into the fragment
$node = // some operations to find whatever node you want to insert the fragment into
$node->appendChild($frag); // stuff the fragment into the original tree
I FOUND THE SOLUTION but it's not a pure php solution, but works very well. A little hack for everybody who lost countless hours, like me, to fix this
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
loadHTML works just fine.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<font color='red'>Hey there mrlanrat!</font>");
echo $dom->saveHTML();
?>
which outputs Hey there mrlanrat! in red.
or
<?php
$dom = new DOMDocument();
$bodyHTML = "here is the body, a nice body I might add";
$dom->loadHTML("<body> " . $bodyHTML . " </body>");
// this would even work as well.
// $bodyHTML = "<body>here is the body, a nice body I might add</body>";
// $dom->loadHTML($bodyHTML);
echo $dom->saveHTML();
?>
Which outputs:
here is the body, a nice body I might add and inside of your HTML source code, its wrapped inside body tags.
I spent a lot of time working on Anthony Forloney's answer, But I cannot seem to get the html to append to the body without it erroring.
#Mark B: I have tried doing that, but as I said in the comments, it errored on my html.
I forgot to add the below, my solution:
I decided to make my html object much simpler and to allow me to do this by not using DOM and just use strings.

Categories