This question already has answers here:
Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
(2 answers)
Closed 8 years ago.
Im building a script that give me an product array by parsing html from a list of websites.
I believe that Im doing everything right.. But for some reason i have alots of difficulty with only one website Makita.ca
So.. Im using DOMXPath for retrieving element. i am providing the RAW html that im getting from makita.ca
What picture i want to get is those on the pictures that are on the left
please also note that the only thing i need is the link of the image and not the actual
image.
the folowing image page is at http://www.makita.ca/index2.php?event=tool&id=100
$productArray = array();
$Dom = new DOMDocument();
#$Dom -> loadHTML($this->html);
$xpath = new DOMXPath($Dom);
echo $xpath -> query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr[1]/td/div/a/img')->length;
if($xpath -> query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table')->length > 0)
{
for($i=0;$i<$xpath->query('//*[#id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr')->length;$i++)
{
if($xpath->query('//*[#id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img') > 0)
$productArray['picture'][] = $xpath -> query('//*[#id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img')->item(0)->nodeValue;
}
}
Do you see what is my mistake ? since now im really lost.
Edit:
ok for test purposes i am echoing the length of the query() method witch should give me how much element match the query
So I retyped to hole query down so they can't have any non asci character
So i retyped the hole query '//*[#id="content_other"]/table[2]//tr/td1/table//tr[4]/td/table//tr1/td/div/a/img'
then the result is 0
So i removed the end of the query part by part..
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div/a = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1] = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td = 0
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr = 5
Wooo i got some element matching here !
ok let try the last element witch is the one i need
so since it is zero based then to get the tr number 5 i need to enter as a path this
//*[#id="content_other"]/table[2]//tr/td[1]/table//tr[4]
But I still get 0.... So i dont know what to do any more..
//div[#class='product_heading']/ancestor-or-self::table[1]//a/img selects firstly the "Action Shots", then all the images found under this bloc.
This XPath expression will be more reliable than yours, because of the low number of positional expressions which tends to break easily as the markup changes.
//div[#class='product_heading']/ancestor-or-self::table[1]//a[#rel='thumbnail']/img would be a stronger security
Related
Update
To avoid delete this question, after a few comments I realized that, PHP uses Xpath 1.0 so I was trying to use functions from Xpath version 2.0.
Thanks for your feedback and hope some high reputation users suggest me if it's better to delete the question or leave it with this update.
I have been searching this problem but none of the solutions posted worked for me. I'm using DomXPath in PHP and I have used the Template Tester for Xpath 2.0 to test my queries: http://videlibri.sourceforge.net/cgi-bin/xidelcgi
This is the first part of the code to start working on it:
$dom_html = new DOMDocument();
libxml_use_internal_errors(true);
$dom_html->loadHTMLFile($file);
libxml_clear_errors();
$xpath = new DOMXPath($dom_html);
Then I try this query in the Xpath 2.0 Tester:
//table/tbody/tr/td/div/count(table)
This query give me the amount of that each div has inside of it and it's a perfect solution for me:
11
2
1
1
2
14
4
19
4
4
3
2
9
16
But when I tried to make the same in PHP, I have not obtain those numbers. I have trying the following solutions:
$quantity = $xpath->evaluate('count(//table/tbody/tr/td/div/table)');
But this give me the total count and not give me the desired solution.
$quantity = $xpath->query('//table/tbody/tr/td/div/count(table)');
When I make this query, I tried using two different ways to obtain mi desired answer but none of them works for me:
1)
foreach ($quantity as $content)
{
echo $content->nodeValue;
}
2)
foreach ($quantity as $content)
{
echo $contenido->textContent;
}
Thanks
I've written a content generator tool for a project im working to assist me batch importing fake content into text fields of a database. It just assists making the site look populated.
I'm using an external class called lorem-php-sum to actually generate the strings that I am inserting. Its incredibly simple really, it just inserts paragraphs of text wrapped in <p> tags (and a random number of them each time) and I then insert these strings into my chosen table within a big loop.
Now the thing is, I want to slightly advance what content is being randomly generated and to add some html list tags, horizontal line tags and other stuff. I want my new html elements to be placed randomly within the paragraphs that I get returned from this paragraph generator class.
The problem is that whilst I can easily insert list tags into my big paragraph string at some random point, I fear sometimes it may insert my new html tags within the existing markup in a way that will break the html.
Does anyone have a trick for inserting html with some rules into another string? I imagine that maybe the php domDocument class can assist with this but not sure now?
You'd need to incorporate some kind of state machine in your generator.
You can think of something like this:
Step1: Choose which element to render: a textnode, a paragraph, a list node.
When you pick a textnode you randomly generate some text and return to Step 1.
When you pick a paragraph you emit <p> and generate some text, emit </p> and return to Step 1.
In the case of a list node you can only make list elements <li>, so pick a random number of elements and fill them with same rules from Step 1.
--
You can also allow nesting. In <li> you can add <strong> and <em>, similar for <p>.
You can make it as crazy as you want I guess :)
Tweak a bit with the coefficients to get good results. Try to make a generator that produces random, but predictable output, total length might be a good thing to control on.
You could hierarchically loop through multidimensional arrays. No cell without a row, no row without a table, as such no li without a ul.
$tags = array("<table>%s</table>\n" ,
array (" <tr>%s</tr>\n" ,
array(" <td>%s</td>\n)),
"<ul>%s</ul>\n",
arrray (" <li>%s</li>\n") //continue with more tags
);
$tags_simple = array("%s", "<strong>%s</strong>",
"<i>%s</i>", "<p>%s</p>\n", "%s</ br>\n"
); //etc, "%s" for a none tag, add more if you like
Pick a ramdom from $tags, multiloop them, sprintf the random sentences and add random simple tags to them. It's a standalone possibility.
So I managed to work this out with other code samples and using domDocument.
I ended up making a function that explodes the string via paragraph tags and returns it as an array containing each paragraph as a separate item.
function splitTextByPara($string,$split_on="p"){
// Add alternative tags to split on with syntax: |//ul|//br
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//".$split_on);
$result = array();
foreach ($entries as $entry) {
$result[] = $entry->ownerDocument->saveHTML( $entry );
}
// re-encode to utf8
$result = array_map("utf8_decode", $result);
return $result;
}
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the #Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[#id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.
are you using the the javascript getElementById and getElementsByTagName if yes than this is the problem
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
You will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using function $doc->getElementById('center_col');
$doc->validateOnParse = true;
$doc->loadHTML($html);
stackoverflow: getelementbyid-problem
http://php.net/manual/de/domdocument.getelementbyid.php
it's in the question #Jon post in his comment!
This question already has an answer here:
Closed 11 years ago.
The community is reviewing whether to reopen this question as of 6 days ago.
Possible Duplicate:
How to use XPath function in a XPathExpression instance programatically?
I'm trying to find all of the rows of a nested table that contain an image with an id that ends with '_imgProductImage'.
I'm using the following query:
"//tr[/td/a/img[ends-with(#id,'_imgProductImage')]"
I'm getting the error: xmlXPathCompOpEval: function ends-with not found
My google searches i believe say this should be a valid query/function. What's the actual function i'm looking for if it's not "ends-with"?
from How to use XPath function in a XPathExpression instance programatically?
One can easily construct an XPath 1.0 expression, the evaluation of which produces the same result as the function ends-with():
$str2 = substring($str1, string-length($str1)- string-length($str2) +1)
produces the same boolean result (true() or false()) as:
ends-with($str1, $str2)
so for your example, the following xpath should work:
//tr[/td/a/img['_imgProductImage' = substring(#id, string-length(#id) - 15)]
you will probably want to add a comment that this is a xpath 1.0 reformulation of ends-with().
It seems that ends-with() is an XPath 2.0 function.
DOMXPath only supports XPath 1.0
Edit after the comment : In your case, I suppose you'll have to :
Find all images, using a simpler XPath query, that will return more images than what you want -- but include those you want to keep.
Loops over those, testing in PHP, for each one of them, if the id attribute (see the getAttribute method) matches what you want.
To test if the attribute is OK, you could use something like this, in the loop that iterates over the images :
$id = $currentNode->getAttribute('id');
if (preg_match('/_imgProductImage$/', $id)) {
// the current node is OK ;-)
}
Note that, in my regex pattern, I used a $ to indicate end of string.
There is no ends-with function in XPath 1.0, but you can fake it:
"//tr[/td/a/img[substring(#id, string-length(#id) - 15) = '_imgProductImage']]"
If you're on PHP 5.3.0 or later, you can use registerPHPFunctions to call any PHP function you want, although the syntax is a little odd. For example,
$xpath = new DOMXPath($document);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("ends_with");
$nodes = $x->query("//tr[/td/a/img[php:function('ends-with',#id,'_imgProductImage')]"
function ends_with($node, $value){
return substr($node[0]->nodeValue,-strlen($value))==$value;
}
I'm looking for a way to get all the form inputs and respective values from a page given a specific URL and form name.
function GetForm($url, $name)
{
return array
(
'field_name_1' => 'value_1',
'field_name_2' => 'value_2',
'select_field_name' => array('option_1', 'option_2', 'option_3'),
);
}
GetForm('http://www.google.com/', 'f');
Can anyone provide me with the necessary regular expressions to accomplish this?
EDIT: I understand that querying the DOM would be far more reliable, however what I'm looking for is a website agnostic solution that allows me to get all the fields of a given form. I don't believe this is possible with DOM without knowing the document nodes first, am I wrong?
I don't need a bullet proof solution, just something that works on standard web pages, for the FORM tag I've come up with the following RegEx;
'~<form.*?name=[\'"]?' . $name . '[\'"]?.*?>(.+?)</form>~is'
I believe that doing something similar for input fields won't be difficult, what I find most challenging is the RegEx for the select and option fields.
Using regex to parse HTML is probably not the best way to go.
You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).
You might also want to take a look at Zend_Dom and Zend_Dom_Query, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test, for instance -- and work quite well ;-)
It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...
EDIT after the comment and the edit of the OP
Here are a couple of thought about, to begin by something "simple", an input tag :
it can spread accross several lines
it can have many attributes
condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
attributes can have double-quotes, single-quotes, or even nothing arround their values
tags / attributes can be both lower-case or upper-case
tags don't always have to be closed
Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...
Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.
On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :
$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//input[#name="q"]');
if ($nodeList->length > 0) {
for ($i=0 ; $i<$nodeList->length ; $i++) {
$node = $nodeList->item($i);
var_dump($node->getAttribute('value'));
}
}
} else {
// too bad...
}
What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input> that have a name attribute that is equal to "q".
And it just works : I'm getting this result :
string 'test' (length=4)
string 'test' (length=4)
(I checked : there are two input name="q" on the page ^^ )
Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)
And that's what we get ;-)
EDIT 2 : and a bit fun with select and options :
Well, just for fun, here's what I came up for select and option :
$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeListSelects = $xpath->query('//select');
if ($nodeListSelects->length > 0) {
for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
$nodeSelect = $nodeListSelects->item($i);
$name = $nodeSelect->getAttribute('name');
$nodeListOptions = $xpath->query('option[#selected="selected"]', $nodeSelect); // We want options that are inside the current select
if ($nodeListOptions->length > 0) {
for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
$nodeOption = $nodeListOptions->item($j);
$value = $nodeOption->getAttribute('value');
var_dump("name='$name' => value='$value'");
}
}
}
}
} else {
// too bad...
}
And I get as an output :
string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
Which is what I expected.
Some explanations ?
Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.
A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D
Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^
I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)
Still : have fun !