I'm using xpath to extract data from a web site, but I have a problem with the XPath selector, assuming i have this HTML code:
<div id="_parent">
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
</div>
what I get:
Hi!
I am a child!
I am a span child!
what I should get:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
My current xpath php code
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[#class='my']");
When in Chrome I open the console and enter this in it:
document.evaluate( "//div[#class='my']", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null ).singleNodeValue;
then what I get is:
<div class="my">
Hi!
<p>I am a child!</p>
<span class="someclass">I am a <b>span</b> child!</span>
</div>
so the XPath expression actually works as intended. So I infer, that the way you apply the XPath expression must be wrong. However you did not show us the code that applies the XPath expression?
Related
I am trying to extract raw html from a web-page using simplehtmldom. I was wondering if it is possible using that library.
For example, let's say I have this web page I am trying to extract data from.
<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>
My goal is to extract everything within div class3 including the raw html code so when I get the data I can enter it to a text box which allows input for source code so it is formatted the same way it is from the webpage.
I have looked at simplehtmldom manuals and did some searching but have yet to find a solution.
Thank you.
Using your example html string
$html = str_get_html('<div class="class1">
<div class="class2">
<div class="class3">
<p>p1</p>
<h1>header here!</h1>
<p>p2</p>
<img src="someimage"></img>
</div>
</div>
</div>');
// Find all divs with class3
foreach($html->find('div[class=class3]') as $element) {
echo $element->outertext;
}
Here is a same of code I have extracted from a webpage...
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Profession
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
administration
</span>
</div>
</div>
When displayed on the webpage it shows as "Profession administration". What I want to do is extract the profession, in this case "administration". However, it's not as simple as it might seem because this piece of code is repeated many times for various other questions, such as
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Industry
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
banking
</span>
</div>
</div>
Any ideas on a good solution?
Please, do not use regular expressions for getting node values from a page.
PHP have a very nice class named DOMDocument. You can just fetch a page as DOMDocument:
$dom = new DOMDocument;
$dom->loadURL("http://test.de/page.html");
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, 'size15')]");
echo $spaner->item(0)->nodeValue . "/" . $spaner->item(1)->nodeValue;
Hello and thank for looking at my question.
I'm in need to grab some data from an HTML snippet.
This source is a trusted/structured one so I think it's OK to use regex in this HTML. Dom and other advanced features in php are an overkill I guess.
Here is the format of the HTML snippet.
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
In above, please note the first 2 DIV tags have IDs set. There could be several row-custom_1 like div tags so I will need to escape them.
I'm actually very poor in regex so I'm expecting a help from you to rab the John Smith from above html snippet.
It could be something like
<div * id="row-custom_1" * > * <div * class="content" * >GRAB THIS </div>
but I don't know how to do it in regex.
John Smith part won't contain any html for sure. it's from a trusted source that it strips all html and gives the data in above format.
I can understand that regex is never a good idea to process HTML anyway.
Thank you very much for any assistance.
Edit just after 30 minutes:
Many of the awesome people suggested to use an HTML parser so I did ; worked like a charm. So if anyone comes here with a similar question, as the stupid question author, I'd recommend using DOM for the job.
Here is a simple DOM based code to get your value from the given HTML:
$html = <<< EOF
<div id="d-container">
<div id="row-custom_1">
<div class="label">Type</div>
<div class="content">John Smith</div>
<div class="clear"></div>
</div>
</div>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$value = $xpath->evaluate("string(//div[#id='d-container']
/div[#id='row-custom_1']/div[#class='content']/text())");
echo "User Name: [$value]\n"; // prints your user name
OUTPUT:
User Name: [John Smith]
I'm trying to use DOMDocument and XPath to search an HTML document using PHP. I want to search by a number such as '022222', and it should return the value of the corresponding h2 tag. Any thoughts on how this would be done?
The HTML document can be found at http://pastie.org/1211369
How about this?
$sxml = simplexml_load_string($data);
$find = "022222";
print_r($sxml->xpath("//li[.='".$find."']/../../../div[#class='content']/h2"));
It returns:
Array
(
[0] => SimpleXMLElement Object
(
[0] => Item 2
)
)
//li[.='xxx'] will locate the li your searching for. Then we use ../ to step up three levels, before we descend into the content-div, as specified by div[#class='content']. Finally we choose the h2 child.
Just FYI, here's how to do it using DOM:
$dom = new DOMDocument();
$dom->loadXML($data);
$find = "022222";
$xpath = new DOMXpath($dom);
$res = $xpath->evaluate("//li[.='".$find."']/../../../div[#class='content']/h2");
if ($res->length > 0) {
$node = $res->item(0);
echo $node->firstChild->wholeText."\n";
}
I want to search by a number such as '022222', and it should return the value of the corresponding h2 tag. Any thoughts on how this would be done?
The HTML document can be found at http://pastie.org/1211369
To start with, the text at the provided link is not a well-formed XML or XHtml document and cannot be directly parsed with XPath.
Therefore I have wrapped it inan <html> element.
On this XML document one of the XPath expressions that selects exactly the wanted text node is:
/*/div[div/ul/li = '022222']/div[#class='content']/h2/text()
Among other advantages, this XPath expression doesn't use any reverse axes and is thus more readable.
The complete XML document on which this XPath expression is evaluated is the following:
<html>
<div class="item">
<div class="content"><h2>Item 1</h2></div>
<div class="phone">
<ul class="phone-single">
<li>01234 567890</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 2</h2></div>
<div class="phone">
<ul class="phone-multiple">
<li>022222</li>
<li>033333</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 3</h2></div>
<div class="phone">
<ul class="phone-single">
<li>02345 678901</li>
</ul>
</div>
</div>
<div class="item">
<div class="content"><h2>Item 4</h2></div>
<div class="phone">
<ul class="phone-multiple">
<li>099999999</li>
<li>088888888</li>
</ul>
</div>
</div>
</html>
I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)
I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)
So I want to capture "Capture this text 1" and "Capture this text 2" and so on.
Doesn't look to hard, but I can't figure it out :(
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
If you want to get :
The text
that's inside a <div> tag with class="text"
that's, itself, inside a <div> with class="main"
I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).
Instead, I would use an XPath query on your document, using the DOMXpath class.
For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :
$html = <<<HTML
<div class="main">
<div class="text">
Capture this text 1
</div>
</div>
<div class="main">
<div class="text">
Capture this text 2
</div>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :
$tags = $xpath->query('//div[#class="main"]/div[#class="text"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
And executing this gives me the following output :
string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
You can use http://simplehtmldom.sourceforge.net/
It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.
Something like this:
// Find all <div> which have attribute id=text
$ret = $html->find('div[id=text]');
See the documentation of it for more help.