How to use symfony dom parser - php

I am trying to use Symfony Crawler.
So I have checked this article.
What I want to do is to get the 3,335.00(The second argument)
For now, I try sentence like this, but it is wrong.
$crawler = $crawler->filter('body > div[#class="cell_label"]');
How can I do it??
<body>
<div class="cell__label"> Value1 </div> <div class="cell__value cell__value_"> 2,355.00 </div>
<div class="cell__label"> Value2 </div> <div class="cell__value cell__value_"> 3,355.00 </div>
<div class="cell__label"> Value3 </div> <div class="cell__value cell__value_"> 4,355.00 </div>
</body>
$crawler = new Crawler($url);
$crawler = $crawler->filter('body > div[#class="cell_label"]');//// no work...
foreach ($crawler as $domElement) {
var_dump($domElement);
}

I can see several issue here:
Using $crawler->filter() implies you must pass a css selector as a parameter, not XPath expressions, so use 'body > div.cell__label' or 'body div[class^="cell__"]' if you need to select all div with a class that starts with cell__, btw you have a typo in cell_label (one underscore).
The Crawler accepts DOMNodeList, DOMNode, array or string as a constructor parameters, not a url to a remote resource (but I assume it may be just an arbitrary variable name you used there). Technically url is a string as well, but not XML formatted string.
If you want to use XPath expression use $crawler->filterXPath(), like that:
$nodes = $crawler->filterXPath('//div[contains(#class, "cell__label")]');
Here's a documentation on how to use XPath - https://www.w3.org/TR/xpath/

Crawler filter can handle jQuery like selectors, so you can:
$crawler = $crawler->filter('.cell__value');

Related

How to get xPath nodeValue USD dollar amount

I'm trying to start from the <span> element that has text Value when transacted
Then get its parent <div> and get following sibling which is a <div> and from that <div> get the text of the child <span>.
From what I can tell, the code is correct and should echo $1,034.29.
It echos $0.00 instead.
What am I missing here?
php code:
$a = new DOMXPath($doc);
$dep_val_txt = $a->query("//span[contains(text(), 'Value when transacted')]");
$dep_val_nxt_elem = $a->query("parent::div", $dep_val_txt[0]);
$dep_val_elem = $a->query("following-sibling::*[1]", $dep_val_nxt_elem[0]);
$dep_val = $dep_val_elem->item(0)->childNodes->item(0)->nodeValue;
echo $dep_val;
html code:
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
In case someone else stumbles upon this question in the future, I will summarize the solution which was concluded by conversation with OP in the comments:
The issue here is not with the DOM selectors, as observed by the fact that his output is $0.00 even though he is not formatting the value to appear as a currency. This led me to believe that the website being scraped is in fact using placeholder values which are updated on the client side using Javascript. The reason this cannot be resolved with selectors is because the DOM received by PHP will be the initial render, which does not contain the values we wish to scrape.
So the solution is to examine the website being scraped to determine where and how the values are being fetched before being added to the DOM on the client side. For example, if the website is using an API call to fetch the values, one can simply use the same API to fetch the intended data without having to scrape the HTML DOM at all.
If you follow OPs question literally
start from the <span> element that has text "Value when transacted"
get its parent <div>
get following sibling which is a <div>
get the text of the child <span>
then the xpath expression should be
//span[text()='Value when transacted']/parent::div/following-sibling::div/span
You might find it easier and faster to process using a regex to match the price, here's a quick example in PHP:
<?php
// Your input HTML (as per your example)
$inputHtml = <<<HTML
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
HTML;
$matches = [];
// Look for any div > span element which contains a string starting with $ and then match a number (allowing for a , or . within the price matched).
if (preg_match_all('#<div.*>\s*<span.*?>\$([0-9.,]+)</span>\s*</div>#mis', $inputHtml, $matches)) {
echo 'Price found: ' . $matches[1][0] . PHP_EOL;
}
Console output from this:
Price found: 1,034.29

How to extract HTML element from a source file

I need to replace a HTML section identified by a tag id in a source code, which is combination of HTML and PHP using PHP. In case it's pure HTML, DOM parser could be used; in case there is no DIV in DIV, I can imagine how to use preg_match. This is what I am trying to do - I have a code (loaded into a string) like:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div>
<div>
<img >
</div>
</div>
</div>
and my task is to replace content of "mydiv" DIV with a new one e.g.
<div id="newdiv>
some text
</div>
so the string will look like this after the change:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div id="newdiv>
some text
</div>
</div>
I have already tried:
1) parsing the code using DOMdocument's loadHTML => it produces a lot of errors in case PHP code is included.
2) I played around a bit with regexes like preg_match_all('/<div id="myid"([^<]*)<\/div>/', $src, $matches), which fails in case more child divs are included.
The best approach I have found so far is:
1) find id="mydiv" string
2) search for '<' and '>' chars and count them like '<'=1 and '>'=-1 (not exactly, but it gives the idea)
3) once I get sum == 0 I should be on position of the closing tag, so I know, which portion string I should exchange
This is quite "heavy" solution, which can stop working in some cases, where the code is different (e.g. onpage PHP code contains the chars as well instead of just simple "include"). So I am looking so some better solution.
You could try something like this:
$file = 'filename.php';
$content = file_get_contents($file);
$array_one = explode( '<div id="mydiv">' , $content );
$my_div_content = explode("</div>" , $array_one[1] )[0];
Or use preg_match like you said:
preg_match('/<div id="mydiv"(.*?)<\/div>/s', $content, $matches)
Yes there is. First you need to use a function that will get the content of the file. Lets call the file homepage.php:
$homepageString = file_get_contents('homepage.php');
Now you have a string with all the content. The next thing you would do is use the preg_replace() function to take out the part of code that you want to take out:
$newHomepageString = preg_replace('/id="mydiv"/',"", $homepageString);
Now you overwrite the existing homepage.php file with the new source code:
file_put_contents("homepage.php", $newHomepageString);
Let me know if it worked for you! :)

Get hrefs that match regex expression using PHP & XPath

I have a page that contains several hyperlinks. The ones I want to get are of the format:
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
I want to extract the three hrefs 123,345,and 678.
I know how to get all the hyperlinks using $gm = $xpath->query("//a") and then loop through them to get the href attribute.
Is there some sort of regexp to get the attributes with the above format only (.i.e "/digits")?
Thanks
XPath 1.0, which is the version supported by DOMXPath(), has no Regex functionalities. Though, you can easily write your own PHP function to execute Regex expression to be called from DOMXPath if you need one, as mentioned in this other answer.
There is XPath 1.0 way to test if an attribute value is a number, which you can use on href attribute value after / character, to test if the attribute value follows the pattern /digits :
//a[number(substring-after(#href,'/')) = substring-after(#href,'/')]
UPDATE :
For the sake of completeness, here is a working example of calling PHP function preg_match from DOMXPath::query() to accomplish the same task :
$raw_data = <<<XML
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
XML;
$doc = new DOMDocument;
$doc->loadXML($raw_data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("preg_match");
// php:function's parameters below are :
// parameter 1: PHP function name
// parameter 2: PHP function's 1st parameter, the pattern
// parameter 3: PHP function's 2nd parameter, the string
$gm = $xpath->query("//a[php:function('preg_match', '~^/\d+$~', string(#href))]");
foreach ($gm as $a) {
echo $a->getAttribute("href") . "\n";
}

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

Regular expression for DIV elements

Say I had this piece of HTML for example:
<div id="gallery2" class="galleryElement">
<h2>My Photos</h2>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
</div>
I need to build the proper regular expression to parse each div class'ed as imageElement and store the contents (as text) in an array starting from the opening <div class = "imageElement"> till its ending div pair </div>. Also, there really are spaces on class = "imageElement". So far I have the expression:
\<div class = "imageElement">[\s\S\d\D]*</div>
but it only gets the whole set of elements. Thanks in advance.
This is a pretty common question here ("How do I parse this XML/HTML with a regular expression?") and I'll give you the same answer: don't.
Regular expressions are notoriously bad at this kind of thing. HTML/XML is not "regular" in the regex sense.
PHP comes with at least 3 XML parsers (SimpleXML, DOMDocument and XMLReader spring to mind) that will do this reliably. Use one of those.
Take a look at Parse HTML With PHP And DOM as an example.
sounds like the trouble you're having is that the * is greedy, ie it matches as much as possible, where you want it to match a little as possible.
If the data inside your divs does not contain "</div>" then you can keep the parsing pretty simple. If it can contain arbitrary HTML data (specifically nested divs), you'll need to parse it more.
If it stays basic, you could do the whole thing without regex. It's a little hackish, but as long as your data says simple, and expected, it should work really fast:
$chunks = explode($body, '<div class = "imageElement">');
array_shift($chunks);
$matches = array();
foreach($chunks as $chunk) {
$pos = strpos('</div>', $chunk);
if($pos) {
$matches[] = substr($chunk, 0, $pos);
{
}
If you need something more flexible, use a real html parser.

Categories