Get hrefs that match regex expression using PHP & XPath - php

I have a page that contains several hyperlinks. The ones I want to get are of the format:
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
I want to extract the three hrefs 123,345,and 678.
I know how to get all the hyperlinks using $gm = $xpath->query("//a") and then loop through them to get the href attribute.
Is there some sort of regexp to get the attributes with the above format only (.i.e "/digits")?
Thanks

XPath 1.0, which is the version supported by DOMXPath(), has no Regex functionalities. Though, you can easily write your own PHP function to execute Regex expression to be called from DOMXPath if you need one, as mentioned in this other answer.
There is XPath 1.0 way to test if an attribute value is a number, which you can use on href attribute value after / character, to test if the attribute value follows the pattern /digits :
//a[number(substring-after(#href,'/')) = substring-after(#href,'/')]
UPDATE :
For the sake of completeness, here is a working example of calling PHP function preg_match from DOMXPath::query() to accomplish the same task :
$raw_data = <<<XML
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
XML;
$doc = new DOMDocument;
$doc->loadXML($raw_data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("preg_match");
// php:function's parameters below are :
// parameter 1: PHP function name
// parameter 2: PHP function's 1st parameter, the pattern
// parameter 3: PHP function's 2nd parameter, the string
$gm = $xpath->query("//a[php:function('preg_match', '~^/\d+$~', string(#href))]");
foreach ($gm as $a) {
echo $a->getAttribute("href") . "\n";
}

Related

How to use symfony dom parser

I am trying to use Symfony Crawler.
So I have checked this article.
What I want to do is to get the 3,335.00(The second argument)
For now, I try sentence like this, but it is wrong.
$crawler = $crawler->filter('body > div[#class="cell_label"]');
How can I do it??
<body>
<div class="cell__label"> Value1 </div> <div class="cell__value cell__value_"> 2,355.00 </div>
<div class="cell__label"> Value2 </div> <div class="cell__value cell__value_"> 3,355.00 </div>
<div class="cell__label"> Value3 </div> <div class="cell__value cell__value_"> 4,355.00 </div>
</body>
$crawler = new Crawler($url);
$crawler = $crawler->filter('body > div[#class="cell_label"]');//// no work...
foreach ($crawler as $domElement) {
var_dump($domElement);
}
I can see several issue here:
Using $crawler->filter() implies you must pass a css selector as a parameter, not XPath expressions, so use 'body > div.cell__label' or 'body div[class^="cell__"]' if you need to select all div with a class that starts with cell__, btw you have a typo in cell_label (one underscore).
The Crawler accepts DOMNodeList, DOMNode, array or string as a constructor parameters, not a url to a remote resource (but I assume it may be just an arbitrary variable name you used there). Technically url is a string as well, but not XML formatted string.
If you want to use XPath expression use $crawler->filterXPath(), like that:
$nodes = $crawler->filterXPath('//div[contains(#class, "cell__label")]');
Here's a documentation on how to use XPath - https://www.w3.org/TR/xpath/
Crawler filter can handle jQuery like selectors, so you can:
$crawler = $crawler->filter('.cell__value');

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

PHP or Javascript: Simply Remove and Replace HTML Code

I have this code on my page, but the link has different names and ids:
<div class="myclass">
<a href="http://www.example.com/?vstid=00575000&veranstaltung=http://www.example.com/page.html">
Example Text</a>
</div>
how can I remove and Replace it to this:
<div class="myclass">Sorry no link</div>
With PHP or Javascript? I tried it with str.replace
Thank you!
I assume you mean dynamically? You won't be able to do this with php because it is server side, and doesn't have anything to do with the HTML once its been output to the screen.
See: http://www.tizag.com/javascriptT/javascript-innerHTML.php for the javascript.
Or you could use jquery which is just better and nicer than trying to do a cross browser compatible javascript script.
$('.myclass').html('Sorry...');
If the page is still on the server before you need to make the replacement, do this:
<?php if (allowed_to_see_link()) { ?>
<div class="myclass">
<a href="http://www.example.com/? vstid=00575000&veranstaltung=http://www.example.com/page.html">
Example Text</a>
</div>
<?php } else { ?>
non-link-text
<php } ?>
and also write the named functions...
You might want to clearify what you are up to. If that is your file, then you can simply open up in an editor and remove the portions. If you want to modify HTML with PHP, you can use native DOM
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$xPath = new DOMXPath($dom);
foreach( $xPath->query('//div[#class="myclass"]/a') as $link) {
$link->parentNode->replaceChild(new DOMText('Sorry no link'), $link);
}
echo $dom->saveHTML();
The above code would replace any direct <a> element children of any <div> elements that have a class attribute of myclass with the Textnode "Sorry no link".

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

Grep... What patterns to extract href attributes, etc. with PHP's preg_grep?

I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "__________" stuff in the strings below?
1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>
The dots denote some arbitrary characters while the underscores denote what I want.
An example string is:
</style></head>
<body><div id="adBlock"><h2>Ads by Google</h2>
<div class="ad"><div>Spider-<b>Man</b> Animated Serie</div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite>www.Crackle.com/Spiderman</cite></div> <div class="ad"><div>Kids <b>Batman</b> Costumes</div>
<span>Great Selection of <b>Batman</b> & Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite>www.CostumeExpress.com</cite></div> <div class="ad"><div><b>Batman</b> Costume</div>
<span>Official <b>Batman</b> Costumes.
<br>
Huge Selection & Same Day Shipping!</span>
<cite>www.OfficialBatmanCostumes.com</cite></div> <div class="ad"><div>Discount <b>Batman</b> Costumes</div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>
<cite>www.discountsuperherocostumes.com</cite></div></div></body>
<script type="text/javascript">
var relay = "";
</script>
<script type="text/javascript" src="/uds/?file=ads&v=1&packages=searchiframe&nodependencyload=true"></script></html>
Thanks!
First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !
Here, you could use :
DOMDocument::loadHTML to load the HTML string
eventually, using the # operator to silence warnings, as your HTML is not quite valid.
The DOMXPath class to do XPath queries on the document
DOM methods to work on the results of the query
See the classes in the Document Object Model section of the manual, and their methods.
For example, you could load your document, and instanciate the DOMXpath class this way :
$html = <<<HTML
....
....
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, use XPath to find the elements you are looking for.
For example, in the first case, you could use something like this, to find all <a> tags that are children of <h2> tags :
// <h2><a ....>_____</a></h2>
$tags = $xpath->query('//h2/a');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
echo '<hr />';
Then, for the second and third case, you are searching for <a> tags that are children of <cite> tags -- and when you've found them, you want to check if they have a href attribute or not :
// <cite><a href="_____" .... >...</a></cite>
// <cite><a .... >________</a></cite>
$tags = $xpath->query('//cite/a');
foreach ($tags as $tag) {
if ($tag->hasAttribute('href')) {
var_dump($tag->getAttribute('href'));
} else {
var_dump($tag->nodeValue);
}
}
echo '<hr />';
And, finally, for the last one, you just want <span> tags :
// <span>_________</span>
$tags = $xpath->query('//span');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
Not that hard -- and much easier to read that regexes, isn't it ? ;-)

Categories