xpath cannot get attribute value - php

Well the htmo code is something like this:
<a href="javascript:my_win('.......'">
<img src="...." border=0>
<font color="red">title
</a>
</font>
and I want to identify the color only for those a's which href contains th word:
javascript:my_win
This is my query:
$xpath->query('//a[contains(#href,"javascript:my_win")]/font');
but I get nothing.
If my query changes to this, I normally get all the hrefs, so there is no chance of mispelling.
$elements = $xpath->query('//a');
If my query changes to this Every colot is being printed out.
$elements = $xpath->query('//a/font');
Whole code is here:
$elements = $xpath->query('//a[contains(#href,"javascript:my_win")]/font');
foreach ( $elements as $element ) {
$str1=$element->getAttribute('color');
}

Use the # character to refer to attributes in XPath:
//a[contains(#href, 'some required text')]
When just writing href, the XPath processor will look for a child element named <href> whose text contents includes the specified string.

Related

Regex: How to only recognise string if it is within an ID or class attribute?

Let's use 3 string examples:
Example 1:
<div id="something">I have a really nice signature, it goes like this</div>
Example 2:
<div>I like balloons</div><div id="signature-xyz">Sent from my iPhone</div>
Example 3:
<div>I like balloons</div><div class="my_signature-xyz">Get iOS</div>
I'd like to remove the entire contents of the "signature" div in examples 2 and 3. Example 1 should not be affected. I don't know ahead of time as to what the div's exact class or ID will be, but I do know it will contain the string 'signature'.
I'm using the code below, which gets me half way there.
$pm = "/signature/i";
if (preg_match($pm, $message, $matches) == 1) {
$message = preg_split($pm, $message, 2)[0];
}
What should I do to achieve the above? Thanks
You can use the following sample to build your code on it:
$dom = new DOMDocument();
$dom->loadHTML($inputHTML);
$xpathsearch = new DOMXPath($dom);
$nodes = $xpathsearch->query("//div[not(contains(#*,'signature'))]");
foreach($nodes as $node) {
//do your stuff
}
Where the xpath:
//div[not(contains(#*,'signature'))]
will allow you to extract all div nodes for which there is no attribute that contains the string signature.
Regex should never being used in HTML/XML/JSON parsing where you can
have theoretically infinite nested depth in the structure. Ref:
Regular Expression Vs. String Parsing

How to save xpath query data to saveHTML with HTML tags?

I'm trying to understand how I can save the html string found by query so that I can access it's elements.
I'm using the following query to find the below ul list.
$data = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
<h2>Hurricane Data</h2>
<ul>
<li><strong>12 items</strong> found, see herefor more information</li>
<li><strong>19 items</strong> found, see herefor more information</li>
<li><strong>13 items</strong> found, see herefor more information</li>
</ul>
If I print_r($data), I get the following DOMNodeList Object ( [length] => 3 ) which refers to the 3 elements found.
If I foreach() into the $data I get a DOMElement Object with all 3 li data.
What I'm trying to accomplish is to put each li data into an accessible array, but I want to parse the html strong & a tags inside too.
Now, I've already did everything I want to do, except the strong and a tags aren't being inserted into the arrays, here is what I've come up with.
$string = [];
$query = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
foreach($query as $values){
$try = new \DOMDocument;
$try->loadHTML(mb_convert_encoding($values->textContent, 'HTML-ENTITIES', 'UTF-8'));
$string[] = $try->saveHTML();
}
echo $string[0];
// outputs = 12 items found, see here for more information
// no strong tags, no hyperlinks
You don't need to reprocess the data, you can just say to save this particular node...
foreach($query as $values){
$string[] = $doc->saveHTML($values);
}
Where $doc is the document used as the basis for your XPath query.

How do I replace part of this string with a .* type of regex in php?

I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.
Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.

Matching wildcard without adding to the array with preg_match_all

I'm trying to capture the table text from an element that looks like this:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_Label17" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
My preg_match_all looks like:
preg_match_all('~475px;">(.*?)</span><br />~', $ret, $vehicle);
The problem is there are other tables on the page that also match but have data not relevant to my query. The data that I want are all in "ListView2," but the "ct101_Label17" varies - Label18, Label19, Label20, etc.
Since I'm not interested in capturing the label, is there a method to match the subject string without capturing the match? Something along the lines of:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_[**WILDCARD HERE**]" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
Any help would be greatly appreciated.
Here is a very poor solution that you are currently considering:
<span\b[^<>]*\bid="ctl00_MainContent_ListView2_ctrl2_ctl01_[^"]*"[^<>]*475px;">(.*?)</span><br\s*/>
See demo
It makes sure we found a <span> tag and there is id attribute starting with ctl00_MainContent_ListView2_ctrl2_ctl01_, and there is some attribute (and you know it is style) ending with 475px;, and then we just capture anything up to the closing </span> tag.
You can get this with DOM and XPath, which is a much safer solution that uses the same logic as above:
$html = "<span id=\"ctl00_MainContent_ListView2_ctrl2_ctl01_Label17\" class=\"vehicledetailTable\" style=\"display:inline-block;width:475px;\">OWNED</span><br />";
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') and #class='vehicledetailTable' and contains(#style,'475px;')]");
$data = array();
foreach ($spans as $span) {
array_push($data, $span->textContent);
}
print_r($data);
Output: [0] => OWNED
Note that the XPath expression contains 3 conditions, feel free to modify any:
//span - get all span tags that
starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') - have an attribute id with value starting with ctl00_MainContent_ListView2_ctrl2_ctl01_
#class='vehicledetailTable' - and have class attribute with value equal to vehicledetailTable
contains(#style,'475px;') - and have a style attribute whose value contains 475px;.
Conditions are enclosed into [...] and are joined with or or and. They can also be grouped with round brackets. You can also use not(...) to invert the condition. XPath is very helpful in such situations.

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!
Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.
If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

Categories