I have the below XML file. There are 4 rows constantly repeated for different websites.
These are _URL _Away _Home _Draw. Each of these is prepended by the website. I need to compare all of the _Away rows to find the highest value, but there may sometimes be 1 of these rows and other times there can be as many as 32. What I would like to know is, is there a way to get these by defining the end of the string without having to explicitly declare the entire string for each website?
<XMLSOCCER.COM>
<Odds>
<Id>1547</Id>
<_10Bet_Home_Home>1.31</_10Bet_Home_Home>
<_10Bet_Home_Url>http://en.10bet.com</_10Bet_Home_Url>
<_10Bet_Home_Away>8.50</_10Bet_Home_Away>
<_10Bet_Home_Draw>5.40</_10Bet_Home_Draw>
<Bet_At_Home_Home>1.25</Bet_At_Home_Home>
<Bet_At_Home_Url>http://www.bet-at-home.com/</Bet_At_Home_Url>
<Bet_At_Home_Away>10.00</Bet_At_Home_Away>
<Bet_At_Home_Draw>5.75</Bet_At_Home_Draw>
<Bet365_Url>http://www.bet365.com/</Bet365_Url>
<Bet365_Home>1.30</Bet365_Home>
<Bet365_Away>9.00</Bet365_Away>
<Bet365_Draw>5.50</Bet365_Draw>
<BetVictor_Home>1.30</BetVictor_Home>
<BetVictor_Url>http://www.betvictor.com/</BetVictor_Url>
<BetVictor_Away>9.00</BetVictor_Away>
<BetVictor_Draw>5.40</BetVictor_Draw>
<Bwin_Home>1.28</Bwin_Home>
</Odds>
</XMLSOCCER.COM>
You can use XPath to fetch all nodes ending with _Away. Here's a code snippet that accomplishes what you want:
<?php
$xml = <<<XML
<XMLSOCCER.COM>
<Odds>
<Id>1547</Id>
<_10Bet_Home_Home>1.31</_10Bet_Home_Home>
<_10Bet_Home_Url>http://en.10bet.com</_10Bet_Home_Url>
<_10Bet_Home_Away>8.50</_10Bet_Home_Away>
<_10Bet_Home_Draw>5.40</_10Bet_Home_Draw>
<Bet_At_Home_Home>1.25</Bet_At_Home_Home>
<Bet_At_Home_Url>http://www.bet-at-home.com/</Bet_At_Home_Url>
<Bet_At_Home_Away>10.00</Bet_At_Home_Away>
<Bet_At_Home_Draw>5.75</Bet_At_Home_Draw>
<Bet365_Url>http://www.bet365.com/</Bet365_Url>
<Bet365_Home>1.30</Bet365_Home>
<Bet365_Away>9.00</Bet365_Away>
<Bet365_Draw>5.50</Bet365_Draw>
<BetVictor_Home>1.30</BetVictor_Home>
<BetVictor_Url>http://www.betvictor.com/</BetVictor_Url>
<BetVictor_Away>9.00</BetVictor_Away>
<BetVictor_Draw>5.40</BetVictor_Draw>
<Bwin_Home>1.28</Bwin_Home>
</Odds>
</XMLSOCCER.COM>
XML;
$sxe = new SimpleXMLElement($xml);
$nodesEndingWithAway = $sxe->xpath('//*[substring(name(),string-length(name())-3) = "Away"]');
$highestValue = 0;
$nodeName = '';
foreach ($nodesEndingWithAway as $node) {
if ((float) $node > $highestValue) {
$highestValue = (float) $node;
$nodeName = $node->getName();
}
}
echo "Highest value is {$highestValue} from node {$nodeName}.\n";
Output:
Highest value is 10 from node Bet_At_Home_Away.
Note: I think it would be possible to accomplish it with a single XPath expression without the need to process the nodes with the foreach.
You can do this with XPath.
$doc = new DOMDocument();
$doc->load($filename);
$xpath = new DOMXPath($doc);
$elements = $xpath->query('/XMLSOCCER.COM/Odds/*[substring(name(),string-length(name())-3) = "Away"]');
$maxValue = 0;
foreach ($elements as $element) {
$value = floatval($element->nodeValue);
$maxValue = max($maxValue, $value);
}
EDIT: very compressed:
$maxbid = max(array_map('floatval', $xml->xpath("//*[substring(name(),string-length(name())-" . (strlen($search) - 1) . ") = '$search']")));
in several steps:
use simplexml and xpath:
$search = "_Away";
$xml = simplexml_load_string($x);
$results = $xml->xpath("//*[substring(name(),string-length(name())-" . (strlen($search) - 1) . ") = '$search']");
Loop through your results:
foreach ($results as $result) echo "$result <br />";
Print highest result:
echo "highest: " . number_format(max(array_map('floatval', $results)), 2, '.', ',');
See it working: http://codepad.viper-7.com/iEpGz9
Related
I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}
You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.
Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;
$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}
I've been trying unsuccessfully with PHP to loop through two XML files and print the result to the screen. The aim is to take a country's name and output its regions/states/provinces as the case may be.
The first block of code successfully prints all the countries but the loop through both files gives me a blank screen.
The countries file is in the format:
<row>
<id>6</id>
<name>Andorra</name>
<iso2>AD</iso2>
<phone_code>376</phone_code>
</row>
And the states.xml:
<row>
<id>488</id>
<name>Andorra la Vella</name>
<country_id>6</country_id>
<country_code>AD</country_code>
<state_code>07</state_code>
</row>
so that country_id = id.
This gives a perfect list of countries:
$xml = simplexml_load_file("countries.xml");
$xml1 = simplexml_load_file("states.xml");
foreach($xml->children() as $key => $children) {
print((string)$children->name); echo "<br>";
}
This gives me a blank screen except for the HTML stuff on the page:
$xml = simplexml_load_file("countries.xml");
$xml1 = simplexml_load_file("states.xml");
$s = "Jamaica";
foreach($xml->children() as $child) {
foreach($xml1->children() as $child2){
if ($child->id == $child2->country_id && $child->name == $s) {
print((string)$child2->name);
echo "<br>";
}
}
}
Where have I gone wrong?
Thanks.
I suspect your problem is not casting the name to a string before doing your comparison. But why are you starting the second loop before checking if it's needed? You're looping through every single item in states.xml needlessly.
$countries = simplexml_load_file("countries.xml");
$states = simplexml_load_file("states.xml");
$search = "Jamaica";
foreach($countries->children() as $country) {
if ((string)$country->name !== $search) {
continue;
}
foreach($states->children() as $state) {
if ((string)$country->id === (string)$state->country_id) {
echo (string)$state->name . "<br/>";
}
}
}
Also, note that naming your variables in a descriptive manner makes it much easier to figure out what's going on with code.
You could probably get rid of the loops altogether using an XPath query to match the sibling value. I don't use SimpleXML, but here's what it would look like with DomDocument:
$search = "Jamaica";
$countries = new DomDocument();
$countries->load("countries.xml");
$xpath = new DomXPath($countries);
$country = $xpath->query("//row[name/text() = '$search']/id/text()");
$country_id = $country[0]->nodeValue;
$states = new DomDocument();
$states->load("states.xml");
$xpath = new DomXPath($states);
$states = $xpath->query("//row[country_id/text() = '$country_id']/name/text()");
foreach ($states as $state) {
echo $state->nodeValue . "<br/>";
}
I have this as a part of my XML that I am loading in a DOM Document:
<error n='\Author'/>
Some Text 1
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><msup><mrow/> <mrow><mn>1</mn><mo>,</mo></mrow> </msup></math></formula>
Some Text 2
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><msup><mrow/> <mn>2</mn> </msup></math></formula>
<error n='\address' />
My goal is to get everything as nodeValue between the
<error n='\Author' />
And
<error n='\address' />
How can this be done?
I tested this:
$author_node = $xpath_xml->query("//error[#n='\Author']/following-sibling::*[1]")->item(0);
if ($author_node != null) {
$i = 1;
$nextNodeName = "";
$author = "";
while ($nextNodeName != "error" && $i < 20) {
$nextNodeName = $xpath_xml->query("//error[#n='\Author']/following-sibling::*[$i]")->item(0)->tagName;
if ($nextNodeName == "error")
continue;
$author .= $nextNode->nodeValue;
}
But Am getting only the formula content, not the text between formulas.
Thank you.
The *only selects element nodes, not text nodes. So only the <formula> elements are selected. You need to use node(). But you could use xpath directly to selected the needed nodes. Look for an explanation of the Kayessian method.
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate(
'//error[#n="\\Author"][1]
/following-sibling::node()
[
count(
.|
//error[#n="\\Author"][1]
/following-sibling::error[#n="\\address"][1]
/preceding-sibling::node()
)
=
count(
//error[#n="\\Author"][1]
/following-sibling::error[#n="\\address"][1]
/preceding-sibling::node()
)
]'
);
$result = '';
foreach ($nodes as $node) {
$result .= $node->nodeValue;
}
var_dump($result);
Demo: https://eval.in/125494
If you want to save not only the text content, but the XML fragment, you can use DOMDocument::saveXml() with the node as argument.
$result = '';
foreach ($nodes as $node) {
$result .= $node->ownerDocument->saveXml($node);
}
var_dump($result);
I'm trying to extract 2 elements using PHP Curl and Xpath!
So far have the element separated in foreach but I would like to have them in the same time:
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
foreach ($elements as $element) {
$url = $element->nodeValue;
//$title = $element->nodeValue;
}
When I echo each one out of the foreach I only get 1 element and when its echoed inside the foreach i get all of them.
My question is how can I get them both at the same time (url and title ) and whats the best way to add them into myqsl using pdo.
thank you
There is no need, in this case, to use XPath twice. You could do one query and navigate to the associated other node(s).
For example, find all of the hrefs that you are interested in and get their ownerElement's (the <a>) node value.
$hrefs = $xpath->query("//p[#class='row']/a/#href");
foreach ($hrefs as $href) {
$url = $href->value;
$title = $href->ownerElement->nodeValue;
// Insert into db here
}
Or, find all of the <a>s that you are interested in and get their href attributes.
$anchors = $xpath->query("//p[#class='row']/a[#href]");
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute("href");
$title = $anchor->nodeValue;
// Insert into db here
}
You're overwriting $url on each iteration. Maybe use an array?
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
$urls = array();
foreach ($elements as $element){
array_push($urls, $element->nodeValue);
//$title = $element->nodeValue;
}
We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}
You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');
Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?