How to break down and parse specific Wikipedia text - php

I'm have the following working example to retrieve a specific Wikipedia page that returns a SimpleXMLElement Object:
ini_set('user_agent', 'michael#example.com');
$doc = New DOMDocument();
$doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');
$xml = simplexml_import_dom($doc);
print '<pre>';
print_r($xml);
print '</pre>';
Which returns:
SimpleXMLElement Object
(
[parse] => SimpleXMLElement Object
(
[#attributes] => Array
(
[title] => Main Page
[revid] => 472210092
[displaytitle] => Main Page
)
[text] => <body><table id="mp-topbanner" style="width: 100%;"...
Silly question/mind blank. What I am trying to do is capture the $xml->parse->text element and in-turn parse that. So ultimately what I want returned is the following object; how do I achieve this?
SimpleXMLElement Object
(
[body] => SimpleXMLElement Object
(
[table] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => mp-topbanner
[style] => width:100% ...

After grabbing a fresh tea and eating a banana, here's the solution I've come up with:
ini_set('user_agent','michael#example.com');
$doc = new DOMDocument();
$doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');
$nodes = $doc->getElementsByTagName('text');
$str = $nodes->item(0)->nodeValue;
$html = new DOMDocument();
$html->loadHTML($str);
This then allows me to get an elements value, which is what I was after. For example:
echo "Some value: ";
echo $html->getElementById('someid')->nodeValue;

Related

PHP unable to access Xpath node of a SimpleXMLElement Object

$x = new DomDocument();
$x->loadXML($responseXml);
$xml = simplexml_import_dom($x);
Outputting the array using print_r($xml) gives the following:
SimpleXMLElement Object
(
[Timestamp] => 2014-11-09T18:28:47.843Z
[Ack] => Success
[Version] => 897
[Build] => E897_UNI_API5_17253832_R1
[Store] => SimpleXMLElement Object
(
[Name] => test
[SubscriptionLevel] => Basic
[Description] => Welcome Message.
)
)
Using $xml->Store->Description outputs "Welcome Message."
When I use xpath to return the Description node using the following code, I get an empty array:
$xpath = new DOMXPath($x);
$result = $xpath->query("/Store/Description");
Why does this fail?
Its easier with simple xml.
$xml = new SimpleXMLElement($string);
$result = $xml->xpath('/Store/Description');
http://php.net/manual/en/simplexmlelement.xpath.php
Simple really, I just needed to register the namespace:
$xml->registerXPathNamespace('urn', 'ebay:apis:eBLBaseComponents');

How to reach till the desired node in xpath result?

As I have mentioned in question title, I am trying below code to reach till the desired node in xpath result.
<?php
$xpath = '//*[#id="topsection"]/div[3]/div[2]/div[1]/div/div[1]';
$html = new DOMDocument();
#$html->loadHTMLFile('http://www.flipkart.com/samsung-galaxy-ace-s5830/p/itmdfndpgz4nbuft');
$xml = simplexml_import_dom($html);
if (!$xml) {
echo 'Error while parsing the document';
exit;
}
$source = $xml->xpath($xpath);
echo "<pre>";
print_r($source);
?>
this is the source code. I am using to scrap price from a ecommerce.
it works it gives below output :
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => line
)
[div] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => prices
[itemprop] => offers
[itemscope] =>
[itemtype] => http://schema.org/Offer
)
[span] => Rs. 10300
[div] => (Prices inclusive of taxes)
[meta] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[itemprop] => price
[content] => Rs. 10300
)
)
[1] => SimpleXMLElement Object
(
[#attributes] => Array
(
[itemprop] => priceCurrency
[content] => INR
)
)
)
)
)
)
Now How to reach till directly [content] => Rs. 10300.
I tried:
echo $source[0]['div']['meta']['#attributes']['content']
but it doesn't work.
Try echo (String) $source[0]->div->meta[0]['content'];.
Basically, when you see an element is an object, you can't access it like an array, you need to use object -> approach.
The print_r of a SimpleXMLElement does not show the real object structure. So you need to have some knowledge:
$source[0]->div->meta['content']
| | | `- attribute acccess
| | `- element access, defaults to the first one
| `- element access, defaults to the first one
|
standard array access to get
the first SimpleXMLElement of xpath()
operation
That example then is (with your address) the following (print_r again, Demo):
SimpleXMLElement Object
(
[0] => Rs. 10300
)
Cast it to string in case you want the text-value:
$rs = (string) $source[0]->div->meta['content'];
However you can already directly access that node with the xpath expression (if that is a single case).
Learn more on how to access a SimpleXMLElement in the Basic SimpleXML usage ExamplesDocs.

How to diff SimpleXML multidimensional array?

I have a SimpleXML Object made from merging multiple XMLs from PubMed (snippet below) but there is repetition from the merge. How can I compare all first child arrays - array[][0], array[][1] etc - and discard any duplicates?
I though perhaps serialising was the answer but you can't serialise a SimpleXML Object afaik..
I'm not sure where to start?
Array
(
[0] => Array
(
[title] => SimpleXMLElement Object
(
[0] => Superstructure of the centromeric complex of TubZRC plasmid partitioning systems.
)
[link] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Version] => 1
)
[0] => 23010931
)
[author] => Aylett, CH., Löwe, J.
[journal] => SimpleXMLElement Object
(
[0] => Proc. Natl. Acad. Sci. U.S.A.
)
[pubdate] => 2012-9-27
[day] => SimpleXMLElement Object
(
[0] => 25
)
[month] => SimpleXMLElement Object
(
[0] => Sep
)
[year] => SimpleXMLElement Object
(
[0] => 2012
)
)
[1] => Array
(
[title] => SimpleXMLElement Object
(
[0] => Superstructure of the centromeric complex of TubZRC plasmid partitioning systems.
)
[link] => SimpleXMLElement Object
(
[#attributes] => Array
(
[Version] => 1
)
[0] => 23010931
)
[author] => Aylett, CH., Löwe, J.
[journal] => SimpleXMLElement Object
(
[0] => Proc. Natl. Acad. Sci. U.S.A.
)
[pubdate] => 2012-9-27
[day] => SimpleXMLElement Object
(
[0] => 25
)
[month] => SimpleXMLElement Object
(
[0] => Sep
)
[year] => SimpleXMLElement Object
(
[0] => 2012
)
)
Alternatively it could be done at the initial XML merge stage - I use the code below at the moment if anyone can suggest how to modify it to remove duplicates?
function simplexml_merge (SimpleXMLElement &$xml1, SimpleXMLElement $xml2) {
$dom1 = new DomDocument();
$dom2 = new DomDocument();
$dom1->loadXML($xml1->asXML());
$dom2->loadXML($xml2->asXML());
$xpath = new domXPath($dom2);
$xpathQuery = $xpath->query('/*/*');
for ($i = 0; $i < $xpathQuery->length; $i++) {
$dom1->documentElement->appendChild(
$dom1->importNode($xpathQuery->item($i), true));
}
$xml1 = simplexml_import_dom($dom1);
}
$xml1 = new SimpleXMLElement($search1);
$xml2 = new SimpleXMLElement($search2);
simplexml_merge($xml1, $xml2);
Thanks.
...
...
For clarity - here's the XML source layout that I am importing into SimpleXML - each PubmedArticle is one "element" I am interested in comparing and ensuring there are no duplicates -
<xml...>
<Document>
<PubmedArticle>
<MedlineCitation>
<PMID version="1">xxx</PMID>
...
</MedlineCitation>
...
</PubmedArticle>
<PubmedArticle>
<MedlineCitation>
<PMID version="1">xxx</PMID>
...
</MedlineCitation>
...
</PubmedArticle>
etc
</Document>
</xml>
The PMID node is unique so can be used to check for duplicates.
...
...
Using the link from #Gordon - I know use:
//Get my source XML
$xml1 = new SimpleXMLElement($search1);
$xml2 = new SimpleXMLElement($search2);
//Run through $xml1 and build a query based on it's PMIDs
$query = array();
foreach ($xml1->PubmedArticle as $paper) {
$query[] = sprintf('(PMID != %s)',$paper->MedlineCitation->PMID);
}
$query = implode('and', $query);
//Run through $xml2 and get node which don't have PMID matching $xml1
foreach ($xml2->xpath(sprintf('PubmedArticle/MedlineCitation[%s]', $query)) as $paper) {
echo $paper->asXml();
}
However I still have one problem - getting the output merged.
The output of $xml2 is missing the <PubmedArticle> node around each 'match' for a start. Then I presume I can use the same merge code (above) to do the merge.
Can you point me in the right direction?
Convert it to an array (which I'm not going to write for you, just iterate and add.), then array_diff().
Decided to follow #Gordon's line as it kept it XML. Eventually got it all working:
//function to check 2 xml inputs for duplicate nodes
function dedupeXML($xml1, $xml2) {
$query = array();
foreach ($xml1->PubmedArticle as $paper) {
$query[] = sprintf('(MedlineCitation/PMID != %s)',$paper->MedlineCitation->PMID);
}
$query = implode('and', $query);
$xmlClean = '<Document>';
foreach ($xml2->xpath(sprintf('PubmedArticle[%s]', $query)) as $paper) {
$xmlClean .= $paper->asXML();
}
$xmlClean .= '</Document>';
$xmlClean = new SimpleXMLElement($xmlClean);
return $xmlClean;
}
//function to merge 2 xml inputs
function mergeXML (SimpleXMLElement &$xml1, SimpleXMLElement $xml2) {
// convert SimpleXML objects into DOM ones
$dom1 = new DomDocument();
$dom2 = new DomDocument();
$dom1->loadXML($xml1->asXML());
$dom2->loadXML($xml2->asXML());
// pull all child elements of second XML
$xpath = new domXPath($dom2);
$xpathQuery = $xpath->query('/*/*');
for ($i = 0; $i < $xpathQuery->length; $i++) {
// and pump them into first one
$dom1->documentElement->appendChild(
$dom1->importNode($xpathQuery->item($i), true));
}
$xml = simplexml_import_dom($dom1);
return $xml;
}
$xml1 = new SimpleXMLElement($search1);
$xml2 = new SimpleXMLElement($search2);
$xml3 = new SimpleXMLElement($search3);
//dedupe and merge inputs
//input 1 & 2
$xml2Clean = dedupeXML($xml1, $xml2);
$xml12 = mergeXML($xml1, $xml2Clean);
//input 1+2 & 3
$xml3Clean = dedupeXML($xml12, $xml3);
$xml123 = mergeXML($xml12, $xml3Clean);
This would be easy to adapt to other data sources - just modify the dedupeXML function to match the data structure of your XML.

How to set attribute for nodes with text content?

I am trying to iterate over set of nodes given by xpath and set certain attribute for each node. However it works only for nodes withou content or with empty (whitespace) content. I have tried 2 approaches but with the same result (maybe they are both the same on some deeper level, dunno). The commented line is the second approach.
$temp = simplexml_load_string (
'<toolbox>
<hammer/>
<screwdriver> </screwdriver>
<knife>
sharp
</knife>
</toolbox>' );
echo "vanilla toolbox: ";
print_r($temp);
$nodes = $temp->xpath('//*[not(#id)]');
foreach($nodes as $obj) {
$tempdom = dom_import_simplexml($obj);
$tempdom->setAttributeNode(new DOMAttr('id', 5));
//$obj->addAttribute('bagr', 5);
}
echo "processed toolbox: ";
print_r($temp);
This is output. Attribute id is missing in node knife.:
vanilla toolbox: SimpleXMLElement Object
(
[hammer] => SimpleXMLElement Object
(
)
[screwdriver] => SimpleXMLElement Object
(
[0] =>
)
[knife] =>
sharp
)
processed toolbox: SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 5
)
[hammer] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 5
)
)
[screwdriver] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 5
)
[0] =>
)
[knife] =>
sharp
I'm unable to reproduce what you describe, the changed XML is:
<?xml version="1.0"?>
<toolbox id="5">
<hammer id="5"/>
<screwdriver id="5"> </screwdriver>
<knife id="5">
sharp
</knife>
</toolbox>
Demo
It's exactly your code, maybe you're using a different LIBXML version? See the LIBXML_VERSION constant (codepad viper has 20626 (2.6.26)).
But probably it's just only the print_r output for a SimpleXMLElement object.
It does not output the attributes for the last element, even on a brand new object, but it's still possible to access the attribute. Demo.
You will see when you print_r($temp->knife['id']); that the attribute is set (as you can see in the earlier XML output).

Why does SimpleXML change my array to the array's first element when I use it?

Here is my code:
$string = <<<XML
<?xml version='1.0'?>
<test>
<testing>
<lol>hello</lol>
<lol>there</lol>
</testing>
</test>
XML;
$xml = simplexml_load_string($string);
echo "All of the XML:\n";
print_r $xml;
echo "\n\nJust the 'lol' array:";
print_r $xml->testing->lol;
Output:
All of the XML:
SimpleXMLElement Object
(
[testing] => SimpleXMLElement Object
(
[lol] => Array
(
[0] => hello
[1] => there
)
)
)
Just the 'lol' array:
SimpleXMLElement Object
(
[0] => hello
)
Why does it output only the [0] instead of the whole array? I dont get it.
What #Yottatron suggested is true, but not at all the cases as this example shows :
if your XML would be like this:
<?xml version='1.0'?>
<testing>
<lol>
<lolelem>Lol1</lolelem>
<lolelem>Lol2</lolelem>
<notlol>NotLol1</lolelem>
<notlol>NotLol1</lolelem>
</lol>
</testing>
Simplexml's output would be:
SimpleXMLElement Object
(
[lol] => SimpleXMLElement Object
(
[lolelem] => Array
(
[0] => Lol1
[1] => Lol2
)
[notlol] => Array
(
[0] => NotLol1
[1] => NotLol1
)
)
)
and by writing
$xml->lol->lolelem
you'd expect your result to be
Array
(
[0] => Lol1
[1] => Lol2
)
but instead of it, you would get :
SimpleXMLElement Object
(
[0] => Lol1
)
and by
$xml->lol->children()
you would get:
SimpleXMLElement Object
(
[lolelem] => Array
(
[0] => Lol1
[1] => Lol2
)
[notlol] => Array
(
[0] => NotLol1
[1] => NotLol1
)
)
What you need to do if you want only the lolelem's:
$xml->xpath("//lol/lolelem")
That gives this result (not as expected shape but contains the right elements)
Array
(
[0] => SimpleXMLElement Object
(
[0] => Lol1
)
[1] => SimpleXMLElement Object
(
[0] => Lol2
)
)
It's because you have two lol elements. In order to access the second you need to do this:
$xml->testing->lol[1];
this will give you "there"
$xml->testing->lol[0];
Will give you "hello"
The children() method of the SimpleXMLElement will give you an object containing all the children of an element for example:
$xml->testing->children();
will give you an object containing all the children of the "testing" SimpleXMLElement.
If you need to iterate, you can use the following code:
foreach($xml->testing->children() as $ele)
{
var_dump($ele);
}
There is more information about SimpleXMLElement here:
http://www.php.net/manual/en/class.simplexmlelement.php
what you might want to do is using the Json encode/decode
$jsonArray = Json_decode(Json_encode($xml), true);
With the true argument you can call instead of using -> use [name]
so an example would be:
$xml = file("someXmlFile.xml");
$jsonArray = Json_decode(Json_encode($xml), true);
foreach(jsonArray['title'] as $title){
Do something with $titles
}
if you have more than 1 element it will typical put in an #attributes if the elements has attributes. This can be countered by using: $title = $title['#attributes']
Hope it could help.
Ah yes, I remember simple XML nearly doing my head in with this parsing arrays issue
Try the below code. It will give you an array of LOL elements, or, if you've just got a single LOL element, it will return that in an array as well.
The main advantage of that is you can do something like foreach ($lol as $element) and it will still work on a single (or on 0) LOL element.
<?php
$string = <<<XML
<?xml version='1.0'?>
<test>
<testing>
<lol>hello</lol>
<lol>there</lol>
</testing>
</test>
XML;
$xml = simplexml_load_string($string);
echo "<pre>";
echo "All of the XML:\n";
print_r($xml);
echo "\n\nJust the 'lol' array:\n";
$test_lols = $xml->testing->children();
$childcount = count($test_lols);
if ($childcount < 2) {
$lol = array($test_lols->lol);
}
else {
$lol = (array) $test_lols;
$lol = $lol['lol'];
}
print_r($lol);
?>
Ran into this issue...
Xpath can be a little slow, so you can achieve the same effect with a simple for loop.
for ($i = 0; $i < count($xml->testing->lol); $i++) {
print_r($xml->testing->lol[$i]);
}

Categories