Extract pattern from xml file using PHP?

Extract pattern from xml file using PHP? - php

I have a remote XML file. I need to read, find some values an save them in an array.
I've got load the file with (no problem with this):
$xml_external_path = 'http://example.com/my-file.xml';
$xml = file_get_contents($xml_external_path);
In this file there are many instances of:
<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>
I need to extract just the number of these strings and save them in a array. I guess I need to use a pattern like:
$pattern = '/<unico>(.*?)<\/unico>/';
But I'm not sure what to do next. Keep in mind that it is an .xml file.
Result should be a populated array like this:
$my_array = array (4241, 234, 534534,2345334);

You can better use XPath to read through an XML file. XPath is a variant of DOMDocument focused on reading and editing XML files. You can query an XPath variable using patterns, which is based on the simple Unix path syntax. So // means anywhere and ./ means relative to selected node. XPath->query() will return a DOMNodelist with all the nodes according to the pattern. The following code will do what you want:
$xmlFile = "
<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>";
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xpath = new DOMXPath($xmlDoc);
// This code returns a DOMNodeList of all nodes with the unico tags in the file.
$unicos = $xpath->query("//unico");
//This returns an integer of how many nodes were found that matched the pattern
echo $unicos->length;
You can find more info on XPath and its syntax here: XPath on Wikipedia#syntax
DOMNodeList implements Traversable, so you can use foreach() to traverse it. If you really want a flat array you can simply convert is using simple code like in question #15807314:
$unicosArr = array();
foreach($unicos as $node){
$unicosArr[] = $node->nodeValue;
}

Using preg_match_all:
<?php
$xml = '<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>';
$pattern = '/<unico>(.*?)<\/unico>/';
preg_match_all($pattern,$xml,$result);
print_r($result[0]);

You could try this, it basically just loops through each line of the file and finds whatever is between the XML <unico> tags.
<?php
$file = "./your.xml";
$pattern = '/<unico>(.*?)<\/unico>/';
$allVars = array();
$currentFile = fopen($file, "r");
if ($currentFile) {
// Read through file
while (!feof($currentFile)) {
$m_sLine = fgets($currentFile);
// Check for sitename validity
if (preg_match($pattern, $m_sLine) == true) {
$curVar = explode("<unico>", $m_sLine);
$curVar = explode("</unico>", $curVar[1]);
$allVars[] = $curVar[0];
}
}
}
fclose($currentFile);
print_r($allVars);
Is this sort of what you want? :)

Related

How to get iTunes-specific child nodes of RSS feeds?

I'm trying to process an RSS feed using PHP and there are some tags such as 'itunes:image' which I need to process. The code I'm using is below and for some reason these elements are not returning any value. The output is length is 0.
How can I read these tags and get their attributes?
$f = $_REQUEST['feed'];
$feed = new DOMDocument();
$feed->load($f);
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
$arrt = $item->getElementsByTagName('itunes:image');
print_r($arrt);
}

getElementsByTagName is specified by DOM, and PHP is just following that. It doesn't consider namespaces. Instead, use getElementsByTagNameNS, which requires the full namespace URI (not the prefix). This appears to be http://www.itunes.com/dtds/podcast-1.0.dtd*. So:
$img = $item->getElementsByTagNameNS('http://www.itunes.com/dtds/podcast-1.0.dtd', 'image');
// Set preemptive fallback, then set value if check passes
urlImage = '';
if ($img) {
$urlImage = $img->getAttribute('href');
}
Or put the namespace in a constant.
You might be able to get away with simply removing the prefix and getting all image tags of any namespace with getElementsByTagName.
Make sure to check whether a given item has an itunes:image element at all (example now given); in the example podcast, some don't, and I suspect that was also giving you trouble. (If there's no href attribute, getAttribute will return either null or an empty string per the DOM spec without erroring out.)
*In case you're wondering, there is no actual DTD file hosted at that location, and there hasn't been for about ten years.

<?php
$rss_feed = simplexml_load_file("url link");
if(!empty($rss_feed)) {
$i=0;
foreach ($rss_feed->channel->item as $feed_item) {
?>
<?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?>
<?php
}
?>

Extract text between multilevel repetitive xml tags using Php

I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
.
.
.
</TranslationStack>
</eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>.
Regex gets me just the first value out of the ten.
preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids)
the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids
Any other suggestions to go about this ?

Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
"<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
echo $o;
echo "\n";
}
?>

You should use php's xpath capabilities, as explained here:
http://www.w3schools.com/php/func_simplexml_xpath.asp
Example:
<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>
XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.

use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo

Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.
If you want any Id element in the first IdList of the eSearchResult
/eSearchResult/IdList[1]/Id
As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.
You need to create an Xpath object for a DOM document
$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
$result[] = trim($id->nodeValue);
}
var_dump($id);

Parsing XML with PHP (simplexml)

Firstly, may I point out that I am a newcomer to all things PHP so apologies if anything here is unclear and I'm afraid the more layman the response the better. I've been having real trouble parsing an xml file in to php to then populate an HTML table for my website. At the moment, I have been able to get the full xml feed in to a string which I can then echo and view and all seems well. I then thought I would be able to use simplexml to pick out specific elements and print their content but have been unable to do this.
The xml feed will be constantly changing (structure remaining the same) and is in compressed format. From various sources I've identified the following commands to get my feed in to the right format within a string although I am still unable to print specific elements. I've tried every combination without any luck and suspect I may be barking up the wrong tree. Could someone please point me in the right direction?!
$file = fopen("compress.zlib://$url", 'r');
$xmlstr = file_get_contents($url);
$xml = new SimpleXMLElement($url,null,true);
foreach($xml as $name) {
echo "{$name->awCat}\r\n";
}
Many, many thanks in advance,
Chris
PS The actual feed

Since no one followed my closevote, I think I can just as well put my own comments as an answer:
First of all, SimpleXml can load URIs directly and it can do so with stream wrappers, so your three calls in the beginning can be shortened to (note that you are not using $file at all)
$merchantProductFeed = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
To get the values you can either use the implicit SimpleXml API and drill down to the wanted elements (like shown multiple times elsewhere on the site):
foreach ($merchantProductFeed->merchant->prod as $prod) {
echo $prod->cat->awCat , PHP_EOL;
}
or you can use an XPath query to get at the wanted elements directly
$xml = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
foreach ($xml->xpath('/merchantProductFeed/merchant/prod/cat/awCat') as $awCat) {
echo $awCat, PHP_EOL;
}
Live Demo
Note that fetching all $awCat elements from the source XML is rather pointless though, because all of them have "Bodycare & Fitness" for value. Of course you can also mix XPath and the implict API and just fetch the prod elements and then drill down to the various children of them.
Using XPath should be somewhat faster than iterating over the SimpleXmlElement object graph. Though it should be noted that the difference is in an neglectable area (read 0.000x vs 0.000y) for your feed. Still, if you plan to do more XML work, it pays off to familiarize yourself with XPath, because it's quite powerful. Think of it as SQL for XML.
For additional examples see
A simple program to CRUD node and node values of xml file and
PHP Manual - SimpleXml Basic Examples

Try this...
$url = "http://datafeed.api.productserve.com/datafeed/download/apikey/58bc4442611e03a13eca07d83607f851/cid/97,98,142,144,146,129,595,539,147,149,613,626,135,163,168,159,169,161,167,170,137,171,548,174,183,178,179,175,172,623,139,614,189,194,141,205,198,206,203,208,199,204,201,61,62,72,73,71,74,75,76,77,78,79,63,80,82,64,83,84,85,65,86,87,88,90,89,91,67,92,94,33,54,53,57,58,52,603,60,56,66,128,130,133,212,207,209,210,211,68,69,213,216,217,218,219,220,221,223,70,224,225,226,227,228,229,4,5,10,11,537,13,19,15,14,18,6,551,20,21,22,23,24,25,26,7,30,29,32,619,34,8,35,618,40,38,42,43,9,45,46,651,47,49,50,634,230,231,538,235,550,240,239,241,556,245,244,242,521,576,575,577,579,281,283,554,285,555,303,304,286,282,287,288,173,193,637,639,640,642,643,644,641,650,177,379,648,181,645,384,387,646,598,611,391,393,647,395,631,602,570,600,405,187,411,412,413,414,415,416,649,418,419,420,99,100,101,107,110,111,113,114,115,116,118,121,122,127,581,624,123,594,125,421,604,599,422,530,434,532,428,474,475,476,477,423,608,437,438,440,441,442,444,446,447,607,424,451,448,453,449,452,450,425,455,457,459,460,456,458,426,616,463,464,465,466,467,427,625,597,473,469,617,470,429,430,615,483,484,485,487,488,529,596,431,432,489,490,361,633,362,366,367,368,371,369,363,372,373,374,377,375,536,535,364,378,380,381,365,383,385,386,390,392,394,396,397,399,402,404,406,407,540,542,544,546,547,246,558,247,252,559,255,248,256,265,259,632,260,261,262,557,249,266,267,268,269,612,251,277,250,272,270,271,273,561,560,347,348,354,350,352,349,355,356,357,358,359,360,586,590,592,588,591,589,328,629,330,338,493,635,495,507,563,564,567,569,568/mid/2891/columns/merchant_id,merchant_name,aw_product_id,merchant_product_id,product_name,description,category_id,category_name,merchant_category,aw_deep_link,aw_image_url,search_price,delivery_cost,merchant_deep_link,merchant_image_url/format/xml/compression/gzip/";
$zd = gzopen($url, "r");
$data = gzread($zd, 1000000);
gzclose($zd);
if ($data !== false) {
$xml = simplexml_load_string($data);
foreach ($xml->merchant->prod as $pr) {
echo $pr->cat->awCat . "<br>";
}
}

<?php
$xmlstr = file_get_contents("compress.zlib://$url");
$xml = simplexml_load_string($xmlstr);
// you can transverse the xml tree however you want
foreach ($xml->merchant->prod as $line) {
// $line->cat->awCat -> you can use this
}
more information here

Use print_r($xml) to see the structure of the parsed XML feed.
Then it becomes obvious how you would traverse it:
foreach ($xml->merchant->prod as $prod) {
print $prod->pId;
print $prod->text->name;
print $prod->cat->awCat; # <-- which is what you wanted
print $prod->price->buynow;
}

$url = 'you url here';
$f = gzopen ($url, 'r');
$xml = new SimpleXMLElement (fread ($f, 1000000));
foreach($xml->xpath ('//prod') as $name)
{
echo (string) $name->cat->awCatId, "\r\n";
}

Remove multiple empty nodes with SimpleXML

I want to delete all the empty nodes in my XML document using SimpleXML
Here is my code :
$xs = file_get_contents('liens.xml')or die("Fichier XML non chargé");
$doc_xml = new SimpleXMLElement($xs);
foreach($doc_xml->xpath('//*[not(text())]') as $torm)
unset($torm);
$doc_xml->asXML("liens.xml");
I saw with a print_r() that XPath is grabbing something, but nothing is removed from my XML file.

$file = 'liens.xml';
$xpath = '//*[not(text())]';
if (!$xml = simplexml_load_file($file)) {
throw new Exception("Fichier XML non chargé");
}
foreach ($xml->xpath($xpath) as $remove) {
unset($remove[0]);
}
$xml->asXML($file);

I know this post is a bit old but in your foreach, $torm is replaced in every iteration. This means your unset($torm) is doing nothing to the original $doc_xml object.
Instead you will need to remove the element itself:
foreach($doc_xml->xpath('//*[not(text())]') as $torm)
unset($torm[0]);
###
by using a simplxmlelement-self-reference.

PHP: Get array of text from perticular XML node type?

I am not totally new to PHP or XML but I am 100% new to paring XML with PHP. I have an XML string that has several nodes but the only ones I am insterested in are the < keyword > nodes which there are an uncertain number of each containing a phrase like so: < keyword >blue diamond jewelry< /keyword > for example say the string looked like this:
<xml>
<pointless_node/>
<seq>
<keyword>diamond ring</keyword>
<keyword>ruby necklace</keyword>
<keyword>mens watch</keyword>
</seq>
<some_node/>
</xml>
I would want an array like this:
['diamond ring','ruby necklace','mens watch']
I tried looking at the PHP manual and just get confused and not sure what to do. Can someone please walk me through how to do this? I am using PHP4.
THANKS!

This turns $keywords into an array of
Objects. Is there a way to get the
text from the objects?
Sure, see this.
$dom = domxml_open_mem($str);
$keywords = $dom->get_elements_by_tagname('keyword');
foreach($keywords as $keyword) {
$text = $keyword->get_content();
// Whatever
}

XML_Parser->xml_parse_into_struct() might be what you're looking for.
Works for Php versions >= 4
http://se.php.net/xml_parse_into_struct
http://www.w3schools.com/PHP/func_xml_parse_into_struct.asp

I think the easiest is:
$dom = domxml_open_mem($str);
$keywords = $dom->get_elements_by_tagname('keyword');

see: http://www.php.net/simplexml-element-xpath
Try the following xpath and array construction
$string = "<xml>
<pointless_node/>
<seq>
<keyword>diamond ring</keyword>
<keyword>ruby necklace</keyword>
<keyword>mens watch</keyword>
</seq>
<some_node/>
</xml>";
$xml = domxml_open_mem($xmlstr)
$xpath = $xml->xpath_new_context();
$result = xpath_eval($xpath,'//keyword');
foreach ($result->nodeset as $node)
{
$result[] = $node->dump_node($node);
}
edit: modified code to reflect php 4 requirements
edit: modified to account for poorly documented behaviour of xpath_new_context (php docs comments point out the error)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract pattern from xml file using PHP? - php

Using preg_match_all: <?php $xml = '<unico>4241</unico> <unico>234</unico> <unico>534534</unico> <unico>2345334</unico>'; $pattern = '/<unico>(.*?)<\/unico>/'; preg_match_all($pattern,$xml,$result); print_r($result[0]);

Related

How to get iTunes-specific child nodes of RSS feeds?

Extract text between multilevel repetitive xml tags using Php

Parsing XML with PHP (simplexml)

Remove multiple empty nodes with SimpleXML

PHP: Get array of text from perticular XML node type?

Categories

Resources