I am trying to use PHP to read a large XML file (gzipped). The file consists of repeated products (actually books). Each book has 1 or more contributors. This is an example of a product.
<Product>
<ProductIdentifier>
<IDTypeName>EAN.UCC-13</IDTypeName>
<IDValue>9999999999999</IDValue>
</ProductIdentifier>
<Contributor>
<SequenceNumber>1</SequenceNumber>
<ContributorRole>A01</ContributorRole>
<PersonNameInverted>Bloggs, Joe</PersonNameInverted>
</Contributor>
<Contributor>
<SequenceNumber>2</SequenceNumber>
<ContributorRole>A01</ContributorRole>
<PersonNameInverted>Jones, John</PersonNameInverted>
</Contributor>
<Contributor>
<SequenceNumber>3</SequenceNumber>
<ContributorRole>B01</ContributorRole>
<PersonNameInverted>Other, An</PersonNameInverted>
</Contributor>
The output I would wish for this example is
Array
(
[1] => 9999999999999
[2] => Bloggs, Joe(A01)
[3] => Jones, John(A01)
[4] => Other, An(B01)
)
My code loads the gzipped XML file and handles the repeated sequence of products with no problem but I cannot get it to handle the repeated sequence of contributors. My code for handling the products and first contributor is shown below but I have tried various ways of looping through the contributors but cannot seem to achieve what I need. I'm a beginner with PHP and XML although an IT professional for many years.
$reader = new XMLReader();
//load the selected XML file to the DOM
if(!$reader->open("compress.zlib://filename.xml.gz","r")){
die('Failed to open file!');
}
while ($reader->read()):
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Product')
{
$xml = simplexml_load_string($reader->readOuterXML());
list($result) = $xml->xpath('//ProductIdentifier[IDTypeName = "EAN.UCC-13"]');
$line[1] = (string)$result->IDValue;
list($result) = $xml->xpath('//Contributor');
$contributorname = (string)$result->PersonNameInverted;
$role = (string)$result->ContributorRole;
$line[2] = $contributorname."(".$role.")";
echo '<pre>'; print_r($line); echo '</pre>';
}
endwhile;
Since you have several contributors, you must handle it as an array and loop on them to prepare your final variable:
<?php
$reader = new XMLReader();
//load the selected XML file to the DOM
if(!$reader->open("compress.zlib://filename.xml.gz","r")){
die('Failed to open file!');
}
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Product') {
$xml = simplexml_load_string($reader->readOuterXML());
list($result) = $xml->xpath('//ProductIdentifier[IDTypeName = "EAN.UCC-13"]');
$line[1] = (string)$result->IDValue;
// get all contributors in an array
$contributors = $xml->xpath('//Contributor');
$i = 2;
// go through all contributors
foreach($contributors as $contributor) {
$contributorname = (string)$contributor->PersonNameInverted;
$role = (string)$contributor->ContributorRole;
$line[$i] = $contributorname."(".$role.")";
$i++;
}
echo '<pre>'; print_r($line); echo '</pre>';
}
}
This will give you the following output:
Array
(
[1] => 9999999999999
[2] => Bloggs, Joe(A01)
[3] => Jones, John(A01)
[4] => Other, An(B01)
)
EDIT: Some explanation here on what is wrong on your code. Instead of taking all the contributors, you just take the first one with list()
http://php.net/manual/en/function.list.php (assign all values of an array into variables). Since you don't know how many contributors you have (i guess...), you cannot use this.
Then you assign the first one into your $line, so you always have only the first one.
Related
I am trying to read XML data in PHP. The XML data is coming from an API whose link is the following: https://seekingalpha.com/api/sa/combined/AAPL.xml
I just need News Headline, News Link , Published Date and Author Name of the first five news from the API. To do this, I am using the following PHP code:
$note = "https://seekingalpha.com/api/sa/combined/".$symbolValue.".xml";
$xml=simplexml_load_file($note);
$jsonArray = array();
for ($i=0; $i<5; $i++) {
$newsHeadline = $xml->channel->item[$i]->title->__toString();
$newsLink = $xml->channel->item[$i]->link->__toString();
$publishedDate = $xml->channel->item[$i]->pubDate->__toString();
$authorName = $xml->channel->item[$i]->sa:author_name->__toString();
$temp = array('Title' => $newsHeadline, 'Link' => $newsLink,'Publish'=>$publishedDate,'Author'=>$authorName);
array_push($jsonArray,$temp);
}
$jsonNews = json_encode($jsonArray);
$completeData[9] = $jsonNews;
In the above code, $note contains the link to the API. The $symbolValue is the value which I am getting from the front end. My code works absolutely until I access the author name ie. The following line of code:
$authorName = $xml->channel->item[$i]->sa:author_name->__toString();
I am getting the following error:
Parse error: syntax error, unexpected ':' in /home/File/Path
It seems like I am not supposed to use the ":" for fetching the author name.
So, how do I get the user name and put it in the $temp such that the Tag for the author name is "Author"?
Please have a look at the API to get an idea about the XML file.
The children() method supports an argument for the namespace you want to read. This is required when you want to read elements which are not in the current/default namespace.
$xmldata = <<<'XML'
<?xml version="1.0"?>
<foobar xmlns:x="http://example.org/">
<abc>test</abc>
<x:def>content</x:def>
</foobar>
XML;
$xml = new Simplexmlelement($xmldata);
$other = $xml->children('http://example.org/');
var_dump((string)$other->def);
This will output the value "content", but using the expression $xml->def will not because that is not in the current/default namespace.
If you property contains for example a :, you could use curly braces
->{'sa:author_name'}
The values you are looking for are in the https://seekingalpha.com/api/1.0 namespace.
You could use the children of the SimpleXMLElement and add te namespace:
$authorName = (string)$xml->channel->item[$i]->children('https://seekingalpha.com/api/1.0')->{'author_name'};
Or you could use xpath.
$authorName = (string)$xml->channel->item[$i]->xpath('sa:author_name')[0];
For example:
$jsonArray = array();
for ($i = 0; $i < 5; $i++) {
$newsHeadline = $xml->channel->item[$i]->title->__toString();
$newsLink = $xml->channel->item[$i]->link->__toString();
$publishedDate = $xml->channel->item[$i]->pubDate->__toString();
$authorName = (string)$xml->channel->item[$i]->xpath('sa:author_name')[0];
// or xpath
// $authorName = (string)$xml->channel->item[$i]->children('https://seekingalpha.com/api/1.0')->{'author_name'};
$temp = array('Title' => $newsHeadline, 'Link' => $newsLink, 'Publish' => $publishedDate, 'Author' => $authorName);
array_push($jsonArray, $temp);
}
$jsonNews = json_encode($jsonArray);
print_r($jsonArray);
Will give you:
Array
(
[0] => Array
(
[Title] => Apple acknowledges iPhone X issue in some devices, plans fix
[Link] => https://seekingalpha.com/symbol/AAPL/news?source=feed_symbol_AAPL
[Publish] => Fri, 10 Nov 2017 12:04:42 -0500
[Author] => Brandy Betz
)
etc...
I am working with XML and php. I am trying to get an array of PHP result to this XML format with two <customer> customers, I can get 1 but dont know how to do it for 2 customers in an array.
<?xml version="1.0"?>
<NewSLSCase>
<Identification CientNo="4xxx" CLientBatchNo="1" IJBatchNo="1" HubNo="201048" ClientEmailAddress1="etunimi.sukunimi#yritys.fi" ClientEmailAddress2="etunimi.sukunimi#yritys.fi"/>
<LedgerSet>
<Ledger ProductionUnit="04" LedgerNo="4xxxxx">
<CustomerSet>
<Customer No="123456" Name="Farhana" Address1="Myhouse" ZipCode="40100" City="Vantaa" LanguageCode="FIN" CountryCode="FI" VatNo="01011-111A" TypeCode="C"/>
<Customer No="123457" Name="Asiakas Anna" Address1="Asiakastie 2" ZipCode="00100" City=" Helsinki" LanguageCode="FIN" CountryCode="FI" VATNo="010111-111A" TypeCode="C"/>
</CustomerSet>
<InvoiceSet>
<Invoice InvoiceNo="123456"/>
<InvoiceHeader InvoiceHeaderType="L" InvoiceCustomerNo="123456" InvoiceCurrency="EUR" InvoiceDate="2011-04-08" InvoiceDueDate="2011-05-15" InvoiceAmount="57.00" ClientOCRReference="908000101800031186" InvoiceReferenceText3="150,00" InvoiceReferenceText4="Laskuille tulostuva" InvoiceReferenceText5="IBAN" InvoiceReferenceText6="BIC"/>
</InvoiceSet>
</Ledger>
</LedgerSet>
</NewSLSCase>
I wrote the codes for just getting one ` but adding another one is something I am truly not sure off.
Here is the array, did not copy the whole thing since its big, but I am using a small part of it for testing.
array(2) (
[0] => stdClass object {
ID => (string) 177835
currency => (string) Asiakas Anna
type => (string) Asiakastie 2
}
[1] => stdClass object {
ID => (string) 177840
Name=> (string) Farhana
Address=> (string) Myhouse
.....
Here is the whole XML format
$result = $this->invoice_page->getInvoiceByID('20241'); //this is the array
$xml = new DomDocument('1.0');
$xml->formatOutput = true;
$NewSLSCase = $xml->createElement("NewSLSCase");
$xml->appendChild($NewSLSCase);
$Identification = $xml->createElement("Identification");
$Identification->setAttribute("CientNo","4xxx");
$Identification->setAttribute("CLientBatchNo", "1");
$Identification->setAttribute("CLientBatchNo", "1");
$Identification->setAttribute("IJBatchNo", "1");
$Identification->setAttribute("HubNo", "201048");
$Identification->setAttribute("ClientEmailAddress1", "etunimi.sukunimi#yritys.fi");
$Identification->setAttribute("ClientEmailAddress2", "etunimi.sukunimi#yritys.fi");
$NewSLSCase->appendChild($Identification);
$LedgerSet = $xml->createElement("LedgerSet");
$NewSLSCase->appendChild($LedgerSet);
$Ledger = $xml->createElement("Ledger");
$Ledger->setAttribute("ProductionUnit","04");
$Ledger->setAttribute("LedgerNo", "4xxxxx");
$LedgerSet->appendChild($Ledger);
$CustomerSet = $xml->createElement("CustomerSet");
$Ledger->appendChild($CustomerSet);
$Customer = $xml->createElement("Customer");
$Customer->setAttribute("No","123456");
$Customer->setAttribute("Name", $result['Fullname']);
$Customer->setAttribute("Address1", $result['Address_street']);
$Customer->setAttribute("ZipCode", $result['Address_zip']);
$Customer->setAttribute("City",$result['Address_city']);
$Customer->setAttribute("LanguageCode", "FIN");
$Customer->setAttribute("CountryCode","FI");
$Customer->setAttribute("VatNo", "01011-111A");
$Customer->setAttribute("TypeCode","C");
$CustomerSet->appendChild($Customer);
$InvoiceSet = $xml->createElement("InvoiceSet");
$Ledger->appendChild($InvoiceSet);
$Invoice = $xml->createElement("Invoice");
$Invoice->setAttribute("InvoiceNo","123456");
$InvoiceSet->appendChild($Invoice);
$InvoiceHeader = $xml->createElement("InvoiceHeader");
$InvoiceHeader->setAttribute("InvoiceHeaderType", "L");
$InvoiceHeader->setAttribute("InvoiceCustomerNo","123456");
$InvoiceHeader->setAttribute("InvoiceCurrency", "EUR");
$InvoiceHeader->setAttribute("InvoiceDate","2011-04-08");
$InvoiceHeader->setAttribute("InvoiceDueDate", "2011-05-15");
$InvoiceHeader->setAttribute("InvoiceAmount","57.00");
$InvoiceHeader->setAttribute("ClientOCRReference", "908000101800031186");
$InvoiceHeader->setAttribute("InvoiceReferenceText3","150,00");
$InvoiceHeader->setAttribute("InvoiceReferenceText4","Laskuille tulostuva");
$InvoiceHeader->setAttribute("InvoiceReferenceText5",$result['IBAN']);
$InvoiceHeader->setAttribute("InvoiceReferenceText6",$result['BIC']);
$InvoiceSet->appendChild($InvoiceHeader);
echo "<xmp>".$xml->saveXML()."</xmp>";
What the above code, gives me is this
<?xml version="1.0"?>
<NewSLSCase>
<Identification CientNo="4xxx" CLientBatchNo="1" IJBatchNo="1" HubNo="201048" ClientEmailAddress1="etunimi.sukunimi#yritys.fi" ClientEmailAddress2="etunimi.sukunimi#yritys.fi"/>
<LedgerSet>
<Ledger ProductionUnit="04" LedgerNo="4xxxxx">
<CustomerSet>
<Customer No="123456" Name="Farhana" Address1="Myhouse" ZipCode="40100" City="Vantaa" LanguageCode="FIN" CountryCode="FI" VatNo="01011-111A" TypeCode="C"/>
</CustomerSet>
<InvoiceSet>
<Invoice InvoiceNo="123456"/>
<InvoiceHeader InvoiceHeaderType="L" InvoiceCustomerNo="123456" InvoiceCurrency="EUR" InvoiceDate="2011-04-08" InvoiceDueDate="2011-05-15" InvoiceAmount="57.00" ClientOCRReference="908000101800031186" InvoiceReferenceText3="150,00" InvoiceReferenceText4="Laskuille tulostuva" InvoiceReferenceText5="IBAN" InvoiceReferenceText6="BIC"/>
</InvoiceSet>
</Ledger>
</LedgerSet>
</NewSLSCase>
I have no clue how to add a foreach to get two customers information.
Use foreach loop through customers:
foreach ($customersArray as $customerData) {
$Customer = $xml->createElement("Customer");
$Customer->setAttribute("No",$customerData->ID);
//set all attributes like that
$CustomerSet->appendChild($Customer);
}
I have an XML file like so:
<GenResponse>
<Detail1></Detail1>
<Detail2></Detail>
<DataNodes>
<DataNode>
<NodeDetails1>
<node4>Parrot Musky Truck Moo</node4>
<node5>Tinker Singer Happy Fool</node5>
<node6>
<FurtherDetails>
<Node>Musky</Node>
<Node>Lorem Ipsum</Node>
</FurtherDetails>
</NodeDetails1>
<NodeDetails2>ID</NodeDetails2>
</DataNode>
<DataNode>
<NodeDetails1>
<node4>Sky Star Panet Shoe</node4>
<node5>Rusky Husky Musky Boo</node5>
</NodeDetails1>
<NodeDetails2>ID</NodeDetails2>
</DataNode>
</DataNodes>
</GenResponse>
I would like to know how I would inject a search string "Musky" to a PHP function and get back <DataNode>...</DataNode> & <DataNode>...</DataNode> back?
Essentially I would like to search a huge XML file for a string and return all the DataNode's which contain the string back.
If this is possible with SimpleXML it would be great. Else any other solution is also fine.
EDIT: Notice how "Musky" can be in different nodes under <DataNode>
use
$xmlStr = file_get_contents('data/your_XML_File.xml');
$xml = new SimpleXMLElement($xmlStr);
// seach records by tag value:
// find nodes with text
$res = $xml->xpath("node2[contains(., 'Musky')]");
print_r($res);
//For testing purpost just copy paste following code in editor , For testing , I didnt use separate xml file.
<?php
//$xmlStr = file_get_contents('test.xml');
$xmlStr = '<node1>
<node2>
<node3>
<node4>Parrot Singer Truck Moo</node4>
<node5>Tinker Musky Happy Fool</node5>
</node3>
<node7>ID</node7>
</node2>
<node2>
<node3>
<node4>Sky Star Panet Shoe</node4>
<node5>Rusky Husky Musky Boo</node5>
</node3>
<node7>ID</node7>
</node2>
</node1>';
$xml = new SimpleXMLElement($xmlStr);
// seach records by tag value:
// find nodes with text
$res = $xml->xpath("node2[contains(., 'Musky')]");
echo "<pre>";
print_r($res);
?>
It gives proper output , i tried
Array
(
[0] => SimpleXMLElement Object
(
[node3] => SimpleXMLElement Object
(
[node4] => Parrot Singer Truck Moo
[node5] => Tinker Musky Happy Fool
)
[node7] => ID
)
[1] => SimpleXMLElement Object
(
[node3] => SimpleXMLElement Object
(
[node4] => Sky Star Panet Shoe
[node5] => Rusky Husky Musky Boo
)
[node7] => ID
)
)
Use this code and you can find your search word.I have made it a function just pass your keyword and you will get your result,
function findWord($findVar)
{
$catalog = simplexml_load_file("xmlfile.xml");
$category = $catalog->node2;
$found = 0;
foreach($category as $c)
{
foreach($c->node3 as $node3)
{
$node4 = (string) ($node3->node4);
$node5 = (string) ($node3->node5);
if (stripos(strtolower($node4),strtolower($findVar)))
{
echo 'Found!!'.'<br/>';
$found++;
}
if (stripos(strtolower($node5),strtolower($findVar)))
{
echo 'Found!!'.'<br/>';
$found++;
}
}
if (stripos(strtolower((string)$c->node7),strtolower($findVar)))
{
echo 'Found!!'.'<br/>';
$found++;
}
}
if ($found == 0)
{
echo "No result";
}
}
$findVar = 'Musky';
findWord($findVar);
I parsing scores from http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383
I able to parse all the attributes. But I can't able to parse the time.
I Used
$homepages = file_get_html("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$teama = $homepages->find('span[id="clock"]');
Kindly help me
Since the that particular site is loading the values dynamically (thru AJAX request), you cant really parse the value upon initial load.
<span id="clock"></span> // this tends to be empty initial load
Normal scrapping:
$homepages = file_get_contents("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$doc = new DOMDocument();
#$doc->loadHTML($homepages);
$xpath = new DOMXPath($doc);
$query = $xpath->query("//span[#id='clock']");
foreach($query as $value) {
echo $value->nodeValue; // the traversal is correct, but this will be empty
}
My suggestion is instead of scraping it, you will need to have to access it thru a request also, since it is a time (of course, as the match goes on this will change and change until the game has ended). Or you can also use their request.
$url = 'http://sports.in.msn.com/liveplayajax/SOCCERMATCH/match/gsm/en-in/1597383';
$contents = file_get_contents($url);
$data = json_decode($contents, true);
echo '<pre>';
print_r($data);
echo '</pre>';
Should yield something like (a part of it actually):
[2] => Array
(
[Code] =>
[CommentId] => -1119368663
[CommentType] => manual
[Description] => FULL-TIME: South Africa 0-5 Brazil.
[Min] => 90'
[MinExtra] => (+3)
[View] =>
[ViewHint] =>
[ViewIndex] => 0
[EditKey] =>
[TrackingValues] =>
[AopValue] =>
)
You should get the 90' by using foreach. Consider this example:
foreach($data['Commentary']['CommentaryItems'] as $key => $value) {
if(stripos($value['Description'], 'FULL-TIME') !== false) {
echo $value['Min'];
break;
}
}
Should print: 90'
i am trying to scrape some data from this site : http://laperuanavegana.wordpress.com/ . actually i want the title of recipe and ingredients . ingredients is located inside two specific keyword . i am trying to get this data using regex and simplehtmldom . but its showing the full html text not just the ingredients . here is my code :
<?php
include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";
traverse($base_url);
function traverse($base_url)
{
$html = file_get_html($base_url);
$k1="Ingredientes";
$k2="Preparación";
preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
echo $out[0][0];
}
?>
there is multiple ingredients in this page . i want all of them . so using preg_match_all()
it will be helpful if anybody detect the bug of this code .
thanks in advance.
When you are already using an HTML parser (even a poor one like SimpleHtmlDom), why are you trying to mess up things with Regex then? That's like using a scalpel to open up the patient and then falling back to a sharpened spoon for the actual surgery.
Since I strongly believe no one should use SimpleHtmlDom because it has a poor codebase and is much slower than libxml based parsers, here is how to do it with PHP's native DOM extension and XPath. XPath is effectively the Regex or SQL for X(HT)ML documents. Learn it, so you will never ever have to touch Regex for HTML again.
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();
$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
$recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);
This will output:
Array
(
[title] => Ensalada tibia de quinua, mango y tomate
[ingredients] => Array
(
[0] => 250gr de quinua cocida tibia
[1] => 1 mango grande
[2] => 2 tomates
[3] => Unas hojas de perejil
[4] => Sal
[5] => Aceite de oliva
[6] => Vinagre balsámico
)
)
Note that we are not parsing http://laperuanavegana.wordpress.com/ but the actual blog post. The main URL will change content whenever the blog owner adds a new post.
To get all the Recipes from the main page, you can use
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
echo
($node->nodeName === 'a') ? "\n# " : '- ',
$node->nodeValue,
PHP_EOL;
}
This will output
# Ensalada tibia de quinua, mango y tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico
# Flan de lúcuma
- 1 lúcuma grandota o 3 pequeñas
- 1/2 litro de leche de soja evaporada
…
and so on
Also see
How do you parse and process HTML/XML in PHP?
DOMDocument in php
You need to add a question mark there. It makes the pattern ungreedy - otherwise it will take everything form the first $k1 to the last $k2 on the page. If you add the question mark it will always take the next $k2.
preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);