Regular Expressions - PHP and XML - php

I'm in college and new to PHP regular expressions but I have somewhat of an idea what I need to do I think. Basically I need to create a PHP program to read XML source code containing several 'stories' and store their details in a mySQL database. I've managed to create an expression that selects each story but I need to break this expression down further in order to get each element within the story. Here's the XML:
XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<latestIssue>
<issue number="256" />
<date>
<day> 21 </day>
<month> 1 </month>
<year> 2011 </year>
</date>
<story>
<title> Is the earth flat? </title>
<author> A. N. Redneck </author>
<url> http://www.HotStuff.ie/stories/story123456.xml </url>
</story>
<story>
<title> What the actress said to the bishop </title>
<author> Brated Film Critic </author>
<url> http://www.HotStuff.ie/stories/story123457.xml </url>
</story>
<story>
<title> What the year has in store </title>
<author> Stargazer </author>
<url> http://www.HotStuff.ie/stories/story123458.xml </url>
</story>
</latestIssue>
So I need to get the title, author and url from each story and add them as a row in my database. Here's what I have so far:
PHP
<?php
$url = fopen("http://address/to/test.xml", "r");
$contents = fread($url,10000000);
$exp = preg_match_all("/<title>(.+?)<\/url>/s", $contents, $matches);
foreach($matches[1] as $match) {
// NO IDEA WHAT TO DO FROM HERE
// $exp2 = "/<title>(.+?)<\/title><author>(.+?)<\/author><url>(.+?)<\/url>/";
// This is what I had but I'm not sure if it's right or what to do after
}
?>
I'd really appreciate the help guys, I've been stuck on this all day and I can't wrap my head around regular expressions at all. Once I've managed to get each story's details I can easily update the database.
EDIT:
Thanks for replying but are you sure this can't be done with regular expressions? It's just the question says "Use regular expressions to analyse the XML and extract the relevant data that you need. Note that information about each story is spread across several lines of XML". Maybe he made a mistake but I don't see why he'd write it like that if it can't be done this way.

First of all, start using
file_get_contents("UrlHere");
to gather the content from a page.
Now if you want to parse the XML use the XML parser in PHP for example.
You could also use third-party XML parsers

Regular expressions are not the correct tool to use here. You want to use a XML parser. I like PHP's SimpleXML
$sXML = new SimpleXMLElement('http://address/to/test.xml', 0, TRUE);
$stories = $sXML->story;
foreach($stories as $story){
$title = (string)$story->title;
$author = (string)$story->author;
$url = (string)$story->url;
}

You should never use regexp to parse an XML document (Ok, never is a big word, in some rare cases the regexp can be better but not in your case).
As it's a document reading, I suggest you to use the SimpleXML class and XPath queries.
For example :
$ cat test.php
#!/usr/bin/php
<?php
function xpathValueToString(SimpleXMLElement $xml, $xpath){
$arrayXpath = $xml->xpath($xpath);
return ($arrayXpath) ? trim((string) $arrayXpath[0]) : null;
}
$xml = new SimpleXMLElement(file_get_contents("test.xml"));
$arrayXpathStories = $xml->xpath("/latestIssue/story");
foreach ($arrayXpathStories as $story){
echo "Title : " . xpathValueToString($story, 'title') . "\n";
echo "Author : " . xpathValueToString($story, 'author') . "\n";
echo "URL : " . xpathValueToString($story, 'url') . "\n\n";
}
?>
$ ./test.php
Title : Is the earth flat?
Author : A. N. Redneck
URL : http://www.HotStuff.ie/stories/story123456.xml
Title : What the actress said to the bishop
Author : Brated Film Critic
URL : http://www.HotStuff.ie/stories/story123457.xml
Title : What the year has in store
Author : Stargazer
URL : http://www.HotStuff.ie/stories/story123458.xml

Related

Xpath query for HTML table within XML in PHP DOMDocument

I have an XML file with following tree structure.
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>Videos</title>
<link>https://www.example.com/r/videos/</link>
<description>A long description of the video.</description>
<image>...</image>
<atom:link rel="self" href="http://www.example.com/videos/.xml" type="application/rss+xml"/>
<item>
<title>The most used Jazz lick in history.</title>
<link>
http://www.example.com/
</link>
<guid isPermaLink="true">
http://www.example.com/
</guid>
<pubDate>Mon, 07 Sep 2015 14:43:34 +0000</pubDate>
<description>
<table>
<tr>
<td>
<a href="http://www.example.com/">
<img src="http://www.example.com/.jpg" alt="The most used Jazz lick in history." title="The most used Jazz lick in history." />
</a>
</td>
<td> submitted by
jcepiano
<br/>
[link]
<a href="http://www.example.com/">
[508 comments]
</a>
</td>
</tr>
</table>
</description>
<media:title>The most used Jazz lick in history.</media:title>
<media:thumbnail url="http://example.jpg"/>
</item>
</channel>
</rss>
Here, the html table element is embedded inside XML and that's confusing me.
Now I want to pick the text node values for //channel/item/title and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
I am using PHP DOMDocument();
I have been looking for a perfect solution for this for 2 days now, can you please let me know how would this happen?
Also I need to count the total number of items in the feed, right now I am doing like this:
...
$queryResult = $xpathvar->query('//item/title');
$total = 1;
foreach($queryResult as $result){
$total++;
}
echo $title;
And I also need a reference link for XPath query selectors' rules.
Thanks in advance! :)
You wrote that you wanted the length of the result set of the following query:
$queryResult = $xpathvar->query('//item/title');
I assume that $xpathvar here is of type DOMXPath. If so, it has a length property as described here. Instead of using foreach, simply use:
$length = $xpathvar->query('//item/title')->length;
Now I want to pick the text node values for //channel/item/title
Which you can get with the expression //channel/item/title/text().
and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Your expression here selects any tr, the first td under that, then the first a. But the first a does not have a value of "[link]" in your source. If you want that, though, you can use:
//channel/item/description/table/tr/td[1]/a[1]/#href
but it looks like you rather want:
//channel/item/description/table/tr/td/a[. = "[link]"][1]/#href
which finds the first a element in the tree that has the value (text node) that is "[link]".
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
Not sure if this was a separate question or meant to explain the previous one. Regardless, the answer the same as in the previous one, unless you explicitly want to search for 2nd a etc (i.e., search by position), in which case you can use numeric predicates.
Note: you start most of your expressions with //expr, which essentially means: search the whole tree at any depth for the expression expr. This is potentially expensive and if all you need is a (relative) root node for which you know the starting point or expression, it is better, and far more performant, to use a direct path. In your case, you can replace //channel for /*/channel (because it is the first under the root element).
I finally could make it work with the code below
$url = "https://www.example.com/r/videos/.xml";
$feed_dom = new domDocument;
$feed_dom->load($url);
$feed_dom->preserveWhiteSpace = false;
$items = $feed_dom->getElementsByTagName('item');
foreach($items as $item){
$title = $item->getElementsByTagName('title')->item(0)->nodeValue;
$desc_table = $item->getElementsByTagName('description')->item(0)->nodeValue;
echo $title . "<br>";
$table_dom = new domDocument;
$table_dom->loadHTML($desc_table);
$xpath = new DOMXpath($table_dom);
$table_dom->preserveWhiteSpace = false;
$yt_link_node = $xpath->query("//table/tr/td[2]/a[2]");
foreach($yt_link_node as $yt_link){
$yt = $yt_link->getAttribute('href');
echo $yt . "<br>";
echo "<br>";
}
}
I thank Abel, your help was greatly useful to achieve the tasks. :)

Show all items that Match in XML using PHP

This is my XML file named: full.xml
I need your help. I need a PHP script that open "full.xml"
and only display all values of the nodes that have .email
Example of the Output I want:
sales#company1.com
sales#company2.com
sales#company3.com
Thanks! I will thank you so much!
EDIT
$Connect = simplexml_load_file("full.xml");
return $Connect->table[0]->*.email;
The design of your XML is not very smart. With this xpath expression, you select all nodes with .email at the end of their name:
$xml = simplexml_load_string($x); // assume XML in $x
$results = $xml->xpath("//*[substring(name(),string-length(name())-" . (strlen('.email') - 1) . ") = '.email']");
--> result is an array with the selected nodes.
BTW: if you have any chance of CHANGING the structure of the XML, AVOID combining information within node names like <company1.email>, but do it like this:
...
<companies>
<company id="1">
<email>info#company1.com</email>
<tel>+498988123456</tel>
<name>somename</name>
</company>
<company id="2">
<email>info#company2.com</email>
<tel>+498988123457</tel>
<name>someothername</name>
</company>
</companies>
....
It will be much easier to read and parse.

PHP script to echo VLC now playing XML attributes

I've been searching for a while on this and haven't had much luck. I've found plenty of resources showing how to echo data from dynamic XML, but I'm a PHP novice, and nothing I've written seems to grab and print exactly what I want, though from everything I've heard, it should be relatively easy. The source XML (located at 192.168.0.15:8080/requests/status.xml) is as follows:
<root>
<fullscreen>0</fullscreen>
<volume>97</volume>
<repeat>false</repeat>
<version>2.0.5 Twoflower</version>
<random>true</random>
<audiodelay>0</audiodelay>
<apiversion>3</apiversion>
<videoeffects>
<hue>0</hue>
<saturation>1</saturation>
<contrast>1</contrast>
<brightness>1</brightness>
<gamma>1</gamma>
</videoeffects>
<state>playing</state>
<loop>true</loop>
<time>37</time>
<position>0.22050105035305</position>
<rate>1</rate>
<length>168</length>
<subtitledelay>0</subtitledelay>
<equalizer/>
<information>
<category name="meta">
<info name="description">
000003EC 00000253 00000D98 000007C0 00009C57 00004E37 000068EB 00003DC5 00015F90 00011187
</info>
<info name="date">2003</info>
<info name="artwork_url"> file://brentonshp04/music%24/Music/Hackett%2C%20Steve/Guitar%20Noir%20%26%20There%20Are%20Many%20Sides%20to%20the%20Night%20Disc%202/Folder.jpg
</info>
<info name="artist">Steve Hackett</info>
<info name="publisher">Recall</info>
<info name="album">Guitar Noir & There Are Many Sides to the Night Disc 2
</info>
<info name="track_number">5</info>
<info name="title">Beja Flor [Live]</info>
<info name="genre">Rock</info>
<info name="filename">Beja Flor [Live]</info>
</category>
<category name="Stream 0">
<info name="Bitrate">128 kb/s</info>
<info name="Type">Audio</info>
<info name="Channels">Stereo</info>
<info name="Sample rate">44100 Hz</info>
<info name="Codec">MPEG Audio layer 1/2/3 (mpga)</info>
</category>
</information>
<stats>
<lostabuffers>0</lostabuffers>
<readpackets>568</readpackets>
<lostpictures>0</lostpictures>
<demuxreadbytes>580544</demuxreadbytes>
<demuxbitrate>0.015997290611267</demuxbitrate>
<playedabuffers>0</playedabuffers>
<demuxcorrupted>0</demuxcorrupted>
<sendbitrate>0</sendbitrate>
<sentbytes>0</sentbytes>
<displayedpictures>0</displayedpictures>
<demuxreadpackets>0</demuxreadpackets>
<sentpackets>0</sentpackets>
<inputbitrate>0.016695899888873</inputbitrate>
<demuxdiscontinuity>0</demuxdiscontinuity>
<averagedemuxbitrate>0</averagedemuxbitrate>
<decodedvideo>0</decodedvideo>
<averageinputbitrate>0</averageinputbitrate>
<readbytes>581844</readbytes>
<decodedaudio>0</decodedaudio>
</stats>
</root>
What I'm trying to write is a simple PHP script that echoes the artist's name (In this example Steve Hackett). Actually I'd like it to echo the artist, song and album, but I'm confident that if I'm shown how to retrieve one, I can figure out the rest on my own.
What little of my script which actually seems to work goes as follows. I've tried more than what's below, but I left out the bits that I know for a fact aren't working.
<?PHP
$file = file_get_contents('http://192.168.0.15:8080/requests/status.xml');
$sxe = new SimpleXMLElement($file);
foreach($sxe->...
echo "Artist: "...
?>
I think I need to use foreach and echo, but I can't figure out how to do it in a way that will print what's between those info brackets.
I'm sorry if I've left anything out. I'm not only new to PHP, but I'm new to StackOverflow too. I've referenced this site in other projects, and it's always been incredibly helpful, so thanks in advance for your patience and help!
////////Finished Working Script - Thanks to Stefano and all who helped!
<?PHP
$file = file_get_contents('http://192.168.0.15:8080/requests/status.xml');
$sxe = new SimpleXMLElement($file);
$artist_xpath = $sxe->xpath('//info[#name="artist"]');
$album_xpath = $sxe->xpath('//info[#name="album"]');
$title_xpath = $sxe->xpath('//info[#name="title"]');
$artist = (string) $artist_xpath[0];
$album = (string) $album_xpath[0];
$title = (string) $title_xpath[0];
echo "<B>Artist: </B>".$artist."</br>";
echo "<B>Title: </B>".$title."</br>";
echo "<B>Album: </B>".$album."</br>";
?>
Instead of using a for loop, you can obtain the same result with XPath:
// Extraction splitted across two lines for clarity
$artist_xpath = $sxe->xpath('//info[#name="artist"]');
$artist = (string) $artist_xpath[0];
echo $artist;
You will have to adjust the xpath expression (i.e. change #name=... appropriately), but you get the idea. Also notice that [0] is necessary because xpath will return an array of matches (and you only need the first) and the cast (string) is used to extract text contained in the node.
Besides, your XML is invalid and will be rejected by the parser because of the literal & appearing in the <info name="album"> tag.
If you look at your code again, you are missing a function that turns the first result of the xpath expression into a string of a SimpleXMLElement (casting).
One way to write this once is to extend from SimpleXMLElement:
class BetterXMLElement extends SimpleXMLElement
{
public function xpathString($expression) {
list($result) = $this->xpath($expression);
return (string) $result;
}
}
You then create the more specific SimpleXMLElement like you did use the less specific before:
$file = file_get_contents('http://192.168.0.15:8080/requests/status.xml');
$sxe = new BetterXMLElement($file);
And then you benefit in your following code:
$artist = $sxe->xpathString('//info[#name="artist"]');
$album = $sxe->xpathString('//info[#name="album"]');
$title = $sxe->xpathString('//info[#name="title"]');
echo "<B>Artist: </B>".$artist."</br>";
echo "<B>Title: </B>".$title."</br>";
echo "<B>Album: </B>".$album."</br>";
This spares you some repeated code. This means as well less places you can make an error in :)
Sure you can further on optimize this by allowing to pass an array of multiple xpath queries and returning all values named then. But that is something you need to write your own according to your specific needs. So use what you learn in programming to make programming more easy :)
If you want some more suggestions, here is another, very detailed example using DOMDocument, the sister-library of SimpleXML. It is quite advanced but might give you some good inspiration, I think something similar is possible with SimpleXML as well and this is probably what you're looking for in the end:
Extracting data from HTML using PHP and xPath

Trying to Parse Only the Images from an RSS Feed

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando
The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}
I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

How can I extract data from XML to PHP using SimpleXML and echo?

I've successfully integrated the LinkedIn API with my website, but I'm struggling to extract information from the XML. At the moment I'm just trying to print it out so I can proceed to use the user's information once they have logged in and given permission.
Below is the format of the XML, and further down is the code I am using to extract the information. The "first name", "last name" and "headline" calls work perfectly, but where an element has sub-headings, nothing is printed out. I've tried using
echo 'Positions: ' . $xml->{'positions:(title)'};
but it doesn't work.
Here is the XML:
<person>
<id>
<first-name />
<last-name />
<headline>
<location>
<name>
<country>
<code>
</country>
</location>
<industry>
<summary/>
<positions total="">
<position>
<id>
<title>
<summary>
<start-date>
<year>
<month>
</start-date>
<is-current>
<company>
<name>
</company>
</position>
</person>
This is the code I've been using to try to extract the information. I know I have to include the sub-heading somehow but I just don't know how!
echo 'First Name: ' . $xml->{'first-name'};
echo '<br/>';
echo 'Last Name: ' . $xml->{'last-name'};
echo '<br/>';
echo 'Headline: ' . $xml->{'headline'};
echo '<br/>';
echo 'Positions: ' . $xml->{'positions'};
Any help would be greatly appreciated, thanks for reading!
Using SimpleXML, you'd access the LinkedIn XML data properties as follows:
Anything with a dash in the property gets {}, so first-name becomes:
$xml->{'first-name'}
Anything without a dash such as headline, is referenced like:
$xml->headline
Anything that is a collection, such as positions, is referenced like:
foreach($xml-positions as $position) {
echo $position->title;
echo $position->{'is-current'};
}
Your XML is not valid, its not well formed. Anyway here's a sample XML and how to use it.
$v = <<<ABC
<vitrine>
<canal>Hotwords</canal>
<product id="0">
<descricao>MP3 Apple iPod Class...</descricao>
<loja>ApetreXo.com</loja>
<preco>à vista R$765,22</preco>
<urlImagem>http://im</urlImagem>
<urlProduto>http://</urlProduto>
</product>
</vitrine>
ABC;
$xml = simplexml_load_string($v);
foreach ($xml->product as $c){
echo $c->loja; //echoing out value of 'loja'
}
Try to use PHP's XML Parser instead:
http://www.php.net/manual/en/function.xml-parse.php
Tried Paul's answer above for:
foreach($xml-positions as $position) {
echo $position->title;
echo $position->{'is-current'};
}
didn't work for me - so I used this - not as elegant but works
for($position_num = 0; $position_num < 10;$position_num++){
echo $xml->positions->position[$position_num]->company->name;
}
Your XML is not well-formed... there are several elements without close tags. So we have no way to know for sure the structure of your XML. (You can't do that in XML like you can in HTML.)
That being said, assuming that <person> is the context node, you can probably get the content of the <title> element using an XPath expression, as in
$xml->xpath('positions/position/title');
I'm assuming $xml is a SimpleXMLElement object.

Categories