I'm writing an RSS to JSON parser and as a part of that, I need to use htmlentities() on any tag found inside the description tag. Currently, I'm trying to use preg_replace(), but I'm struggling a little with it. My current (non-working) code looks like:
$pattern[0] = "/\<description\>(.*?)\<\/description\>/is";
$replace[0] = '<description>'.htmlentities("$1").'</description>';
$rawFeed = preg_replace($pattern, $replace, $rawFeed);
If you have a more elegant solution to this as well, please share. Thanks.
Simple. Use preg_replace_callback:
function _handle_match($match)
{
return '<description>' . htmlentities($match[1]) . '</description>';
}
$pattern = "/\<description\>(.*?)\<\/description\>/is";
$rawFeed = preg_replace_callback($pattern, '_handle_match', $rawFeed);
It accepts any callback type, so also methods in classes.
The more elegant solution would be to employ SimpleXML. Or a third party library such as XML_Feed_Parser or Zend_Feed to parse the feed.
Here is a SimpleXML example:
<?php
$rss = file_get_contents('http://rss.slashdot.org/Slashdot/slashdot');
$xml = simplexml_load_string($rss);
foreach ($xml->item as $item) {
echo "{$item->description}\n\n";
}
?>
Keep in mind that RSS and RDF and Atom look different, which is why it can make sense to employ one of the above libraries I mentioned.
Related
I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.
You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation
I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:
$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";
What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?
I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).
$xml = new SimpleXMLElement ($string);
$result=$xml->xpath('/p');
while(list( , $node)=each($result)){
echo '/p: ' , $node, "\n";
}
Hopefully someone with (a lot) more experience in PHP will be able to help me out :D
Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:
$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');
Which will return a DOMNodeList.
I vote for use regexp. For tag p
preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
foreach($arr as $value)
{
echo $value."</br>";
}
}
Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.
http://simplehtmldom.sourceforge.net/
It can be used like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
I have been coding for a while now but just can't seem to get my head around regular expressions.
This brings me to my question which is the following: is it bad practice to use PHP's explode for breaking up a string of html code to select bits of text? I need to scrape a page for various bits of information and due to my horrific regex knowledge (In a full software engineering degree I had to write maybe one....) I decided upon using explode().
I have provided my code below so someone more seasoned than me can tell me if it's essential that I use regex for this or not!
public function split_between($start, $end, $blob)
{
$strip = explode($start,$blob);
$strip2 = explode($end,$strip[1]);
return $strip2[0];
}
public function get_abstract($pubmed_id)
{
$scrapehtml = file_get_contents("http://www.ncbi.nlm.nih.gov/m/pubmed/".$pubmed_id);
$data['title'] = $this->split_between('<h2>','</h2>',$scrapehtml);
$data['authors'] = $this->split_between('<div class="auth">','</div>',$scrapehtml);
$data['journal'] = $this->split_between('<p class="j">','</p>',$scrapehtml);
$data['aff'] = $this->split_between('<p class="aff">','</p>',$scrapehtml);
$data['abstract'] = str_replace('<p class="no_t_m">','',str_replace('</p>','',$this->split_between('<h3 class="no_b_m">Abstract','</div>',$scrapehtml)));
$strip = explode('<div class="ids">', $scrapehtml);
$strip2 = explode('</div>', $strip[1]);
$ids[] = $strip2[0];
$id_test = strpos($strip[2],"PMCID");
if (isset($strip[2]) && $id_test !== false)
{
$step = explode('</div>', $strip[2]);
$ids[] = $step[0];
}
$id_count = 0;
foreach ($ids as &$value) {
$value = str_replace("<h3>", "", $value);
$data['ids'][$id_count]['id'] = str_replace("</h3>", "", str_replace('<span>','',str_replace('</span>','',$value)));
$id_count++;
}
$jsonAbstract = json_encode($data);
echo $this->indent($jsonAbstract);
}
I highly recommend you try out the PHP Simple HTML DOM Parser library. It handles invalid HTML and has been designed to solve the same problem you're working on.
A simple example from the documentation is as follows:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
It's not essential to use regular expressions for anything, although it'll be useful to get comfortable with them and know when to use them.
It looks like your scraping PubMed, which I'm guessing has fairly static mark-up in terms of mark-up. If what you have works and performs as you hope I can't see any reason to switch over to using regular expressions, they're not necessarily going to be any quicker in this example.
Learn regular expressions and try to use a language that has libraries for this kind of task like perl or python. It will save you a lot of time.
At first they might seem daunting but they are really easy for most of the tasks.
Try reading this: http://perldoc.perl.org/perlre.html
I need read in and parse data from a third party website which sends XML data. All of this needs to be done server side.
What is the best way to do this using PHP?
You can obtain the remote XML data with, e.g.
$xmldata = file_get_contents("http://www.example.com/xmldata");
or with curl. Then use SimpleXML, DOM, whatever.
A good way of parsing XML is often to use XPP (XML Pull Parsing) librairy, PHP has an implementation of it, it's called XMLReader.
http://php.net/manual/en/class.xmlreader.php
I would suggest you to use DOMDocument (PHP inline built class)
A simple example of its power could be the following code:
/***********************************************************************************************
Takes the RSS news feeds found at $url and prints them as HTML code.
Each news is rendered in a <div class="rss"> block in the order: date + title + description.
***********************************************************************************************/
function Render($url, $max_feeds = 1000)
{
$doc = new DOMDocument();
if(#$doc->load($url, LIBXML_NOCDATA|LIBXML_NOBLANKS))
{
$feed_count = 0;
$items = $doc->getElementsByTagName("item");
//echo $items->length; //DEBUG
foreach($items as $item)
{
if($feed_count > $max_feeds)
break;
//Unfortunately inside <item> node elements are not always in same order, therefor we have to call many times getElementsByTagName
//WARNING: using iconv function instead of utf8_decode because this last one did not convert properly some characters like apostrophe 0x19 from techsport.it feeds.
$title = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("title")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$description = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("description")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$link = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("link")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
//pubDate tag is not mandatory in RSS [RSS2 spec: http://cyber.law.harvard.edu/rss/rss.html]
$pub_date = $item->getElementsByTagName("pubDate"); $date_html = "";
//play with date here if you want
echo "<div class='rss'>\n<p class='title'><a href='" . $link . "'>" . $title . "</a></p>\n<p class='description'>" . $description . "</p>\n</div>\n\n";
$feed_count++;
}
}
else
echo "<div class='rss'>Service not available.</div>";
}
I have been using simpleXML for a while.
I am returned the following:
<links>
<image_link>http://img357.imageshack.us/img357/9606/48444016.jpg</image_link>
<thumb_link>http://img357.imageshack.us/img357/9606/48444016.th.jpg</thumb_link>
<ad_link>http://img357.imageshack.us/my.php?image=48444016.jpg</ad_link>
<thumb_exists>yes</thumb_exists>
<total_raters>0</total_raters>
<ave_rating>0.0</ave_rating>
<image_location>img357/9606/48444016.jpg</image_location>
<thumb_location>img357/9606/48444016.th.jpg</thumb_location>
<server>img357</server>
<image_name>48444016.jpg</image_name>
<done_page>http://img357.imageshack.us/content.php?page=done&l=img357/9606/48444016.jpg</done_page>
<resolution>800x600</resolution>
<filesize>38477</filesize>
<image_class>r</image_class>
</links>
I wish to extract the image_link in PHP as simply and as easily as possible. How can I do this?
Assume, I can not make use of any extra libs/plugins for PHP. :)
Thanks all
At Josh's answer, the problem was not escaping the "/" character. So the code Josh submitted would become:
$text = 'string_input';
preg_match('/<image_link>([^<]+)<\/image_link>/gi', $text, $regs);
$result = $regs[0];
Taking usoban's answer, an example would be:
<?php
// Load the file into $content
$xml = new SimpleXMLElement($content) or die('Error creating a SimpleXML instance');
$imagelink = (string) $xml->image_link; // This is the image link
?>
I recommend using SimpleXML because it's very easy and, as usoban said, it's builtin, that means that it doesn't need external libraries in any way.
You can use SimpleXML as it is built in PHP.
use regular expressions
$text = 'string_input';
preg_match('/<image_link>([^<]+)</image_link>/gi', $text, $regs);
$result = $regs[0];