Working with HTML econding in PHP (intelligent way to decode) - php

from a PHP script I'm downloading a RSS feed like:
$fp = fopen('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss','r')
or die('Error reading RSS data.');
The feed is an spanish news feed, after I downloaded the file I parsed all the info into one var that have only the content of the tag <description> of every <item>. Well, the issue is that when I echo the var all the information have an html enconding like:
echo($result); // this print: el ministerio pãºblico investigarã¡ la publicaciã³n en la primera pã¡gina
Well I can create a HUGE case instance that searchs for every char can change it for the correspongind one, like: ã¡ for Á and so and so, but there is no way to do this with a single function??? or even better, there is no way to download the content to $fp without the html encoding? Thanks!
Actual code:
<?php
$acumula="";
$insideitem = false;
$tag = '';
$title = '';
$description = '';
$link = '';
function startElement($parser, $name, $attrs) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
$tag = $name;
} elseif ($name == 'ITEM') {
$insideitem = true;
}
}
function endElement($parser, $name) {
global $insideitem, $tag, $title, $description, $link, $acumula;
if ($name == 'ITEM') {
$acumula = $acumula . (trim($title)) . "<br>" . (trim($description));
$title = '';
$description = '';
$link = '';
$insideitem = false;
}
}
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
switch ($tag) {
case 'TITLE':
$title .= $data;
break;
case 'DESCRIPTION':
$description .= $data;
break;
case 'LINK':
$link .= $data;
break;
}
}
}
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, 'startElement', 'endElement');
xml_set_character_data_handler($xml_parser, "characterData");
$fp = fopen('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss','r')
or die('Error reading RSS data.');
while ($data = fread($fp, 4096)) {
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf('XML error: %s at line %d',
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
//echo $acumula;
fclose($fp);
xml_parser_free($xml_parser);
echo($acumula); // THIS IS $RESULT!
?>

EDIT
Since you're already using the XML parser, you're guaranteed the encoding is UTF-8.
If your page is encoded in ISO-8859-1, or even ASCII, you can do this to convert:
$result = mb_convert_encoding($result, "HTML-ENTITIES", "UTF-8");
Use a library that handles this for you, e.g. the DOM extension or SimpleXML. Example:
$d = new DOMDocument();
$d->load('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss');
//now all the data you get will be encoded in UTF-8
Example with SimpleXML:
$url = 'http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss';
if ($sxml = simplexml_load_file($url)) {
echo htmlspecialchars($sxml->channel->title); //UTF-8
}

You can use DOMDocument from PHP to strip HTML encoding tags.
And use encoding conversion functions also from PHP to change encoding of this sting.

Related

How do i get the node-names from xml_parser()

I try to pre-sort and slice a big XML file for later processing via xml_parser
function CreateXMLParser($CHARSET, $bareXML = false) {
$CURRXML = xml_parser_create($CHARSET);
xml_parser_set_option( $CURRXML, XML_OPTION_CASE_FOLDING, false);
xml_parser_set_option( $CURRXML, XML_OPTION_TARGET_ENCODING, $CHARSET);
xml_set_element_handler($CURRXML, 'startElement', 'endElement');
xml_set_character_data_handler($CURRXML, 'dataHandler');
xml_set_default_handler($CURRXML, 'defaultHandler');
if ($bareXML) {
xml_parse($CURRXML, '<?xml version="1.0"?>', 0);
}
return $CURRXML;
}
function ChunkXMLBigFile($file, $tag = 'item', $howmany = 1000) {
global $CHUNKON, $CHUNKS, $ITEMLIMIT;
$CHUNKON = $tag;
$ITEMLIMIT = $howmany;
$xml = CreateXMLParser('UTF-8', false);
$fp = fopen($file, "r");
$CHUNKS = 0;
while(!feof($fp)) {
$chunk = fgets($fp, 10240);
xml_parse($xml, $chunk, feof($fp));
}
xml_parser_free($xml);
processChunk();
}
function processChunk() {
global $CHUNKS, $PAYLOAD, $ITEMCOUNT;
if ('' == $PAYLOAD) {
return;
}
$xp = fopen($file = "xmlTemp/slices/slice_".$CHUNKS.".xml", "w");
fwrite($xp, '<?xml version="1.0" ?>'."\n");
fwrite($xp, "<producten>");
fwrite($xp, $PAYLOAD);
fwrite($xp, "</producten>");
fclose($xp);
print "Written ".$file."<br>";
$CHUNKS++;
$PAYLOAD = '';
$ITEMCOUNT = 0;
}
function startElement($xml, $tag, $attrs = array()) {
global $PAYLOAD, $CHUNKS, $ITEMCOUNT, $CHUNKON;
if (!($CHUNKS||$ITEMCOUNT)) {
if ($CHUNKON == strtolower($tag)) {
$PAYLOAD = '';
}
} else {
$PAYLOAD .= "<".$tag;
}
foreach($attrs as $k => $v) {
$PAYLOAD .= " $k=".'"'.addslashes($v).'"';
}
$PAYLOAD .= '>';
}
function endElement($xml, $tag) {
global $CHUNKON, $ITEMCOUNT, $ITEMLIMIT;
dataHandler(null, "<$tag>");
if ($CHUNKON == strtolower($tag)) {
if (++$ITEMCOUNT >= $ITEMLIMIT) {
processChunk();
}
}
}
function dataHandler($xml, $data) {
global $PAYLOAD;
$PAYLOAD .= $data;
}
but how can I access the node-name??
.. I have to sort some items (with n nodes) out, before the slice-file is saved. the the XML is parsed line after line, right? so I have to save the nodes from a whole item temporarely and decide if the item is gonna be written to the file.. is there a way to do this?
Your code is effectively reading the entire source file every time you call the ChunkXMLBigFile function.
After your while loop you have all the elements, which you can then manipulate any way you like.
See the following questions about how to approach this:
How to sort a xml file using DOM
Sort XML nodes with PHP
If you parse the chunks after that in batches of $howmany you are where you want to be.
Tip: there are many examples online where this functionality is presented in an Object Orient Programming (OOP) approach where all the functions are inside a class. This would also eliminate the need of global variables which can cause some (read: a lot) of frustrations and confusion.

Quote issue in PHP

I have scrape data from Telugu site:
when i got "Suriya’s ‘24’ in legal tangle" this kind of string then that quote is not recognized by php function and it's converted in different character(Issue Link).
Code:
//
include "simple_html_dom.php";
// Get news from telugu site
$url = "http://www.123telugu.com/category/mnews";
$html = file_get_html($url);
$divs = $html->find('div.leading');
$result = array();
$status = FALSE;
$i = 0;
foreach ($divs as $d) {
$status = TRUE;
$title = $d->find('a', 0)->plaintext;
$result[$i]['Title'] = $title;
$link = $d->find('a', 0)->href;
$result[$i]['Link'] = $link;
$title = trim(mysql_real_escape_string($title)); // code for title
$html = file_get_html($link);
// code for image
$image = '';
foreach ($html->find('div.post-content') as $im) {
$image = $im->find('img', 0)->src; // code for image
}
$image = trim(str_replace('//', '', $image));
$result[$i]['Image'] = $image;
// code for content
$content = '';
foreach ($html->find('div.post-content p') as $co) {
$content.= $co->plaintext; // code for content
}
$result[$i]['Content'] = $content;
$i++;
}
echo json_encode(array('Status' => $status, 'Data' => $result));
We have to add following code on top of the page. will solve the issue.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Solution:
$string= iconv('utf-8', 'us-ascii//TRANSLIT', $string);
htmlspecialchars_decode() might be the function that you are looking for. Just run the final output from the scraper with this function. It should decode all the special HTML encoded characters.
Check out: http://php.net/htmlspecialchars_decode

Rss Feed, generating the image

I am trying to generate an RSS feed on my site using the code below. The rss is appearing but I am having two issues:
When the feed shows on my page the images do not show up, instead you see the img link appear directly on the page like this... <img src="http://graphics8.nytimes.com/images/2011/11/18/movies/18RDP_GARBO/18RDP_GARBO-thumbStandard.jpg" border="0" height="75" width="75" hspace="4" align="left">
How do I limit the amount of articles that appear on my site?
Here is the link to the RSS: Spy RSS FEED
Here is the code I am using:
<?php
$insideitem = false;
$tag = "";
$title = "";
$description = "";
$link = "";
$locations = array('http://topics.nytimes.com/topics/reference/timestopics/subjects/e/espionage/index.html?rss=1');
srand((float) microtime() * 10000000); // seed the random gen
$random_key = array_rand($locations);
function startElement($parser, $name, $attrs) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
$tag = $name;
} elseif ($name == "ITEM") {
$insideitem = true;
}
}
function endElement($parser, $name) {
global $insideitem, $tag, $title, $description, $link;
if ($name == "ITEM") {
printf("<dt><b><a href='%s' target=new>%s</a></b></dt>",
trim($link),htmlspecialchars(trim($title)));
printf("<dt>%s</dt><br><br>",htmlspecialchars(trim($description)));
$title = "";
$description = "";
$link = "";
$insideitem = false;
}
}
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
switch ($tag) {
case "TITLE":
$title .= $data;
break;
case "DESCRIPTION":
$description .= $data;
break;
case "LINK":
$link .= $data;
break;
}
}
}
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
$fp = fopen($locations[$random_key], 'r')
or die("Error reading RSS data.");
while ($data = fread($fp, 4096))
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
fclose($fp);
xml_parser_free($xml_parser);
?>
In endElement(), when outputting the feed content, it calls printf("<dt>%s</dt><br><br>",htmlspecialchars(trim($description)));
If you remove the htmlspecialchars function, then it should display images and other html properly instead of converting < to < etc.
Given that code, there is no built in way to limit the number of feeds. Nytimes may have an option you can pass as part of the query string that restricts the number of results, but I am not sure about that.
A quick fix would be to add a global variable called $numShown or something like that, and at the beginning of endElement(), you can increment it, and the check to see if it is above some value and if so just return prior to all the printf calls to output the feed item.
<?php
function endElement($parser, $name) {
global $insideitem, $tag, $title, $description, $link, $numShown;
if ($name == "ITEM") {
$numShown++;
if ($numShown >= 5) {
return ;
}
printf ( "<dt><b><a href='%s' target=new>%s</a></b></dt>", trim ( $link ), htmlspecialchars ( trim ( $title ) ) );
printf ( "<dt>%s</dt><br><br>", trim ( $description ) );
$title = "";
$description = "";
$link = "";
$insideitem = false;
}
}

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

PHP:Fatal error :Cannot redeclare startElement()

for some reason I get this error below when trying to use multiple require() functions in my PHP. Basically, I'm use a couple require() functions to access a couple xml parser pages.
Does anyone know how to fix this?If this isn't very descriptive please say below and I will try to fix it. Thank you. I appreciate any positive feedback. Also, I'm just learning PHP so please don't be too harsh on me. I'm going to provide the following code below.
Here is the error:
Fatal error: Cannot redeclare startElement() (previously declared in /Applications/XAMPP/xamppfiles/htdocs/yournewsflow/news/sports.php:27) in /Applications/XAMPP/xamppfiles/htdocs/yournewsflow/news/political.php on line 34
Here are the require functions:
<?php
require("news/sports.php");
require("news/political.php");
?>
Here is the xml parser used for a couple pages:
<?php
$tag = "";
$title = "";
$description = "";
$link = "";
$pubDate = "";
$show= 50;
$feedzero = "http://feeds.finance.yahoo.com/rss/2.0/category-stocks?region=US&lang=en-US"; $feedone = "http://feeds.finance.yahoo.com/rss/2.0/category-ideas-and-strategies?region=US&lang=en-US";
$feedtwo = "http://feeds.finance.yahoo.com/rss/2.0/category-earnings?region=US&lang=en-US"; $feedthree = "http://feeds.finance.yahoo.com/rss/2.0/category-bonds?region=US&lang=en-US";
$feedfour = "http://feeds.finance.yahoo.com/rss/2.0/category-economy-govt-and-policy?region=US&lang=en-US";
$insideitem = false;
$counter = 0;
$outerData;
function startElement($parser, $name, $attrs) {
global $insideitem, $tag, $title, $description, $link, $pubDate;
if ($insideitem) {
$tag = $name;
} elseif ($name == "ITEM") {
$insideitem = true;
} }
function endElement($parser, $name) {
global $insideitem, $tag, $counter, $show, $showHTML, $outerData;
global $title, $description, $link, $pubDate;
if ($name == "ITEM" && $counter < $show) {
echo "<table>
<tr>
<td>
".htmlspecialchars($description)."
</td>
</tr>";
// if you chose to show the HTML
if ($showHTML) {
$title = htmlspecialchars($title);
$description = htmlspecialchars($description);
$link = htmlspecialchars($link);
$pubDate = htmlspecialchars($pubDate);
// if you chose not to show the HTML
} else {
$title = strip_tags($title);
$description = strip_tags($description);
$link = strip_tags($link);
$pubDate = strip_tags($pubDate);
}
// fill the innerData array
$innerData["title"] = $title;
$innerData["description"] = $description;
$innerData["link"] = $link;
$innerData["pubDate"] = $pubDate;
// fill one index of the outerData array
$outerData["data".$counter] = $innerData;
// make all the variables blank for the next iteration of the loop
$title = "";
$description = "";
$link = "";
$pubDate = "";
$insideitem = false;
// add one to the counter
$counter++;
}
}
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link, $pubDate;
if ($insideitem) {
switch ($tag) {
case "TITLE":
$title .= $data;
break;
case "DESCRIPTION":
$description .= $data;
break;
case "LINK":
$link .= $data;
break;
case "PUBDATE":
$pubDate .= $data;
break;
}
}
}
// Create an XML parser
$xml_parser = xml_parser_create();
// Set the functions to handle opening and closing tags
xml_set_element_handler($xml_parser, "startElement", "endElement");
// Set the function to handle blocks of character data
xml_set_character_data_handler($xml_parser, "characterData");
// if you started with feed:// fix it to html://
// Open the XML file for reading
$feedzeroFp = fopen($feedzero, 'r') or die("Error reading RSS data.");
$feedoneFp = fopen($feedone, 'r') or die("Error reading RSS data.");
$feedtwoFp = fopen($feedtwo, 'r') or die("Error reading RSS data.");
$feedthreeFp = fopen($feedthree, 'r') or die("Error reading RSS data.");
$feedfourFp = fopen($feedfour, 'r') or die("Error reading RSS data.");
// Read the XML file 4KB at a time
while ($data = fread($feedoneFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedoneFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedoneFp);
while ($data = fread($feedtwoFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedtwoFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedtwoFp);
while ($data = fread($feedthreeFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedthreeFp))
//Handle errors in parsing
or die(sprintfs("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedthreeFp);
while ($data = fread($feedfourFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedfourFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedfourFp);
// Free up memory used by the XML parser
xml_parser_free($xml_parser);
?>
You cant require the same "parser" more than once because youve already defined the functions in that file. You need to restructure your code:
In parser.functions.php:
function startElement($parser, $name, $attrs) {
global $insideitem, $tag, $title, $description, $link, $pubDate;
if ($insideitem) {
$tag = $name;
} elseif ($name == "ITEM") {
$insideitem = true;
} }
function endElement($parser, $name) {
global $insideitem, $tag, $counter, $show, $showHTML, $outerData;
global $title, $description, $link, $pubDate;
if ($name == "ITEM" && $counter < $show) {
echo "<table>
<tr>
<td>
".htmlspecialchars($description)."
</td>
</tr>";
// if you chose to show the HTML
if ($showHTML) {
$title = htmlspecialchars($title);
$description = htmlspecialchars($description);
$link = htmlspecialchars($link);
$pubDate = htmlspecialchars($pubDate);
// if you chose not to show the HTML
} else {
$title = strip_tags($title);
$description = strip_tags($description);
$link = strip_tags($link);
$pubDate = strip_tags($pubDate);
}
// fill the innerData array
$innerData["title"] = $title;
$innerData["description"] = $description;
$innerData["link"] = $link;
$innerData["pubDate"] = $pubDate;
// fill one index of the outerData array
$outerData["data".$counter] = $innerData;
// make all the variables blank for the next iteration of the loop
$title = "";
$description = "";
$link = "";
$pubDate = "";
$insideitem = false;
// add one to the counter
$counter++;
}
}
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link, $pubDate;
if ($insideitem) {
switch ($tag) {
case "TITLE":
$title .= $data;
break;
case "DESCRIPTION":
$description .= $data;
break;
case "LINK":
$link .= $data;
break;
case "PUBDATE":
$pubDate .= $data;
break;
}
}
}
In your actual page php files:
$tag = "";
$title = "";
$description = "";
$link = "";
$pubDate = "";
$show= 50;
$feedzero = "http://feeds.finance.yahoo.com/rss/2.0/category-stocks?region=US&lang=en-US"; $feedone = "http://feeds.finance.yahoo.com/rss/2.0/category-ideas-and-strategies?region=US&lang=en-US";
$feedtwo = "http://feeds.finance.yahoo.com/rss/2.0/category-earnings?region=US&lang=en-US"; $feedthree = "http://feeds.finance.yahoo.com/rss/2.0/category-bonds?region=US&lang=en-US";
$feedfour = "http://feeds.finance.yahoo.com/rss/2.0/category-economy-govt-and-policy?region=US&lang=en-US";
$insideitem = false;
$counter = 0;
$outerData;
require_once('path/to/parser.functions.php');
// Create an XML parser
$xml_parser = xml_parser_create();
// Set the functions to handle opening and closing tags
xml_set_element_handler($xml_parser, "startElement", "endElement");
// Set the function to handle blocks of character data
xml_set_character_data_handler($xml_parser, "characterData");
// if you started with feed:// fix it to html://
// Open the XML file for reading
$feedzeroFp = fopen($feedzero, 'r') or die("Error reading RSS data.");
$feedoneFp = fopen($feedone, 'r') or die("Error reading RSS data.");
$feedtwoFp = fopen($feedtwo, 'r') or die("Error reading RSS data.");
$feedthreeFp = fopen($feedthree, 'r') or die("Error reading RSS data.");
$feedfourFp = fopen($feedfour, 'r') or die("Error reading RSS data.");
// Read the XML file 4KB at a time
while ($data = fread($feedoneFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedoneFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedoneFp);
while ($data = fread($feedtwoFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedtwoFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedtwoFp);
while ($data = fread($feedthreeFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedthreeFp))
//Handle errors in parsing
or die(sprintfs("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedthreeFp);
while ($data = fread($feedfourFp, 4096))
//Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser,$data,feof($feedfourFp))
//Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
// Close the XML file
fclose($feedfourFp);
// Free up memory used by the XML parser
xml_parser_free($xml_parser);
This means the function startElement was already defined. You cannot have more than one function with the same name.

Categories