Working with large XML files in PHP

Working with large XML files in PHP - php

i have a problem using XMLParser and simplexml_load_dom. Im trying to search in 4 with 2MB each files and in a 27 MB file. The problem is not with the memory but with the execution time (around 50s). How can i optimize the code?
public function searchInFeed()
{
$feed =& $this->getModel('feed','afiliereFeeduriModel');
$myfeeds = $feed->selectFeed();
foreach ($myfeeds as $f)
{
$x = new XMLReader();
$x->open($f->url);
$z = microtime(true);
$doc = new DOMDocument('1.0', 'UTF-8');
set_time_limit(0);
while ($x->read())
{
if ($x->nodeType === XMLReader::ELEMENT)
{
$nod = simplexml_import_dom($doc->importNode($x->expand(), true));
$data['text'] = 'Chicco termometru';
$data['titlu'] = 'title';
$data['nod'] = &$nod;
if ($this->searchInXML($data))
{
echo $nod->title."<br>";
}
$x->next();
}
}
}
echo microtime(true) - $z."<br>";
echo memory_get_usage()/1024/1024;
die();
}

Related

php xmlreader not returning complete data

Hello I am trying to get all the data from XML File , so i have used xmltoassoc function. It is working for 5 mb file but not for more than 9mb file.
Here is my code:
I have modified this function to get json code,
function xml2assoc($xml, $name)
{
$tree = null;
while($xml->read())
{
if($xml->nodeType == XMLReader::END_ELEMENT)
{
return $tree;
}
else if($xml->nodeType == XMLReader::ELEMENT)
{
$node = array();
if($xml->hasAttributes)
{
$attributes = array();
while($xml->moveToNextAttribute())
{
$attributes[$xml->name] = $xml->value;
}
}
if(!$xml->isEmptyElement)
{
$childs = xml2assoc($xml, $node['tag']);
if(isset($childs['text']))
{
$tree = $childs;
} else {
$tree['text'] = $childs[0];
}
}
}
else if($xml->nodeType == XMLReader::TEXT)
{
if(isset($xmlArr['text']))
{
$tree = $xmlArr;
} else {
$tree['text'] = $xmlArr[0];
}
}
}
return $tree;
}
I have used this function to return JSON by passing URL.
function PARSE_XML_JSON($url)
{
$text = "";
$xml = new XMLReader();
$xml->open($url);
$assoc = xml2assoc($xml, "root");
$xml->close();
if(isset($assoc['text']))
{
$text = $assoc['text'];
}
//StoreInTxtFile($text);
return $text;
}
I have also tried to save data in files by doing this:
function StoreInTxtFile($data)
{
$myFile = 'jsonfile-'.time().'.txt';
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $data);
fclose($fh);
}
Please tell me what i'm missing.
Thanks

use LIBXML_PARSEHUGE
$xml = new XMLReader();
$xml->open($url, NULL, LIBXML_PARSEHUGE);

scrape html page with strange result

the scrape works but, the strange thing is that the result is ["-3°"]
I tried so many different things to get just -3°
But how is it that does [" and "] show up if they are not in the code!
Does someone can give me some direction how to achieve this
the code I am using is
<?php
function scrape($url){
$output = file_get_contents($url);
return $output;
}
function fetchdata($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$page = scrape("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$result = fetchdata($page, "<p class=\"text-center mrgn-tp-md mrgn-bttm-sm lead\"><span class=\"wxo-metric-hide\">", "<abbr title=\"Celsius\">C</abbr>");
echo json_encode(array($result));
?>
already thanks for you help!

You can use the DOMDocument to parse the HTML file.
$page = file_get_contents("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
libxml_use_internal_errors(false);
$paragraphs = $doc->getElementsByTagName('p');
foreach($paragraphs as $p){
if($p->getAttribute('class') == 'text-center mrgn-tp-md mrgn-bttm-sm lead') {
foreach($p->getElementsbyTagName('span') as $attr) {
if($attr->getAttribute('class') == 'wxo-metric-hide') {
foreach($attr->getElementsbyTagName('abbr') as $abbr) {
if($abbr->getAttribute('title') == 'Celsius') {
echo trim($attr->nodeValue);
}
}
}
}
}
}
Output:
-3°C
This is assuming the classes and structure are consistent...

PHP performance issue

I am trying to get the coin data of this website: http://www.tf2wh.com.
With this script:
$name = $_POST["item"];
$url = file_get_contents("http://www.tf2wh.com/allitems");
$dom = new DOMDocument();
#$dom->loadHTML($url);
$dom->saveHTML();
$code = "";
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[contains(attribute::class, "entry qual")]') as $e ) {
$code .= $e->nodeValue;
}
$code = substr($code,strpos($code,$name)-30,30);
$code = explode("(",$code);
$coins = "";
for($i = 0; $i < strlen($code[0]); $i++){
if(is_numeric($code[0][$i])){
$coins .= $code[0][$i];
}
}
echo $coins;
It works fine but there are two problems. First, its sooo slow, the time between request and response is around 15-30 seconds. Second, sometime this error occurs:
Fatal error: Maximum execution time of 30 seconds exceeded in
C:\xampp\htdocs\steammarket\getCoins.php on line 6
How can I fix this problem with the performance issue.

Connect site slow.
First php code set_time_limit(0); or ini_set('max_execution_time', 300); //300 seconds = 5 minutes
<?php
set_time_limit(0);
$name = $_POST["item"];
$url = file_get_contents("http://www.tf2wh.com/allitems");
$dom = new DOMDocument();
#$dom->loadHTML($url);
$dom->saveHTML();
$code = "";
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[contains(attribute::class, "entry qual")]') as $e ) {
$code .= $e->nodeValue;
}
$code = substr($code,strpos($code,$name)-30,30);
$code = explode("(",$code);
$coins = "";
for($i = 0; $i < strlen($code[0]); $i++){
if(is_numeric($code[0][$i])){
$coins .= $code[0][$i];
}
}
echo $coins;

Pinterest style / PHP Image Scraper Crashing Server

I must have a memory leak or something that is just eating memory on my server somewhere in this class. For example if I file_get_contents(http://www.theknot.com) it will not be able to connect to the server tho its not down, or mysql closes the connection, or in extreme situations completed knock out the server for a mount of time that we can not even get a ping. I know its somewhere within the preg_match_all if block, but I dont know what would get run away to what I can only assume is a lot of processing on the regex match due to whatever is within the content that is fetched from the remote site. Any ideas?
<?php
class Utils_Linkpreview extends Zend_Db_table
{
public function getPreviews($url) {
$link = $url;
$width = 200;
$height = 200;
$regex = '/<img[^\/]+src="([^"]+\.(jpe?g|gif|png))/';
/// $regex = '/<img[^\/]+src="([^"]+)/';
$thumbs = false;
try {
$data = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
if (($data) && preg_match_all($regex, $data, $m, PREG_PATTERN_ORDER)) {
if (isset($m[1]) && is_array($m[1])) {
$thumbs = array();
foreach (array_unique($m[1]) as $url) {
if (
($url = $this->rel2abs($url, $link)) &&
($i = #getimagesize($url)) &&
$i[0] >= ($width-10) &&
$i[1] >= ($height-10)
) {
$thumbs[] = $url;
}
}
}
}
return $thumbs;
}
private function rel2abs($url, $host) {
if (substr($url, 0, 4) == 'http') {
return $url;
} else {
$hparts = explode('/', $host);
if ($url[0] == '/') {
return implode('/', array_slice($hparts, 0, 3)) . $url;
} else if ($url[0] != '.') {
array_pop($hparts);
return implode('/', $hparts) . '/' . $url;
}
}
}
}
?>
EDIT - Amal Murali's comment pointed me in a better direction using PHP's DomDocument. Thanks bud!
Here is the result:
public function getPreviews($url) {
$link = $url;
$thumbs = false;
try {
$html = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//img[#width > 200 or substring-before(#width, 'px') > 200 or #height > 200 or substring-before(#height, 'px') > 200]") as $node)
{
$url = $node->getAttribute("src");
$thumbs[] = $this->rel2abs($url, $link);
}
return $thumbs;
}

EDIT - Amal Murali's comment pointed me in a better direction using PHP's DomDocument. Thanks bud!
Here is the result:
public function getPreviews($url) {
$link = $url;
$thumbs = false;
try {
$html = file_get_contents($link);
} catch (Exception $e) {
print "Caught exception when attempting to find images: ". $e->getMessage(). "\n";
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//img[#width > 200 or substring-before(#width, 'px') > 200 or #height > 200 or substring-before(#height, 'px') > 200]") as $node)
{
$url = $node->getAttribute("src");
$thumbs[] = $this->rel2abs($url, $link);
}
return $thumbs;
}

XML parsing in php

I am parsing a xml and but there is a tag which contain image and text both and i want to seprate both image and text in diffrent columns of table in my design layout but i dont know how to do it. please help me. my php file is :
<?php
$RSS_Content = array();
function RSS_Tags($item, $type)
{
$y = array();
$tnl = $item->getElementsByTagName("title");
$tnl = $tnl->item(0);
$title = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("link");
$tnl = $tnl->item(0);
$link = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("description");
$tnl = $tnl->item(0);
$img = $tnl->firstChild->textContent;
$y["title"] = $title;
$y["link"] = $link;
$y["description"] = $img;
$y["type"] = $type;
return $y;
}
function RSS_Channel($channel)
{
global $RSS_Content;
$items = $channel->getElementsByTagName("item");
// Processing channel
$y = RSS_Tags($channel, 0); // get description of channel, type 0
array_push($RSS_Content, $y);
// Processing articles
foreach($items as $item)
{
$y = RSS_Tags($item, 1); // get description of article, type 1
array_push($RSS_Content, $y);
}
}
function RSS_Retrieve($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
RSS_Channel($channel);
}
}
function RSS_RetrieveLinks($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
$items = $channel->getElementsByTagName("item");
foreach($items as $item)
{
$y = RSS_Tags($item, 1);
array_push($RSS_Content, $y);
}
}
}
function RSS_Links($url, $size = 15)
{
global $RSS_Content;
$page = "<ul>";
RSS_RetrieveLinks($url);
if($size > 0)
$recents = array_slice($RSS_Content, 0, $size + 1);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0) continue;
$title = $article["title"];
$link = $article["link"];
$img = $article["description"];
$page .= "$title\n";
}
$page .="</ul>\n";
return $page;
}
function RSS_Display($url, $click, $size = 8, $site = 0, $withdate = 0)
{
global $RSS_Content;
$opened = false;
$page = "";
$site = (intval($site) == 0) ? 1 : 0;
RSS_Retrieve($url);
if($size > 0)
$recents = array_slice($RSS_Content, $site, $size + 1 - $site);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0)
{
if($opened == true)
{
$page .="</ul>\n";
$opened = false;
}
$page .="<b>";
}
else
{
if($opened == false)
{
$page .= "<table width='369' border='0'>
<tr>";
$opened = true;
}
}
$title = $article["title"];
$link = $article["link"];
$img = $article["description"];
$page .= "<td width='125' align='center' valign='middle'>
<div align='center'>$img</div></td>
<td width='228' align='left' valign='middle'><div align='left'><a
href=\"$click\" target='_top'>$title</a></div></td>";
if($withdate)
{
$date = $article["date"];
$page .=' <span class="rssdate">'.$date.'</span>';
}
if($type==0)
{
$page .="<br />";
}
}
if($opened == true)
{
$page .="</tr>
</table>";
}
return $page."\n";
}
?>

To separate the image and description you need to parse the HTML that is stored inside the description element again as XML. Luckily it is valid XML inside that element, therefore you can do this straight forward with SimpleXML, the following code-example take the URL and converts each item *description* into the text only and extracts the src attribute of the image to store it as the image element:
<item>
<title>Fake encounter: BJP backs Kataria, says CBI targeting Modi</title>
<link>http://ibnlive.in.com/news/fake-encounter-bjp-backs-kataria-says-cbi-targeting-modi/391802-37-64.html</link>
<description>The BJP lashed out at the CBI and questioned its 'shoddy investigation' into the Sohrabuddin fake encounter case.</description>
<pubDate>Wed, 15 May 2013 13:48:56 +0530</pubDate>
<guid>http://ibnlive.in.com/news/fake-encounter-bjp-backs-kataria-says-cbi-targeting-modi/391802-37-64.html</guid>
<image>http://static.ibnlive.in.com/ibnlive/pix/sitepix/05_2013/bjplive_kataria3.jpg</image>
</item>
The code-example is:
$url = 'http://ibnlive.in.com/ibnrss/top.xml';
$feed = simplexml_load_file($url);
$items = $feed->xpath('(//channel/item)');
foreach ($items as $item) {
list($description, $image) =
simplexml_load_string("<r>$item->description</r>")
->xpath('(/r|/r//#src)');
$item->description = (string)$description;
$item->image = (string)$image;
}
You can then import the SimpleXML into a DOMElement with dom_import_simplexml() however honestly, I just would wrap that little HTML creation as well into a foreach of SimpleXML because you can make use of LimitIterator for the paging as well as you could with DOMDocument and the data you access is actually easily at hand with SimpleXML, it's just easy to pass along the XML elements as SimpleXMLElements instead of parsing into an array first and then processing the array. That's moot.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Working with large XML files in PHP - php

Related

php xmlreader not returning complete data

scrape html page with strange result

PHP performance issue

Pinterest style / PHP Image Scraper Crashing Server

XML parsing in php

Categories

Resources