SimpleXML - fails loading some remote URL's - php

I'm using SimpleXML to fetch a remote XML file and im having some issues because sometimes SimpleXML can't load the XML. I don't know exactly the reason but i suspect the remote site takes longer than usual to return data, resulting in a timeout.
The code i use is the following:
$xml = #simplexml_load_file($url);
if(!$xml){
$database = Config_helper::get_config_option('mysql');
$db = new \DB($database['database'], $database['server'], $database['user'], $database['password']);
$date = date('Y-m-d H:i:s');
$db->query("INSERT INTO gearman_job_error (timestamp, data, attempt)
VALUES ('$date', '{$job->workload()}', '1')");
//$db->query("INSERT INTO gearman_job_error (timestamp, data, attempt) VALUES ({$date}, {$job->workload()}, 1);");
return $job->sendFail();
}
else {
foreach($xml->point as $key=>$value):
$length = count($value);
$timestamp = (string) $value->data[0];
$j=0;
for ($i = 1; $i < $length; $i++)
{
$forecast[$timestamp][$time_request][] = array($variables[$j] => (string) $value->data[$i]);
$j++;
}
endforeach;
return serialize($forecast);
}
Those url's i can't load are stored in the database and by checking them i confirm that they load correctly in the browser.. no problem with them.
Example: http://mandeo.meteogalicia.es/thredds/ncss/modelos/WRF_HIST/d02/2015/02/wrf_arw_det_history_d02_20150211_0000.nc4?latitude=40.393288&longitude=-8.873433&var=rh%2Ctemp%2Cswflx%2Ccfh%2Ccfl%2Ccfm%2Ccft&point=true&accept=xml&time_start=2015-02-11T00%3A00Z&time_end=2015-02-14T20%3A00Z
My question is, how can i insist the SimpleXML to take it's time to load the url? My goal is only after a reasonable time it assumes it can't load the file and store it in the database.

simplexml_load_file itself doesn't have any support for specifying timeouts, but you can combine file_get_contents and simplexml_load_string, like this:
<?php
$timeout = 30;
$url = 'http://...';
$context = stream_context_create(['http' => ['timeout' => $timeout]]);
$data = file_get_contents($url, false, $context);
$xml = simplexml_load_string($data);
print_r($xml);

I figured a way of doing this that for now suits me.
I set a maximum number of tries to fetch the xml and if it doesn't work that means the xml can be possibly damaged or missing.
I have tested and the results are accurate! It's simple and more effective then setting a timeout. I guess you can always set a timeout also.
$maxTries = 5;
do
{
$content = #file_get_contents($url);
}
while(!$content && --$maxTries);
if($content)
{
try
{
$xml = #simplexml_load_string($content);
# Do what you have to do here #
}
catch(Exception $exception)
{
print($exception->getMessage());
}
}
else
{
echo $url;
$job->sendFail();
}

Related

PHP webscraper does not produce errors nor start the loop/create output

I am writing a web scraper in PHP using gitpod. After a while I have managed to solve all problems. But even though no problems are left, the code does not open the browser nor produce any output.
Does anybody have an idea why that could be the case?
<?php
if (file_exists('vendor/autoload.php')) {
require 'vendor/autoload.php';
}
use Goutte\Client;
$client = new Goutte\Client();
// Create a new array to store the scraped data
$data = array();
// Loop through the pages
if ($client->getResponse()->getStatus() != 200) {
echo 'Failed to access website. Exiting script.';
exit();
}
for ($i = 0; $i < 3; $i++) {
// Make a request to the website
$crawler = $client->request('GET', 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page=' . $i);
// Find all the initiatives on the page
$crawler->filter('.initiative')->each(function ($node) use (&$data) {
// Extract the information for each initiative
$title = $node->filter('h3')->text();
$link = $node->filter('a')->attr('href');
$description = $node->filter('p')->text();
$deadline = $node->filter('time')->attr('datetime');
// Append the data for the initiative to the data array
$data[] = array($title, $link, $description, $deadline);
});
// Sleep for a random amount of time between 5 and 10 seconds
$sleep = rand(5,10);
sleep($sleep);
}
// Open the output file
$fp = fopen('initiatives.csv', 'w');
// Write the header row
fputcsv($fp, array('Title', 'Link', 'Description', 'Deadline'));
// Write the data rows
foreach ($data as $row) {
fputcsv($fp, $row);
}
// Close the output file
fclose($fp);
?>

Wordpress - file_get_contents loop turns down homepage for a while - alternative?

i have following problem:
in a function, i put in an array with at least 700 names. I get out an array with all information about their releases from the last 10 days.
The function gets via iTunes API a json response, which i want to use for further analyzings.
Problem:
- while executing function, it takes about 3mins to finish it.
- homepage is not reachable for others, while i execute it:
(Error on Server: (70007)The timeout specified has expired: AH01075: Error dispatching request to : (polling)) --> Running out of memory?
Questions:
- how to code this function more efficient?
- how to code this function without using to much memory, shall i use unset(...) ??
Code:
function getreleases($artists){
# print_r($artists);
$releases = array();
foreach( $artists as $artist){
$artist = str_replace(" ","%20",$artist);
$ituneslink = "https://itunes.apple.com/search?term=".$artist."&media=music&entity=album&limit=2&country=DE";
$itunesstring = file_get_contents($ituneslink);
$itunesstring = json_decode($itunesstring);
/*Results being decoded from json to an array*/
if( ($itunesstring -> resultCount)>0 ){
foreach ( $itunesstring -> results as $value){
if( (date_diff(date_create('now'), date_create( ($value -> releaseDate )))->format('%a')) < 10) {
#echo '<br>Gefunden: ' . $artist;
$releases[] = $value;
}
}
}else{
echo '<br><span style="color:red">Nicht gefunden bei iTunes: ' . $artist .'</span>';
}
unset($ituneslink);
unset($itunesstring);
unset($itunesstring2);
}
return $releases;
}
The problem lies in the fact that every time that function is executed, your server needs to make 700+ API Calls, parse the data, and work your logic on it.
One potential solution is to use Wordpress's transients to 'cache' the value (or perhaps even the whole output), this way, it won't have to execute that strenuous function on every connection, it'll just pull the data from the transient. You can set an expiry date for a transient, so you can have it refetch the information every X days/hours.
Take a look at this article from CSS Tricks that walks you through a simple example using transients.
But the problem is not fixed. While updating the stuff and getting 700 items from iTunes API and while Running in the for-loop, the homepage is getting out of memory. although homepage is not reachable from my computer. I just tried for a "timeout" or "sleep" sothat the script is searching for stuff every few seconds. But it doesn't change it.
I just improved: Changed "foreach" to "for" because of memory reasons. Now variables are not being copied. Are there more problems :-/ ??
I've got to for-loops in there. Maybe $itunesstring is being copied ?
if(!function_exists('get_contents')){
function get_contents(&$url){
// if cURL is available, use it...
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$cache = curl_exec($ch);
curl_close($ch);
return $cache;
}
}
function getfromituneslink(&$link,&$name){
$name = str_replace("'","",$name);
$name = substr($name, 0, 14);
$result = get_transient("getlink_itunes_{$name}");
if(false === $result){
$result = get_contents($link);
set_transient("getlink_itunes_{$name}",$result, 12*HOUR_IN_SECONDS);
}
return $result;
}
function getreleases(&$artists){
$releases= array();
while( 0 < count($artists)){
$itunesstring = array();
$artist = array_shift($artists);
$artist = str_replace(" ","%20",$artist);
$ituneslink = "https://itunes.apple.com/search?term=".$artist."&media=music&entity=album&limit=2&country=DE";
$itunesstring = getfromituneslink($ituneslink,$artist);
unset($ituneslink);
$itunesstring = json_decode($itunesstring);
if( ($itunesstring -> resultCount)>0 ){
#for($i=0; $i< (count($itunesstring -> results))-1; ++$i)
while( 0 < count(($itunesstring -> results))){
$value = array_shift($itunesstring -> results);
#$value = &$itunesstring[results][$i];
#foreach ( $itunesstring -> results as $value)
if( (date_diff(date_create('now'), date_create( ($value -> releaseDate )))->format('%a')) < 6) {
$releases[] = array($value->artistName, $value->collectionName, $value->releaseDate, str_replace("?uo=4","",$value -> collectionViewUrl));
unset($value);
}
}
}else{
echo '<br><span style="color:red">Nicht gefunden bei iTunes: ' . str_replace("%20"," ",$artist) .'</span>';
}
unset($ituneslink);
unset($itunesstring);
}
return $releases;
}
I don't know, where the problem is. :-(
Any other possibilty to let the function run to get the information one by another

How do I store mostly static data from a JSON api?

My php project is using the reddit JSON api to grab the title of the current page's submission.
Right now I am doing running some code every time the page is loaded and I'm running in to some problems, even though there is no real API limit.
I would like to store the title of the submission locally somehow. Can you recommend the best way to do this? The site is running on appfog. What would you recommend?
This is my current code:
<?php
/* settings */
$url="http://".$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
$reddit_url = 'http://www.reddit.com/api/info.{format}?url='.$url;
$format = 'json'; //use XML if you'd like...JSON FTW!
$title = '';
/* action */
$content = get_url(str_replace('{format}',$format,$reddit_url)); //again, can be xml or json
if($content) {
if($format == 'json') {
$json = json_decode($content,true);
foreach($json['data']['children'] as $child) { // we want all children for this example
$title= $child['data']['title'];
}
}
}
/* output */
/* utility function: go get it! */
function get_url($url) {
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,1);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
?>
Thanks!
Introduction
Here is a modified version of your code
$url = "http://stackoverflow.com/";
$loader = new Loader();
$loader->parse($url);
printf("<h4>New List : %d</h4>", count($loader));
printf("<ul>");
foreach ( $loader as $content ) {
printf("<li>%s</li>", $content['title']);
}
printf("</ul>");
Output
New List : 7New podcast from Joel Spolsky and Jeff Atwood. Good site for example code/ Pyhtonstackoverflow.com has clearly the best Web code ever conceived in the history of the Internet and reddit should better start copying it.A reddit-like, OpenID using website for programmersGreat developer site. Get your questions answered and by someone who knows.Stack Overflow launched into publicStack Overflow, a programming Q & A site. & Reddit could learn a lot from their interface!
Simple Demo
The Problem
I see some things you want to achieve here namely
I would like to store the title of the submission locally somehow
Right now I am doing running some code every time the page is loaded
From what i understand you need is a simple cache copy of your data so that you don't have to load the url all the time.
Simple Solution
A simple cache system you can use is memcache ..
Example A
$url = "http://stackoverflow.com/";
// Start cache
$m = new Memcache();
$m->addserver("localhost");
$cache = $m->get(sha1($url));
if ($cache) {
// Use cache copy
$loader = $cache;
printf("<h2>Cache List: %d</h2>", count($loader));
} else {
// Start a new Loader
$loader = new Loader();
$loader->parse($url);
printf("<h2>New List : %d</h2>", count($loader));
$m->set(sha1($url), $loader);
}
// Oupput all listing
printf("<ul>");
foreach ( $loader as $content ) {
printf("<li>%s</li>", $content['title']);
}
printf("</ul>");
Example B
You can use Last Modification Date as the cache key as so that you would only save new copy only if the document is modified
$headers = get_headers(sprintf("http://www.reddit.com/api/info.json?url=%s",$url), true);
$time = strtotime($headers['Date']); // get last modification date
$cache = $m->get($time);
if ($cache) {
$loader = $cache;
}
Since your class implements JsonSerializable you can json encode your result and also store in a Database like MongoDB or MySQL
$data = json_encode($loader);
// Save to DB
Class Used
class Loader implements IteratorAggregate, Countable, JsonSerializable {
private $request = "http://www.reddit.com/api/info.json?url=%s";
private $data = array();
private $total;
function parse($url) {
$content = json_decode($this->getContent(sprintf($this->request, $url)), true);
$this->data = array_map(function ($v) {
return $v['data'];
}, $content['data']['children']);
$this->total = count($this->data);
}
public function getIterator() {
return new ArrayIterator($this->data);
}
public function count() {
return $this->total;
}
public function getType() {
return $this->type;
}
public function jsonSerialize() {
return $this->data;
}
function getContent($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
}
I'm not sure what your question is exactly but the first thing that pops is the following:
foreach($json['data']['children'] as $child) { // we want all children for this example
$title= $child['data']['title'];
}
Are you sure you want to overwrite $title? In effect, that will only hold the last $child title.
Now, to your question. I assume you're looking for some kind of mechanism to cache the contents of the requested URL so you don't have to re-issue the request every time, am I right? I don't have any experience with appFog, only with orchestra.io but I believe they have the same restrictions regarding writing to files, as in you can only write to temporary files.
My suggestion would be to cache the (processed) response in either:
APC shared memory with a short TTL
temporary files
database
You could use the hash of the URL + arguments as the lookup key, doing this check inside get_url() would mean you wouldn't need to change any other part of your code and it would only take ~3 LOC.
After this:
if($format == 'json') {
$json = json_decode($content,true);
foreach($json['data']['children'] as $child) { // we want all children for this example
$title = $child['data']['title'];
}
}
}`
Then store in a json file and dump it into your localfolder website path
$storeTitle = array('title'=>$title)
$fp = fopen('../pathToJsonFile/title.json'), 'w');
fwrite($fp, json_encode($storeTitle));
fclose($fp);
Then you can always call the json file next time and decode it and extract the title into a variable for use
i usually just store the data as is as a flat file, like so:
<?php
define('TEMP_DIR', 'temp/');
define('TEMP_AGE', 3600);
function getinfo($url) {
$temp = TEMP_DIR . urlencode($url) . '.json';
if(!file_exists($temp) OR time() - filemtime($temp) > TEMP_AGE) {
$info = "http://www.reddit.com/api/info.json?url=$url";
$json = file_get_contents($info);
file_put_contents($temp, $json);
}
else {
$json = file_get_contents($temp);
}
$json = json_decode($json, true);
$titles = array();
foreach($json['data']['children'] as $child) {
$titles[] = $child['data']['title'];
}
return $titles;
}
$test = getinfo('http://imgur.com/');
print_r($test);
PS.
i use file_get_contents to get the json data, you might have your own reasons to use curl.
also i don't check for format, cos clearly you prefer json.

Fatal error: Out of memory PHP

I am not sure why this was working fine last night and this morning I am getting
Fatal error: Out of memory (allocated 1611137024) (tried to allocate
1610350592 bytes) in /home/twitcast/public_html/system/index.php on
line 121
The section of code being ran is as follows
function podcast()
{
$fetch = new server();
$fetch->connect("TCaster");
$collection = $fetch->db->shows;
// find everything in the collection
$cursor = $collection->find();
if($cursor->count() > 0)
{
$test = array();
// iterate through the results
while( $cursor->hasNext() ) {
$test[] = ($cursor->getNext());
}
$i = 0;
foreach($test as $d) {
for ( $i = 0; $i <= 3; $i ++) {
$url = $d["streams"][$i];
$xml = file_get_contents( $url );
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadXML( $xml); // $xml = file_get_contents( "http://www.c3carlingford.org.au/podcast/C3CiTunesFeed.xml")
// Initialize XPath
$xpath = new DOMXpath( $doc);
// Register the itunes namespace
$xpath->registerNamespace( 'itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');
$items = $doc->getElementsByTagName('item');
foreach( $items as $item) {
$title = $xpath->query( 'title', $item)->item(0)->nodeValue;
$published = strtotime($xpath->query( 'pubDate', $item)->item(0)->nodeValue);
$author = $xpath->query( 'itunes:author', $item)->item(0)->nodeValue;
$summary = $xpath->query( 'itunes:summary', $item)->item(0)->nodeValue;
$enclosure = $xpath->query( 'enclosure', $item)->item(0);
$url = $enclosure->attributes->getNamedItem('url')->value;
$fname = basename($url);
$collection = $fetch->db->shows_episodes;
$cursorfind = $collection->find(array("internal_url"=>"http://twitcatcher.russellharrower.com/videos/$fname"));
if($cursorfind->count() < 1)
{
$copydir = "/home/twt/public_html/videos/";
$data = file_get_contents($url);
$file = fopen($copydir . $fname, "w+");
fputs($file, $data);
fclose($file);
$collection->insert(array("show_id"=> new MongoId($d["_id"]),"stream"=>$i,"episode_title"=>$title, "episode_summary"=>$summary,"published"=>$published,"internal_url"=>"http://twitcatcher.russellharrower.com/videos/$fname"));
echo "$title <br> $published <br> $summary <br> $url<br><br>\n\n";
}
}
}
}
}
line 121 is
$data = file_get_contents($url);
You want to add 1.6GB of memory usage for a single PHP thread? While you can increase the memory limit, my strong advice is to look at another way of doing what you want.
Probably the easiest solution: you can use CURL to request a byte range of the source file (using Curl is wiser than get_file_contents anyway, for remote files). You can get 100K ata time, write to the local file then got the next 100k and appeand to the file etc, until the entire file is pulled in.
You may also do something with streams, but it gets a little more complex. This may be your only option if the remote server won't let you get part of a file by bytes.
Finally there's Linux commands such as wget, run through exec(), if your server has permissions.
Memory Limit - take a look at this directive. Suppose that is what you need.
or you may try to use copy instead of reading file to memory (which is video file, as I understand so there is nothing strange that it takes a lot of memory):
$copydir = "/home/twt/public_html/videos/";
copy($url, $copydir . $fname);
Looks like last night opened file were smaller)

'Node no longer exists' error in PHP

I'm using the following code to turn user's IP into latitude/longitude information using the hostip web service:
//get user's location
$ip=$_SERVER['REMOTE_ADDR'];
function get_location($ip) {
$content = file_get_contents('http://api.hostip.info/?ip='.$ip);
if ($content != FALSE) {
$xml = new SimpleXmlElement($content);
$coordinates = $xml->children('gml', TRUE)->featureMember->children('', TRUE)->Hostip->ipLocation->children('gml', TRUE)->pointProperty->Point->coordinates;
$longlat = explode(',', $coordinates);
$location['longitude'] = $longlat[0];
$location['latitude'] = $longlat[1];
$location['citystate'] = '==>'.$xml->children('gml', TRUE)->featureMember->children('', TRUE)->Hostip->children('gml', TRUE)->name;
$location['country'] = '==>'.$xml->children('gml', TRUE)->featureMember->children('', TRUE)->Hostip->countryName;
return $location;
}
else return false;
}
$data = get_location($ip);
$center_long=$data['latitude'];
$center_lat=$data['longitude'];
This works fine for me, using $center_long and $center_lat the google map on the page is centered around my city, but I have a friend in Thailand who tested it from there, and he got this error:
Warning: get_location() [function.get-location]: Node no longer exists in /home/bedbugs/registry/index.php on line 21
So I'm completely confused by this, how could he be getting an error if I don't? I tried googling it and it has something to do with parsing XML data, but the parsing process is the same for me and him. Note that line 21 is the one that starts with '$coordinates =' .
You need to check the service actually has an <ipLocation> listed, you're doing:
$xml->children('gml', TRUE)->featureMember->children('', TRUE)->Hostip->ipLocation
->children('gml', TRUE)->pointProperty->Point->coordinates
but the XML output for my IP is:
<HostipLookupResultSet version="1.0.1" xsi:noNamespaceSchemaLocation="http://www.hostip.info/api/hostip-1.0.1.xsd">
<gml:description>This is the Hostip Lookup Service</gml:description>
<gml:name>hostip</gml:name>
<gml:boundedBy>
<gml:Null>inapplicable</gml:Null>
</gml:boundedBy>
<gml:featureMember>
<Hostip>
<ip>...</ip>
<gml:name>(Unknown City?)</gml:name>
<countryName>(Unknown Country?)</countryName>
<countryAbbrev>XX</countryAbbrev>
<!-- Co-ordinates are unavailable -->
</Hostip>
</gml:featureMember>
</HostipLookupResultSet>
The last part ->children('gml', TRUE)->pointProperty->Point->coordinates gives the error because it has no children (for some IPs).
You can add a basic check to see if the <ipLocation> node exists like this (assuming the service always returns at least up to the <hostIp> node):
function get_location($ip) {
$content = file_get_contents('http://api.hostip.info/?ip='.$ip);
if ($content === FALSE) return false;
$location = array('latitude' => 'unknown', 'longitude' => 'unknown');
$xml = new SimpleXmlElement($content);
$hostIpNode = $xml->children('gml', TRUE)->featureMember->children('', TRUE)->Hostip;
if ($hostIpNode->ipLocation) {
$coordinates = $hostIpNode->ipLocation->children('gml', TRUE)->pointProperty->Point->coordinates;
$longlat = explode(',', $coordinates);
$location['longitude'] = $longlat[0];
$location['latitude'] = $longlat[1];
}
$location['citystate'] = '==>'.$hostIpNode->children('gml', TRUE)->name;
$location['country'] = '==>'.$hostIpNode->countryName;
return $location;
}

Categories