How to speed up the execution of this PHP script - php

I have write a script for webscraping where i am fetching each link from the page and getting load that url in the code and this working extremely slow this is taking about 50 sec for first output and taking an age to complete about 100 links, I am not getting why this is working so slow, I am thinking about caching but don't know how this could help us.
1) Page caching OR Opcode cache.
code is :
public function searchForum(){
global $wpdb;
$sUrl = $this->getSearchUrl();
$this->logToCrawler();
$cid = $this->getCrawlId();
$html = file_get_dom($sUrl);
$c=1;
foreach($html('div.gridBlobTitle a:first-child') as $element){
$post_page = file_get_dom($element->href);
$post_meta = array();
foreach($post_page('table#mytable img:first-child') as $img){
if(isset($img->src)){
$post_meta['image_main'] = self::$forumurl.$img->src;
}
else{
$post_meta['image_main']=NULL;
}
}
foreach($post_page('table.preferences td:odd') as $elm){
$post_meta[] = strip_tags($elm->getInnerText());
unset($elm);
}
/*Check if can call getPlainText for description fetch*/
$object = $post_page('td.collection',2);
$methodVariable = array($object, 'getPlainText');
if(is_callable($methodVariable, true, $callable_name)){
$post_meta['description'] = utf8_encode($object->getPlainText());
}
else{
$post_meta['description'] = NULL;
}
$methodVariable = array($object, 'getInnerText');
if(is_callable($methodVariable, true, $callable_name)){
/*Get all the images we found*/
$rough_html = $object->getInnerText();
preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $rough_html, $matches);
$images = array_map('self::addUrlToItems',$matches[1]);
$images = json_encode($images);
}
if($post_meta[8]=='WTB: Want To Buy'){
$status='buy';
}
else{
$status='sell';
}
$lastdate = strtotime(date('Y-m-d',strtotime("-1 month")));
$listdate = strtotime(date('Y-m-d',strtotime($post_meta[9])));
/*Check for date*/
if($listdate>=$lastdate){
$wpdb->query("INSERT
INTO tbl_scrubed_data SET
keywords='".esc_sql($this->getForumSettings()->search_meta)."',
url_to_post='".esc_sql($element->href)."',
description='".esc_sql($post_meta['description'])."',
date_captured=now(),crawl_id='".$cid."',
image_main='".esc_sql($post_meta['image_main'])."',
images='".esc_sql($images)."',brand='".esc_sql($post_meta[0])."',
series='".esc_sql($post_meta[1])."',model='".esc_sql($post_meta[2])."',
watch_condition='".esc_sql($post_meta[3])."',box='".esc_sql($post_meta[4])."',
papers='".esc_sql($post_meta[5])."',year='".esc_sql($post_meta[6])."',case_size='".esc_sql($post_meta[7])."',status='".esc_sql($post_meta[8])."',listed='".esc_sql($post_meta[9])."',
asking_price='".esc_sql($post_meta[10])."',retail_price='".esc_sql($post_meta[11])."',payment_info='".esc_sql($post_meta[12])."',forum_id='".$this->getForumSettings()->ID."'");
unset($element,$post_page,$images);
} /*END: Check for date*/
}
$c++;
}
Note :
1) I am using [Ganon DOM Parser][1] for parsing the HTML.
[1]: https://code.google.com/p/ganon/wiki/AccesElements
2) On windows XP with WAMP, Mysql 5.5 PHP 5.3, 1 GB of RAM.
If you need more info please comment them.
Thanks

You need to figure out what parts of your program are being slow. There are two ways to do that.
1) Put in some print statements that print out the time in various places, so you can say "Hey, look, this took 5 seconds to go from here to here."
2) Use a profiler like xdebug that will run your program and analyze it while it's running and then you can know which parts of the code are slow.
Just looking at a program you can't say "Oh, that's the slow part to speed up." Without knowing what's slow, you'll probably waste time speeding up parts that aren't the slow parts.

Related

Detect Languages; CakePHP updateAll Bad Performance

UPDATE: I think the cakePhp updateAll is the problem. If i uncomment the updateAll and pr the results i get in 1-2 seconds so many language Detections like in 5 minutes!!!! I only must update one row and can determine that row with author and title... is there a better and faster way???
I'm using detectlanguage.com in order to detect all english texts in my sql database. My Database consists of about 500.000 rows. I tried many things to detect the lang of all my texts faster. Now it will take many days... :/
i only send 20% of the text (look at my code)
i tried to copy my function and run the function many times. the copied code shows the function for all texts with a title starting with A
I only can run 6 functions at the same time... (localhost)... i tried a 7th function in a new tab, but
Waiting for available socket....
public function detectLanguageA()
{
set_time_limit(0);
ini_set('max_execution_time', 0);
$mydatas = $this->datas;
$alldatas = $mydatas->find('all')->where(['SUBSTRING(datas.title,1,1) =' => 'A'])->where(['datas.lang =' => '']);
foreach ($alldatas as $row) {
$text = $row->text;
$textLength = round(strlen($text)*0.2);
$text = substr($text,0,$ltextLength);
$title = $row->title;
$author = $row->author;
$languageCode = DetectLanguage::simpleDetect($text);
$mydatas->updateAll(
['lang' => $languageCode], // fields
['author' => $author,'textTitle' => $title]); // conditions*/
}
}
I hope some one has a idea for my problem... Now the language detection for all my texts will take more than one week :/ :/
My computer runs over 20 hours with only little interruptions... But i only detected the language of about 13.000 texts... And in my database are 500.000 texts...
Now i tried sending texts by batch, but its also to slow... I always send 20 texts in one Array and i think thats the maximum...
Is it possible that the cakePhp 3.X updateAll-function makes it so slowly?
THE PROBLEM WAS THE CAKEPHP updateAll
Now i'm using: http://book.cakephp.org/3.0/en/orm/saving-data.html#updating-data with a for loop and all is fast and good
use Cake\ORM\TableRegistry;
$articlesTable = TableRegistry::get('Articles');
for ($i = 1; $i < 460000; $i++) {
$oneArticle = $articlesTable->get($i);
$languageCode = DetectLanguage::simpleDetect($oneArticle->lyrics);
$oneArticle->lang = $languageCode;
$articlesTable->save($oneSong);
}

How can I get only a part of a json file instead of the entire thing with php?

I'm connecting to the trakt.tv api, I want to create a little app for myself that displays movies posters with ratings etc.
This is what I'm currently using to retrieve their .json file containing all the info I need.
$json = file_get_contents('http://api.trakt.tv/movies/trending.json/2998fbac88fd207cc762b1cfad8e34e6');
$movies = json_decode($json, true);
$movies = array_slice($movies, 0, 20);
foreach($movies as $movie) {
echo $movie['images']['fanart'];
}
Because the .json file is huge it is loading pretty slow. I only need a couple of attributes from the file, like title,rating and the poster link. Besides that I only need the first 20 or so. How can I make sure to load only a part of the .json file to load it faster?
Besides that I'm not experienced with php in combination with .json so if my code is garbage and you have suggestions I would love to hear them.
Unless the API provides a limit parameter or similar, I don't think you can limit the query at your side. On a quick look it doesn't seem to provide this. It also doesn't look like it really returns that much data (under 100KB), so I guess it is just slow.
Given the slow API I'd cache the data you receive and only update it once per hour or so. You could save it to a file on your server using file_put_contents and record the time it was saved too. When you need to use the data, if the saved data is over an hour old, refresh it.
This quick sketch of an idea works:
function get_trending_movies() {
if(! file_exists('trending-cache.php')) {
return cache_trending_movies();
}
include('trending-cache.php');
if(time() - $movies['retreived-timestamp'] > 60 * 60) { // 60*60 = 1 hour
return cache_trending_movies();
} else {
unset($movies['retreived-timestamp']);
return $movies;
}
}
function cache_trending_movies() {
$json = file_get_contents('http://api.trakt.tv/movies/trending.json/2998fbac88fd207cc762b1cfad8e34e6');
$movies = json_decode($json, true);
$movies = array_slice($movies, 0, 20);
$movies_with_date = $movies;
$movies_with_date['retreived-timestamp'] = time();
file_put_contents('trending-cache.php', '<?php $movies = ' . var_export($movies_with_date, true) . ';');
return $movies;
}
print_r(get_trending_movies());

ajax calls how to retrieve partial results during long executions with PHP

Good day all.
I have a page that calls a script via AJAX, the script calls a prestashop webservice and has to insert several items at a time. My problem is that the script seems "freezed" for most of the time, then after 2 or 3 minutes, starts to print out results, and continue since the end. what i would like to do is to retrieve something from the script each time it insert an item, and not to "buffer" hundreds of results and then see all of them in one time.
this is the code (stripped of unecessary parts) that I'm using.
<?php
function PS_new_product(all product attributes) {
global $webService;
try {
$opt = array('resource' => 'products');
$opt['postXml'] = $xml -> asXML();
$xml = $webService -> add($opt); //this should return each product state (if it's inserted or not)
return true;
} catch (PrestaShopWebserviceException $ex) {
return false;
}
}
function inserisciProdottiRAW(){
set_time_limit(30);
$sql_prodotti = "SELECT everything i need to import";
if ($prodotti = mysql_query($sql_prodotti)) {
while ($row = mysql_fetch_assoc($prodotti)){
$webService = new PrestaShopWebservice(PS_SHOP_PATH, PS_WS_AUTH_KEY, DEBUG);
$opt = array('resource' => 'products');
$opt['filter[reference]'] ="[".$row["modello"]."]";
$xml = $webService->get($opt);
$prodotto = $xml->children()->children();
if ($prodotto->product[#id] == ""){
PS_new_product(/*all product attributes*/)
}
}
}
echo "ok";
}
inserisciProdottiRAW();
?>
I would like something that i could catch in the page I have called it to know for example at which items it arrived at a certain time... it could be possible? or I have to implement something that count the items inserted in the database every... mh... 30 seconds?
If you need a quick and dirty solution - just include an echo after every insertion and make sure that it includes new line and is big enough to flush cache in php/apache (4KB should do it). Use this method for example:
function logProgress($message)
{
echo($messsage);
for ($i=0; $i<4096; $i++)
{
echo(" ");
}
echo("\n");
}
If you use gzip then it could not be enough and use other random white space characters as well.
If you want to show the progress to some user then you can run your insertion script in background, save state in some database table and poll it from different script.
Running a background job can be done using fork function, curl or if you need good job manager, try gearman.
Also be warned that if you use sessions you cannot have 2 scripts running in the same time - one will be waiting for the other one to finish. If you know that you won't be using session anymore in your script, you can call session_close() get rid of this locking issue.

Can I retry file_get_contents() until it opens a stream?

I am using PHP to get the contents of an API. The problem is, sometimes that API just sends back a 502 Bad Gateway error and the PHP code can’t parse the JSON and set the variables correctly. Is there some way I can keep trying until it works?
This is not an easy question because PHP is a synchronous language by default.
You could do this:
$a = false;
$i = 0;
while($a == false && $i < 10)
{
$a = file_get_contents($path);
$i++;
usleep(10);
}
$result = json_decode($a);
Adding usleep(10) allows your server not to get on his knees each time the API will be unavailable. And your function will give up after 10 attempts, which prevents it to freeze completely in case of long unavailability.
Since you didn't provide any code it's kind of hard to help you. But here is one way to do it.
$data = null;
while(!$data) {
$json = file_get_contents($url);
$data = json_decode($json); // Will return false if not valid JSON
}
// While loop won't stop until JSON was valid and $data contains an object
var_dump($data);
I suggest you throw some sort of increment variable in there to stop attempting after X scripts.
Based on your comment, here is what I would do:
You have a PHP script that makes the API call and, if successful, records the price and when that price was acquired
You put that script in a cronjob/scheduled task that runs every 10 minutes.
Your PHP view pulls the most recent price from the database and uses that for whatever display/calculations it needs. If pertinent, also show the date/time that price was captured
The other answers suggest doing a loop. A combo approach probably works best here: in your script, put in a few loops just in case the interface is down for a short blip. If it's not up after say a minute, use the old value until your next try.
A loop can solve this problem, but so can a recursive function like this one:
function file_get_contents_retry($url, $attemptsRemaining=3) {
$content = file_get_contents($url);
$attemptsRemaining--;
if( empty($content) && $attemptsRemaining > 0 ) {
return file_get_contents_retry($url, $attemptsRemaining);
}
return $content;
}
// Usage:
$retryAttempts = 6; // Default is 3.
echo file_get_contents_retry("http://google.com", $retryAttempts);

Simple Html DOM Caching

I'm using Simple HTML DOM to scrape (with permission) some websites. I basically scrape around 50 different websites with statistical data which is updated around four times a day.
As you can imagine it takes times to do the scraping and therefore I need to speed up the process by doing some caching.
My vision is:
DATA-PRESENTATION.php // where all the results are shown
SCRAPING.php // the code that makes the job
I want to set up a cron job on SCRAPING.PHP in a way it executes 4 times a day and save all the data in caché which then will be requested by DATA-PRESENTATION.PHP making the experience for the user way faster.
My question is how can I implement this caché thing? I'm very rookie at PHP, I've been reading tutorials but they are not very helpfull and there are just a few so I just couldn't really learn how to do it.
I know other solution might be implementing a database but I don't want to do that. Also, I've been reading about high end solutions like memcached, but the site is very simple and for personal use, so I don't need that kind of stuff.
Thanks!!
SCRAPING.PHP
<?php
include("simple_html_dom.php");
// Labour stats
$html7 = file_get_html('http://www.website1.html');
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
?>
DATA-PRESENTATION.PHP
<div class="news-pitch">
<h1>Webiste: <?php echo utf8_encode($web_title[0]->plaintext); ?></h1>
<p>Unemployment rate: <?php echo utf8_encode($web_figure[0]->plaintext); ?></p>
</div>
FINAL CODE! Many thanks #jerjer and #PaulD.Waite, I couldn't really get this done without your help!
Files:
1- DataPresentation.php // here I show the data requested to Cache.html
2- Scraping.php // here I scrape the sites and then save the results to Cache.html
3- Cache.html // here the scraping results are saved
I set up a Cron Job on Scraping.php telling it to overwrite Cache.html each time.
1- DataPresentation.php
<?php
include("simple_html_dom.php");
$html = file_get_html("cache/test.html");
$title = $html->find("h1");
echo $title[0]->plaintext;
?>
2- Scraping.php
<?php
include("simple_html_dom.php");
// by adding "->find("h1")" I speed up things as it only retrieves the information I'll be using and not the whole page.
$filename = "cache/test.html";
$content = file_get_html ('http://www.website.com/')->find("h1");
file_put_contents($filename, $content);
?>
3- Cache.html
<h1>Current unemployment 7,2%</h1>
It loads immediately and by setting things this way I assure there's always a Caché file to be loaded.
Here is a sample of a file-based caching:
<?php
// Labour stats
$filename = "cache/website1.html";
if(!file_exists($filename)){
$content = file_get_contents('http://www.website1.html');
file_put_contents($filename, $content);
}
$html7 = file_get_html($filename);
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
?>
Try using Zend_Cache library from Zend_Framework. It's quite simple to use:
function loadHtmlWithCache($webAddress){
$frontendOptions = array(
'lifetime' => 7200, // cache lifetime of 2 hours
'automatic_serialization' => true
);
$backendOptions = array(
'cache_dir' => './tmp/' // Directory where to put the cache files
);
// getting a Zend_Cache_Core object
$cache = Zend_Cache::factory('Core',
'File',
$frontendOptions,
$backendOptions);
if( ($result = $cache->load($webAddress)) === false ) {
$html7 = file_get_html($webAddress);
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
$cache->save($webAddress,array('title'=>$web_title,'figure' => $web_figure));
} else {
// cache hit! shout so that we know
$web_title = $result['title'];
$web_figure = $result['figure'];
}
}

Categories