I'm using Simple HTML DOM to scrape (with permission) some websites. I basically scrape around 50 different websites with statistical data which is updated around four times a day.
As you can imagine it takes times to do the scraping and therefore I need to speed up the process by doing some caching.
My vision is:
DATA-PRESENTATION.php // where all the results are shown
SCRAPING.php // the code that makes the job
I want to set up a cron job on SCRAPING.PHP in a way it executes 4 times a day and save all the data in caché which then will be requested by DATA-PRESENTATION.PHP making the experience for the user way faster.
My question is how can I implement this caché thing? I'm very rookie at PHP, I've been reading tutorials but they are not very helpfull and there are just a few so I just couldn't really learn how to do it.
I know other solution might be implementing a database but I don't want to do that. Also, I've been reading about high end solutions like memcached, but the site is very simple and for personal use, so I don't need that kind of stuff.
Thanks!!
SCRAPING.PHP
<?php
include("simple_html_dom.php");
// Labour stats
$html7 = file_get_html('http://www.website1.html');
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
?>
DATA-PRESENTATION.PHP
<div class="news-pitch">
<h1>Webiste: <?php echo utf8_encode($web_title[0]->plaintext); ?></h1>
<p>Unemployment rate: <?php echo utf8_encode($web_figure[0]->plaintext); ?></p>
</div>
FINAL CODE! Many thanks #jerjer and #PaulD.Waite, I couldn't really get this done without your help!
Files:
1- DataPresentation.php // here I show the data requested to Cache.html
2- Scraping.php // here I scrape the sites and then save the results to Cache.html
3- Cache.html // here the scraping results are saved
I set up a Cron Job on Scraping.php telling it to overwrite Cache.html each time.
1- DataPresentation.php
<?php
include("simple_html_dom.php");
$html = file_get_html("cache/test.html");
$title = $html->find("h1");
echo $title[0]->plaintext;
?>
2- Scraping.php
<?php
include("simple_html_dom.php");
// by adding "->find("h1")" I speed up things as it only retrieves the information I'll be using and not the whole page.
$filename = "cache/test.html";
$content = file_get_html ('http://www.website.com/')->find("h1");
file_put_contents($filename, $content);
?>
3- Cache.html
<h1>Current unemployment 7,2%</h1>
It loads immediately and by setting things this way I assure there's always a Caché file to be loaded.
Here is a sample of a file-based caching:
<?php
// Labour stats
$filename = "cache/website1.html";
if(!file_exists($filename)){
$content = file_get_contents('http://www.website1.html');
file_put_contents($filename, $content);
}
$html7 = file_get_html($filename);
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
?>
Try using Zend_Cache library from Zend_Framework. It's quite simple to use:
function loadHtmlWithCache($webAddress){
$frontendOptions = array(
'lifetime' => 7200, // cache lifetime of 2 hours
'automatic_serialization' => true
);
$backendOptions = array(
'cache_dir' => './tmp/' // Directory where to put the cache files
);
// getting a Zend_Cache_Core object
$cache = Zend_Cache::factory('Core',
'File',
$frontendOptions,
$backendOptions);
if( ($result = $cache->load($webAddress)) === false ) {
$html7 = file_get_html($webAddress);
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");
$cache->save($webAddress,array('title'=>$web_title,'figure' => $web_figure));
} else {
// cache hit! shout so that we know
$web_title = $result['title'];
$web_figure = $result['figure'];
}
}
Related
so i'm trying to make a PHP crawler (for personal use).
What the code does is displaying "found" for each ebay auction item found that ends in less than 1 hour but there seems to be a problem. The crawler can't get all the span elements and the "remaining time" element is a .
the simple_html_dom.php is downloaded and not edited.
<?php include_once('simple_html_dom.php');
//url which i want to crawl -contains GET DATA-
$url = 'http://www.ebay.de/sch/Apple-Notebooks/111422/i.html?LH_Auction=1&Produktfamilie=MacBook%7CMacBook%2520Air%7CMacBook%2520Pro%7C%21&LH_ItemCondition=1000%7C1500%7C2500%7C3000&_dcat=111422&rt=nc&_mPrRngCbx=1&_udlo&_udhi=20';
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find('span') as $part){
echo $part;
//when i echo $part it does display many span elements but not the remaining time ones
$cur_class = $part->class;
//the class attribute of an auction item that ends in less than an hour is equal with "MINUTES timeMs alert60Red"
if($cur_class == 'MINUTES timeMs alert60Red'){
echo 'found';
}
}
?>
Any answers would be useful, thanks in advance
Looking at the fetched HTML it seems as if the class alert60Red is set through JavaScript. So you couldn't find it as JavaScript is never executed.
So just searching for MINUTES timeMs looks stable as well.
<?php
include_once('simple_html_dom.php');
$url = 'http://www.ebay.de/sch/Apple-Notebooks/111422/i.html?LH_Auction=1&Produktfamilie=MacBook%7CMacBook%2520Air%7CMacBook%2520Pro%7C%21&LH_ItemCondition=1000%7C1500%7C2500%7C3000&_dcat=111422&rt=nc&_mPrRngCbx=1&_udlo&_udhi=20';
$html = new simple_html_dom();
$html->load_file($url);
foreach ($html->find('span') as $part) {
$cur_class = $part->class;
if (strpos($cur_class, 'MINUTES timeMs') !== false) {
echo 'found';
}
}
If a snippet of code is included in another php file, or html is embedded in php, your browser cannot see it.
So no webcrawl api can detect it. I think your best bet is to find the location of simple_html_Dom.php and try crawl that file somehow. You may not even be able to get access to it. It's tricky.
You could also try find by Id if your api has that function?
UPDATE: I think the cakePhp updateAll is the problem. If i uncomment the updateAll and pr the results i get in 1-2 seconds so many language Detections like in 5 minutes!!!! I only must update one row and can determine that row with author and title... is there a better and faster way???
I'm using detectlanguage.com in order to detect all english texts in my sql database. My Database consists of about 500.000 rows. I tried many things to detect the lang of all my texts faster. Now it will take many days... :/
i only send 20% of the text (look at my code)
i tried to copy my function and run the function many times. the copied code shows the function for all texts with a title starting with A
I only can run 6 functions at the same time... (localhost)... i tried a 7th function in a new tab, but
Waiting for available socket....
public function detectLanguageA()
{
set_time_limit(0);
ini_set('max_execution_time', 0);
$mydatas = $this->datas;
$alldatas = $mydatas->find('all')->where(['SUBSTRING(datas.title,1,1) =' => 'A'])->where(['datas.lang =' => '']);
foreach ($alldatas as $row) {
$text = $row->text;
$textLength = round(strlen($text)*0.2);
$text = substr($text,0,$ltextLength);
$title = $row->title;
$author = $row->author;
$languageCode = DetectLanguage::simpleDetect($text);
$mydatas->updateAll(
['lang' => $languageCode], // fields
['author' => $author,'textTitle' => $title]); // conditions*/
}
}
I hope some one has a idea for my problem... Now the language detection for all my texts will take more than one week :/ :/
My computer runs over 20 hours with only little interruptions... But i only detected the language of about 13.000 texts... And in my database are 500.000 texts...
Now i tried sending texts by batch, but its also to slow... I always send 20 texts in one Array and i think thats the maximum...
Is it possible that the cakePhp 3.X updateAll-function makes it so slowly?
THE PROBLEM WAS THE CAKEPHP updateAll
Now i'm using: http://book.cakephp.org/3.0/en/orm/saving-data.html#updating-data with a for loop and all is fast and good
use Cake\ORM\TableRegistry;
$articlesTable = TableRegistry::get('Articles');
for ($i = 1; $i < 460000; $i++) {
$oneArticle = $articlesTable->get($i);
$languageCode = DetectLanguage::simpleDetect($oneArticle->lyrics);
$oneArticle->lang = $languageCode;
$articlesTable->save($oneSong);
}
I'm connecting to the trakt.tv api, I want to create a little app for myself that displays movies posters with ratings etc.
This is what I'm currently using to retrieve their .json file containing all the info I need.
$json = file_get_contents('http://api.trakt.tv/movies/trending.json/2998fbac88fd207cc762b1cfad8e34e6');
$movies = json_decode($json, true);
$movies = array_slice($movies, 0, 20);
foreach($movies as $movie) {
echo $movie['images']['fanart'];
}
Because the .json file is huge it is loading pretty slow. I only need a couple of attributes from the file, like title,rating and the poster link. Besides that I only need the first 20 or so. How can I make sure to load only a part of the .json file to load it faster?
Besides that I'm not experienced with php in combination with .json so if my code is garbage and you have suggestions I would love to hear them.
Unless the API provides a limit parameter or similar, I don't think you can limit the query at your side. On a quick look it doesn't seem to provide this. It also doesn't look like it really returns that much data (under 100KB), so I guess it is just slow.
Given the slow API I'd cache the data you receive and only update it once per hour or so. You could save it to a file on your server using file_put_contents and record the time it was saved too. When you need to use the data, if the saved data is over an hour old, refresh it.
This quick sketch of an idea works:
function get_trending_movies() {
if(! file_exists('trending-cache.php')) {
return cache_trending_movies();
}
include('trending-cache.php');
if(time() - $movies['retreived-timestamp'] > 60 * 60) { // 60*60 = 1 hour
return cache_trending_movies();
} else {
unset($movies['retreived-timestamp']);
return $movies;
}
}
function cache_trending_movies() {
$json = file_get_contents('http://api.trakt.tv/movies/trending.json/2998fbac88fd207cc762b1cfad8e34e6');
$movies = json_decode($json, true);
$movies = array_slice($movies, 0, 20);
$movies_with_date = $movies;
$movies_with_date['retreived-timestamp'] = time();
file_put_contents('trending-cache.php', '<?php $movies = ' . var_export($movies_with_date, true) . ';');
return $movies;
}
print_r(get_trending_movies());
Good day all.
I have a page that calls a script via AJAX, the script calls a prestashop webservice and has to insert several items at a time. My problem is that the script seems "freezed" for most of the time, then after 2 or 3 minutes, starts to print out results, and continue since the end. what i would like to do is to retrieve something from the script each time it insert an item, and not to "buffer" hundreds of results and then see all of them in one time.
this is the code (stripped of unecessary parts) that I'm using.
<?php
function PS_new_product(all product attributes) {
global $webService;
try {
$opt = array('resource' => 'products');
$opt['postXml'] = $xml -> asXML();
$xml = $webService -> add($opt); //this should return each product state (if it's inserted or not)
return true;
} catch (PrestaShopWebserviceException $ex) {
return false;
}
}
function inserisciProdottiRAW(){
set_time_limit(30);
$sql_prodotti = "SELECT everything i need to import";
if ($prodotti = mysql_query($sql_prodotti)) {
while ($row = mysql_fetch_assoc($prodotti)){
$webService = new PrestaShopWebservice(PS_SHOP_PATH, PS_WS_AUTH_KEY, DEBUG);
$opt = array('resource' => 'products');
$opt['filter[reference]'] ="[".$row["modello"]."]";
$xml = $webService->get($opt);
$prodotto = $xml->children()->children();
if ($prodotto->product[#id] == ""){
PS_new_product(/*all product attributes*/)
}
}
}
echo "ok";
}
inserisciProdottiRAW();
?>
I would like something that i could catch in the page I have called it to know for example at which items it arrived at a certain time... it could be possible? or I have to implement something that count the items inserted in the database every... mh... 30 seconds?
If you need a quick and dirty solution - just include an echo after every insertion and make sure that it includes new line and is big enough to flush cache in php/apache (4KB should do it). Use this method for example:
function logProgress($message)
{
echo($messsage);
for ($i=0; $i<4096; $i++)
{
echo(" ");
}
echo("\n");
}
If you use gzip then it could not be enough and use other random white space characters as well.
If you want to show the progress to some user then you can run your insertion script in background, save state in some database table and poll it from different script.
Running a background job can be done using fork function, curl or if you need good job manager, try gearman.
Also be warned that if you use sessions you cannot have 2 scripts running in the same time - one will be waiting for the other one to finish. If you know that you won't be using session anymore in your script, you can call session_close() get rid of this locking issue.
I have write a script for webscraping where i am fetching each link from the page and getting load that url in the code and this working extremely slow this is taking about 50 sec for first output and taking an age to complete about 100 links, I am not getting why this is working so slow, I am thinking about caching but don't know how this could help us.
1) Page caching OR Opcode cache.
code is :
public function searchForum(){
global $wpdb;
$sUrl = $this->getSearchUrl();
$this->logToCrawler();
$cid = $this->getCrawlId();
$html = file_get_dom($sUrl);
$c=1;
foreach($html('div.gridBlobTitle a:first-child') as $element){
$post_page = file_get_dom($element->href);
$post_meta = array();
foreach($post_page('table#mytable img:first-child') as $img){
if(isset($img->src)){
$post_meta['image_main'] = self::$forumurl.$img->src;
}
else{
$post_meta['image_main']=NULL;
}
}
foreach($post_page('table.preferences td:odd') as $elm){
$post_meta[] = strip_tags($elm->getInnerText());
unset($elm);
}
/*Check if can call getPlainText for description fetch*/
$object = $post_page('td.collection',2);
$methodVariable = array($object, 'getPlainText');
if(is_callable($methodVariable, true, $callable_name)){
$post_meta['description'] = utf8_encode($object->getPlainText());
}
else{
$post_meta['description'] = NULL;
}
$methodVariable = array($object, 'getInnerText');
if(is_callable($methodVariable, true, $callable_name)){
/*Get all the images we found*/
$rough_html = $object->getInnerText();
preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $rough_html, $matches);
$images = array_map('self::addUrlToItems',$matches[1]);
$images = json_encode($images);
}
if($post_meta[8]=='WTB: Want To Buy'){
$status='buy';
}
else{
$status='sell';
}
$lastdate = strtotime(date('Y-m-d',strtotime("-1 month")));
$listdate = strtotime(date('Y-m-d',strtotime($post_meta[9])));
/*Check for date*/
if($listdate>=$lastdate){
$wpdb->query("INSERT
INTO tbl_scrubed_data SET
keywords='".esc_sql($this->getForumSettings()->search_meta)."',
url_to_post='".esc_sql($element->href)."',
description='".esc_sql($post_meta['description'])."',
date_captured=now(),crawl_id='".$cid."',
image_main='".esc_sql($post_meta['image_main'])."',
images='".esc_sql($images)."',brand='".esc_sql($post_meta[0])."',
series='".esc_sql($post_meta[1])."',model='".esc_sql($post_meta[2])."',
watch_condition='".esc_sql($post_meta[3])."',box='".esc_sql($post_meta[4])."',
papers='".esc_sql($post_meta[5])."',year='".esc_sql($post_meta[6])."',case_size='".esc_sql($post_meta[7])."',status='".esc_sql($post_meta[8])."',listed='".esc_sql($post_meta[9])."',
asking_price='".esc_sql($post_meta[10])."',retail_price='".esc_sql($post_meta[11])."',payment_info='".esc_sql($post_meta[12])."',forum_id='".$this->getForumSettings()->ID."'");
unset($element,$post_page,$images);
} /*END: Check for date*/
}
$c++;
}
Note :
1) I am using [Ganon DOM Parser][1] for parsing the HTML.
[1]: https://code.google.com/p/ganon/wiki/AccesElements
2) On windows XP with WAMP, Mysql 5.5 PHP 5.3, 1 GB of RAM.
If you need more info please comment them.
Thanks
You need to figure out what parts of your program are being slow. There are two ways to do that.
1) Put in some print statements that print out the time in various places, so you can say "Hey, look, this took 5 seconds to go from here to here."
2) Use a profiler like xdebug that will run your program and analyze it while it's running and then you can know which parts of the code are slow.
Just looking at a program you can't say "Oh, that's the slow part to speed up." Without knowing what's slow, you'll probably waste time speeding up parts that aren't the slow parts.