PHP Fast scraping

PHP Fast scraping - php

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?
Thanks in advance.
<?php
require 'simple_html_dom.php';
// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';
// Debug
$i = 0;
// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
echo $element->href . '<br>';
$i++;
}
}
echo "<h1>$i were found</h1>";
?>

How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.
You could:
have a crawl.php file that runs periodically to update your links.
then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.
This way:
Every time you refresh your webpage, it doesn't re-request info from the news site.
It's less important that the news site takes a little long to respond.
Decouple crawling/display
You will want to decouple, crawling and display 100%.
Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!
crawler.php
<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds
require 'simple_html_dom.php';
$html_output = ""; // use this to build up html output
$sites = array(
array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
/* more sites go here, like this */
// array('URL', 'KEY')
);
// loop over each site
foreach ($sites as $site){
$url = $site[0];
$key = $site[1];
// fetch site
$syds = file_get_html($url);
// loop over each link
foreach($syds->find($key) as $element) {
// add link to $html_output
$html_output .= $element->href . '<br>\n';
}
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>
display.php
/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>
Still want faster?
If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.

The bottlenecks are:
blocking IO - you can switch to an asynchronous library like guzzle
parsing - you can switch to a different parser for better parsing speed

Related

Fetch data from site Page by Page & go through sub links

URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;

I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.

PHP-Retrieve specific content from multiple pages of a website

What I want to accomplish might be a little hardcore, but I want to know if it's possible:
The question:
My question is the same as PHP-Retrieve content from page, but I want to use it on multiple pages.
The situation:
I'm using a website about TV shows. All the TV shows have the same URL and then the name of the show:
http://bierdopje.com/shows/NAME_OF_SHOW
On every show page, there's a line which tells you if the show is cancelled or still running. I want to retrieve that line to make an overview of the cancelled shows (the website only supports an overview of running shows, so I want to make an extra functionality).
The real question:
How can I tell DOM to retrieve all the shows and check for the status of the show?
(http://bierdopje.com/shows/*).
The Note:
I understand that this process may take a while because it is reading the whole website (or is it too much data?).

use this code to fetch only the links from the single website.
include_once('simple_html_dom.php');
$html = file_get_html('http://www.couponrani.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

I use phpquery to fetch data from a web page, like jQuery in Dom.
For example, to get the list of all shows, you can do this :
<?php
require_once 'phpQuery/phpQuery/phpQuery.php';
$doc = phpQuery::newDocumentHTML(
file_get_contents('http://www.bierdopje.com/shows')
);
foreach (pq('.listing a') as $key => $a) {
$url = pq($a)->attr('href'); // will give "/shows/07-ghost"
$show = pq($a)->text(); // will give "07 Ghost"
}
Now you can process all shows individualy, make a new phpQuery::newDocumentHTML for each show and with an selector extract the information you need.
Get the status of a show
$html = file_get_contents('http://www.bierdopje.com/shows/alcatraz');
$doc = phpQuery::newDocumentHTML($html);
$status = pq('.content>span:nth-child(6)')->text();

Checking up on a link exchange

I have made a link exchange with another site. 3 days later, the site has removed my link.
Is there a simple php script to help me control link exchanges and notify me if my link has been removed?
I need it as simple as possible, and not a whole ad. system manager.

If you know the URL of webpage where your ad(link) exists then you can use Simple HTML DOM Parser to get all links of that webpage in array and then use php in_array function to check that your link exists in that array or not. You can run this script on daily bases using crontab.
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all links
$allLinks = array();
foreach($html->find('a') as $element) {
$allLinks[] = $element->href;
}
// Check your link.
$adLink = "http://www.mylink.com";
if ( in_array($adLink , $allLinks ) ) {
echo "My link exists.";
} else {
echo "My link is removed.";
}

Technically there's no way to know if someone's website has a link to yours unless you have traffic directed from their website or you look at their website.
Your best bet would be either:
A script which records every time they link to your image. This is simple enough by mixing PHP and .htaccess
.htaccess:
RewriteRule path/to/myImage.jpg path/to/myScript.php
myScript.php:
/* Record (database, file, or however) that they accessed */
header("Content-type: image/jpeg");
echo file_get_contents("path/to/myImage.jpg");
Or a script which looks at their website every X amount of minutes/hours/days and searches the returned HTML for the link to your image. The challenge here is making the script run periodically. This can be done with crontab or similar
myScript.php:
$html = file_get_contents("http://www.theirsite.com");
if(strpos($html, 'path/to/myImage.jpg') !== FALSE)
/* Happiness */
else
/* ALERT! */

Generating cache file for Twitter rss feed

I'm working on a site with a simple php-generated twitter box with user timeline tweets pulled from the user_timeline rss feed, and cached to a local file to cut down on loads, and as backup for when twitter goes down. I based the caching on this: http://snipplr.com/view/8156/twitter-cache/. It all seemed to be working well yesterday, but today I discovered the cache file was blank. Deleting it then loading again generated a fresh file.
The code I'm using is below. I've edited it to try to get it to work with what I was already using to display the feed and probably messed something crucial up.
The changes I made are the following (and I strongly believe that any of these could be the cause):
- Revised the time difference code (the linked example seemed to use a custom function that wasn't included in the code)
Removed the "serialize" function from the "fwrites". This is purely because I couldn't figure out how to unserialize once I loaded it in the display code. I truthfully don't understand the role that serialize plays or how it works, so I'm sure I should have kept it in. If that's the case I just need to understand where/how to deserialize in the second part of the code so that it can be parsed.
Removed the $rss variable in favor of just loading up the cache file in my original tweet display code.
So, here are the relevant parts of the code I used:
<?php
$feedURL = "http://twitter.com/statuses/user_timeline/#######.rss";
// START CACHING
$cache_file = dirname(__FILE__).'/cache/twitter_cache.rss';
// Start with the cache
if(file_exists($cache_file)){
$mtime = (strtotime("now") - filemtime($cache_file));
if($mtime > 600) {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
echo "<!-- twitter cache generated ".date('Y-m-d h:i:s', filemtime($cache_file))." -->";
}
else {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/#######.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
//END CACHING
//START DISPLAY
$doc = new DOMDocument();
$doc->load($cache_file);
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
}
// the rest of the formatting and display code....
}
?>
ETA 6/17 Nobody can help…?
I'm thinking it has something to do with writing a blank cache file over a good one when twitter is down, because otherwise I imagine that this should be happening every ten minutes when the cache file is overwritten again, but it doesn't happen that frequently.
I made the following change to the part where it checks how old the file is to overwrite it:
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
if($mtime > 600 && $cache_rss != ''){
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
…so now, it will only write the file if it's over ten minutes old and there's actual content retrieved from the rss page. Do you think this will work?

Yes your code is problematic, because whatever Twitter sends you, you write it.
You should test the file you get from Twitter like this:
if (($mtime > 600) && ($cache_rss = file_get_contents($feedURL)))
{
file_put_contents($cache_rss);
}
file_get_contents() return false if there is an error, check it before caching some new content.

Measuring online time on website

I would like to measure how much time a user spends on my website. It's needed for a community site where you can say: "User X has been spending 1397 minutes here."
After reading some documents about this, I know that there is no perfect way to achieve this. You can't measure the exact time. But I'm looking for an approach which gives a good approximation.
How could you do this? My ideas:
1) Adding 30 seconds to the online time counter on every page view.
2) On every page view, save the current timestamp. On the next view, add the difference between the saved timestamp and the current timestamp to the online time counter.
I use PHP and MySQL if this does matter.
I hope you can help me. Thanks in advance!

This is probably pointless.... what if the user has three tabs open and is "visiting" your site while actually working on the other two tabs? Do you want to count that?

Two factors are working against you -
You can only collect point-in-time statistics (page views), and there's no reasonable way to detect what happened between those points;
Even then, you'd be counting browser window time, not user time; users can easily have multiple tabs open on multiple browser instances simultaneously.
I suspect your best approximation is attributing some average amount of attention time per click and then multiplying. But then you might just as well measure clicks.

Why not just measure what actually can be measured?: referrals, page views, click-throughs, etc.
Collecting and advertising these kinds of numbers is completely in line with the rest of the world of web metrics.
Besides—if someone were to bring up a web page and then, say, go on a two week holiday, how best to account for it?

What you could do is check if a user is active on the page and then send an ajax request to your server every X seconds (would 60 secs be fine?) that a user is active or not on the page.
Then you can use the second method you have mentioned to calculate the time difference between two 'active' timestamps that are not separated by more than one or two intervals. Adding these would give the time spent by the user on your site.

google analytics includes a very powerful event logging/tracking mechanism you can customize and tap into get really good measurements of user behavior - I'd look into that

A very simple solution is to use a hidden iframe that loads a php web page periodically. The loaded web page logs the start time (if it doesn't exist) and the stop time. When the person leaves the page you are left with the time the person first came to the site and the last time they were there. In this case, the timestamp is updated every 3 seconds.
I use files to hold the log information. The filename I use consists of month-day-year ipaddress.htm
Example iframe php code. Put this in yourwebsite/yourAnalyticsiFrameCode.php:
<?php
// get the IP address of the sender
$clientIpAddress=$_SERVER['REMOTE_ADDR'];
$folder = "yourAnalyticsDataFolder";
// Combine the IP address with the current date.
$clientFileRecord=$folder."/".date('d-M-Y')." ".$clientIpAddress;
$startTimeDate = "";
// check to see if the folder to store analytics exists
if (!file_exists($folder))
{
if (!mkdir($folder))
return; // error - just bail
}
if (file_exists($clientFileRecord) )
{
//read the contents of the clientFileRedord
$lines = file($clientFileRecord);
$count = 0;
// Loop through our array, show HTML source as HTML source; and line numbers too.
foreach ($lines as $line_num => $line)
{
echo($line);
if ($count == 0)
$startTimeDate = rtrim( $line );
$count++;
}
}
if ($startTimeDate == "")
$startTimeDate = date('H:i:s d-M-Y');
$endTimeDate = date('H:i:s d-M-Y');
// write the start and stop times back out to the file
$file = fopen($clientFileRecord,"w");
fwrite($file,$startTimeDate."\n".$endTimeDate);
fclose($file);
?>
The javascript to periodically reload the iframe in the main web page.:
<!-- Javascript to reload the analytics code -->
<script>
window.setInterval("reloadIFrame();", 3000);
function reloadIFrame() {
document.getElementById('AnalyticsID').src = document.getElementById('AnalyticsID').src
// document.frames["AnalyticsID"].location.reload();
}
</script>
The iframe in the main web page looks like this:
<iframe id="AnalyticsID" name="AnalyticsID" src="http://yourwebsite/yourAnalyticsiFrameCode.php" width="1"
height="1" frameborder="0" style="visibility:hidden;display:none">
</iframe>
A very simple way to display the time stamp files:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
</head>
<body>
Analytics results
<br>
<?php
$folder = "yourAnalyticsDataFolder";
$files1 = scandir($folder);
// Loop through the files
foreach ($files1 as $fn)
{
echo ($fn."<br>\n");
$lines = file($folder."/".$fn);
foreach ($lines as $line_num => $line)
{
echo(" ".$line."<br>\n");
}
echo ("<br>\n <br>");
}
?>
</body>
</html>
You get a results page like this:
22-Mar-2015 104.37.100.30
18:09:03 22-Mar-2015
19:18:53 22-Mar-2015
22-Mar-2015 142.162.20.133
18:10:06 22-Mar-2015
18:10:21 22-Mar-2015

I think client side JavaScript analytics is the solution for this.
You have the google analitycs, piwik, and there also commercials tools in JS that do exactly that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Fast scraping - php

The bottlenecks are: blocking IO - you can switch to an asynchronous library like guzzle parsing - you can switch to a different parser for better parsing speed

Related

Fetch data from site Page by Page & go through sub links

PHP-Retrieve specific content from multiple pages of a website

Checking up on a link exchange

Generating cache file for Twitter rss feed

Measuring online time on website

Categories

Resources