I have a script which I think is pretty basic scraping, call it what you will, but it takes on average at least 6 seconds...is it possible to speed it up? The $date variables are only there for timing the code and don't add anything significant to the time it takes. I have set two timing markers and each is approx 3 seconds between. Example URL below for testing
$date = date('m/d/Y h:i:s a', time());
echo "start of timing $date<br /><br />";
include('simple_html_dom.php');
function getUrlAddress()
{
$url = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $url .'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
$date = date('m/d/Y h:i:s a', time()); echo "<br /><br />after geturl $date<br /><br />";
$parts = explode("/",$url);
$html = file_get_html($url);
$date = date('m/d/Y h:i:s a', time()); echo "<br /><br />after file_get_url $date<br /><br />";
$file_string = file_get_contents($url);
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
foreach($html->find('img') as $e){
$image = $e->src;
if (preg_match("/orangeBlue/", $image)) { $image = ''; }
if (preg_match("/BeaconSprite/", $image)) { $image = ''; }
if($image != ''){
if (preg_match("/http/", $image)) { $image = $image; }
elseif (preg_match("*//*", $image)) { $image = 'http:'.$image; }
else { $image = $parts['0']."//".$parts[1].$parts[2]."/".$image; }
$size = getimagesize($image);
if (($size[0]>110)&&($size[1]>110)){
if (preg_match("/http/", $image)) { $image = $image; }
echo '<img src='.$image.'><br>';
}
}
}
$date = date('m/d/Y h:i:s a', time()); echo "<br /><br />end of timing $date<br /><br />";
Example URL
UPDATE
This is actual what timing markers show:
start of timing 01/24/2012 12:31:50 am
after geturl 01/24/2012 12:31:50 am
after file_get_url 01/24/2012 12:31:53 am
end of timing 01/24/2012 12:31:57 am
http://www.ebay.co.uk/itm/Duke-Nukem-Forever-XBOX-360-Game-BRAND-NEW-SEALED-UK-PAL-UK-Seller-/170739972246?pt=UK_PC_Video_Games_Video_Games_JS&hash=item27c0e53896`
It's probably the getimagesize function - it is going and fetching every image on the page so it can determine the size. Maybe you can write something with curl to get the header only for Content-size (though, actually, this might be what getimagesize does).
At any rate, back in the day I wrote a few spiders and it's kind of slow to do, with internet speeds better than ever it's still a fetch for each element. And I wasn't even concerned with images.
I'm not a PHP guy, but it looks to me like you're going out to the web to get the file twice...
First using this:
$html = file_get_html($url);
Then again using this:
$file_string = file_get_contents($url);
So if each hit takes a couple of seconds, you might be able to reduce your timing by finding a way to cut this down to a single web-hit.
Either that, or I'm blind. Which is a real possibility!
Related
I have the following PHP code: When it is commented out (as it is now) from just after the comment // ... more stuff in here to the end of that comment block, my page renders html from that 2nd php block as expected as in this screenshot.
If I uncomment that block it renders like the screenshot after the code below which is not what I want (I want the rendered html as in the first screenshot). The only thing not showing in my comment block is a function call that curls a webpage, parses html with DomXNode types of things, and returns an array with 3 elements. How can I get the original rendering of the html back and what am I possibly doing that is ruining that for me? I tried echo instead of print and that makes no difference.
I honestly did search for the answer on here and found lots of pages describing how do do just the opposite of what I want so please be gentle with me. I was surprised that I couldn't find a similar question and I know there has to be an easy answer here. Thanks!
<?php
// ... more stuff in here
/*
include("../../includes/curl_fx.php");
if ($doAppend === "parcel") {
$lines = explode(PHP_EOL, $Data);
foreach($lines as $line) {
if(strpos($line, "http") > 0) {
$start = stripos(strval($line), "http");
$fullLength = strlen($line);
$urlLength = ($fullLength - $start);
$fullUrl = substr($line, $start, $urlLength);
$arraySDAT = getSDAT($fullUrl);
$line .= ", " . $arraySDAT[0] . ", " . $arraySDAT[1] . ", " . $arraySDAT[2] . "\n";
fwrite($Handle, $line);
}
}
}
*/
?>
<?php
if ($DataAdded === true) {
print "<h2>YourFile.txt</h2>Data has been added.<br />Close this window or tab to return to the web map.<br />";
} else {
print "Data may not have been added. Check the file.<br />";
}
fclose($Handle);
print $doAppendAnswer;
print "<br />";
?>
EDIT: Here is the function.
<?php
function getSDAT ($fullUrl="") {
$ch = curl_init($fullUrl);
if (! $ch) {
die( "Cannot allocate a new PHP-CURL handle" );
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
header("Content-type: text");
curl_close($ch);
libxml_use_internal_errors(true);
libxml_clear_errors();
$doc = DOMDocument::loadHTML($data);
$xpath = new DOMXPath($doc);
$ownName1query = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblOwnerName_0"][#class="text"]';
$ownName2query = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblOwnerName2_0"][#class="text"]';
$ownAddrquery = '//table/tr/td/span[#id="MainContent_MainContent_cphMainContentArea_ucSearchType_wzrdRealPropertySearch_ucDetailsSearch_dlstDetaisSearch_lblMailingAddress_0"][#class="text"]';
$entries = $xpath->query($ownName1query);
foreach($entries as $entry) {
$ownname1 = $entry->nodeValue;
}
$entries = $xpath->query($ownName2query);
foreach($entries as $entry) {
$ownname2 = $entry->nodeValue;
}
$entries = $xpath->query($ownAddrquery);
$pattern = '#<br\s*/?>#i';
$replacement = ", ";
$i=0;
foreach($entries as $entry) {
$ownAddr = $entry->nodeValue;
if(!$entry->childNodes == 0) {
$ownAddr = $doc->saveHTML($entry);
}
$ownAddr2 = preg_replace($pattern, $replacement, $ownAddr, 15, $count); // replace <br/> with a comma
$ownAddr3 = strip_tags($ownAddr2);
}
return array($ownname1, $ownname2, $ownAddr3);
}
Your problem is:
header("Content-type: text");
Just remove that. Why is it there?
As mentioned, it's the header which is causing you a problem. You see the header decides the type of content that the current document should have or how the current document should behave - an information that is usually found in the < head >...< /head > part of an HTML. You can use it for declaring the content type, controlling the cache, redirecting, and etc.
When you use header("Content-type: text"), you are deciding that the content of the current document "yourdocument.php" would be a text instead of the default which is HTML.
header("Content-type: text/html");
echo "<html>This would make mypage.php behave as an HTML</html>";
// This is usually unnecessary since text/html is already the default header
header("Content-type: text/javascript");
echo "this would make mypage.php behave as a javascript";
header("Content-type: text/css");
echo "this would make mypage.php behave as a CSS";
header('Content-type: image/jpeg');
readfile("source/to/my/file.jpg");
// this would make mypage.php display file.jpg and act as a jpg
header('Content-type: image/png');
readfile("source/to/my/file.png");
// this would make mypage.php display file.png and act as a png
header('Content-type: image/gif');
readfile("source/to/my/file.gif");
// this would make mypage.php display file.gif and act as a gif
header('Content-type: image/x-icon');
readfile("source/to/my/file.ico");
// this would make mypage.php display file.ico and act as an icon
header('Content-type: image/x-win-bitmap');
readfile("source/to/my/file.cur");
// this would make mypage.php display file.cur and act as a cursor
I'm working on a project and it's something new for me. I'll need to fetch rss content from websites, and display Descripion, Title and Images (Thumbnails). Right now i've noticed that some feeds show thumbnails as Enclosure tag and some others dont. right now i have the code for both, but i need to understand how i can create a conditional like:
If the rss returns enclosure image { Do something }
Else { get the common thumb }
Here follow the code that grab the images:
ENCLOSURE TAG IMAGE:
if ($enclosure = $block->get_enclosure())
{
echo "<img src=\"" . $enclosure->get_link() . "\">";
}
NOT ENCLOSURE:
if ($enclosure = $block->get_enclosure())
{
echo '<img src="'.$enclosure->get_thumbnail().'" title="'.$block->get_title().'" width="200" height="200">';
}
=================================================================================================
PS: If we look at both codes they're almost the same, the difference are get_thumbnail and get_link.
Is there a way i can create a conditional to use the correct code and always shows the thumbnail?
Thanks everyone in advance!
EDITED
Here is the full code i have right now:
include_once(ABSPATH . WPINC . '/feed.php');
if(function_exists('fetch_feed')) {
$feed = fetch_feed('http://feeds.bbci.co.uk/news/world/africa/rss.xml'); // this is the external website's RSS feed URL
if (!is_wp_error($feed)) : $feed->init();
$feed->set_output_encoding('UTF-8'); // this is the encoding parameter, and can be left unchanged in almost every case
$feed->handle_content_type(); // this double-checks the encoding type
$feed->set_cache_duration(21600); // 21,600 seconds is six hours
$feed->handle_content_type();
$limit = $feed->get_item_quantity(18); // fetches the 18 most recent RSS feed stories
$items = $feed->get_items(0, $limit); // this sets the limit and array for parsing the feed
endif;
}
$blocks = array_slice($items, 0, 3); // Items zero through six will be displayed here
foreach ($blocks as $block) {
//echo $block->get_date("m d Y");
echo '<div class="single">';
if ($enclosure = $block->get_enclosure())
{
echo '<img class="image_post" src="'.$enclosure->get_link().'" title="'.$block->get_title().'" width="150" height="100">';
}
echo '<div class="description">';
echo '<h3>'. $block->get_title() .'</h3>';
echo '<p>'.$block->get_description().'</p>';
echo '</div>';
echo '<div class="clear"></div>';
echo '</div>';
}
And here are the XML pieces with 2 different tags for images:
Using Thumbnails: view-source:http://feeds.bbci.co.uk/news/world/africa/rss.xml
Using Enclosure: http://feeds.news24.com/articles/news24/SouthAfrica/rss
Is there a way i can create a conditional to use the correct code and always shows the thumbnail?
Sure there is. You've not said in your question what blocks you so I have to assume the reason, but I can imagine multiple.
Is the reason a decisions with more than two alternations?
You handle the scenario of a feed item having no image or an image already:
if ($enclosure = $block->get_enclosure())
{
echo '<img class="image_post" src="'.$enclosure->get_link().'" title="'.$block->get_title().'" width="150" height="100">';
}
With your current scenario there is only one additional alternation which makes it three: if the enclosure is a thumbnail and not a link:
No image (no enclosure)
Image from link (enclosure with link)
Image from thumbnail (enclosure with thumbnail)
And you then don't know how to create a decision of that. This is what basically else-if is for:
if (!$enclosure = $block->get_enclosure())
{
echo "no enclosure: ", "-/-", "\n";
} elseif ($enclosure->get_link()) {
echo "enclosure link: ", $enclosure->get_link(), "\n";
} elseif ($enclosure->get_thumbnail()) {
echo "enclosure thumbnail: ", $enclosure->get_thumbnail(), "\n";
}
This is basically then doing the output based on that. However if you assign the image URL to a variable, you can decide on the output later on:
$image = NULL;
if (!$enclosure = $block->get_enclosure())
{
// nothing to do
} elseif ($enclosure->get_link()) {
$image = $enclosure->get_link();
} elseif ($enclosure->get_thumbnail()) {
$image = $enclosure->get_thumbnail();
}
if (isset($image)) {
// display image
}
And if you then move this more or less complex decision into a function of it's own, it will become even better to read:
$image = feed_item_get_image($block);
if (isset($image)) {
// display image
}
This works quite well until the decision becomes even more complex, but this would go out of scope for an answer on Stackoverflow.
I posted this question last night but I worded it wrong and didn't explain correctly. I am trying to cache googlemaps geocoding results for use in a firefighting mapping thing I have made and I am getting close to google's limits hence the need to cache the results. Help!
The code below kinda works however it creates a new sql record each time the code runs regardless of whether the address is already in the database. It seems to only call google once then saves the data, it then loads the data from the database ok and it looks like the rest works but I just can't see where I am going wrong.. As I said it creates a new record each and everytime the code runs and I would end up with a database full of the same identical records. My brain hurts but I really need to get this working so I can continue to help fire trucks to fires. :)
Oh, I am using the geocoding results on googlemaps (as permitted by the T&C) and this code is really only to get it working. I hope that makes sense and I hope someone can help... Thanks :)
<?php
$jobadd = "1000 BURWOOD HWY, BURWOOD";
// connect to the database
include('connect-db.php');
// get results from database and find needle
$result = mysql_query("SELECT * FROM geocache")
or die(mysql_error());
while($row = mysql_fetch_array( $result ))
{
$needle = '' . $row['address'] . '';
if (strpos($jobadd,$needle) !== false)
{
$status = "CACHED";
$latitude = '' . $row['latitude'] . '';
$longitude = '' . $row['longitude'] . '';
}
else
{
$status = "GOOGLE";
$address2 = "$jobadd, Victoria, Australia";
define("MAPS_HOST", "maps.google.com");
$base_url = "http://" . MAPS_HOST . "/maps/api/geocode/xml";
$request_url = $base_url . "?address=" . urlencode($address2) ."&sensor=false";
$xml = new SimpleXMLElement(file_get_contents($request_url));
$latitude = $xml->result->geometry->location->lat;
$longitude = $xml->result->geometry->location->lng;
// save the data to the database
mysql_query("INSERT geocache SET address='$jobadd', latitude='$latitude', longitude='$longitude' ")
or die(mysql_error());
}
}
echo $status;
echo '<BR>';
echo $jobadd;
echo '<BR>';
echo 'LAT:';
echo $latitude;
echo '<BR>';
echo 'LON:';
echo $longitude;
?>
Have you tried to output the result of strpos? The logic does not look incorrect. The issue is in the if condition. What do the records look like in the DB?
i have this piece of code which permits me to retrieve the information from a link... now, it says failed to open stream... here is the code:
Thanks!
$b = time ();
$date1 =date( "Y-m-d;h:i:s" , mktime(date("h")+6, date("i"), date("s"), date("m") , date("d"), date("Y")));
$str_time = "";
$str_msg = "";
$str_from = "";
$str_zip = "";
echo file_get_contents('href="http://testext.i-movo.com/api/receivesms.aspx?".$str_from."".$str_zip."".$phone."".$str_time."".$date1."".$str_msg.""');
}
This:
echo file_get_contents('href="http://testext.i-movo.com/api/receivesms.aspx?".$str_from."".$str_zip."".$phone."".$str_time."".$date1."".$str_msg.""');
Should be this:
echo file_get_contents("http://testext.i-movo.com/api/receivesms.aspx?".$str_from.$str_zip.$phone.$str_time.$date1.$str_msg);
please read the documentation for file_get_contents
the example says
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
you are using
$homepage = file_get_contents('href="http://www.example.com/');
now what is wrong ..
The script I am using 'gets' a html page and parses is showing only the .jpg images within, but I need to make some modifications and when i do it simply fails...
This works:
include('simple_html_dom.php');
function getUrlAddress() {
$url = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $url .'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
$html = file_get_html($url);
foreach($html->find('img[src$=jpg]') as $e)
echo '<img src='.$e->src .'><br>';
However, there are some problems... I only want to show images over a certain size, plus some site do not display full URL in the img tag and so need to try to get around that too... so I have done the following:
include('simple_html_dom.php');
function getUrlAddress() {
$url = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $url .'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
$html = file_get_html($url);
foreach($html->find('img[src$=jpg]') as $e)
$image = $e->src;
// check to see if src has domain
if (preg_match("/http/", $e->src)) {
$image = $image;
} else {
$parts = explode("/",$url);
$image = $parts['0']."//".$parts[1].$parts[2].$e->src;
}
$size = getimagesize($image);
echo "<br /><br />size is {$size[0]}";
echo '<img src='.$image.'><br>';
This works, but only returns the first image.
On the example link below there are 5 images, which the first code shows but does not display them as the src is without the leading domain
Example link as mentioned above
Is there a better way to do this? And why does the loop fail?
You seem to be missing a {:
foreach($html->find('img[src$=jpg]') as $e) {
You forgot your brackets:
foreach($html->find('img[src$=jpg]') as $e){
$image = $e->src;
// check to see if src has domain
if (preg_match("/http/", $e->src)) { $image = $image; }
else {
$parts = explode("/",$url);
$image = $parts['0']."//".$parts[1].$parts[2].$e->src;
}
$size = getimagesize($image);
echo "<br /><br />size is {$size[0]}";
echo '<img src='.$image.'><br>';
}