How to get images using file_get_contents as array - php

I have the following problem with getting images as array.
In this code I'm trying to check if images for search Test 1 exist - if yes, then display, if not then try with Test 2 and that's it. Current code can do it but is super slow.
This if (sizeof($matches[1]) > 3) { because this 3 sometimes contains advertisement on crawled website, so this is my secure how to skip it.
My question is how I can speed up code below to get if (sizeof($matches[1]) > 3) { faster? I believe that this makes code very slow, because this array may contain up to 1000 images
$get_search = 'Test 1';
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
if (sizeof($matches[1]) > 3) {
$ch_foreach = 1;
}
if ($ch_foreach == 0) {
$get_search = 'Test 2';
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
if (sizeof($matches[1]) > 3) {
$ch_foreach = 1;
}
}
foreach ($matches[1] as $match) if ($tmp++ < 20) {
if (#getimagesize($match)) {
// display image
echo $match;
}
}

$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
unless the www.everypixel.com server is is on the same LAN (in which case compression overhead may be slower than transferring it in plain), curl with CURLOPT_ENCODING should do this faster than file_get_contents, and even if it is on the same lan, curl should be faster than file_get_contents because file_get_contents keeps reading until the server close the connection, but curl keeps reading until Content-Length bytes has been read, which is faster than waiting for a server to close a socket, so do this instead:
$ch=curl_init('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
curl_setopt_array($ch,array(CURLOPT_ENCODING=>'',CURLOPT_RETURNTRANSFER=>1));
$html=curl_exec($ch);
about your regex:
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
DOMDocument with getElementsByTagName("img") and getAttribute("src") should be faster than using your regex, so do this instead:
$domd=#DOMDocument::loadHTML($html);
$urls=[];
foreach($domd->getElementsByTagName("img") as $img){
$url=$img->getAttribute("src");
if(!empty($url)){
$urls[]=$url;
}
}
and probably the slowest part of your entire code, the #getimagesize($match) inside a loop potentially containing over 1000 urls, every call to getimagesize() with an url makes php download the image, and it uses the file_get_contents method meaning it suffers from the same Content-Length issue that makes file_get_contents slow. in addition, all the images are downloaded sequentially, downloading them in parallel should be much faster, which can be done with the curl_multi api, but doing that is a complex task and i cba writing an example for you, but i can point you to an example: https://stackoverflow.com/a/54717579/1067003

Related

preg_match misses some ids while fetching data with cURL

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.
For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.
Question 1.
Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?
Question 2.
Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?
Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)
<?php
include('simple_html_dom.php');
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
// Find target image
$url = "http://store.steampowered.com/app/".$i;
$html = file_get_html($url);
$element = $html->find('img.game_header_image_full');
if($i == $times_to_run) {
echo "Success!";
}
foreach($element as $key => $value){
// Check if image was found
if (strpos($value,'img') == false) {
// Do nothing, repeat loop with $i++;
} else {
// Add (don't overwrite) to file steam.txt
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
}
?>
vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)
<?php
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$url = "http://store.steampowered.com/app/".$i;
$reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";
if(preg_match($reg, $content)) {
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
?>
Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.
Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags
Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

Recursive Web Link Search in PHP

I am trying to do recursive web link search using PHP, but the code doesn't seem to work. I get a timeout error.
function linksearch($url)
{
$text = file_get_contents($url);
if (!empty($text))
{
$res1 = preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",
$text,
$matches);
if ($res1)
{
foreach(array_unique($matches[0]) as $link)
{
linksearch($url);
}
}
else
{
// echo "No links found.";
}
}
}
You have a neverending loop there in your function, because you call your function again inside your function.
linksearch($url);
You need a condition to terminate your function. That ain't no recursion, because on each iteration the input would change and end until some condition. Now it's the same all the time - $url.
Why don't you first save the page locally, and tune your script fetching a local test file, instead of having to run a remote call every time. You won't get a timeout error from the evaluation code that follows your file_get_contents(), unless the HTML file is humungously large.

Can I retry file_get_contents() until it opens a stream?

I am using PHP to get the contents of an API. The problem is, sometimes that API just sends back a 502 Bad Gateway error and the PHP code can’t parse the JSON and set the variables correctly. Is there some way I can keep trying until it works?
This is not an easy question because PHP is a synchronous language by default.
You could do this:
$a = false;
$i = 0;
while($a == false && $i < 10)
{
$a = file_get_contents($path);
$i++;
usleep(10);
}
$result = json_decode($a);
Adding usleep(10) allows your server not to get on his knees each time the API will be unavailable. And your function will give up after 10 attempts, which prevents it to freeze completely in case of long unavailability.
Since you didn't provide any code it's kind of hard to help you. But here is one way to do it.
$data = null;
while(!$data) {
$json = file_get_contents($url);
$data = json_decode($json); // Will return false if not valid JSON
}
// While loop won't stop until JSON was valid and $data contains an object
var_dump($data);
I suggest you throw some sort of increment variable in there to stop attempting after X scripts.
Based on your comment, here is what I would do:
You have a PHP script that makes the API call and, if successful, records the price and when that price was acquired
You put that script in a cronjob/scheduled task that runs every 10 minutes.
Your PHP view pulls the most recent price from the database and uses that for whatever display/calculations it needs. If pertinent, also show the date/time that price was captured
The other answers suggest doing a loop. A combo approach probably works best here: in your script, put in a few loops just in case the interface is down for a short blip. If it's not up after say a minute, use the old value until your next try.
A loop can solve this problem, but so can a recursive function like this one:
function file_get_contents_retry($url, $attemptsRemaining=3) {
$content = file_get_contents($url);
$attemptsRemaining--;
if( empty($content) && $attemptsRemaining > 0 ) {
return file_get_contents_retry($url, $attemptsRemaining);
}
return $content;
}
// Usage:
$retryAttempts = 6; // Default is 3.
echo file_get_contents_retry("http://google.com", $retryAttempts);

How to speed up the execution of this PHP script

I have write a script for webscraping where i am fetching each link from the page and getting load that url in the code and this working extremely slow this is taking about 50 sec for first output and taking an age to complete about 100 links, I am not getting why this is working so slow, I am thinking about caching but don't know how this could help us.
1) Page caching OR Opcode cache.
code is :
public function searchForum(){
global $wpdb;
$sUrl = $this->getSearchUrl();
$this->logToCrawler();
$cid = $this->getCrawlId();
$html = file_get_dom($sUrl);
$c=1;
foreach($html('div.gridBlobTitle a:first-child') as $element){
$post_page = file_get_dom($element->href);
$post_meta = array();
foreach($post_page('table#mytable img:first-child') as $img){
if(isset($img->src)){
$post_meta['image_main'] = self::$forumurl.$img->src;
}
else{
$post_meta['image_main']=NULL;
}
}
foreach($post_page('table.preferences td:odd') as $elm){
$post_meta[] = strip_tags($elm->getInnerText());
unset($elm);
}
/*Check if can call getPlainText for description fetch*/
$object = $post_page('td.collection',2);
$methodVariable = array($object, 'getPlainText');
if(is_callable($methodVariable, true, $callable_name)){
$post_meta['description'] = utf8_encode($object->getPlainText());
}
else{
$post_meta['description'] = NULL;
}
$methodVariable = array($object, 'getInnerText');
if(is_callable($methodVariable, true, $callable_name)){
/*Get all the images we found*/
$rough_html = $object->getInnerText();
preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $rough_html, $matches);
$images = array_map('self::addUrlToItems',$matches[1]);
$images = json_encode($images);
}
if($post_meta[8]=='WTB: Want To Buy'){
$status='buy';
}
else{
$status='sell';
}
$lastdate = strtotime(date('Y-m-d',strtotime("-1 month")));
$listdate = strtotime(date('Y-m-d',strtotime($post_meta[9])));
/*Check for date*/
if($listdate>=$lastdate){
$wpdb->query("INSERT
INTO tbl_scrubed_data SET
keywords='".esc_sql($this->getForumSettings()->search_meta)."',
url_to_post='".esc_sql($element->href)."',
description='".esc_sql($post_meta['description'])."',
date_captured=now(),crawl_id='".$cid."',
image_main='".esc_sql($post_meta['image_main'])."',
images='".esc_sql($images)."',brand='".esc_sql($post_meta[0])."',
series='".esc_sql($post_meta[1])."',model='".esc_sql($post_meta[2])."',
watch_condition='".esc_sql($post_meta[3])."',box='".esc_sql($post_meta[4])."',
papers='".esc_sql($post_meta[5])."',year='".esc_sql($post_meta[6])."',case_size='".esc_sql($post_meta[7])."',status='".esc_sql($post_meta[8])."',listed='".esc_sql($post_meta[9])."',
asking_price='".esc_sql($post_meta[10])."',retail_price='".esc_sql($post_meta[11])."',payment_info='".esc_sql($post_meta[12])."',forum_id='".$this->getForumSettings()->ID."'");
unset($element,$post_page,$images);
} /*END: Check for date*/
}
$c++;
}
Note :
1) I am using [Ganon DOM Parser][1] for parsing the HTML.
[1]: https://code.google.com/p/ganon/wiki/AccesElements
2) On windows XP with WAMP, Mysql 5.5 PHP 5.3, 1 GB of RAM.
If you need more info please comment them.
Thanks
You need to figure out what parts of your program are being slow. There are two ways to do that.
1) Put in some print statements that print out the time in various places, so you can say "Hey, look, this took 5 seconds to go from here to here."
2) Use a profiler like xdebug that will run your program and analyze it while it's running and then you can know which parts of the code are slow.
Just looking at a program you can't say "Oh, that's the slow part to speed up." Without knowing what's slow, you'll probably waste time speeding up parts that aren't the slow parts.

Modification to a code to merge two parts of it with similar characteristics

Below is a link crawler that gets the urls of a page in a given depth. At the end of it I added a regular expression to match all the emails of the url that is just crawled. As you can see in the second part, it file_get_content the same page it just downloaded, meaning twice the execution time, bandwidth etc.
The question is how can I merge those two parts to use the first downloaded page, to avoid getting it again? Thank you.
function crawler($url, $depth = 2) {
$dom = new DOMDocument('1.0');
if (!$parts || !#$dom->loadHTMLFile($url)) {
return;
}
.
.
.
//this is where the second part starts
$text = file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+([_\\.-][a-z0-9]+)*#([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i", $text, $matches);
}
Replace:
$text = file_get_contents($url);
with:
$text = $dom->saveHTML();
http://www.php.net/manual/en/domdocument.savehtml.php
Alternatively, in the first part of your function, you could save the HTML into a variable using file_get_contents, then pass it to $dom->loadHTML. That way you can then reuse the variable with your regex.
http://www.php.net/manual/en/domdocument.loadhtml.php

Categories