preg_match misses some ids while fetching data with cURL - php

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.
For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.
Question 1.
Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?
Question 2.
Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?
Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)
<?php
include('simple_html_dom.php');
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
// Find target image
$url = "http://store.steampowered.com/app/".$i;
$html = file_get_html($url);
$element = $html->find('img.game_header_image_full');
if($i == $times_to_run) {
echo "Success!";
}
foreach($element as $key => $value){
// Check if image was found
if (strpos($value,'img') == false) {
// Do nothing, repeat loop with $i++;
} else {
// Add (don't overwrite) to file steam.txt
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
}
?>
vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)
<?php
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$url = "http://store.steampowered.com/app/".$i;
$reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";
if(preg_match($reg, $content)) {
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
?>

Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.
Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags
Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

Related

How to get images using file_get_contents as array

I have the following problem with getting images as array.
In this code I'm trying to check if images for search Test 1 exist - if yes, then display, if not then try with Test 2 and that's it. Current code can do it but is super slow.
This if (sizeof($matches[1]) > 3) { because this 3 sometimes contains advertisement on crawled website, so this is my secure how to skip it.
My question is how I can speed up code below to get if (sizeof($matches[1]) > 3) { faster? I believe that this makes code very slow, because this array may contain up to 1000 images
$get_search = 'Test 1';
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
if (sizeof($matches[1]) > 3) {
$ch_foreach = 1;
}
if ($ch_foreach == 0) {
$get_search = 'Test 2';
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
if (sizeof($matches[1]) > 3) {
$ch_foreach = 1;
}
}
foreach ($matches[1] as $match) if ($tmp++ < 20) {
if (#getimagesize($match)) {
// display image
echo $match;
}
}
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
unless the www.everypixel.com server is is on the same LAN (in which case compression overhead may be slower than transferring it in plain), curl with CURLOPT_ENCODING should do this faster than file_get_contents, and even if it is on the same lan, curl should be faster than file_get_contents because file_get_contents keeps reading until the server close the connection, but curl keeps reading until Content-Length bytes has been read, which is faster than waiting for a server to close a socket, so do this instead:
$ch=curl_init('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
curl_setopt_array($ch,array(CURLOPT_ENCODING=>'',CURLOPT_RETURNTRANSFER=>1));
$html=curl_exec($ch);
about your regex:
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);
DOMDocument with getElementsByTagName("img") and getAttribute("src") should be faster than using your regex, so do this instead:
$domd=#DOMDocument::loadHTML($html);
$urls=[];
foreach($domd->getElementsByTagName("img") as $img){
$url=$img->getAttribute("src");
if(!empty($url)){
$urls[]=$url;
}
}
and probably the slowest part of your entire code, the #getimagesize($match) inside a loop potentially containing over 1000 urls, every call to getimagesize() with an url makes php download the image, and it uses the file_get_contents method meaning it suffers from the same Content-Length issue that makes file_get_contents slow. in addition, all the images are downloaded sequentially, downloading them in parallel should be much faster, which can be done with the curl_multi api, but doing that is a complex task and i cba writing an example for you, but i can point you to an example: https://stackoverflow.com/a/54717579/1067003

Need help decoding JSON from Riot API with PHP

As a part of an assignment I am trying to pull some statistics from the Riot API (JSON data for League of Legends). So far I have managed to find summoner id (user id) based on summoner name, and I have filtered out the id's of said summoner's previous (20) games. However now I can't figure out how to get the right values from the JSON data. So this is when I'll show you my code I guess:
$matchIDs is an array of 20 integers (game IDs)
for ($i = 1; $i <= 1; $i++)
{
$this_match_data = get_match($matchIDs[$i], $server, $api);
$processed_data = json_decode($this_match_data, true);
var_dump($processed_data);
}
As shown above my for loop is set to one, as I'm just focusing on figuring out one before continuing with all 20. The above example is how I got the match IDs and the summoner IDs. I'll add those codes here for comparison:
for ($i = 0; $i <= 19; $i++)
{
$temp = $data['matches'][$i]['matchId'];
$matchIDs[$i] = json_decode($temp, true);
}
$data is the variable I get when I pull all the info from the JSON page, it's the same method I use to get $this_match_data in the first code block.
function match_list($summoner_id, $server, $api)
{
$summoner_enc = rawurlencode($summoner);
$summoner_lower = strtolower($summoner_enc);
$curl =curl_init('https://'.$server.'.api.pvp.net/api/lol/'.$server.'/v2.2/matchlist/by-summoner/'.$summoner_id.'?api_key='.$api.'');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
Now to the root of the problem, This is where I put the data I get from the site, so you can see what I am working with. Now by using the following code I can get the first value in that file, the match ID.
echo $processed_data['matchId'];
But I can't seem to lock down any other information from this .json file. I've tried typing stuff like ['region'] instead of ['matchId'] with no luck as well as inserting index numbers like $processed_data[0], but nothing happens. This is just how I get the right info from the first examples and I am really lost here.
Ok, so I think I've figured it out myself. By adding this to the code I can print out the json file in a way more human-friendly way, and that should make it much easier to handle the data.
echo ("<pre>");
var_dump($processed_data);
echo ("</pre>");

Make a list of Items from steam Marketplace

I want to fill a database table with certain items from the Steam Marketplace, specifically at the moment, guns from CSGO. I can't seem to find any database or list already of all the gun names, skin names and skin qualities, which is what I want.
One way I thought of to do it is to get to the list of items I want, EG "Shotguns", and save each item on the page into the database, and go through each page of that search. EG:
http://steamcommunity.com/market/search?appid=730&q=shotgun#p1_default_desc
http://steamcommunity.com/market/search?appid=730&q=shotgun#p2_default_desc
Ect..
Firstly, I'm not exactly sure how I would do that, and secondly, I wanted to know if there would be an easier way.
I plan on using the names of items to later get the prices by substituting the names into this: http://steamcommunity.com/market/priceoverview/?currency=3&appid=730&market_hash_name=StatTrak%E2%84%A2%20P250%20%7C%20Steel%20Disruption%20%28Factory%20New%29
And updating the prices every hour or so by running that check for every item. (probably at least a few thousand..)
The general gist of what you need to do boils down to:
Identify the urls you need to parse. In your case you'll notice that the results are loaded via ajax. Right-click the page, click 'inspect element' and go to the network tab. You'll see that the actual url is: http://steamcommunity.com/market/search/render/?query=&start=<STARTVALUE>&count=<NUMBEROFRESULTS>&search_descriptions=0&sort_column=quantity&sort_dir=desc&appid=730&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Pistol&category_730_Type%5B%5D=tag_CSGO_Type_SMG&category_730_Type%5B%5D=tag_CSGO_Type_Rifle&category_730_Type%5B%5D=tag_CSGO_Type_SniperRifle&category_730_Type%5B%5D=tag_CSGO_Type_Shotgun&category_730_Type%5B%5D=tag_CSGO_Type_Machinegun&category_730_Type%5B%5D=tag_CSGO_Type_Knife
Identify what the response type is. In this case it is json, and the data we want is inside a html-snippet
Find the framework required to parse it. You can use json_decode(...) to decode the json string. This question will give more information how to parse html.
You can now feed these urls to a function that loads the page. You can use file_get_contents(...) or the curl library.
Enter the values you parse from the response into your database. Make sure that the script does not get killed when it runs for too long. This question will give you more information about that.
You can use the following as a framework. You'll have to figure the structure of the html yourself, and lookup a tutorial of the html parser and mysql library you want to use.
<?php
//Prevent this script from being killed. Please note that if this script never
//ends, you'll have to kill it manually
set_time_limit( 0 );
//The api does not allow for more than 100 results at a time
$start = 0;
$count = 100;
$maxresults = PHP_INT_MAX;
$baseurl = "http://steamcommunity.com/market/search/render/?query=&start=$1&count=$2&search_descriptions=0&sort_column=quantity&sort_dir=desc&appid=730&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Pistol&category_730_Type%5B%5D=tag_CSGO_Type_SMG&category_730_Type%5B%5D=tag_CSGO_Type_Rifle&category_730_Type%5B%5D=tag_CSGO_Type_SniperRifle&category_730_Type%5B%5D=tag_CSGO_Type_Shotgun&category_730_Type%5B%5D=tag_CSGO_Type_Machinegun&category_730_Type%5B%5D=tag_CSGO_Type_Knife";
while( $start < $maxresults ) {
//Constructing the next url
$url = str_replace( "$1", $start, $baseurl );
$url = str_replace( "$2", $count, $url );
//Doing the request
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
$result = json_decode( curl_exec( $ch ), TRUE );
curl_close( $ch );
//Doing things with the result
//
//First let's see if everything went according to plan
if( $result == NULL || $result["success"] !== TRUE ) {
echo "Something went horribly wrong. Please edit the script to take this error into account and rerun it.";
exit( -1 );
}
//Bookkeeping for the next url we have to fetch
$count = $result["pagesize"];
$start += $count;
$maxresults = $result["total_count"];
//This is the html we have to parse
$html = $result["results_html"];
//Look up an example how to parse html, and how to get data from it
//Look up how to make a database connection and how to insert data into
//your database
}
echo "And we are done!";

Can I retry file_get_contents() until it opens a stream?

I am using PHP to get the contents of an API. The problem is, sometimes that API just sends back a 502 Bad Gateway error and the PHP code can’t parse the JSON and set the variables correctly. Is there some way I can keep trying until it works?
This is not an easy question because PHP is a synchronous language by default.
You could do this:
$a = false;
$i = 0;
while($a == false && $i < 10)
{
$a = file_get_contents($path);
$i++;
usleep(10);
}
$result = json_decode($a);
Adding usleep(10) allows your server not to get on his knees each time the API will be unavailable. And your function will give up after 10 attempts, which prevents it to freeze completely in case of long unavailability.
Since you didn't provide any code it's kind of hard to help you. But here is one way to do it.
$data = null;
while(!$data) {
$json = file_get_contents($url);
$data = json_decode($json); // Will return false if not valid JSON
}
// While loop won't stop until JSON was valid and $data contains an object
var_dump($data);
I suggest you throw some sort of increment variable in there to stop attempting after X scripts.
Based on your comment, here is what I would do:
You have a PHP script that makes the API call and, if successful, records the price and when that price was acquired
You put that script in a cronjob/scheduled task that runs every 10 minutes.
Your PHP view pulls the most recent price from the database and uses that for whatever display/calculations it needs. If pertinent, also show the date/time that price was captured
The other answers suggest doing a loop. A combo approach probably works best here: in your script, put in a few loops just in case the interface is down for a short blip. If it's not up after say a minute, use the old value until your next try.
A loop can solve this problem, but so can a recursive function like this one:
function file_get_contents_retry($url, $attemptsRemaining=3) {
$content = file_get_contents($url);
$attemptsRemaining--;
if( empty($content) && $attemptsRemaining > 0 ) {
return file_get_contents_retry($url, $attemptsRemaining);
}
return $content;
}
// Usage:
$retryAttempts = 6; // Default is 3.
echo file_get_contents_retry("http://google.com", $retryAttempts);

How to parse an HTML page using PHP?

Parsing HTML / JS codes to get info using PHP.
www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626
Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:
3-4 years
4-5 years
5-6 years
7-8 years
How can you say if the sizes are available or not?
Now take a look at this page first and check the sizes again:
www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751
This item has the following sizes:
12 months
18 months - Not Available
24 months
As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.
What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?
EDIT:
Added a working code and a new problem to tackle.
Working code but it needs more work:
<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and convert them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.
Like this one:
http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006
Any idea how to go about this?
SOLUTION:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
return curl_exec($ch);
curl_close ($ch);
}
$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');
preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);
echo print_r($bingo);
Link: http://debconf11.com/stackoverflow.php
You are on your own now :)
EDIT2:
Ok, we are close to solution...
<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>
It is not loaded via ajax, instead array is in javascript variable. You can parse this with PHP, you can clearly see that 18 months is a False, which means it is not available.
EDIT:
This sizes are loaded via javascript, therefore you cannot parse them since they are not there.
I can extract only this...
<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
You can sniff JS to check if you can load sizes based on product id.
First you need: http://simplehtmldom.sourceforge.net/
Forget file_get_contents() it is ~5 slower than cURL.
You then parse this piece of code (html with id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize)
<select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>
You can then use preg_match(),explode(),str_replace() and others to filter out values you want. I can write it but I don't have time right now :)
The most simple way to fetch the content of a URL is to rely on fopen wrappers and just use file_get_contents with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy
You can download the file using fopen() or file_get_contents(), as Raoul Duke said, but if you have experience with the JavaScript DOM model, the DOM extension might be a bit easier to use than Tidy.
I know for a fact that the DOM extension is enabled by default in PHP, but I am a bit unsure if Tidy is (the manual page only says it's "bundeled", so I suspect that it might not be enabled).

Categories