I want to fill a database table with certain items from the Steam Marketplace, specifically at the moment, guns from CSGO. I can't seem to find any database or list already of all the gun names, skin names and skin qualities, which is what I want.
One way I thought of to do it is to get to the list of items I want, EG "Shotguns", and save each item on the page into the database, and go through each page of that search. EG:
http://steamcommunity.com/market/search?appid=730&q=shotgun#p1_default_desc
http://steamcommunity.com/market/search?appid=730&q=shotgun#p2_default_desc
Ect..
Firstly, I'm not exactly sure how I would do that, and secondly, I wanted to know if there would be an easier way.
I plan on using the names of items to later get the prices by substituting the names into this: http://steamcommunity.com/market/priceoverview/?currency=3&appid=730&market_hash_name=StatTrak%E2%84%A2%20P250%20%7C%20Steel%20Disruption%20%28Factory%20New%29
And updating the prices every hour or so by running that check for every item. (probably at least a few thousand..)
The general gist of what you need to do boils down to:
Identify the urls you need to parse. In your case you'll notice that the results are loaded via ajax. Right-click the page, click 'inspect element' and go to the network tab. You'll see that the actual url is: http://steamcommunity.com/market/search/render/?query=&start=<STARTVALUE>&count=<NUMBEROFRESULTS>&search_descriptions=0&sort_column=quantity&sort_dir=desc&appid=730&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Pistol&category_730_Type%5B%5D=tag_CSGO_Type_SMG&category_730_Type%5B%5D=tag_CSGO_Type_Rifle&category_730_Type%5B%5D=tag_CSGO_Type_SniperRifle&category_730_Type%5B%5D=tag_CSGO_Type_Shotgun&category_730_Type%5B%5D=tag_CSGO_Type_Machinegun&category_730_Type%5B%5D=tag_CSGO_Type_Knife
Identify what the response type is. In this case it is json, and the data we want is inside a html-snippet
Find the framework required to parse it. You can use json_decode(...) to decode the json string. This question will give more information how to parse html.
You can now feed these urls to a function that loads the page. You can use file_get_contents(...) or the curl library.
Enter the values you parse from the response into your database. Make sure that the script does not get killed when it runs for too long. This question will give you more information about that.
You can use the following as a framework. You'll have to figure the structure of the html yourself, and lookup a tutorial of the html parser and mysql library you want to use.
<?php
//Prevent this script from being killed. Please note that if this script never
//ends, you'll have to kill it manually
set_time_limit( 0 );
//The api does not allow for more than 100 results at a time
$start = 0;
$count = 100;
$maxresults = PHP_INT_MAX;
$baseurl = "http://steamcommunity.com/market/search/render/?query=&start=$1&count=$2&search_descriptions=0&sort_column=quantity&sort_dir=desc&appid=730&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Pistol&category_730_Type%5B%5D=tag_CSGO_Type_SMG&category_730_Type%5B%5D=tag_CSGO_Type_Rifle&category_730_Type%5B%5D=tag_CSGO_Type_SniperRifle&category_730_Type%5B%5D=tag_CSGO_Type_Shotgun&category_730_Type%5B%5D=tag_CSGO_Type_Machinegun&category_730_Type%5B%5D=tag_CSGO_Type_Knife";
while( $start < $maxresults ) {
//Constructing the next url
$url = str_replace( "$1", $start, $baseurl );
$url = str_replace( "$2", $count, $url );
//Doing the request
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
$result = json_decode( curl_exec( $ch ), TRUE );
curl_close( $ch );
//Doing things with the result
//
//First let's see if everything went according to plan
if( $result == NULL || $result["success"] !== TRUE ) {
echo "Something went horribly wrong. Please edit the script to take this error into account and rerun it.";
exit( -1 );
}
//Bookkeeping for the next url we have to fetch
$count = $result["pagesize"];
$start += $count;
$maxresults = $result["total_count"];
//This is the html we have to parse
$html = $result["results_html"];
//Look up an example how to parse html, and how to get data from it
//Look up how to make a database connection and how to insert data into
//your database
}
echo "And we are done!";
Related
As a part of an assignment I am trying to pull some statistics from the Riot API (JSON data for League of Legends). So far I have managed to find summoner id (user id) based on summoner name, and I have filtered out the id's of said summoner's previous (20) games. However now I can't figure out how to get the right values from the JSON data. So this is when I'll show you my code I guess:
$matchIDs is an array of 20 integers (game IDs)
for ($i = 1; $i <= 1; $i++)
{
$this_match_data = get_match($matchIDs[$i], $server, $api);
$processed_data = json_decode($this_match_data, true);
var_dump($processed_data);
}
As shown above my for loop is set to one, as I'm just focusing on figuring out one before continuing with all 20. The above example is how I got the match IDs and the summoner IDs. I'll add those codes here for comparison:
for ($i = 0; $i <= 19; $i++)
{
$temp = $data['matches'][$i]['matchId'];
$matchIDs[$i] = json_decode($temp, true);
}
$data is the variable I get when I pull all the info from the JSON page, it's the same method I use to get $this_match_data in the first code block.
function match_list($summoner_id, $server, $api)
{
$summoner_enc = rawurlencode($summoner);
$summoner_lower = strtolower($summoner_enc);
$curl =curl_init('https://'.$server.'.api.pvp.net/api/lol/'.$server.'/v2.2/matchlist/by-summoner/'.$summoner_id.'?api_key='.$api.'');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
Now to the root of the problem, This is where I put the data I get from the site, so you can see what I am working with. Now by using the following code I can get the first value in that file, the match ID.
echo $processed_data['matchId'];
But I can't seem to lock down any other information from this .json file. I've tried typing stuff like ['region'] instead of ['matchId'] with no luck as well as inserting index numbers like $processed_data[0], but nothing happens. This is just how I get the right info from the first examples and I am really lost here.
Ok, so I think I've figured it out myself. By adding this to the code I can print out the json file in a way more human-friendly way, and that should make it much easier to handle the data.
echo ("<pre>");
var_dump($processed_data);
echo ("</pre>");
For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.
For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.
Question 1.
Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?
Question 2.
Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?
Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)
<?php
include('simple_html_dom.php');
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
// Find target image
$url = "http://store.steampowered.com/app/".$i;
$html = file_get_html($url);
$element = $html->find('img.game_header_image_full');
if($i == $times_to_run) {
echo "Success!";
}
foreach($element as $key => $value){
// Check if image was found
if (strpos($value,'img') == false) {
// Do nothing, repeat loop with $i++;
} else {
// Add (don't overwrite) to file steam.txt
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
}
?>
vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)
<?php
$i = 0;
$times_to_run = 100;
set_time_limit(0);
while ($i++ < $times_to_run) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$url = "http://store.steampowered.com/app/".$i;
$reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";
if(preg_match($reg, $content)) {
file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
}
}
?>
Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.
Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags
Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php
So checked via a phpinfo() and Safe Mode on my server is off, Curl is activated and there are no reasons for it not to work.
I also made sure Sharrre.php is in my root directory. Even included the Curlurl to the php file. Tried both absolute and relative linking. The google button with the counter shows as soon it is uploaded but not as expected because the counter shows 0 the entire time.
The culprit seems to be: $json = array('url'=>'','count'=>0);
After a few lines of other code we got this:
if(filter_var($_GET['url'], FILTER_VALIDATE_URL)){
if($type == 'googlePlus'){ //source http://www.helmutgranda.com/2011/11/01/get-a-url-google-count-via-php/
$contents = parse('https://clients6.google.com/rpc?key=AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQurl=' . $url . '&count=true');
preg_match( '/window\.__SSR = {c: ([\d]+)/', $contents, $matches );
if(isset($matches[0])){
$json['count'] = (int)str_replace('window.__SSR = {c: ', '', $matches[0]);
}
}
So either the google url code is not valid anymore or... well maybe there is something wrong with the suspected culprit because:
when changed to a value higher than 0 $json = array('url'=>'','count'=>15);
It shows 15 counts as you can see. I want it to be dynamic though and get the counts I already have and update those per click.
What can be done to solve this?
In my particular case the problem was in the asignement of the URL to the Curl object.
The original script sharrre.php sets the URL by asigning it to an array element of the curl object, but this is not working and causes Google counter not retrieve any amount.
Instead, the URL must be asigned by the curl_setopt() function.
This resolved this problem in my case:
sharrre.php:
//...
$ch = curl_init();
//$options[CURLOPT_URL] = $encUrl; // <<<--- not working! comment this line.
curl_setopt_array($ch, $options);
curl_setopt($ch, CURLOPT_URL, $encUrl ); // <<<--- Yeeaa, working! Add this line.
//...
Hope this help.
I have a simple list of orders which a user can filter by status (open,dispatched,closed). The filter dropdown triggers a post to the server and sends the filter value through. Orders are listed out 10 to a page with pagination links for any results greater than 10. Problem is when I click the pagination links to view the next page of results the filter value in the post is lost.
public function filter_orders() {
$page = ($this->uri->segment(4)) ? $this->uri->segment(4) : 0;
$filter = $this->input->post('order_status_filter');
$config = array();
$config["base_url"] = base_url() . "control/orders/filter_orders";
$config["per_page"] = 10;
$config['next_link'] = 'Next';
$config["uri_segment"] = 4;
$config['total_rows'] = $this->model_order->get_all_orders_count($this->input->post('order_status_filter'));
}
How can I make the pagination and filter work together. I've thought about injecting a query string in to the pagination links but it doesn't seem like a great solution.
The answer is very simple, use $_GET. You can also use URI segments.
i.e, index.php/cars/list/5/name-asc/price-desc'
The main reason you'll want to use $_GET is so you can link other users so they see the same result set you see. I'm sure users of your web app will want this functionality if you can imagine them linking stuff to each other.
That said, it would be ok to ALSO store the filters in the session so that if the user navigates away from the result set and then goes back, everything isn't reset.
Your best bet is to start a session and store the POST data in the session. In places in your code where you check to see if the user has sent POST data, you can check for session data (if POST is empty).
In other words, check for POST data (as you already do). If you got POST data, store it in the session. If a page has no POST data, check to see if you have session data. If you do, proceed as if it was POSTed. If you have both, overwrite the session with POST. You'll want to use new data your user sent you to overwrite older data they previously sent.
You either put everything in $_GET or if the data is sensible, put it in $_SESSION. Then it travels between pages.
In your case there seem to be no reason to put your filter data anywhere else than in $_GET.
A query string does seem the best solution. You could store it in the session or in cookies as well, but it makes sense to also store it in the query string.
Store it in cookies or the session if you want to remember the user's choice. Which seems like a friendly solution. It allows the user to keep their settings for a next visit, or for another page.
Store it in the query string, because going to 'page 2' doesn't tell you anything if you don't know about filters, page size or sorting. So if a user wants to bookmark page 2 or send it by e-mail, let them be able to send a complete link that contains this meta information.
Long story short: Store it in both.
maybe its not a right answer but, give it a try
<?php
// example url
$url = "index.php?page=6&filter1=value1&filter2=value2";
// to get the current url
//$url = "http://".$_SERVER["HTTP_HOST"].$_SERVER["REQUEST_URI"];
// change the page to 3 without changing any other values
echo url_change_index( $url, "page", 3 );
// will output "index.php?page=3&filter1=value1&filter2=value2"
// remove page index from url
echo url_change_index( $url, "page" );
// will output "index.php?filter1=value1&filter2=value2"
// the function
function url_change_index( $url, $name = null, $value = null ) {
$query = parse_url( $url, PHP_URL_QUERY );
$filter = str_replace( $query, "", $url );
parse_str( $query, $parsed );
$parsed = ( !isset( $parsed ) || !is_array( $parsed ) ) ? array() : $parsed;
if ( empty( $value ) ) {
unset( $parsed[$name] );
}
else {
$parsed[$name] = $value;
}
return $filter.http_build_query( $parsed );
}
?>
Parsing HTML / JS codes to get info using PHP.
www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626
Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:
3-4 years
4-5 years
5-6 years
7-8 years
How can you say if the sizes are available or not?
Now take a look at this page first and check the sizes again:
www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751
This item has the following sizes:
12 months
18 months - Not Available
24 months
As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.
What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?
EDIT:
Added a working code and a new problem to tackle.
Working code but it needs more work:
<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and convert them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.
Like this one:
http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006
Any idea how to go about this?
SOLUTION:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
return curl_exec($ch);
curl_close ($ch);
}
$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');
preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);
echo print_r($bingo);
Link: http://debconf11.com/stackoverflow.php
You are on your own now :)
EDIT2:
Ok, we are close to solution...
<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>
It is not loaded via ajax, instead array is in javascript variable. You can parse this with PHP, you can clearly see that 18 months is a False, which means it is not available.
EDIT:
This sizes are loaded via javascript, therefore you cannot parse them since they are not there.
I can extract only this...
<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
You can sniff JS to check if you can load sizes based on product id.
First you need: http://simplehtmldom.sourceforge.net/
Forget file_get_contents() it is ~5 slower than cURL.
You then parse this piece of code (html with id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize)
<select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>
You can then use preg_match(),explode(),str_replace() and others to filter out values you want. I can write it but I don't have time right now :)
The most simple way to fetch the content of a URL is to rely on fopen wrappers and just use file_get_contents with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy
You can download the file using fopen() or file_get_contents(), as Raoul Duke said, but if you have experience with the JavaScript DOM model, the DOM extension might be a bit easier to use than Tidy.
I know for a fact that the DOM extension is enabled by default in PHP, but I am a bit unsure if Tidy is (the manual page only says it's "bundeled", so I suspect that it might not be enabled).