I am trying to build a list of all the city pages on ghix.com, which does not have such a complete directory. To do this I am using their 'city id' which is unique for each city, but does not follow any particular order.
I am using cURL and PHP to loop through the possible domains to search for those that match actual cities. Simple enough. See the code bellow, which produces a 500 Internal Service Error.
If it worked, output should be a list of 'city ids' which do not match actual cities. If the URL matches an actual city it will not have (0) on the page, but if it does not match a city it will have (0) on the page.
I have looked over and corrected this several times, what is causing the error?
<html>
<?php
for ($i = 1; ; $i <= 1000000; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term="(0)";
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
echo $i;
Echo "br/";
}
}
?>
</html>
UPDATE
a slightly different approach I tried, with the same effect...
<html>
<?php
for ($i = 1; $i <= 100; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term="(0)";
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if (strpos($ch,$term)) {
echo $url;
echo "<br>";
}
?>
</html>
In your first chunk of code, you have an extra ; in the for conditions. Next, you need to initialize cURL with $ch = curl_init(); around the beginning. That opens the handler $ch that you call on later. Finally, use 1= for the false condition for if instead of the exclamation and double equal. After these fixes, I'm not getting any 500 errors. After that, it's just a matter of collecting the data from the pages and putting it in the right places.
In the second chunk of code, you still need to initialize cURL. Then you need to put in another curly bracket at the end to close out the for loop. And then it's a matter of dealing with the output from cURL.
It seems you are getting more typo errors than anything. Watch your server logs and they will tell you more about what you are looking for. And for cURL, read up on the options that you can set in PHP on the PHP site. It's a good read. Good luck.
What I have found is that you can use the auto-complete JSON request to find all the city ids.
The JSON request url is http://www.ghix.com/goto/dynamic/suggest?InputVarName=q&q=FRAGMENT&Type=json&SkipPrerequisites=1. Here FRAGMENT is the letters you type in the input box. Iteratively request on that URL would reveal all the CityIDs you are looking for.
Be aware, they may have bot protection for such ajax queries.
I saw their JSON is malformed. It can be fixed using Services_JSON pear package.
require_once 'Services/JSON.php';
$Services_JSON = new Services_JSON();
$json = $Services_JSON->decode($jsons);
Using dxtool/WebGet the following code seems to be working.
require_once("WebGet.php");
// common headers to make this request more real.
$headers = array(
"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"Accept-Encoding" => "gzip, deflate",
"Accept-Language" => "en-us,en;q=0.5",
"Connection" => "keep-alive",
"Referer" => "http://www.ghix.com/goto/dynamic/search",
"User-Agent" => "Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"
);
$w = new WebGet();
$w->cookieFile = dirname(__FILE__) . "/cookie.txt";
// do a page landing
$w->requestContent("http://www.ghix.com/goto/dynamic/search", array(), array(), $headers);
$city_ids = array();
for ($i = 1; $i <= 1000; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term = "> (0)<";
$w->requestContent($url, array(), array(), $headers);
// sleep some times to make it more like request from human
usleep(10000);
if (strpos($w->cachedContent, $term) !== FALSE){
$city_ids[] = $i;
echo $url, PHP_EOL;
}
}
print_r($city_ids);
Related
I am looking to collect the titles of all of the posts on a subreddit, and I wanted to know what would be the best way of going about this?
I've looked around and found some stuff talking about Python and bots. I've also had a brief look at the API and am unsure in which direction to go.
As I do not want to commit to find out 90% of the way through it won't work, I ask if someone could point me in the right direction of language and extras like any software needed for example pip for Python.
My own experience is in web languages such as PHP so I initially thought of a web app would do the trick but am unsure if this would be the best way and how to go about it.
So as my question stands
What would be the best way to collect the titles (in bulk) of a
subreddit?
Or if that is too subjective
How do I retrieve and store all the post titles of a subreddit?
Preferably needs to :
do more than 1 page of (25) results
save to a .txt file
Thanks in advance.
PHP; in 25 lines:
$subreddit = 'pokemon';
$max_pages = 10;
// Set variables with default data
$page = 0;
$after = '';
$titles = '';
do {
$url = 'http://www.reddit.com/r/' . $subreddit . '/new.json?limit=25&after=' . $after;
// Set URL you want to fetch
$ch = curl_init($url);
// Set curl option of of header to false (don't need them)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Set curl option of nobody to false as we need the body
curl_setopt($ch, CURLOPT_NOBODY, 0);
// Set curl timeout of 5 seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
// Set curl to return output as string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Execute curl
$output = curl_exec($ch);
// Get HTTP code of request
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Close curl
curl_close($ch);
// If http code is 200 (success)
if ($status == 200) {
// Decode JSON into PHP object
$json = json_decode($output);
// Set after for next curl iteration (reddit's pagination)
$after = $json->data->after;
// Loop though each post and output title
foreach ($json->data->children as $k => $v) {
$titles .= $v->data->title . "\n";
}
}
// Increment page number
$page++;
// Loop though whilst current page number is less than maximum pages
} while ($page < $max_pages);
// Save titles to text file
file_put_contents(dirname(__FILE__) . '/' . $subreddit . '.txt', $titles);
I'm using cURL.
I got the result in an array named resp as you can see below. But I don't know how to save only the first 19 characters into another array like var1[0].
Right now I only print them. (are like resp[0]='a', resp[1]='b', resp[2]='c',... and I want the first 19 characters to be saved into one call like var1[0]='abc..'.
I also tried with implode and array_merge, but no success.
$curl = curl_init();
// Set some options - we are passing in a useragent too here
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => $var
//CURLOPT_USERAGENT => 'Codular Sample cURL Request'
));
// Send the request & save response to $resp
$resp = curl_exec($curl);
echo($resp);
// Close request to clear up some resources
curl_close($curl);
for($i = 0; $i <=18; ++$i)
{
echo($resp[$i]);
}
Assuming that a character is stored at each index of $resp, you could take the following action:
$nineteen = '';
for($i = 0; $i <=18; ++$i) {
$nineteen .= $resp[$i];
}
$var = array($nineteen);
This code adds all 19 chars to a string and then adds them to the $var array. Simple + readable.
You can also do this with slice() and implode(). Both methodologies have roughly the same performance.
(I'm scraping this stuff with the permission of the website in question, by the way).
Pretty simple web scraper, was working fine when I was loading all the links by hand, but when I've tried to load them in via JSON and variables (so I can do lots of scraping with the one script and make the process more modular by just adding more links to JSON) it runs on an infinite loop.
(Page has been loading for about 15 minutes now)
Here is my JSON. Only one store is in there for testing purposes but there is going to be about 15 more.
[
{
"store":"Incu Men",
"cat":"Accessories",
"general_cat":"Accessories",
"spec_cat":"accessories",
"url":"http://www.incuclothing.com/shop-men/accessories/",
"baseurl":"http://www.incuclothing.com",
"next_select":"a.next",
"prod_name_select":".infobox .fn",
"label_name_select":".infobox .brand",
"desc_select":".infobox .description",
"price_select":"#price",
"mainImg_select":"",
"more_imgs":".product-images",
"product_url":".hproduct .photo-link"
}
]
Here is the PHP scraper code:
<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init();
// Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url);
// Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting cURL's option to return the webpage data
$data = curl_exec($ch);
// Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch);
// Closing cURL
return $data;
// Returning the data from the function
}
function getLinks($catURL, $prodURL, $baseURL, $next_select) {
$urls = array();
while($catURL) {
echo "Indexing: $url" . PHP_EOL;
$html = str_get_html(curl($catURL));
foreach ($html->find($prodURL) as $el) {
$urls[] = $baseURL . $el->href;
}
$next = $html->find($next_select, 0);
$url = $next ? $baseURL . $next->href : null;
echo "Results: $next" . PHP_EOL;
}
return $urls;
}
$string = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);
foreach ($json_array as $value){
$baseURL = $value['baseurl'];
$catURL = $value['url'];
$store = $value['store'];
$general_cat = $value['general_cat'];
$spec_cat = $value['spec_cat'];
$next_select = $value['next_select'];
$prod_name = $value['prod_name_select'];
$label_name = $value['label_name_select'];
$description = $value['desc_select'];
$price = $value['price_select'];
$prodURL = $value['product_url'];
if (!is_null($value['mainImg_select'])){
$mainImg = $value['mainImg_select'];
}
$more_imgs = $value['more_imgs'];
$allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);
}
?>
Any ideas why the script would be running infinitely and not returning anything/stopping/printing anything to screen? I'm just gonna let it run until it stops. When I was doing this by hand it would only take a minute or so, sometimes less, so I'm sure it's a problem with my variables/json but I can't for the life of me see what the issues lie.
Can anyone take a quick look and point me in the right direction?
There is a problem with your while($catURL) loop. What do you want to do ?
Moreover, you can force to display information on your browser with the flush() command.
Well, I am attempting to reuse the handles I've spawned in the initial process, however after the first run it simply stops working. If I remove (or recreate the entire handler) the handles and add them again, it works fine. What could be the culprit of this?
My code currently looks like this:
<?php
echo 'Handler amount: ';
$threads = (int) trim(fgets(STDIN));
if($threads < 1) {
$threads = 1;
}
$s = microtime(true);
$url = 'http://mywebsite.com/some-script.php';
$mh = curl_multi_init();
$ch = array();
for($i = 0; $i < $threads; $i++) {
$ch[$i] = curl_init($url);
curl_setopt_array($ch[$i], array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (X11; Linux i686; rv:21.0) Gecko/20130213 Firefox/21.0',
CURLOPT_REFERER => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true
));
curl_multi_add_handle($mh, $ch[$i]);
}
while($mh) {
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
$e = microtime(true);
$totalTime = number_format($e - $s, 2);
if($totalTime >= 1) {
echo floor($threads / $totalTime) . ' requests per second (total time '.$totalTime.'s)' . "\r";
$s = microtime(true);
}
}
foreach($ch as $handler) {
curl_multi_remove_handle($mh, $handler);
curl_close($handler);
}
curl_multi_close($mh);
?>
When I have CURLOPT_VERBOSE set to true, I see many "additional stuff not fine transfer.c:1037: 0 0" messages, I read about them on a different question, and it seems that it is caused by some obvious things:
Too fast
Firewall
ISP restricting
AFAIK, this is not it, because if I recreate the handles every time, they successfully complete at about 79 requests per second (about 529 bytes each)
My process for reusing the handles:
Create the multi handler, and add the specified number of handles to the multi handler
While the mutli handler is working, execute all the handles
After the while loop has stopped (it seems very unlikely that it will), close all the handles and the multi curl handler
It executes all handles once and then stops.
This is really stumping me. Any ideas?
I ran into the same problem (using C++ though) and found out that I need to remove the curl easy handle(s) and add it back in again. My solution was to remove all handles at the end of the curl_multi_perform loop and add them back in at the beginning of the outer loop in which I reuse existing keep-alive connections:
for(;;) // loop using keep-alive connections
{
curl_multi_add_handle(...)
while ( stillRunning ) // curl_multi_perform loop
{
...
curl_multi_perform(...)
...
}
curl_multi_remove_handle(...)
}
Perhaps this also applies to your PHP scenario. Remember: don't curl_easy_cleanup or curl_easy_init the curl handle in between.
If you turn on CURLOPT_VERBOSE you can follow along in the console and very that your connections are indeed reused. That has solved this problem for me.
[Updated At Bottom]
Hi everyone.
Start With Short URLs:
Imagine that you've got a collection of 5 short urls (like http://bit.ly) in a php array, like this:
$shortUrlArray = array("http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123");
End with Final, Redirected URLs:
How can I get the final url of these short urls with php? Like this:
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
I have one method (found online) that works well with a single url, but when looping over multiple urls, it only works with the final url in the array. For your reference, the method is this:
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//$header['errno'] = $err;
//$header['errmsg'] = $errmsg;
//$header['content'] = $content;
print($header[0]);
return $header;
}
//Using the above method in a for loop
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i <= $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
array_push($finalURLs, $rawURL);
}
Close, but not enough
This method works, but only with a single url. I Can't use it in a for loop which is what I want to do. When used in the above example in a for loop, the first four elements come back unchanged, and only the final element is converted into its final url. This happens whether your array is 5 elements or 500 elements long.
Solution Sought:
Please give me a hint as to how you'd modify this method to work when used inside of a for loop with collection of urls (Rather than just one).
-OR-
If you know of code that is better suited for this task, please include it in your answer.
Thanks in advance.
Update:
After some further prodding I've found that the problem lies not in the above method (which, after all, seems to work fine in for loops) but possibly encoding. When I hard-code an array of short urls, the loop works fine. But when I pass in a block of newline-seperated urls from an html form using GET or POST, the above mentioned problem ensues. Are the urls somehow being changed into a format not compatible with the method when I submit the form????
New Update:
You guys, I've found that my problem was due to something unrelated to the above method. My problem was that the URL encoding of my short urls converted what i thought were just newline characters (separating the urls) into this: %0D%0A which is a line feed or return character... And that all short urls save for the final url in the collection had a "ghost" character appended to the tail, thus making it impossible to retrieve the final urls for those only. I identified the ghost character, corrected my php explode, and all works fine now. Sorry and thanks.
This may be of some help: How to put string in array, split by new line?
You would probably do something like this, assuming you're getting the URLs returned in POST:
$final_urls = array();
$short_urls = explode( chr(10), $_POST['short_urls'] ); //You can replace chr(10) with "\n" or "\r\n", depending on how you get your urls. And of course, change $_POST['short_urls'] to the source of your string.
foreach ( $short_urls as $short ) {
$final_urls[] = get_web_page( $short );
}
I get the following output, using var_dump($final_urls); and your bit.ly url:
http://codepad.org/8YhqlCo1
And my source: $_POST['short_urls'] = "http://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123";
I also got an error, using your function: Notice: Undefined offset: 0 in /var/www/test.php on line 27 Line 27: print($header[0]); I'm not sure what you wanted there...
Here's my test.php, if it will help: http://codepad.org/zI2wAOWL
I think you are almost have it there. Try this:
$shortUrlArray = array("http://yhoo.it/2deaFR",
"http://bit.ly/900913",
"http://bit.ly/4m1AUx");
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i < $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
printf($rawURL."\n");
array_push($finalURLs, $rawURL);
}
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw