Multi Curl Usage when number of requests is dynamic

Multi Curl Usage when number of requests is dynamic - php

I have a database setup and all it does is give me at most 10 URL's. I need to post data to those 10 URL's when the page is loaded. This means the script to send message hits "send.php?message=Foo". and it post's 'foo' to the pages in the database. There is no way around this. But I need to be able to do it. Right now I am trying to use regular curl requests in a while loop but that only posts to the first URL. How do I use the CURL multi functions to do this:
$query = 'SELECT * FROM `chat` LIMIT 0, 20;';
$result = mysql_query($query);
$init = 0;
while($row = mysql_fetch_array( $result )) {
$URL = $row['url'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"$URL");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec ($ch);
curl_close ($ch);
}

Are you sure it only posts to the first URL? Try adding some debugging statements. Check if you're hitting a timeout in PHP.

What about timeouts? Look at the example here http://au2.php.net/manual/en/function.curl-multi-init.php. Start a multi handle before while, create curl handles in while and add those handles to multi handle method, then after while execute the curl call and close the handle, do check for errors/exceptions thrown by curl, example, a url may time out and so on.

Related

PHP Instagram API Pagination

I am writing a program in PHP that retrieves the list of who a user is following on Instagram. The problem I have is that their API only returns 50 results per call, and the rest is paginated.
I know that there is a 'next page' as the returned JSON has a pagination->next_url.
Currently, the code I have gets the JSON and decodes it. Immediately afterwards, a call is made to get the next page using the URL from the first API call.
Have a look:
function getFollows($url){
$client_id = "my client id";
//echo "A url: ".$url."</br>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close($ch);
return json_decode($result);
}
$url = 'https://api.instagram.com/v1/users/'.$user_id.'/follows/?client_id='.$client_id.'&access_token='.$token;
$first_page = getFollows($url);
$second_page = getFollows($first_page->pagination->next_url);
What I would like to do instead is to check the JSON for a next url and make a call to the next_url. Then it would check the JSON from that url for a next url and repeat. All collected JSON would then be merged into one list which I can then iterate through to echo each individual person.
My question is how can I for every time there is pagination, get the next url, merge the JSON and repeat until there are no more pages to go through.
I could keep making $third_page, $fourth_page, but then that is pointless if the user has more than four pages of followers and if they only have 10 followers for example.
I have tried using an if function to check if there is pagination and array_merge(), but to no avail. Maybe I was doing it wrong.
Please can someone point me in the right direction.
Thanks,
-DH

you can take a ready-made code, there is a point with pagination- https://github.com/cosenary/Instagram-PHP-API

Get Content from Web Pages with PHP

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?

First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.

<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>

You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()

PHP cURL crawler doesn't fetch all data

I'm trying to write my first crawler by using PHP with cURL library. My aim is to fetch data from one site systematically, which means that the code doesn't follow all hyperlinks on the given site but only specific links.
Logic of my code is to go to the main page and get links for several categories and store those in an array. Once it's done the crawler goes to those category sites on the page and looks if the category has more than one pages. If so, it stores subpages also in another array. Finally I merge the arrays to get all the links for sites that needs to be crawled and start to fetch required data.
I call the below function to start a cURL session and fetch data to a variable, which I pass to a DOM object later and parse it with Xpath. I store cURL total_time and http_code in a log file.
The problem is that the crawler runs for 5-6 minutes then stops and doesn't fetch all required links for sub-pages. I print content of arrays to check result. I can't see any http error in my log, all sites give a http 200 status code. I can't see any PHP related error even if I turn on PHP debug on my localhost.
I assume that the site blocks my crawler after few minutes because of too many requests but I'm not sure. Is there any way to get a more detailed debug? Do you think that PHP is adequate for this type of activity because I wan't to use the same mechanism to fetch content from more than 100 other sites later on?
My cURL code is as follows:
function get_url($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
$info = curl_getinfo($ch);
$logfile = fopen("crawler.log","a");
echo fwrite($logfile,'Page ' . $info['url'] . ' fetched in ' . $info['total_time'] . ' seconds. Http status code: ' . $info['http_code'] . "\n");
fclose($logfile);
curl_close($ch);
return $data;
}
// Start to crawle main page.
$site2crawl = 'http://www.site.com/';
$dom = new DOMDocument();
#$dom->loadHTML(get_url($site2crawl));
$xpath = new DomXpath($dom);

Use set_time_limit to extend the amount of time your script can run for. That is why you are getting Fatal error: Maximum execution time of 30 seconds exceeded in your error log.

do you need to run this on a server? If not, you should try the cli version of php - it is exempt from common restrictions

cUrl multiple URL open

$query = 'SELECT * FROM `chat` LIMIT 0, 24334436743;';
$result = mysql_query($query);
while($row = mysql_fetch_array( $result )) {
$URL = $row['url'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"$URL");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//curl_setopt($ch, CURLOPT_POSTFIELDS, "user=unrevoked clarity&randominfo=hi");
curl_exec ($ch);
curl_close ($ch);
}
//curl_close ($ch);
}
Alright the above snippet is me pulling a whole bunch of URL's from a database and I am trying to send data to each of them. But it seems to gum the page up (even with only one or two URL's). Is there a built in system to handle this or something?

You can initialize multiple requests using the curl_multi_*() functions, then have them sent all at once. There is probably a limit to how many requests can be pooled. And the overall processing time will take as long as the slowest connection/server.
So your approach (many many URLs at once) is still problematic. Maybe you can rewrite it to do the processing in your browser, start multiple AJAX requests with some visual feedback.

Requesting a URL from the network is an expensive operation, and even downloading a few will noticeably increase the latency of your page. Can you cache the contents of the pages in a database? Do you have to download the URL; can you make the client do it with an iframe?

PHP - How to check URLS for 404/Timeout?

Here is my structure:
MYSQL: Table: toys ---> Column: id, URL. How do I get my PHP script to check all of those URLs to see if they are alive or have page 404's? Try not to echo or diplay the results on page. I will need to to record in MYSQL with a extra column "checks".
Results will be in this format:
http://asdasd.adas --- up --- 404
It will be in PHP/Curl if possible. I have been trying for ages. I gave up so decided to ask here.
URL's are all located in my database.

In cURL, there's the curl_getinfo function, that returns some info about the current handle:
<?php
// Create a curl handle
$ch = curl_init('http://www.yahoo.com/');
// Execute
curl_exec($ch);
//fill here the error/timeout checks.
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);

I trust you're able to run a SQL query and enumerate through the results, so here's the cURL part. For each URL, send it a HEAD request, and check the result code.
<?php
$handle = curl_init($yourURL);
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$result = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// $result now contains the HTTP result code the page sent
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Multi Curl Usage when number of requests is dynamic - php

Are you sure it only posts to the first URL? Try adding some debugging statements. Check if you're hitting a timeout in PHP.

Related

PHP Instagram API Pagination

Get Content from Web Pages with PHP

PHP cURL crawler doesn't fetch all data

cUrl multiple URL open

PHP - How to check URLS for 404/Timeout?

Categories

Resources