Improved efficiency when making multiple requests to Google Complete API with PHP

Improved efficiency when making multiple requests to Google Complete API with PHP - php

I was playing around with the Google Complete API looking for a quick way to get hold of the top 26 most searched terms for various question prefixes - one for each letter of the alphabet.
I wouldn't count myself a programmer but it seemed like a fun task!
My script works fine locally but it takes too long on my shared server and times out after 30 seconds - and as it's shared I can't access the php.ini to lengthen the max execution time.
It made me wonder if there was a more efficient way of making the requests to the API, here is my code:
<?php
$prep = $_POST['question'];
for($i=0;$i<26;$i++){
$letters = range('a','z');
$letter = $letters[$i];
$term = $prep . $letter;
if(!$xml=simplexml_load_file('http://google.com/complete/search?output=toolbar&q=' . $term)){
trigger_error('Error reading XML file',E_USER_ERROR);
}
do{
$count = 1;
$result = ucfirst($xml->CompleteSuggestion->suggestion->attributes()->data);
$queries = number_format((int)$xml->CompleteSuggestion->num_queries->attributes()->int);
echo '<p><span>' . ucfirst($letter) . ':</span> ' . $result . '?</p>';
echo '<p class="queries">Number of queries: ' . $queries . '</p><br />';
} while ($count < 0);
}
?>
I also wrote a few lines that fed the question in to the Yahoo Answers API, which worked pretty well although it made the results take even longer and I couldn't exact match on the search term through the API so I got a few odd answers back!
Basically, is the above code the most efficient way of calling an API multiple times?
Thanks,
Rich

You should using user perspective to re-look into this issue, ask yourself,
Will you like to wait 30 seconds for a web page to load?
Obviously you dun want
How can I make the page load faster?
You are depending on an external resource (google api)
and not just calling once, but 26 times asynchronously
So, if you change the above synchronously,
the total time is reduced form 26 to 1 (with the expenses of network bandwidth)
Take a look at http://php.net/manual/en/function.curl-multi-exec.php,
here is first step of optimization
If you get the above done,
your time spent on external resource could reduce up to 95%
Will this good enough ?
Obviously not yet
Any call to external resource is not reliable, even is google
if the network down, DNS not resolvable, your page is going down too
How to prevent that ?
You need cache, basically the logic is :-
search for existing cache, if found, return from cache
if not, query google api synchronously (from a to z)
store the result into cache
return the result
However, on-demand process is still not ideal (first user issue the request have to wait longest),
if you know the permutation of user input (hopefully not that big),
you can use a scheduler (cronjob) to periodically pull result from google api,
and store the result locally

I recommend using cron jobs for this kind of work. This way you can either change the max execution time with a parameter or splitt the work into multiple operations and run the cron job more regulary to run one operation after another.

Related

keep track of a global value on a website

I'm currently working on a project that involves a website which gets data from a game API.
Problem is, I am bound to a specific Rate limit (of currently 500 requests per 10 minutes) which i must not exceed.
How do i keep track of the current count of requests other than writing / reading it to a file / database everytime someone requests the data (i guess that woulnd't be the best approach and could potentially ?! lead to problems with a few houndred people accessing the website at the same time)
The website calls a php script with neccessary information the user provides to get the data from the API

You can use APC for this.
The Alternative PHP Cache (APC) is a free and open opcode cache for
PHP. Its goal is to provide a free, open, and robust framework for
caching and optimizing PHP intermediate code.
You don't need any external library to build this extension. Saving and fetching a variable across requests are as easy as this:
<?php
$bar = 'BAR';
apc_add('foo', $bar);
var_dump(apc_fetch('foo'));
echo "\n";
$bar = 'NEVER GETS SET';
apc_add('foo', $bar);
var_dump(apc_fetch('foo'));
echo "\n";
?>
Here is the documentation.

Since all requests are separate, they don't know anything about the other requests. There is no way to have a "shared" variable in PHP.
Best way is probably to create a database table and create a record in there every time you do a request. Keep track of when each request was made with a datetime column.
That way you can quickly check how many requests were done in the last 10minutes by counting the records made in the last 10 minutes.
Run a generic delete query on the table on occasion.
A simple query like that will not really hurt your performance unless you have a really busy site.
Another solution might be to cache the results from the API and re-use the results for each request. Then refresh the results from the API about every few seconds (1 request every 2 seconds ends up at 300/10minutes). But that would require the data to be actually cache-able and re-useable.

MySQL asynchronous database request performance

I've been looking at asynchronous database requests in PHP using mysqlnd. The code is working correctly but comparing performance pulling data from one reasonable sized table versus the same data split across multiple tables using asynchronous requests I'm not getting anything like the performance I would expect although it does seem fairly changeable according to hardware setup.
As I understand it I should be achieving, rather than:
x = a + b + c + d
Instead:
x = max(a, b, c, d)
Where x is the total time taken and a to d are the times for individual requests. What I am actually seeing is a rather minor increase in performance on some setups and on others worse performance as if requests weren't asynchronous at all. Any thoughts or experiences from others who may have worked with this and come across the same are welcome.
EDIT: Measuring the timings here, we are talking about queries spread over 10 tables, individually the queries take no more than around 8 seconds to complete, combining the time each individual request takes to complete (not asynchronously) it totals around 18 seconds.
Performing the same requests asynchronously total query time is also around 18 seconds. So clearly the requests are not being executed in parallel against the database.
EDIT: Code used is exactly as shown in the documentation here
<?php
$link1 = mysqli_connect();
$link1->query("SELECT 'test'", MYSQLI_ASYNC);
$all_links = array($link1);
$processed = 0;
do {
$links = $errors = $reject = array();
foreach ($all_links as $link) {
$links[] = $errors[] = $reject[] = $link;
}
if (!mysqli_poll($links, $errors, $reject, 1)) {
continue;
}
foreach ($links as $link) {
if ($result = $link->reap_async_query()) {
print_r($result->fetch_row());
if (is_object($result))
mysqli_free_result($result);
} else die(sprintf("MySQLi Error: %s", mysqli_error($link)));
$processed++;
}
} while ($processed < count($all_links));
?>

I'll expand my comments and I'll try to explain why you won't gain any performance using the setup you have currently.
Asynchronous, in your case, means that the process of retrieving data is asynchronous compared to the rest of your code. The two moving parts (getting data) and working with the data are separate and are executed one after another, but only when the data arrives.
This implies that you want to utilize the CPU to its fullest, so you won't invoke PHP code until the data is ready.
In order for that to work, you must seize the control of PHP process and make it use one of operating system's event interfaces (epoll on Linux, or IOCP on Windows). Since PHP is either embedded into a web server (mod_php) or runs as its own standalone FCGI server (php-fpm), that implies the best utilization of asynchronous data fetching would be when you run a CLI php script since it's quite difficult to utilize event interfaces otherwise.
However, let's focus on your problem and why your code isn't faster.
You assumed that you are CPU bound and your solution was to retrieve data in chunks and process them that way - that's great, however since nothing you do yields faster execution, that means you are 100% I/O bound.
The process of retrieving data from databases forces the hard disk to perform seeking. No matter how much you "chunk" that, if the disk is slow and if the data is scattered around the disk - that part will be slow and creating more workers that deal with parts of the data will just make the system slower and slower since each worker will have the same problem with retrieving the data.
I'd conclude that your issue lies in the slow hard disk, too big of a data set that might be improperly constructed for chunked retrieval. I suggest updating this question or creating another question that will help you retrieve data faster and in a more optimal way.

Better approach to optimize PHP Cron Job

I can't get into too many specifics as this is a project for work, but anyways..
I'm in the process of writing a SOAP client in PHP that pushes all responses to a MySQL database. My main script makes an initial soap request that retrieves a large set of items (approximately ~4000 at the moment, but the list is expected to grow into hundreds of thousands at some point).
Once this list of 4000 items is returned, I use exec("/usr/bin/php path/to/my/historyScript.php &") that sends a history request for each item. The web service api supports up to 30 requests / sec. Below is some pseudo code for what I am currently doing:
$count = 0;
foreach( $items as $item )
{
if ( $count == 30 )
{
sleep(1); // Sleep for one second before calling the next 30 requests
$count = 0;
}
exec('/usr/bin/php path/to/history/script.php &');
$count++;
}
The problem I'm running into is that I am unsure when the processes finish and my development server is starting to crash. Since data is expected to grow, I know this is a very poor solution to my problem.
Might there be a better approach I should consider using for a task like this? I just feel that this is more of a 'hack'

I am not sure, but i feel that the reason for your application crash, you are keeping large set of data in PHP variable. Look into this, based on RAM size this(data size) will leads to system crash. And my suggestion is try to limit incoming data from external service per request, instead number of request to the service.

curl php scraping through cron job every minute on Shared hosting

I have a tricky problem. I am on a basic shared hosting. I have created a good scraping script using curl and php.
Because multi-threading with Curl is not really multi-threading and even the best curl multi-threading scripts I have used are speeding by 1,5-2 the scraping, I came to the conclusion that I need to run massive amount of cron tasks (like 50) per minute on my php script that interacts with a mysql table in order to offer fast web scraping to my customers.
My problem is that I get a "Mysql server has gone away" when having lots of cron tasks running at the same time. If I decrease the number of cron tasks, it continues to work but always slow.
I have also tried a browser-based solution by reloading the script every time the while is finished. It works better but always the same problem: When I decide to run 10 times the script at the same time, it begins to overload the mysql server or the web server (i don't know)
To resolve this, I have acquired an mysql server where I can set the my.cnf ...but the problem stays approximatively the same.
=========
MY QUESTION IS : WHERE THE PROBLEM IS COMING FROM ? TABLE SIZE ? I NEED A BIG 100MBPS DEDICATED SERVER. IF YES, ARE YOU SURE IT WILL RESOLVE THE PROBLEM, AND HOW FAST IT IS ? BY KNOWING I WANT THAT THE EXTRACTION SPEED GOES TO APPROXIMATIVELY 100 URLS PER SECOND (at this time, it goes to 1 URL per 15 seconds, incredibly slow...)
There is only one while on the script. It loads all the page and preg match or dom data and insert into mysql database.
I extract lots of data, this is why a table fastly contain millions of entries...but when I remove them, maybe it goes a bit faster but it is always the same problem: it is impossible to massively run tasks in parallel in order to accelerate the process.
I don't think the problem is coming from my script. In all the cases, even optimized perfectly, I will not go as fast as I want.
I ested by using the script withotu proxies for scraping, but the difference is very small..not significant..
My conclusion is that I need to use a dedicated server but I don't want to invest like 100$ per month if I am not sure It will resolve the problem and I will be able to run these massive amounts of cron tasks / calls on the mysql db without problem.

I would have to see the code but essentially it does look like you are being rate limited by your host.
Is it possible to run your cron once a minute or two but batch the scrapes onto one SQL connect in your script?
Essentially, the goal would be to open the sql socket once and run multiple URL scrapes on the connect vs. your current one scrape per mysql connect hopefully avoiding the rate limiting by your host.
Pseudo-code:
<?php
$link = mysqli_connect("127.0.0.1", "my_user", "my_password", "my_db");
$sql = "SELECT url FROM urls_table WHERE scraped='0' LIMIT 100";
$result = mysqli_query($link, $sql);
while($row = mysqli_fetch_array($result, MYSQLI_NUM)){
$url_to_scrape = $row[0];
//TODO: your scrape code goes here
}
//Only AFTER you've scraped multiple URLs do we close the connection
//this will drastically reduce the number of SQL connects and should help
mysqli_close($link);
?>

It's so easy... never send multithreading on the same URL. May be many different URLs. But try to respect a certain timeout. You can do that with:
sleep($random); $random = random(15, 35) ; // in seconds

How would I perhaps reduce the processor footprint of a php script?

I'm attempting to make a php script that can load the current weather forecast and it uses a bit of XML pre-processing to digest the input, however it is accessed quite often and reloaded. The problem begins with my current host, which yes I do understand why, limits the amount of processing power a script takes up.
Currently takes an entire process for ever execution, which is around 3 seconds per execution. I'm limited to 12, yet I get quite a few pings.
My question to you guys is: What methods, if any, can I use to cache the output of a script so that it does not have to pre-process something it already did 5 minutes ago. Since it is weather, I can have a time difference of up to 2 hours.
I am quite familiar with php too, so don't worry xD.
~Thank you very much,
Jonny :D

You could run a cronjob that would generate the weather forecast data and then just display the whole thing from cache. You could use APC so it is always loaded in memory (plus all other added advantages).

The Zend Framework provides the Zend_Cache object with multiple backends (File, memcached, APD). Or you can roll your own with something like:
$cachFile = "/path/to/cache/file";
$ttl = 60; // 60 second time to live
if (!file_exists($cacheFile) || time()-filemtime($cacheFile) > $ttl) {
$data = getWeatherData(); // Go off and get the data
file_put_contents(serialize($cacheFile), $data);
} else {
$data = unserialize(file_get_contents($cacheFile));
}

need a code snippet to see what kind of processing you are doing. consider using xdebug to better optimize your code.
Also you may use a benchmarking tool such as AB to see how many processes your server can handle.
there are several different caching mechanisms available but without seeing what kind of process you are doing it is hard to say...

3 seconds is an extremely long execution time, as already asked, some cold would be nice to see how you process the 'input' and in what format said input is in.
A quick and dirty article about caching out of script to file is found here:
http://codestips.com/?p=153

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.