I have a website scraping project. Look at this code:
<?php
include('db.php');
$r = mysql_query("SELECT * FROM urltable");
$rows= mysql_num_rows($r);
for ($j = 0; $j <$rows; ++$j) {
$row = mysql_fetch_row($r);
$html = file_get_contents(mysql_result($r,$j,'url'));
$file = fopen($j.".txt", "w");
fwrite($file,$html);
fclose($file);
}
?>
I have a list of url. This code means that, make text files using the contents(HTML) from each url.
When running this code, I can make only one file per second [each file size~ 20KB]. My internet is providing 3mbps downloading speed, but I can't utilize that speed with this code.
How do I speed up file_get_contents()? Or how do I speed up this code using threading or configure php.ini file or any other methods?
As this was not one of the suggestions on the duplicate page I will add it here.
Take a close look at Curl Multi PHP Manual page.
Its not totally straight forward but once you get it running its very fast.
Basically you issue multiple curl requests and then collect the data returned as and when it returns. It returns in any order so there is a bit of control required.
I have used this on a data collection process to reduce 3-4 hours of processing to 30 minutes.
The only issue could be that you swamp a site with multiple requests and the owner considers that an issue and bans your access. But with a bit of sensible sleep()'ing added to your process you should be able to reduce that possibility to a minimum.
You can add few controls with the streams.
But cURL should be much better, if available.
$stream_options = array(
'http' => array(
'method' => 'GET',
'header' => 'Accept-language: en',
'timeout' => 30,
'ignore_errors' => true,
));
$stream_context = stream_context_create($stream_options);
$fc = file_get_contents($url, false, $stream_context);
Related
I want to run more than 2500+ call on same time. So i have created a batch of 100 (2500/100 = 25 total call).
// REQUEST_BATCH_LIMIT = 100
$insert_chunks = array_chunk(['array', 'i want', 'to', 'insert'], REQUEST_BATCH_LIMIT);
$mh = $running = $ch = [];
foreach ($insert_chunks as $chunk_key => $insert_chunk) {
$mh[$chunk_key] = curl_multi_init();
$ch[$chunk_key] = [];
foreach ($insert_chunk as $ch_key => $_POST) {
$ch[$chunk_key][$ch_key] = curl_init('[Dynamic path of API]');
curl_setopt($ch[$chunk_key][$ch_key], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh[$chunk_key], $ch[$chunk_key][$ch_key]);
}
do {
curl_multi_exec($mh[$chunk_key], $running[$chunk_key]);
curl_multi_select($mh[$chunk_key]);
} while ($running[$chunk_key] > 0);
foreach(array_keys($ch[$chunk_key]) as $ch_key) {
$response = curl_getinfo($ch[$chunk_key][$ch_key]);
$returned_data = curl_multi_getcontent($ch[$chunk_key][$ch_key]);
curl_multi_remove_handle($mh[$chunk_key], $ch[$chunk_key][$ch_key]);
}
curl_multi_close($mh[$chunk_key]);
}
When i running this in local the system is hanged totally.
But this limit of batch like 100, 500 are not same on different device and server, so what is the reason about it? and what changes should i do to increase it?
If i am adding 1000 data with batch of 50, so for every batch 50 records should insert, but it insert randomly for a batch like 40, 42, 48, etc. so way this is skipped calls? (If i am using single record with simple cURL using loop then it is working fine.)
P.S. This code is i am using for bigcommrece API.
The BigCommerce API definitely throttles requests. The limits are different depending on which plan you are on.
https://support.bigcommerce.com/s/article/Platform-Limits
The "Standard Plan" is 20,000 per hour. I'm not sure how that is really implemented, though, because in my own experience, I've been throttled before hitting 20,000 requests in an hour.
As Nico Haase suggests, the key is for you to log every response you get from the BigCommerce API. While not a perfect system, they do usually provide a response that is helpful to understand the failure.
I run a process that makes thousands of API requests every day. I do sometimes have requests that fail as if the BigCommerce API simply dropped the connection.
I have to make a thousand requests to the IGDB API and I am having troubles making this work. Everytime I run my script, it's loading for some time then my web host tells me "Error: there is a problem...It seems that something went wrong." (not very helpful I know).
Since I believe the issue comes from the amount of requests, I have tried reducing it but I am down to 60 requests with a pause of 4 seconds between each and still no success.
My latest try:
$splice = array_splice($array, 0, 60);
foreach($splice as $key => $value){
$request = wp_remote_get('https://igdbcom-internet-game-database-v1.p.mashape.com/games/?fields=*&search='.$value['Name'],
array( 'headers' => array(
'Accept' => 'application/json',
'X-Mashape-Key' => 'Key' )));
$body = wp_remote_retrieve_body($request);
$data_api = json_decode($body, true);
sleep(4);
}
Would anyone know what I am doing wrong? I am running out of ideas...
That's very likely to be nothing more than the timeout response from either PHP or the server.
Although there ways to go around these securities, they aren't for nothing.
You should use CLI to execute cargo queries, not CGI. CGI access is for regular users, whatever their role/priviledges. As a developer, you have access to the code, and to the server (or at least your sysadmin does if you are in a team). You SHOULD use command line to do these queries. It will take less time, will have less chance to fail and you'll have the error logs printed right away unless you redirect them to a file.
I'm working in a template animation system, so I have different folders in S3 with different files inside (html, imgs, etc.)
What I do is:
I change the folder policy like that:
function changeFolderPolicy($folderPath, $client=null, $public) {
if (!$client) {
$client = getClientS3();
}
$effect = 'Allow';
if (!$public) {
$effect = 'Deny';
}
$policy = json_encode(array(
'Statement' => array(
array(
'Sid' => 'AllowPublicRead',
'Action' => array(
's3:GetObject'
),
'Effect' => $effect,
'Resource' => array(
"arn:aws:s3:::".__bucketS3__."/".$folderPath."*"
),
'Principal' => array(
'AWS' => array(
"*"
)
)
)
)
));
$client->putBucketPolicy(array(
'Bucket' => __bucketS3__,
'Policy' => $policy
));
}
After changing the policy, the frontend gets all the necessary files.
However, sometimes, some files aren't loaded because of a forbidden 403. It's not always the same files, sometimes ara all loaded, sometimes none... I don't have a clue since putBucketPolicy is a synchronous call.
Thank you very much.
First, putBucketPolicy is not exactly synchronous. The validatation of the policy is synchronous but the application of the policy requires a nonspecific amount of time to replicate through the infrastructure.
There is no mechanism exposed for determining whether the policy has propagated.
Second, you're bucket policies in a way that fundamentally makes no sense.
Of course, this setup makes the implicit assumption that only one copy of this code would ever run at the same time, which is usually an unsafe assumption, even if it seems true right now.
But worse... toggling a prefix publicly readable so you can copy those files, then (presumably) putting it back when you're done - - instead of using the service correctly, by using the credentials to sign requests to download individual objects you need - - frankly, if I am correctly understanding what you're doing, here, I am at a loss for words to describe just how wrong this solution is.
This seems comparable to a bank manager securing the bank vault with a bicycle lock instead of using the vault's hardened, high-security, built-in access-control mechanisms because a bicycle lock "is easier to open."
I know that this feature in php creates and returns a stream context with any options supplied in options preset. I also know that I can use it to do what I want which is to pass my username and password credentials...But I still don't quite get how to use this feature and what it exactly does the complicated words in the description are really confusing me....I have the code from the example but I don't know how I can use this feature to pass credentials to www.confluence.com (only I can access it since its on a apache server). Can someone please explain or give example of how I can use this to pass the credentials?
EDIT: Here is the overal summary, pretty short so dunno if you call it a summary....I am assigned an app to make.....there is about 10 different ways to do this. due to the limitations they have given me, I can only work with 1 way that I found...I am very frustrated because there is a much easier way to do this using google calendar but they refuse due to security reasons so I am stuck to confluence calendar.....In addition to that, to make this harder, confluence is hosted by external company and so I cannot even use get_contents to directly access the confluence calendar because it asks for login credentials....I am not a pro at this and one after another obstacles keep popping up to make my life harder and this is just bs...Ive spent hours and hours for the past 3 weeks finding solutions only to have it rejected...Ive finally got this get_contents thingy working but now login credentials is a pain in the butt and I have never dealt with this so I am trying my best but I have ABSOLUTELY NO IDEA what stream_context_create does OR how to even use it to pass my credentials...Confluence the site I am trying to access I can manually login using my logins but the code cannot....confluence is on an apache server so other people cannot access it....and I cannot share my login info as that is company's security issue...I am sorry if i made no sense but I am very frustrated and can find no solution and my mind is half dead from coding and reading
Believe me I have done much searching and googling....But Ive finally reached the point where I am blank stuck...
<?php
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
/* Sends an http request to www.example.com
with additional headers shown above */
$fp = fopen('http://www.example.com', 'r', false, $context);
fpassthru($fp);
fclose($fp);
?>
I found some additional code here similar to what I am doing but not quite sure what to do with this or how to modify it..
$data = array('account'=>'javier',
'password'=>'12345',
'submit'=>'SUBMIT');
$content = file_get_contents('http://localhost/misc/login.php', false, stream_context_create( array('http' => array('method' => 'POST', 'content' => http_build_query($data) ) ) ) );
$sFind = 'Logged in';
$search = strpos($content, $sFind);
echo $content;
if($search === false){
echo 'Invalid Account';
}
else {
echo 'Valid Account';
}
Another Idea I have is to use JavaScript to do the login...I dont know which is better but I am not allowed to download any libraries or such so suggestions are nice.
You are using the 'GET' method, it is uncommon for login information to be passed like this.
Are you sure the page you are requesting doesn't use POST?
Also personally when retrieving webpages, especially when sending POST or GET variables I prefer to use cURL.
I have a php site www.test.com.
In the index page of this site ,i am updating another site's(www.pgh.com) database by the following php code.
$url = "https://pgh.com/test.php?userID=".$userName. "&password=" .$password ;
$response = file_get_contents($url);
But,now the site www.pgh.com is down.So it also affecting my site 'www.test.com'.
So how can i add some exception or something else to this code,so that my site should work if other site is down
$response = file_get_contents($url);
if(!$response)
{
//Return error
}
From the PHP manual
Adding a timeout:
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
file_get_contents returns false on fail.
You have two options:
Add a timeout to the file_get_contents call using stream_get_context() (The manual has good examples; Docs for the timeout parameter here). This is not perfect, as even a one second's timeout will cause a notable pause when loading the page.
More complex but better: Use a caching mechanism. Do the file_get_contents request in a separate script that gets called frequently (e.g. every 15 minutes) using a cron job (if you have access to one). Write the result to a local file, which your actual script will read.