quickly validate a large list of URL's in PHP? - php

I have a database of content with free text in it There are about 11000 rows of data, and each row has 87 columns. There are thus (potentially) around 957000 fields to check if URLs are valid.
I did a regular expression to extract all things that look like URLs (http/s, etc.) and built up an array called $urls. I then loop through it, passing each $url to my curl_exec() call.
I have tried cURL (for each $url):
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 250);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECT_ONLY, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPGET, 1);
foreach ($urls as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
$exec = curl_exec($ch);
// Extra stuff here... it does add overhead, but not that much.
}
curl_close($ch);
As far as I can tell, this SHOULD work and be as fast as I can go, but it takes around 2-3 seconds per URL.
There has to be a faster way?
I am planning on running this via a cron job, and then check my local database first if this URL has been checked in the last 30 days, and if not, then check, so over time this will become less, but I just want to know if cURL is the best solution, and whether I am missing something to make it faster?
EDIT:
Based on the comment bby Nick Zulu below, I sit with this code now:
function ODB_check_url_array($urls, $debug = true) {
if (!empty($urls)) {
$mh = curl_multi_init();
foreach ($urls as $index => $url) {
$ch[$index] = curl_init($url);
curl_setopt($ch[$index], CURLOPT_CONNECTTIMEOUT_MS, 10000);
curl_setopt($ch[$index], CURLOPT_NOBODY, 1);
curl_setopt($ch[$index], CURLOPT_FAILONERROR, 1);
curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch[$index], CURLOPT_CONNECT_ONLY, 1);
curl_setopt($ch[$index], CURLOPT_HEADER, 1);
curl_setopt($ch[$index], CURLOPT_HTTPGET, 1);
curl_multi_add_handle($mh, $ch[$index]);
}
$running = null;
do {
curl_multi_exec($mh, $running);
} while ($running);
foreach ($ch as $index => $response) {
$return[$ch[$index]] = curl_multi_getcontent($ch[$index]);
curl_multi_remove_handle($mh, $ch[$index]);
curl_close($ch[$index]);
}
curl_multi_close($mh);
return $return;
}
}

let's see..
use the curl_multi api (it's the only sane choice for doing this in PHP)
have a max simultaneous connection limit, don't just create a connection for each url (you'll get out-of-memory or out-of-resource errors if you just create a million simultaneous connections. and i wouldn't even trust the timeout errors if you just created a million connections simultaneously)
only fetch the headers, because downloading the body would be a waste of time and bandwidth
here is my attempt:
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
if ($max_connections < 1) {
throw new InvalidArgumentException("max_connections MUST be >=1");
}
foreach ($urls as $key => $foo) {
if (!is_string($foo)) {
throw new \InvalidArgumentException("all urls must be strings!");
}
if (empty($foo)) {
unset($urls[$key]); //?
}
}
unset($foo);
$urls = array_unique($urls); // remove duplicates.
$ret = array();
$mh = curl_multi_init();
$workers = array();
$work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
// > If an added handle fails very quickly, it may never be counted as a running_handle
while (1) {
curl_multi_exec($mh, $still_running);
if ($still_running < count($workers)) {
break;
}
$cms=curl_multi_select($mh, 10);
//var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
}
while (false !== ($info = curl_multi_info_read($mh))) {
//echo "NOT FALSE!";
//var_dump($info);
{
if ($info['msg'] !== CURLMSG_DONE) {
continue;
}
if ($info['result'] !== CURLM_OK) {
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
}
} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
}
} else {
$code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
if ($code[0] === "3") {
if ($consider_http_300_redirect_as_error) {
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
}
} else {
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
} else {
$ret[] = $workers[(int)$info['handle']];
}
}
} elseif ($code[0] === "2") {
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
} else {
$ret[] = $workers[(int)$info['handle']];
}
} else {
// all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
if ($return_fault_reason) {
$ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
}
}
}
curl_multi_remove_handle($mh, $info['handle']);
assert(isset($workers[(int)$info['handle']]));
unset($workers[(int)$info['handle']]);
curl_close($info['handle']);
}
}
//echo "NO MORE INFO!";
};
foreach ($urls as $url) {
while (count($workers) >= $max_connections) {
//echo "TOO MANY WORKERS!\n";
$work();
}
$neww = curl_init($url);
if (!$neww) {
trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
if ($return_fault_reason) {
$ret[$url] = array(false, -1, "curl_init() failed");
}
continue;
}
$workers[(int)$neww] = $url;
curl_setopt_array($neww, array(
CURLOPT_NOBODY => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT_MS => $timeout_ms
));
curl_multi_add_handle($mh, $neww);
//curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
}
while (count($workers) > 0) {
//echo "WAITING FOR WORKERS TO BECOME 0!";
//var_dump(count($workers));
$work();
}
curl_multi_close($mh);
return $ret;
}
here is some test code
$urls = [
'www.example.org',
'www.google.com',
'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, false));
returns
array(0) {
}
because they all timed out (1 millisecond timeout), and fail reason reporting was disabled (that's the last argument),
$urls = [
'www.example.org',
'www.google.com',
'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, true));
returns
array(3) {
["www.example.org"]=>
array(3) {
[0]=>
bool(false)
[1]=>
int(28)
[2]=>
string(39) "curl_exec error 28: Timeout was reached"
}
["www.google.com"]=>
array(3) {
[0]=>
bool(false)
[1]=>
int(28)
[2]=>
string(39) "curl_exec error 28: Timeout was reached"
}
["https://www.google.com"]=>
array(3) {
[0]=>
bool(false)
[1]=>
int(28)
[2]=>
string(39) "curl_exec error 28: Timeout was reached"
}
}
increasing the timeout limit to 1000 we get
var_dump(validate_urls($urls, 1000, 1000, true, false));
=
array(3) {
[0]=>
string(14) "www.google.com"
[1]=>
string(22) "https://www.google.com"
[2]=>
string(15) "www.example.org"
}
and
var_dump(validate_urls($urls, 1000, 1000, true, true));
=
array(3) {
["www.google.com"]=>
array(3) {
[0]=>
bool(true)
[1]=>
int(0)
[2]=>
string(50) "got a http 200 code, which is considered a success"
}
["www.example.org"]=>
array(3) {
[0]=>
bool(true)
[1]=>
int(0)
[2]=>
string(50) "got a http 200 code, which is considered a success"
}
["https://www.google.com"]=>
array(3) {
[0]=>
bool(true)
[1]=>
int(0)
[2]=>
string(50) "got a http 200 code, which is considered a success"
}
}
and so on :) the speed should depend on your bandwidth and $max_connections variable, which is configurable.

This is the fastest I could get it real quick, by using a tiny ping:
$domains = ['google.nl', 'blablaasdasdasd.nl', 'bing.com'];
foreach($domains as $domain){
$exists = null!==shell_exec("ping ".$domain." -c1 -s1 -t1");
echo $domain.' '.($exists?'exists':'gone');
echo '<br />'.PHP_EOL;
}
c-> count (1 is enough)
s-> size (1 is al we need)
t-> timeout -> timeout when no response. You might want to tweak this one.
Please keep in mind that some servers don't respond to ping. I dont know a percentage which do, but I suggest implementing a better 2nd check for all those that fail the ping check , should be a significantly less result.

Related

How to get the result of executing curl_multi_exec?

I receive data asynchronously (curl_multi_exec) from JSON.
As a result, how to divide the received data into 2 variables ($response_url and $response_url2)?
I need two variables to continue working with each JSON separately.
$urls = [
"https://rssbot.ru/1.json",
"https://rssbot.ru/2.json"
];
$mh = curl_multi_init();
$allResponse = [];
foreach($urls as $k => $url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_multi_add_handle($mh, $ch);
$allResponse[$k] = $ch;
}
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
foreach($allResponse as $id => $ch) {
$response = curl_multi_getcontent($ch);
curl_multi_remove_handle($mh, $ch);
$response = (json_decode($response));
var_dump($response);
}
curl_multi_close($mh);
echo $response_url;
echo $response_url2;
var_dump:
array(2) {
[0]=>
object(stdClass)#1 (3) {
["symbol"]=>
string(7) "XRPBUSD"
["price"]=>
string(6) "0.3400"
["time"]=>
int(1671427537235)
}
[1]=>
object(stdClass)#2 (3) {
["symbol"]=>
string(7) "MKRUSDT"
["price"]=>
string(6) "542.60"
["time"]=>
int(1671427559567)
}
}
array(3) {
[0]=>
object(stdClass)#2 (2) {
["symbol"]=>
string(6) "ETHBTC"
["price"]=>
string(10) "0.07081400"
}
[1]=>
object(stdClass)#1 (2) {
["symbol"]=>
string(6) "LTCBTC"
["price"]=>
string(10) "0.00377700"
}
[2]=>
object(stdClass)#3 (2) {
["symbol"]=>
string(6) "BNBBTC"
["price"]=>
string(10) "0.01482300"
}
}
Thanks!
$results = [];
$prev_running = $running = null;
do {
curl_multi_exec($mh, $running);
if ($running != $prev_running) {
$info = curl_multi_info_read($mh);
if (is_array($info) && ($ch = $info['handle'])) {
$content = curl_multi_getcontent($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
$results[$url] = ['content' => $content, 'status' => $info['result'], 'status_text' => curl_error($ch)];
}
$prev_running = $running;
}
} while ($running > 0);
Use list function.
An example :
$urls = [
"https://rssbot.ru/1.json",
"https://rssbot.ru/2.json"
];
list($a, $b) = $urls;
// $a : string(24) "https://rssbot.ru/1.json"
// $b : string(24) "https://rssbot.ru/2.json"

Ordering by JSON

We are importing our JSON from an API. The JSON is pulling through fine but unordered
We want to order the JSON file by the name field, We have used uasort but it does not seem to take effect?
$url="https://dev-api.ourwebsite.com";
$ch = curl_init();
// Disable SSL verification
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// Will return the response, if false it print the response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Set the url
curl_setopt($ch, CURLOPT_URL,$url);
// Execute
$result=curl_exec($ch);
// DUMPING THE JSON
$json=json_decode($result, true);
uasort($json, 'name');
foreach($json as $value) {
$course_name=$value["name"];
}
usort() (or uasort() if you need to keep the keys of the array) is what you need:
<?php
// mocking some data
$json = [
["name" => "paul"],
["name" => "jeff"],
["name" => "anna"]
];
uasort($json,
// this callable needs to return 1 or -1, depending on how you want it to sort
function($a, $b) {
if($a['name']>$b['name']) {
return 1;
} else {
return -1;
}
});
var_dump($json);
foreach($json as $value) {
$course_name=$value["name"];
echo $course_name."<br>";
}
// output:
array(3) {
[2]=>
array(1) {
["name"]=>
string(4) "anna"
}
[1]=>
array(1) {
["name"]=>
string(4) "jeff"
}
[0]=>
array(1) {
["name"]=>
string(4) "paul"
}
}
anna
jeff
paul

Newsletter2Go REST API - How to add new recipient to list?

Is it possible to add a new recipient via the REST API of Newsletter2Go?
I tried it like this (snippet):
public function subscribeAction()
{
$this->init();
$email = $this->getRequest()->getParam('email');
$gender = $this->getRequest()->getParam('gender');
$first_name = $this->getRequest()->getParam('first_name');
$last_name = $this->getRequest()->getParam('last_name');
$endpoint = "/recipients";
$data = array(
"list_id" => $this->listId,
"email" => $email,
"gender" => $gender,
"first_name" => $first_name,
"last_name" => $last_name,
);
$response = $this->curl($endpoint, $data);
var_dump($response);
}
/**
* #param $endpoint string the endpoint to call (see docs.newsletter2go.com)
* #param $data array tha data to submit. In case of POST and PATCH its submitted as the body of the request. In case of GET and PATCH it is used as GET-Params. See docs.newsletter2go.com for supported parameters.
* #param string $type GET,PATCH,POST,DELETE
* #return \stdClass
* #throws \Exception
*/
public function curl($endpoint, $data, $type = "GET")
{
if (!isset($this->access_token) || strlen($this->access_token) == 0) {
$this->getToken();
}
if (!isset($this->access_token) || strlen($this->access_token) == 0) {
throw new \Exception("Authentication failed");
}
return $this->_curl('Bearer ' . $this->access_token, $endpoint, $data, $type);
}
private function _curl($authorization, $endpoint, $data, $type = "GET")
{
$ch = curl_init();
$data_string = json_encode($data);
$get_params = "";
if ($type == static::METHOD_POST || $type == static::METHOD_PATCH) {
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
if ($type == static::METHOD_POST) {
curl_setopt($ch, CURLOPT_POST, true);
}
} else {
if ($type == static::METHOD_GET || $type == static::METHOD_DELETE) {
$get_params = "?" . http_build_query($data);
} else {
throw new \Exception("Invalid HTTP method: " . $type);
}
}
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $type);
curl_setopt($ch, CURLOPT_URL, static::BASE_URL . $endpoint . $get_params);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-Type: application/json',
'Authorization: ' . $authorization,
'Content-Length: ' . ($type == static::METHOD_GET || $type == static::METHOD_DELETE) ? 0 : strlen($data_string)
));
if (!$this->sslVerification) {
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}
$response = curl_exec($ch);
curl_close($ch);
return json_decode($response);
}
https://docs.newsletter2go.com/#!/Recipient/createRecipient
But I get a big response object, even though I only try to add one single recipient, and the new recipient is not added to the list.
object(stdClass)#188 (3) {
["status"]=>
int(200)
["info"]=>
object(stdClass)#169 (3) {
["links"]=>
object(stdClass)#43 (3) {
["_href"]=>
string(60) "https://api.newsletter2go.com/recipients?_limit=50&_offset=0"
["_next"]=>
string(61) "https://api.newsletter2go.com/recipients?_limit=50&_offset=50"
["_last"]=>
string(63) "https://api.newsletter2go.com/recipients?_limit=50&_offset=3433"
}
["count"]=>
int(3483)
["additional"]=>
object(stdClass)#167 (1) {
["active"]=>
int(0)
}
}
["value"]=>
array(50) {
[0]=>
object(stdClass)#85 (2) {
["_href"]=>
string(49) "https://api.newsletter2go.com/recipients/n9yldrar"
["id"]=>
string(8) "n9yldrar"
}
[1]=>
object(stdClass)#185 (2) {
["_href"]=>
string(49) "https://api.newsletter2go.com/recipients/pgwvgwmr"
["id"]=>
string(8) "pgwvgwmr"
}
...
[49]=>
object(stdClass)#87 (2) {
["_href"]=>
string(49) "https://api.newsletter2go.com/recipients/usa0dx4n"
["id"]=>
string(8) "usa0dx4n"
}
You must use a POST request to add a recipient, right now you are making a GET request. Change
$response = $this->curl($endpoint, $data);
to
$response = $this->curl($endpoint, $data, 'POST');
then it should work!
Mostly GET requests are used to get data, while POST requests are used to set data. The GET request you've posted returns all your recipients.

not receiving any values from php server

I want to retrieve all names from my database and send it to all the registered phones by gcm.
I get an error at my android side
Error
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.Object java.util.ArrayList.get(int)' on a null object reference
at reminder.com.org.remind.GCMPushReceiverService.onMessageReceived(GCMPushReceiverService.java:27)
this is the result of var_dump($message)
array(1) { ["message"]=> array(3) { [0]=> string(5) "name1" [1]=> string(5) "name2" [2]=> string(5) "name3" } }
Please help. Thanks.
PHP Server Code
<?php
include('db_connect.php');
DEFINE('GOOGLE_API_KEY','my_google_api_key');
$db = new DB_CONNECT();
$conn = $db->connect();
$gcmRegids = array();
$names = array();
$sql = "SELECT * FROM Test";
$result = $conn->query($sql);
while ($row = $result->fetch_assoc()) {
array_push($gcmRegids, $row['reg_id']);
array_push($names, $row['name']);
}
if(isset($gcmRegids)) {
$e = "ads";
$message = array('message' => $names);
var_dump($message);
$pushStatus = sendPushNotification($gcmRegids,$message);
}
function sendPushNotification($reg_ids, $message) {
$url = 'https://android.googleapis.com/gcm/send';
$fields = array(
'registration_ids' => $reg_ids,
'data' => $message,
);
$headers = array (
'Authorization: key='. GOOGLE_API_KEY,
'Content-type: application/json'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HTTPHEADER,$headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER,false);
curl_setopt($ch, CURLOPT_POSTFIELDS,json_encode($fields));
$result = curl_exec($ch);
if($result === false) {
die('Curl failed:'. curl_error($ch));
}
curl_close($ch);
//echo $result;
return $result;
}
?>
array(1) { ["message"]=> array(3) { [0]=> string(5) "name1" [1]=> string(5) "name2" [2]=> string(5) "name3" } }
Android code
public class GCMPushReceiverService extends GcmListenerService {
public final static String s = "msg";
ArrayList<String> arr = new ArrayList<String>();
#Override
public void onMessageReceived(String from, Bundle data) {
if(data != null) {
arr = data.getStringArrayList("message");
Log.d("Names:",arr.get(0));
}
}
.....
}
You can't get simply an Php Array into Android/java array .. you have to convert it.. try to change your php script as..
$message = array('message' => json_encode($names));
it will convert your php array($names) into json.
than get the JSON in android as..
String message = data.getString("message");
Log.e(TAG, "Message: " + message);
now convert it into JsonArray as..
JsonArray array = new JsonArray(message);
now loop on it and get the content of Php array names.
I think here you are doing wrong.
change your code as below and try again
if (isset($gcmRegids) && count($gcmRegids) > 0) {
$e = "ads";
$message = array('message' => $names);
var_dump($message);
foreach ($gcmRegids as $gcmid) {
$pushStatus = sendPushNotification($gcmid, $message);
}
}

PHP/JSON - how to parsing before import to mysql

i want to save some items to database from json.
json source is here:
http://api.worldoftanks.eu/2.0/account/tanks/?application_id=d0a293dc77667c9328783d489c8cef73&account_id=500014703
I want to save the database only those values
"tank_id":
"battles":
"wins":
from each section
for checking JSON source i using this part of script, work fine
function readWotResponse($url, $isUrl = true)
{
if ($isUrl && !ini_get('allow_url_fopen')) {
throw new Exception('allow_url_fopen = Off');
}
$content = file_get_contents($url);
if ($isUrl) {
$status_code = substr($http_response_header[0], 9, 3);
if ($status_code != 200) {
throw new Exception('Got status code ' . $status_code . ' expected 200');
}
}
$json = json_decode($content);
if (!$json) {
throw new Exception('Response contained an invalid JSON response');
}
if ($json->status != 'ok') {
throw new Exception($json->status_code);
}
return $json;
}
for parsing i try
$tanks_id_url = "http://api.worldoftanks.eu/2.0/account/tanks/?application_id=d0a293dc77667c9328783d489c8cef73&account_id=500014703";
try {
$tjson = readWotResponse($tanks_id_url);
print_r ($tjson) ;
}
so and now i need help with good parsing to got good value.
when i use var_dump i got this
object(stdClass)#1 (3) { ["status"]=> string(2) "ok" ["count"]=>
int(1) ["data"]=> object(stdClass)#2 (1) {
["500014703"]=>
array(106) {
[0]=>
object(stdClass)#3 (6) {
["achievements"]=>
object(stdClass)#4 (59) {
["medal_dumitru"]=>
int(0) . . . . .
this part is problem for me
**array(106) {
[0]=>**
because when i want use
$value = $tjson->data->500014703->O->achievements->medal_dumitru;
i got error :-(

Categories