Attempting to get array with curl_exec [duplicate] - php

This question already has answers here:
How can I access an array/object?
(6 answers)
Closed 9 months ago.
I am trying to get an array from the AnimeCharactersDatabase in order to then produce a table with the results. I had it working once before but cannot remember how I got it to work.
Looking at the url (http://www.animecharactersdatabase.com/api_series_characters.php?character_q=Usagi), "search_results" should be itself an array of characters which have arrays of info within them.
<?php
$url= "http://www.animecharactersdatabase.com/api_series_characters.php?character_q=Usagi";
/* gets the data from a URL */
function get_acdb($url)
{
//ACDB requires certain agents for the query per their documentation.
$agents = array(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100508 SeaMonkey/2.0.4',
'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; da-dk) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1' ,
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0 Waterfox/56.2.14',
'Lynx/2.8.7dev.4 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.8d',
'Lynx/2.8.9dev.8 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/3.4.9',
'Lynx/2.8.3dev.9 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6',
'Opera/9.80 (Windows NT 5.3; U; x64; en-US) Presto/2.12.388 Version/12.18'
);
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
//I believe this should make it return the data, not just true.
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_USERAGENT,$agents[array_rand($agents)]);
$data = curl_exec($ch);
curl_close($ch);
return json_decode($data, true);
}
/* Parse the data into Characters. */
$arr = get_acdb($url);
$arr = array($arr["search_results"]);
//Upon testing, this is always an array of 1, and $arr[0] shows no value. This is where I need help, making the for each loop actually add characters to the new array SearchArray.
$i=-1;
foreach ($arr[0] as $xyz) {
//I never get within this function because there $arr[0] doesn't seem to be an array.
$i=$i+1;
$CharName = $arr[0][$i]["name"];
$CharID = $arr[0][$i]["id"];
$SeriesID = $arr[0][$i]["anime_id"];
$SeriesName = $arr[0][$i]["anime_name"];
$medialenA = strlen($CharName) + 25;
$medialenB = strlen($SeriesName) + 2;
$mediatype = substr($arr[0][$i]["desc"],$medialenA);
$mediatype = substr($mediatype,0,strlen($mediatype) -$medialenB);
$CharSex = $arr[0][$i]["gender"];
{
//Add relevant matches to Array.
$SearchArray[] = array(
'CharID'=>$CharID,
'Name'=>$CharName,
'SeriesID'=>$SeriesID,
'SeriesName'=>$SeriesName,
'Sex'=>$CharSex,
);
}
}
?>

I have modified a bit your code but it is fully functional
<?php
$url = "https://www.animecharactersdatabase.com/api_series_characters.php?character_q=Usagi";
function getData($url){
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
//I believe this should make it return the data, not just true.
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:7.0.1) Gecko/20100101 Firefox/7.0.1');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$results= getData($url);
$objectResults= json_decode($results);
foreach($objectResults->search_results as $character){
echo $character->anime_id."\n";
echo $character->anime_name."\n";
echo $character->anime_image."\n";
echo $character->character_image."\n";
echo $character->id."\n";
echo $character->gender."\n";
echo $character->name."\n";
echo $character->desc."\n";
echo"\n\n";
}

Related

How to get title from URL in PHP from sites returning 403 Forbidden

I am trying to get the title of a few pages in PHP with this code. It works fine with almost every link except for a few, for example, with 9gag.
function download_page($url)
{
$agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
return $data;
}
function get_title_tag($str)
{
$pattern = '/<title[^>]*>(.*?)<\/title>/is';
if(preg_match_all($pattern, $str, $out))
{
return $out[1][0];
}
return false;
}
$url = "https://9gag.com/gag/avPBX3b";
$data = download_page($url);
echo $extracted_title = get_title_tag($data);
It echoes
Attention Required! | Cloudflare
which seems to be protected by a Cloudflare bot verification page. But when I try to post this link on any social network, they are able get the title and all the metadata required. How is it possible?
Edit:
Even if I use the opengraph.io API, I get:
"root":{
"error":{
"code": 2005
"message": "Got 403 error from server."
}
}
just replace agent string and it should work OK, from:
$agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36';
to:
$agent = 'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)';
I see that CloudFlare has enabled captcha verification if standard agent strings are present so this will easily bypass this. I'm puzzled with security here but that is out of scope of this question
You can make use of Facebook's Graph API.
https://graph.facebook.com/v7.0/?fields=og_object&id=https://9gag.com/gag/avPBX3b
JSON Output:
{
"og_object": {
"id": "994417753967326",
"description": "More memes, funny videos and pics on 9GAG",
"title": "32 Places People Have Mispronounced Their Entire Life",
"type": "article",
"updated_time": "2020-06-12T15:54:27+0000"
},
"id": "https://9gag.com/gag/avPBX3b"
}
You can read more about it's usage here.

Why would a PHP cURL request work on localhost but not on server (getting 403 forbidden)? [duplicate]

I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error.
I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:
Warning:
file_get_contents(http://example.com/viewProperty.html?id=7715888)
[function.file-get-contents]: failed
to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in
/home/scraping/simple_html_dom.php on
line 40
The line of code triggering it is:
$url="http://www.example.com/viewProperty.html?id=".$id;
$html=file_get_html($url);
I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.
I know it's quite an old thread but thought of sharing some ideas.
Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.
So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.
A list of common user agents used by browsers are listed below:
Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("www.google.com", false, $context);
This piece of code, fakes the user agent and sends the request to https://google.com.
References:
stream_context_create
Cheers!
This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.
It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.
You should probably talk to the administrator of the remote server.
Add this after you include the simple_html_dom.php
ini_set('user_agent', 'My-Application/2.5');
You can change it like this in parser class from line 35 and on.
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html()
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
}
Have you tried other site?
It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:
$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Write this in simple_html_dom.php for me it worked
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
//$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
}
I realize this is an old question, but...
Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.
Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.
You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.
$context = stream_context_create(
array(
"http" => array(
'method'=>"GET",
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36\r\n" .
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,
image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
"accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" .
"accept-encoding: gzip, deflate, br\r\n"
)
)
);
In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.
Use below code:
if you use -> file_get_contents
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
));
=========
if You use curl,
curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');

2 JSON. Only 1 work. json_decode php

I seriously got gray hair.
I would like to echo the [ask] data for https://api.gdax.com/products/btc-usd/ticker/
But it's return null.
When i try with to use another API with almost the same json, it work perfect.
This example works
<?php
$url = "https://api.bitfinex.com/v1/ticker/btcusd";
$json = json_decode(file_get_contents($url), true);
$ask = $json["ask"];
echo $ask;
This example return null
<?php
$url = "https://api.gdax.com/products/btc-usd/ticker/";
$json = json_decode(file_get_contents($url), true);
$ask = $json["ask"];
echo $ask;
Anybody there has an good explanation, whats wrong with the code returning null
the server of that null result is preventing php agent to connect thus returning http 400 error. you need to specify a user_agent value to your http request.
e.g.
$ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36';
$options = array('http' => array('user_agent' => $ua));
$context = stream_context_create($options);
$url = "https://api.gdax.com/products/btc-usd/ticker/";
$json = json_decode(file_get_contents($url, false, $context), true);
$ask = $json["ask"];
echo $ask;
you can also use any user_agent string you want on the $ua variable, as long as you make sure that your target server allows it.
You can't access this URL without passing arguments. It happen some time when the host is checking from where the request come.
$ch = curl_init();
$header=array('GET products/btc-usd/ticker/ HTTP/1.1',
'Host: api.gdax.com',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language:en-US,en;q=0.8',
'Cache-Control:max-age=0',
'Connection:keep-alive',
'Host:adfoc.us',
'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36',
);
curl_setopt($ch,CURLOPT_URL,"https://api.gdax.com/products/btc-usd/ticker/");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,0);
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
$result=curl_exec($ch);
Then you can use json_decode() on $result !

How to parse image feed with php

Using Wikipedia API link to get main image about some world known characters/events.
Example : (Stanislao Mattei)
This would show as following
Now my question
I'd like to parse the xml to get image url to be shown up
here is the code i'm willing to use if it right ~ thanks to ccKep ~
<?PHP
ini_set("user_agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$url = "http://en.wikipedia.org/w/api.php?action=query&list=allimages&aiprop=url&format=xml&ailimit=1&aifrom=Stanislao Mattei";
$xml = simplexml_load_file($url);
$extracts = $xml->xpath("/api/query/allimages");
var_dump($extracts);
?>
It should gives results as array
how i can get among it the exact url of the image to be shown should be :
http://upload.wikimedia.org/wikipedia/en/a/a1/Stanislaus.jpg
to put it in html code
<img src="http://upload.wikimedia.org/wikipedia/en/a/a1/Stanislaus.jpg">
~ Thanks a lot
Did you try $xml->query->allimages->img->attributes()->url
Your code will look like this:
<?php
ini_set("user_agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$url = "http://en.wikipedia.org/w/api.php?action=query&list=allimages&aiprop=url&format=xml&ailimit=1&aifrom=Stanislao Mattei";
$xml = simplexml_load_file($url);
$url = $xml->query->allimages->img->attributes()->url;
echo "URL: ".$url . "<br/>";
echo '<img src="'.$url.'">';
?>

Curl 400 error when using UserAgent

Why i'm getting sometimes this error?
**Bad Request**
Your browser sent a request that this server could not understand.
Apache Server at control.digitalcoding.com Port 80
When
$UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11";
everything works fine, but not with
Opera/7.52 (Windows NT 5.1; U) [en]
Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1
Mozilla/5.0 (Windows NT 6.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
for example. What is the problem?
HtmlReciever.php
<?php
if(empty($_GET["Link"]))
{
echo "empty";
die;
}
$LinkToFetch = urldecode($_GET["Link"]);
$UserAgent = urldecode($_GET["UserAgent"]);
function iscurlinstalled()
{
if (in_array ('curl', get_loaded_extensions()))
{
return true;
}
else
{
return false;
}
}
// If curl is instaled
if(iscurlinstalled()==true)
{
$ch = curl_init($LinkToFetch);
curl_setopt($ch, CURLOPT_USERAGENT,$UserAgent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$HtmlCode = curl_exec($ch);
curl_close($ch);
}
else
{
$HtmlCode = file_get_contents($LinkToFetch);
}
echo $HtmlCode;
?>
I must say that i'm running RecieverHtml.php from another .php with GET like this
http://127.0.0.1/reciever/RecieverHtml.php?Link=http%3A%2F%2Fwww.digitalcoding.com%2Ftools%2Fdetect-browser-settings.html&UserAgent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+rv%3A10.0.1%29+Gecko%2F20100101+Firefox%2F10.0.1%0D%0A
This depends on the server your request is sent to. If the server checks the user agent and allows only requests that match a limited/incomplete/outdated list of common browser user agents, the server might return a generic 400 status code.
If you don't have control over the server and want your script to work, use the user agent that works and forget about the others. The user agent you provide with your request is "wrong" anyway, as it is not Chrome doing the actual request but your server running your PHP script.
EDIT:
You can also pass the user agent of the browser that requests your PHP script by using the following code:
curl_setopt($ch, CURLOPT_USERAGENT, $_REQUEST['HTTP_USER_AGENT']);
Just keep in mind that the value might be empty or exotic (like. Lynx/2.8.8dev.3 libwww-FM/2.14 SSL-MM/1.4.1) and be rejected by the server.

Categories