This is quite a mystery for me.
I am trying to load external HTML via cURL and get elements by tag name (to get specific property)
function getData($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = getData("https://example.com");
$dom = new DOMDocument();
$dom->loadHTML($url);
$aInputs = $dom->getElementsByTagName("input");
foreach($aInputs as $node) {
echo $node->getAttribute("name");
}
but script dumps different values than expected. When I dump variable $url and inspect those inputs via inspector it shows correct values of attributes, but after creating DOMDocument they are wrong. What could possibly cause this?
EDIT: As I discovered that JS could be the fault. In browser JS changes some stuff which cURL doesn't. Is there a workaround to make cURL act like a browser?
Related
I try to load and parse feed with PHP. But when I'm loading it from PHP it contain only 40 [item] entries. I try to load it with cURL, or file_get_contents(). If I open feed in brouser, then save it local, then parse it - I have full numbers of [item]s.
What is the cause of the problem can be?
Example code:
$ch = curl_init();
$xml_url = 'http://tokyotosho.info/rss.php?filter=1&entries=300&cat=1';
curl_setopt($ch, CURLOPT_URL, $xml_url);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru-RU; rv:1.7.12) Gecko/20050919 Firefox/1.0.7");
$result = curl_exec($ch);
curl_close($ch);
$xml_data = simplexml_load_string($result);
I'm trying to get information from the Spotify API. When accessing this URL in my browser, it all works perfect; https://api.spotify.com/v1/search?q=Led%20Zeppelin%20Kashmir&type=track
However, then I use this code to try to get the data I'm just getting a white page. I've Googled and searched Stackoverflow, but still no cigar. Does anyone know why this code doesn't work?
Appreciate any help on this.
$artist = 'Led Zeppelin';
$title = 'Kashmir';
$spotifyURL = 'https://api.spotify.com/v1/search?q='.$artist.'%20'.$title.'&type=track';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $spotifyURL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x");
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$json = curl_exec($ch);
$json = json_decode($json);
curl_close($ch);
echo '<pre>'.print_r($json, true).'</pre>';
Your URL contains spaces. Use the following line instead:
$spotifyURL = 'https://api.spotify.com/v1/search?q='.urlencode($artist.' '.$title).'&type=track';
When running on my localhost or my server I can’t load a particular external a page. However, if I load the page in my browser it loads or using Postman it loads fine.
How can I fix this and how is Spotify preventing this?
The URL of the content I want to load is this.
<?php
$url="https://embed.spotify.com/?uri=spotify:user:spotify:playlist:4hOKQuZbraPDIfaGbM3lKI";
$page = file_get_contents($url);
echo $page; //returns nothing
$page = get_data($url);
echo $page; //returns nothing with a http code of 0
$url="https://www.google.com";
$page = file_get_contents($url);
echo $page; //returns google
$page = get_data($url);
echo $page; //returns google with a http code of 200
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
echo curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return $data;
}
?>
Try setting CURLOPT_POSTFIELDS to true & set the URL parameters in CURLOPT_POSTFIELDS like this. Note the URL changes as a result since the parameters are now in CURLOPT_POSTFIELDS. I set the params as an array called $post_fields since I find that to be an easier way to read that way when debugging.
UPDATE: The post params didn’t work. But adding CURLOPT_SSL_VERIFYHOST set to false as well as CURLOPT_SSL_VERIFYPEER set to false seems to do the trick on my side.
Here is my cleaned up version of your code. Removed your tests & commented out the post param stuff I thought would help before:
// Set the `url`.
$url="https://embed.spotify.com/?uri=spotify:user:spotify:playlist:4hOKQuZbraPDIfaGbM3lKI";
// Set the `post` fields.
$post_fields = array();
// $post_fields['uri'] = 'spotify:user:spotify:playlist:4hOKQuZbraPDIfaGbM3lKI';
// Set the `post` fields.
$page = get_data($url, $post_fields);
// Echo the output.
echo $page;
// Gets the data from a `url`.
function get_data($url, $post_fields) {
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
// curl_setopt($ch, CURLOPT_POST, true);
// curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
// echo curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $data;
}
M trying to crawl some data from a URL
with the help of simple html dom.
But when id start my crawler its giving an error
** failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found**
i have tried cUrl but 404 error is thrown.
here my php simple dom code
function getURLContent($url)
{
$html = new simple_html_dom();
$html->load_file($url);
/* i perfome some opetions here*/
}
and with cUrl
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$data = curl_exec($curl);
echo $data;
curl_close($curl);
How could i do this..?
Thanks in advance..
Yes try to configure the useragent
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
add these to your code and try
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
curl_setopt($ch, CURLOPT_HEADER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); //set headers
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // set true for https urls
404 Error is obvious, page not found. Try Fiddler for catching the parameters needed as your physical browser catches, and pass the same parameters via cURL in your script.
If you are getting Blocked error page, means try changing User-Agent OR use a proxy address(you can easily get free proxy on internet) OR try to maintaining the session while requesting your page, Fiddler will help you in this.
I am trying to use cURL to automate a login with multiple steps involved. The problem I am running into is that I get the first page of the login fine but the next page I hit I must select or hit a link to continue. How the heck do I "keep going". I've tried taking the next URL and putting it into my cURL code but it does not work as it just goes directly to that page and errors because I have not gone to the first page of the login process. Here is my code.
$ch = curl_init();
$data = array('fp_software' => '', 'fp_screen' => '', 'fp_browser' => '','txtUsername' => "$username", 'btnLogin' => 'Log In');
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/Login.aspx');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_close ($ch);
The next url is www.website.com/PassMarkFrame.aspx - Basically I need to crawl threw this login process.
I tried this...but it didn't work.
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/Login.aspx'); // use the URL that shows up in your <form action="...url..."> tag
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, 'https://www.website.com/PassMarkFrame.aspx'); // use the URL that shows up in your <form action="...url..."> tag
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
curl_close ($ch);
Is that the right syntax?
Don't close the curl handle after each stage. if cookies are being set, and you haven't configured the cookiejar/cookiefile options, then you start with a brand new sparkly fresh and clean CURL with no "memory" of the previous requests.
Keep the same curl handle going, and any cookies set by the site will be preserved.