I recently checked one of our websites and realized that the search for postal code wasn't working anymore.
I get the following error:
'Failed to load external entity'
If instead I use simplexml_load_string() I receive
'Start tag expected, '<' not found'.
This is the code I'm using:
libxml_use_internal_errors(true);
$xml = simplexml_load_file('https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode');
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
I read somewhere it might actually has something to do with HTTP headers but I did not find any useful info on this.
In OSM Nominatim's usage policy it is stated that you need to provide a User-Agent or HTTP-Referer request header to identify the application. As such, using a user-agent to masquerade as end-user browser is really not great etiquette.
You can find the usage policy here. It also says that the default values used by http libraries (like the one simplexml_load_file() uses) are not acceptable.
You say you are using simplexml_load_string(), but fail to say how are you getting the XML to that function. But the most likely scenario is that whichever method you are using to get the XML file, you are also neglecting to pass the mandatory headers.
As such, I'd create a request using php-curl, provide one of these headers to identify your app; and parse the resulting XML string with simplexml_parse_string().
E.g.:
// setup variables
$nominatim_url = 'https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$user_agent = 'ID_Identifying_Your_App v100';
$http_referer = 'http://www.urltoyourapplication.com';
$timeout = 10;
// curl initialization
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $nominatim_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// this is are the bits you are missing
// Setting curl's user-agent
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
// you an also use this one (http-referer), it's up to you. Either one or both.
curl_setopt($ch, CURLOPT_REFERER, $http_referer);
// get the XML
$data = curl_exec($ch);
curl_close($ch);
// load it in simplexml
$xml = simplexml_load_string($data);
// This was your code, left as it was
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
you can useing curlwith adding custom header , i hope this code useful for you :
<?php
$request_url='https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept-Language: en-US,en;q=0.9,fa;q=0.8,und;q=0.7',
'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'));
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
echo($data);
Related
I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.
I am not much familiar with php and curl, need to convert an advance PHP cURL POST request to python equivalent.
It's a code from payment gateway site called paygate, and am using their sample php API from developer.paygate.co.za/. The code that I tried to convert into python is below:
<?php
//The PayGate PayXML URL
define( "SERVER_URL", "https://www.paygate.co.za/payxml/process.trans" );
//Construct the XML document header
$XMLHeader = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE protocol SYSTEM \"https://www.paygate.co.za/payxml/payxml_v4.dtd\">";
// - Then construct the full transaction XML
$XMLTrans = '<protocol ver="4.0" pgid="10011013800" pwd="test"><authtx cref="ABCqwerty1234" cname="Patel Sunny" cc="5200000000000015" exp="032022" budp="0" amt="10000" cur="ZAR" cvv="123" rurl="http://localhost/pg_payxml_php_final.php" nurl="http://localhost/pg_payxml_php_notify.php" /></protocol>'
// Construct the request XML by combining the XML header and transaction
$Request = $XMLHeader.$XMLTrans;
// Create the POST data header containing the transaction
$header[] = "Content-type: text/xml";
$header[] = "Content-length: ".strlen($Request)."\r\n";
$header[] = $Request;
// Use cURL to post the transaction to PayGate
// - first instantiate cURL; if it fails then quit now.
$ch = curl_init();
if (!$ch) die("ERROR: cURL initialization failed. Check your cURL/PHP configuration.");
// - then set the cURL options; to ignore SSL invalid certificates; set timeouts etc.
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt ($ch, CURLOPT_CUSTOMREQUEST, "POST");
// - then set the PayXML URL and the transaction data
curl_setopt ($ch, CURLOPT_URL, SERVER_URL);
curl_setopt ($ch, CURLOPT_HTTPHEADER, $header);
// Connect to PayGate PayXML and send data
$Response = curl_exec ($ch);
// Checl for any connection errors and then close the connection.
$curlError = curl_errno($ch);
curl_close($ch);
I know about basic requests in python but couldn't pass attributes in that request, I am also confused about passing cURL data in requests.
I am trying it like:
import requests
post_data = {'pgid':'10011013800',
'pwd':'test',
'cref': 'ABCX1yty36858gh',
'cname':'PatelSunny',
'cc':'5200000470000015',
'exp':'032022',
'budp':'0',
'amt':'50000',
'cur':'ZAR',
'cvv':'123',
'rurl':'http://localhost/pg_payxml_php_final.php',
'nurl':'http://localhost/pg_payxml_php_notify.php',
'submit':'Submit'
}
r = requests.get('https://www.paygate.co.za/payxml/process.trans', params=post_data,headers=headers)
# print(r.url)
print r.text
But it shows error
405 - HTTP verb used to access this page is not allowed.
Finally solved it,
import requests
import xml.etree.ElementTree as ET
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'112',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36",}
xml = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE protocol SYSTEM "https://www.paygate.co.za/payxml/payxml_v4.dtd">
<protocol ver="4.0" pgid="10011013800" pwd="test">
<authtx cref="ABCX1j64564" cname="Patel Sunny" cc="5200000000000015" exp="032022" budp="0" amt="10000" cur="ZAR" cvv="123"
rurl="http://localhost/pg_payxml_php_final.php" nurl="http://localhost/pg_payxml_php_notify.php" />
</protocol>
"""
headers = {'Content-Type': 'application/xml'} # set what your server accepts
response = requests.post('https://www.paygate.co.za/payxml/process.trans', data=xml, headers=headers).text
tree = ET.fromstring(response)
for node in tree.iter('authrx'):
sdesc = node.attrib.get('sdesc') # STATUS MESSAGE
tid = node.attrib.get('tid') # TRANSACTION ID
cref = node.attrib.get('cref') # REFERENCE NO. like invoice_no or sale_order_no
auth = node.attrib.get('auth')
rdesc = node.attrib.get('rdesc') # Result Code description.
print sdesc, tid, cref
I have a repetitive task that I do daily. Log in to a web portal, click a link that pops open a new window, and then click a button to download an Excel spreadsheet. It's a time consuming task that I would like to automate.
I've been doing some research with PHP and cUrl, and while it seems like it should be possible, I haven't found any good examples. Has anyone ever done something like this, or do you know of any tools that are better suited for it?
Are you familiar with the basics of HTTP requests? Like, do you know the difference between a POST and a GET request? If what you're doing amounts to nothing more than GET requests, then it's actually super simple and you don't need to use cURL at all. But if "clicking a button" means submitting a POST form, then you will need cURL.
One way to check this is by using a tool such as Live HTTP Headers and watching what requests happen when you click on your links/buttons. It's up to you to figure out which variables need to get passed along with each request and which URLs you need to use.
But assuming that there is at least one POST request, here's a basic script that will post data and get back whatever HTML is returned.
<?php
if ( $ch = curl_init() ) {
$data = 'field1=' . urlencode('somevalue');
$data .= '&field2[]=' . urlencode('someothervalue');
$url = 'http://www.website.com/path/to/post.asp';
$userAgent = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)';
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$html = curl_exec($ch);
curl_close($ch);
} else {
$html = false;
}
// write code here to look through $html for
// the link to download your excel file
?>
try this >>>
$ch = curl_init();
$csrf_token = $this->getCSRFToken($ch);// this function to get csrf token from website if you need it
$ch = $this->signIn($ch, $csrf_token);//signin function you must do it and return channel
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 300);// if file large
curl_setopt($ch, CURLOPT_URL, "https://your-URL/anything");
$return=curl_exec($ch);
// the important part
$destination ="files.xlsx";
if (file_exists( $destination)) {
unlink( $destination);
}
$file=fopen($destination,"w+");
fputs($file,$return);
if(fclose($file))
{
echo "downloaded";
}
curl_close($ch);
I make a usual request by curl to get an xml file
$userAgent = 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $requestUrl);
curl_setopt($ch, CURLOPT_REFERER, $requestUrl);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$output = curl_exec($ch);
curl_close($ch);
then I use
$xml = simplexml_load_string($output);
but I faced an url that has invalid xml structure, and, hence it gives errors when I use simplexml_load_string function, and trying to open in the browser it shows this
This page contains the following errors:
error on line 3 at column 53: invalid character in attribute value
So, I want to try to check if the response contains that error so I can do necessary things in my code
I have tried smth like this
if (strpos($output, "This page contains the following errors:") === false) {
echo "valid xml";
} else {
echo "invalid xml structure";
}
but it does not work, and even if the url returns invalid xml , after all it can no find that text in the response.
Thanks
You can use libxml_use_internal_errors(true) to turn off errors and use libxml_get_errors() to to fetch error information as needed.
http://php.net/manual/en/function.libxml-use-internal-errors.php
I read over 20 related questions on this site, searched in Google but no use. I'm new to PHP and am using PHP Simple HTML DOM Parser to fetch a URL. While this script works with local test pages, it just won't work with the URL that I need the script for.
Here is the code that I wrote for this, following an example file that came with the PHP Simple DOM parser library:
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.farmersagent.com/Results.aspx?isa=1&name=A&csz=AL');
foreach($html->find('li.name ul#generalListing') as $e)
echo $e->plaintext;
?>
And this is the error message that I get:
Warning: file_get_contents(http://www.farmersagent.com/Results.aspx?isa=1&name=A&csz=AL) [function.file-get-contents]: failed to open stream: Redirection limit reached, aborting in /home/content/html/website.in/test/simple_html_dom.php on line 70
Please guide me what should be done to make it work. I'm new so please suggest a way that is simple. While reading other questions and their answers on this site, I tried cURL method to create a handle but I failed to make it work. The cURL method that I tried keeps returning "Resources" or "Objects". I don't know how to pass that to Simple HTML DOM Parser to make $html->find() work properly.
Please help!
Thanks!
Had a similar problem today. I was using CURL and it wasn't returning my any error. Tested with file_get_contents() and I got...
failed to open stream: Redirection limit reached, aborting in
Made a few searches and I'v ended with this function that works on my case...
function getPage ($url) {
$useragent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36';
$timeout= 120;
$dir = dirname(__FILE__);
$cookie_file = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($ch, CURLOPT_ENCODING, "" );
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt($ch, CURLOPT_AUTOREFERER, true );
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt($ch, CURLOPT_MAXREDIRS, 10 );
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com/');
$content = curl_exec($ch);
if(curl_errno($ch))
{
echo 'error:' . curl_error($ch);
}
else
{
return $content;
}
curl_close($ch);
}
The website was checking for a valid user agent and for cookies.
The cookie issue was causing it! :)
Peace!
Resolved with:
<?php
$context = stream_context_create(
array(
'http' => array(
'max_redirects' => 101
)
)
);
$content = file_get_contents('http://example.org/', false, $context);
?>
You can also inform if you have a proxy in the middle:
$aContext = array('http'=>array('proxy'=>$proxy,'request_fulluri'=>true));
$cxContext = stream_context_create($aContext);
More details on: https://cweiske.de/tagebuch/php-redirection-limit-reached.htm (thanks #jqpATs2w)
Using cURL you would need to have the CURLOPT_RETURNTRANSFER option set to true in order to return the body of the request with call to curl_exec like this:
$url = 'http://www.farmersagent.com/Results.aspx?isa=1&name=A&csz=AL';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// you may set this options if you need to follow redirects. Though I didn't get any in your case
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($curl);
curl_close($curl);
$html = str_get_html($content);
I also needed to add this HTTP context options ignore_errors :
see : https://www.php.net/manual/en/context.http.php
$arrContextOptions = array(
"ssl" => array(
// skip error "Failed to enable crypto" + "SSL operation failed with code 1."
"verify_peer" => false,
"verify_peer_name" => false,
),
// skyp error "failed to open stream: operation failed" + "Redirection limit reached"
'http' => array(
'max_redirects' => 101,
'ignore_errors' => '1'
),
);
$file = file_get_contents($file_url, false, stream_context_create($arrContextOptions));
Obviously, I only use it for quick debugging purpose on my local environment. It is not for production.
I'm not sure exactly why you redefined the $html object with a string from get html, The object is meant to be used for searching the string. If you overwrite the object with a string, the object no longer exists and cannot be used.
In any case, to search the string returned from curl.
<?php
$url = 'http://www.example.com/Results.aspx?isa=1&name=A&csz=AL';
include('simple_html_dom.php');
# create object
$html = new simple_html_dom();
#### CURL BLOCK ####
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
# you may set this options if you need to follow redirects.
# Though I didn't get any in your case
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($curl);
curl_close($curl);
# note the variable change.
$string = str_get_html($content);
# load the curl string into the object.
$html->load($string);
#### END CURL BLOCK ####
# without the curl block above you would just use this.
$html->load_file($url);
# choose the tag to find, you're not looking for attributes here.
$html->find('a');
# this is looking for anchor tags in the given string.
# you output the attributes contents using the name of the attribute.
echo $html->href;
?>
you might be searching a different tag, the method is the same
# just outputting a different tag attribute
echo $html->class;
echo $html->id;