simple_html_dom ignores special characters

simple_html_dom ignores special characters - php

The code I am using is the one below, this works perfectly fine until I encounter url with Japanese character or any special characters. I have observed this issue and it seems that it is only returning the domain name whenever the url contains special characters such as japanese, as a result I kept getting random results which I don't intend to retrieve.
include_once 'simple_html_dom.php';
header('Content-Type: text/html; charset=utf-8');
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
$html = file_get_html($url_link);
echo $html->find('.bigChar', 0)->innertext;
I should be getting a result of "Knights of Ramune" since that's the element I was trying to retrieve. Instead, the $url_link was redirected to domain name which is the 'http://kissanime.com/' without 'Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH'. And from there, it looks for the class with a value of '.bigChar' that results of giving random value.

The Real Problem domain is, how to retrieve the data using a URL with UTF-8 Characters, not simple_html_dom.
First of all, we need to encode the characters:
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
$strPosLastPart = strrpos($url_link, '/') + 1;
$lastPart = substr($url_link, $strPosLastPart);
$encodedLastPart = rawurlencode($lastPart);
$url_link = str_replace($lastPart, $encodedLastPart, $url_link);
Normaly this should work. Since i have test it, it worked not. So I am asking why this error happens, and made a Call using CURL.
Object reference not set to an instance of an object. Description: An
unhandled exception occurred during the execution of the current web
request. Please review the stack trace for more information about the
error and where it originated in the code.
Exception Details: System.NullReferenceException: Object reference not
set to an instance of an object.
Now we know, this page is written in ASP.NET. But i was asking me, why it not work. I added a User Agent, and voila:
$ch = curl_init($url_link);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0');
$data = curl_exec($ch);
echo $data;
All together (working):
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
//Encode Characters
$strPosLastPart = strrpos($url_link, '/') + 1;
$lastPart = substr($url_link, $strPosLastPart);
$encodedLastPart = rawurlencode($lastPart);
$url_link = str_replace($lastPart, $encodedLastPart, $url_link);
//Download Data
$ch = curl_init($url_link);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0');
$data = curl_exec($ch);
//Load Data into Html (untested, since i am not using this Lib)
$html = str_get_html($data);
Now the difference would be, to read $data into your simple_html_dom.php class, instead of file_get_html.
Cheers

Related

File_get_html return empty html in PHP Simple HTML DOM Parser

I made a script that was getting content from another site using Simple HTML DOM Parser. It looked like this
include_once('simple_html_dom.php');
$html = file_get_html('http://csgolounge.com/'.$tradeid);
foreach($html->find('div[id=tradediv]') as $trade) {
$when = $trade->find('.tradeheader')[0];
}
I was probably looking for content too often (every 30 secs) , and now i get empty html back.
I tryed to change User agent like this
$context = stream_context_create();
stream_context_set_params($context, array('user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n'));
$html = file_get_html('http://csgolounge.com/profile?id='.$steamid, 0, $context);
But am still getting back empty html.

The problem was that my html file was too big . Simple html dom has defined max file size define('MAX_FILE_SIZE', 600000). I changed it to 900000 and now its working again.

file_get_contents returns unreadable text for a specific url

When I try to read the rss feeds of the kat.cr using php file_get_contents function, I get some unreadable text but when I open it up with my browser the feed is fine.
I have tried many other hosts but no chance in getting the correct data.
I even have tried setting the user-agent to diffrent browsers but still no change.
this is a simple code that I've tried:
$options = array('http' => array('user_agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'));
$url = 'https://kat.cr/movies/?rss=1';
$data = file_get_contents($url, FILE_TEXT, stream_context_create($options));
echo $data;
I'm curious how their doing it and what I can do to overcome the problem.
A part of unreadable text:
‹ي]يrم6–‎?Oپي©™ت,à7{»‌âgw&يؤe;éN¹\S´HK\S¤–¤l+ے÷ِùِIِ”(إژzA5‌ةض؛غ%K4ـ{qtqy½ùوa^ »¬nٍھ|ûٹSِ eه¤Jَrِْصڈ1q^}sü§7uسlدزؤYً¾²yفVu‌•يغWGG·Iس&m>،“j~$ےzؤ(?zï‍ج’²جٹم?!ّ÷¦حغ";‏گ´Yس¢ï³{tر5ز ³َsgYٹْ.ں#
Actually everytime I open up the link there is some different unreadable text.

As I mentioned in the comment - the contents returned are gzip encoded so you need to un-gzip the data. Depending upon your version of php you may or may not have gzdecode installed, I don't but the function here does the trick.
if( !function_exists('gzdecode') ){
function gzdecode( $data ){
$g=tempnam('/tmp','ff');
#file_put_contents( $g, $data );
ob_start();
readgzfile($g);
$d=ob_get_clean();
unlink($g);
return $d;
}
}
$data=gzdecode( file_get_contents( $url ) );
echo $data;

Login to amazon using CURL

I'm trying to login to amazon using curl, however when i send the POST data I'm not getting anything and i want to use curl only i don't want to use any API. This is the code that i tried:
<?php
$curl_crack = curl_init();
CURL_SETOPT($curl_crack,CURLOPT_URL,"https://www.amazon.com/ap/signin?_encoding=UTF8&openid.assoc_handle=usflex&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2F%3Fref_%3Dnav_custrec_signin");
CURL_SETOPT($curl_crack,CURLOPT_USERAGENT,$_SERVER['HTTP_USER_AGENT']);
//CURL_SETOPT($curl_crack,CURLOPT_PROXY,trim($socks[$sockscount]));
//CURL_SETOPT($curl_crack,CURLOPT_PROXYTYPE,CURLPROXY_SOCKS5);
CURL_SETOPT($curl_crack,CURLOPT_POST,True);
CURL_SETOPT($curl_crack,CURLOPT_POSTFIELDS,"appAction=SIGNIN&email=test#hotmail.com&create=0&password=test123");
CURL_SETOPT($curl_crack,CURLOPT_RETURNTRANSFER,True);
CURL_SETOPT($curl_crack,CURLOPT_COOKIEFILE,"cookie.txt");
curl_setopt($curl_crack, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl_crack, CURLOPT_FOLLOWLOCATION, 1);
CURL_SETOPT($curl_crack,CURLOPT_TIMEOUT,30);
echo $check = curl_exec($curl_crack);
?>

Here you go. Tested & working.
EDIT: This code stopped working sometime before June 2016. Amazon has added client side Javascript browser fingerprinting that breaks automated logins like the one below. It's actually not that hard to bypass but I haven't spent time on engineering PHP code to do so which would be easily breakable by minor changes.
Instead, I've posted an example below the old PHP code that uses CasperJS to log in. PhatomJS or Selenium could also be used.
To supply a little background, an extra form field called metaData1 is populated by Jaavascript which contains a base64 encoded string of obfuscated browser information. Some of it might be compared with server side collected data.
Here's an example string (before encoding):
9E0AC647#{"version":"2.3.6-AUI","start":1466184997409,"elapsed":5,"userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36","plugins":"Chrome PDF Viewer Shockwave Flash 2100Widevine Content Decryption Module 148885Native Client ||1600-1200-1150-24---","dupedPlugins":"Chrome PDF Viewer Shockwave Flash 2100Widevine Content Decryption Module 148885Native Client Chrome PDF Viewer ||1600-1200-1150-24---","flashVersion":"21.0.0","timeZone":-8,"lsUbid":"X69-8317848-6241674:1466184997","mercury":{"version":"2.1.0","start":1467231996334,"ubid":"X69-8317848-6241674:1466184997","trueIp":"1020304","echoLatency":831},"timeToSubmit":57868,"interaction":{"keys":47,"copies":0,"cuts":0,"pastes":0,"clicks":6}}
As you can see the string contains some creepy information, what browser plugins are loaded, your key and mouse click count on the page, the trueIp is a 32-bit long IP address of your computer, your time zone, screen resolution and viewport resolution, and how long you were on the login page. There's quite a bit more info that it can collect, but this is a sample from my browser.
The value 9E0AC647 is a crc32 checksum of the string after the # - it won't match because I changed trueIp and other data. This data then goes through some transformation (encoding) using some values from Javascript, is base64 encoded, and then added to the login form.
Here's a permanent paste of the JS code responsible for all of this.
The steps:
Fetch the home page to establish cookies
Parse HTML to extract login URL
Fetch login page
Parse HTML and find signin form
Extract form inputs for login (there are quite a few required hidden fields)
Build post array for login
Submit login form
Check for success or failure
PHP Code (no longer working - see example below):
<?php
// amazon username & password
$username = 'you#example.com';
$password = 'yourpassword';
// http headers for requests
$headers = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive',
'DNT: 1', // :)
);
// initialize curl
$ch = curl_init('https://www.amazon.com/');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEFILE, '');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
// fetch homepage to establish cookies
$result = curl_exec($ch);
// parse HTML looking for login URL
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($result);
// find link to login page
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//*[#id="nav-link-yourAccount"]');
if ($elements->length == 0) {
die('Did not find "sign-in" link!');
}
// get login url
$url = $elements->item(0)->attributes->getNamedItem('href')->nodeValue;
if (strpos($url, 'http') !== 0) {
$url = 'https://www.amazon.com' . $url;
}
// fetch login page
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
// parse html to get form inputs
$dom->loadHTML($result);
$xpath = new DOMXPath($dom);
// find sign in form inputs
$inputs = $xpath->query('//form[#name="signIn"]//input');
if ($inputs->length == 0) {
die('Failed to find login form fields!');
}
// get login post url
$url = $xpath->query('//form[#name="signIn"]');
$url = $url->item(0)->attributes->getNamedItem('action')->nodeValue; // form action (login URL)
// array of form fields to submit
$fields = array();
// build list of form inputs and values
for ($i = 0; $i < $inputs->length; ++$i) {
$attribs = $inputs->item($i)->attributes;
if ($attribs->getNamedItem('name') !== null) {
$val = (null !== $attribs->getNamedItem('value')) ? $attribs->getNamedItem('value')->nodeValue : '';
$fields[$attribs->getNamedItem('name')->nodeValue] = $val;
}
}
// populate login form fields
$fields['email'] = $username;
$fields['password'] = $password;
// prepare for login
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($fields));
// execute login post
$result = curl_exec($ch);
$info = curl_getinfo($ch);
// if login failed, url should be the same as the login url
if ($info['url'] == $url) {
echo "There was a problem logging in.<br>\n";
var_dump($result);
} else {
// if successful, we are redirected to homepage so URL is different than login url
echo "Should be logged in!<br>\n";
var_dump($result);
}
Working CasperJS code:
var casper = require('casper').create();
casper.userAgent('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0');
phantom.cookiesEnabled = true;
var AMAZON_USER = 'you#yoursite.com';
var AMAZON_PASS = 'some crazy password';
casper.start('https://www.amazon.com/').thenClick('a#nav-link-yourAccount', function() {
this.echo('Title: ' + this.getTitle());
var emailInput = 'input#ap_email';
var passInput = 'input#ap_password';
this.mouseEvent('click', emailInput, '15%', '48%');
this.sendKeys('input#ap_email', AMAZON_USER);
this.wait(3000, function() {
this.mouseEvent('click', passInput, '12%', '67%');
this.sendKeys('input#ap_password', AMAZON_PASS);
this.mouseEvent('click', 'input#signInSubmit', '50%', '50%');
});
});
casper.then(function(e) {
this.wait(5000, function() {
this.echo('Capping');
this.capture('amazon.png');
});
});
casper.run(function() {
console.log('Done');
casper.done();
});
You should really extend this code to act more like a human!

Using compression to get an external XML feed

Using PHP, I am accessing an external URL, which is an XML feed file, and I'm parsing the results into my database. The XML file is large, around 27 MB.
How can I compress that file before the data transfer is initiated so I receive something much smaller than 27 MB? My guess is gzip should be used, but I don't know how.
This is my code I'm using for retreiving the data from the XML file:
$url = "http://www.website.com/feed.xml";
$xmlStr = file_get_contents("$url") or die("can't get file");
$xmlLinq = simplexml_load_string($xmlStr);
EDIT: The file is already using default gzip/deflate compression, but I seem to be accessing the non-compressed one.
EDIT: I got this piece of code from the owner of the feed, those are supposed to be instructions how to solve this problem, but this seems to be in C#. I'd need the equivalent in PHP:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 60000;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
request.KeepAlive = false;
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; ru; rv:1.9) Gecko/2008052906 Firefox/3.0 (.NET CLR 3.5.30729)";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
StreamReader reader = new StreamReader(responseStream, Encoding.UTF8);

Expanding on my comment, web servers will only send content compressed using Gzip if the request's Accept-Encoding header contains gzip. To fire off a request containing this header, you can use the following:
$url = "http://www.website.com/feed.xml";
$curl = curl_init($url);
curl_setopt_array($curl, array(
CURLOPT_ENCODING => '', // specify that we accept all supported encoding types
CURLOPT_RETURNTRANSFER => true));
$xml = curl_exec($curl);
curl_close($curl);
if($xml === false) {
die('Can\'t get file');
}
$xmlLinq = simplexml_load_string($xml);
This uses the cURL extension, which is a very flexible library for making HTTP requests.

Get content from a url using php

I want to get the dynamic contents from a particular url:
I have used the code
echo $content=file_get_contents('http://www.punoftheday.com/cgi-bin/arandompun.pl');
I am getting following results:
document.write('"Bakers have a great knead to make bread."
') document.write('© 1996-2007 Pun of the Day.com
')
How can i get the string Bakers have a great knead to make bread.
Only string inside first document.write will change, other code will remain constant
Regards,
Pankaj

You are fetching a JavaScript snippet that is supposed to be built in directly into the document, not queried by a script. The code inside is JavaScript.
You could pull out the code using a regular expression, but I would advise against it. First, it's probably not legal to do. Second, the format of the data they serve can change any time, breaking your script.
I think you should take at their RSS feed. You can parse that programmatically way easier than the JavaScript.
Check out this question on how to do that: Best way to parse RSS/Atom feeds with PHP

1) several local methods
<?php
echo readfile("http://example.com/"); //needs "Allow_url_include" enabled
echo include("http://example.com/"); //needs "Allow_url_include" enabled
echo file_get_contents("http://example.com/");
echo stream_get_contents(fopen('http://example.com/', "rb")); //you may use "r" instead of "rb" //needs "Allow_url_fopen" enabled
?>
2) Better Way is CURL:
echo get_remote_data('http://example.com'); // GET request
echo get_remote_data('http://example.com', "var2=something&var3=blabla" ); // POST request
//============= https://github.com/tazotodua/useful-php-scripts/ ===========
function get_remote_data($url, $post_paramtrs=false) { $c = curl_init();curl_setopt($c, CURLOPT_URL, $url);curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); if($post_paramtrs){curl_setopt($c, CURLOPT_POST,TRUE); curl_setopt($c, CURLOPT_POSTFIELDS, "var1=bla&".$post_paramtrs );} curl_setopt($c, CURLOPT_SSL_VERIFYHOST,false);curl_setopt($c, CURLOPT_SSL_VERIFYPEER,false);curl_setopt($c, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0"); curl_setopt($c, CURLOPT_COOKIE, 'CookieName1=Value;'); curl_setopt($c, CURLOPT_MAXREDIRS, 10); $follow_allowed= ( ini_get('open_basedir') || ini_get('safe_mode')) ? false:true; if ($follow_allowed){curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);}curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 9);curl_setopt($c, CURLOPT_REFERER, $url);curl_setopt($c, CURLOPT_TIMEOUT, 60);curl_setopt($c, CURLOPT_AUTOREFERER, true); curl_setopt($c, CURLOPT_ENCODING, 'gzip,deflate');$data=curl_exec($c);$status=curl_getinfo($c);curl_close($c);preg_match('/(http(|s)):\/\/(.*?)\/(.*\/|)/si', $status['url'],$link);$data=preg_replace('/(src|href|action)=(\'|\")((?!(http|https|javascript:|\/\/|\/)).*?)(\'|\")/si','$1=$2'.$link[0].'$3$4$5', $data);$data=preg_replace('/(src|href|action)=(\'|\")((?!(http|https|javascript:|\/\/)).*?)(\'|\")/si','$1=$2'.$link[1].'://'.$link[3].'$3$4$5', $data);if($status['http_code']==200) {return $data;} elseif($status['http_code']==301 || $status['http_code']==302) { if (!$follow_allowed){if(empty($redirURL)){if(!empty($status['redirect_url'])){$redirURL=$status['redirect_url'];}} if(empty($redirURL)){preg_match('/(Location:|URI:)(.*?)(\r|\n)/si', $data, $m);if (!empty($m[2])){ $redirURL=$m[2]; } } if(empty($redirURL)){preg_match('/href\=\"(.*?)\"(.*?)here\<\/a\>/si',$data,$m); if (!empty($m[1])){ $redirURL=$m[1]; } } if(!empty($redirURL)){$t=debug_backtrace(); return call_user_func( $t[0]["function"], trim($redirURL), $post_paramtrs);}}} return "ERRORCODE22 with $url!!<br/>Last status codes<b/>:".json_encode($status)."<br/><br/>Last data got<br/>:$data";}
NOTICE: It automatically handles FOLLOWLOCATION problem + Remote urls are automatically re-corrected! ( src="./imageblabla.png" --------> src="http://example.com/path/imageblabla.png" )
p.s.on GNU/Linux distro servers, you might need to install the php5-curl package to use it.

Pekka's answer is probably the best way of doing this. But anyway here's the regex you might want to use in case you find yourself doing something like this, and can't rely on RSS feeds etc.
document\.write\(' // start tag
([^)]*) // the data to match
'\) // end tag
EDIT for example:
<?php
$subject = "document.write('"Paying for college is often a matter of in-tuition."<br />')\ndocument.write('<i>© 1996-2007 <a target=\"_blank\" href=\"http://www.punoftheday.com\">Pun of the Day.com</a></i><br />')";
$pattern = "/document\.write\('([^)]*)'\)/";
preg_match($pattern, $subject, $matches);
print_r($matches);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

simple_html_dom ignores special characters - php

Related

File_get_html return empty html in PHP Simple HTML DOM Parser

file_get_contents returns unreadable text for a specific url

Login to amazon using CURL

Using compression to get an external XML feed

Get content from a url using php

Categories

Resources