Font or Unicode issue on Scraping [duplicate]

Font or Unicode issue on Scraping [duplicate] - php

This question already has answers here:
PHP DOMDocument failing to handle utf-8 characters (☆)
(3 answers)
Closed 7 years ago.
Am trying to scrape info from a site.
The site have like this
127 East Zhongshan No 2 Rd; 中山东二路127号
But when i try to scrap it & echo it then it will show
127 East Zhongshan No 2 Rd; ä¸å±±ä¸äºè·¯127å·
I also try UTF-8
There is my php code
now please help me for solve this problem.
function GrabPage($site){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec ($ch);
ob_end_clean();
curl_close ($ch);
}
$GrabData = GrabPage($site);
$dom = new DOMDocument();
#$dom->loadHTML($GrabData);
$xpath = new DOMXpath($dom);
$mainElements = array();
$mainElements = $xpath->query("//div[#class='col--one-whole mv--col--one-half wv--col--one-whole'][1]/dl/dt");
foreach ($mainElements as $Names2) {
$Name2 = $Names2->nodeValue;
echo "$Name2";
}

First off, you need to set the charset before anything else on top of PHP file:
header('Content-Type: text/html; charset=utf-8');
You need to convert the html markup you got with mb_convert_encoding:
#$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8'));
Sample Output

First thing is to see if the captured HTML source is properly encoded. If yes try
utf8_decode($Name2)
This should get your string ready for saving as well as printing

Related

JSON API request in PHP not working Properly [duplicate]

This question already has answers here:
Decode gzipped web page retrieved via cURL in PHP
(2 answers)
Closed 2 years ago.
I'm struggling to retrieve data from an API using PHP.
My PHP code:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://api.coronavirus.data.gov.uk/v1/data?filters=areaName=United%2520Kingdom;areaType=overview&structure={%22areaName%22:%22areaName%22,%22date%22:%22date%22,%22newCasesByPublishDate%22:%22newCasesByPublishDate%22,%22cumCasesByPublishDate%22:%22cumCasesByPublishDate%22}");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($curl);
curl_close($curl);
echo $output;
?>
Sadly this just outputs a random list of characters like this:
��9|_��M�\�E�JЀ���"�H*p�Kx���,:VK�`>���m��a˒��܆�V^��gt�d�e]���������������f�ӷ�7�?_�\=�^Z)�b�j��߽�����������{����������tu�������7����+/J���~���������>������o>|�y�v�����s��Et{���i��L;�+>�Pq�F7��ĵQ�u�q9s��e��*͈f hkh�T����Ӵn�ճ�4�Dk�4��E��q��J����XUi�4-r��j�9UZB���s)�J9��a&�z�� �$�Ͳ��n &��4���򺁖�� �a* ��UP�:Gz�4В:�Z�Pc�#KZsښJU'���mĔ3�#Kj�4KSU�#Kꜧ7�ڇ��ZR����ZUyݨ2I����`�֭�ӴQ�u-Y{JZX�i�%5*dwZ�;����4�Ű�FI�%A��3�������E�u��{#�e#TZm�uWidsVm� j�˱�`�V$��hrz�\���4�������Fǜ��%˩e�xQ��J-F�*�к�m��.��Q��y�m��2�4�+ݚJ��Ts�R�(!Ge�t���\��F�e���)� �kȤ�ZN4Tf��-���ςhَ&*���T�ɕ����ZA �R."��0�Y�#��Ԓ-��ֿ0�l#?թL�Uj��l,$�+md��t0�ȒLv�VM����f�$Abm�L#%)�hU�d�Sf�ݬ�vL�FǼ�k�VE�ܔ�%Y�/z�n��c�=UE&35��yd $3`�ʲ>jf��֬t�E��TvY��׬��(#"]�9%�u��4�'E���U�utQcК�}�T��*�浨0�J%T�Z� ��AkVr�kFj����iL ��0�+S��Y�H9�>u���S�O�S�I0��|�95-���g���Q +1d(HHԮ�Y�M=���IM�i8���T�PA��Ta�Ԇ��O���$*%�2����o�aU<��h��RR�2�.�#��cf�aT�8��JkF �!�O#y&����h$˔ #׎���yς�&è� ��|DO�;Z��$X3U��2yX�b��e��0��Ȩ�4���ZT<�iCK�D�t�i��D���m���Pa� k!����D3&^��������e֔i�2:��s�4�j��*��]z�3�G�tT�g<.�&7����V����^yY٦N$9�%���TF�a?ÿ�T׍��Q���e��6�w2�~L��m� �C�E��#wԏV^�7l5�uk�DPixZ\˅Ny��6�'���"�d�ͭ�5lj��}�J��88��Z�J���S�Ir5V��g�M��#[cx�5�}M�:��k{Suo� \7˔i�%��[�=BUe�U�p���%U��V]#[my�Z��.�V�*���z�4����݇�]�R^&�V�� o� �ɜ&^stS^(&���ʁv�W�p}l�\T)��]�#;�f *��S�U�YD�������\%��ft��*�(���ZO�z���#G�͐6L9*O�W�-[Qu�S������L��n��݊��B�sJ���M�Q����>�+ �F�)9�|��f7?�D�ó�%��Ҁ�������al6m�c#GUǄ(�pyl䨖Z��cVS�q�O�t�PV��1I�j��%Ŭ��F���(�*�)�7�h%�*�U��\����VA�ڠ�~Fл��{k����Rh�b{���i�nv��M� |Ƙhf-TZ����{���&6�z��uW��UC��&�7kC�����D�n��T3�i�.�'��6��w1���,M��B{7e�r#M��81��м�Tc���]TiDR�G�Y K5���`���d��9�G�%����=�r(Й �W���Zx���q�#�^�P���b Oi�S1� &���sB#��sF#G�oB�M(e U�_�26��M(� %6��M(�99�Y�?T�mB)爲J�']N3*��W���7��93&S��}W)mJ݄�P��O1l=����U]�>]_� �[��V��(��V���n�#�h+e�����R�Jl+e�����R�Jl+e�����R���2����l�e+e([)C�J��������7W��ǫ����w������_��}u�������������ot�����������������˻����W��?<�?}��t�|ֿ~����_��_/��}����_��>~۩ώ?����S�=��<}w��������5-��_��_fk��������ç�����m��3W��ߦ���x����⶷nx
Putting the url directly into a browser works fine, giving a nice JSON, see see here.
I'm really scratching my head to know where to go from here to get this to parse properly in PHP.
Can anyone help, I'm completely out of my depth here.

Excellent....thank you soooo much !!!!!, you have just solved a problem I have been trying to sort out on-and-off for some time.

The retrieved data is encoded, to enable decoding, add this line.
curl_setopt($curl, CURLOPT_ENCODING, '');
This enables the decoding of the retrieved data, by using and empty string as parameter, it enables all encoding types.
With the line added, your code will look like this.
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://api.coronavirus.data.gov.uk/v1/data?filters=areaName=United%2520Kingdom;areaType=overview&structure={%22areaName%22:%22areaName%22,%22date%22:%22date%22,%22newCasesByPublishDate%22:%22newCasesByPublishDate%22,%22cumCasesByPublishDate%22:%22cumCasesByPublishDate%22}");
curl_setopt($curl, CURLOPT_ENCODING, ''); // <--- ADDED LINE!
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($curl);
curl_close($curl);
echo $output;

you need just encoding, this is working code:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://api.coronavirus.data.gov.uk/v1/data?
filters=areaName=United%2520Kingdom;areaType=overview&structure{%22areaName%22:%22areaName%22,%22date%22:%22date%22,%22newCasesByPublishDate%22:%22newCasesByPublishDate%22,%22cumCasesByPublishDate%22:%22cumCasesByPublishDate%22}");
curl_setopt($curl, CURLOPT_ENCODING, '');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($curl);
curl_close($curl);
echo $output;
?>

cURL input to DOMDocument UTF-8

I am reading in the HTML from a URL and even though it is labelled as UTF-8 in the browser I have to iconv Windows-1252//IGNORE to get the correct result.
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$html = iconv("UTF-8", "Windows-1252//IGNORE", $html);
echo ($html);
Output (long HTML file and raw output):<span class="price">€30 and under</span>
To parse through the DOMDocument I tried different ways including enforcing UTF-8 encoding but basically
$tmp = new DOMDocument();
//$tmp->encoding = 'UTF-8';
$tmp->loadHTML($html);
echo $tmp->saveXML();
which outputs the HTML as <span class="price">30 and under</span>. This character is a Windows 1252 Character for €, but I cannot figure out how to convert it back to the original (same for other special characters).
Thanks for any ideas on how to explain or fix this really strange DOMDoc behaviour!
fj

How do I extract text data from a web page? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Okay, so I have the following function that grabs the web page I need:
function login2($url2) {
$fp = fopen("cookie.txt", "w");
fclose($fp);
$login2 = curl_init();
curl_setopt($login2, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($login2, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($login2, CURLOPT_TIMEOUT, 40000);
curl_setopt($login2, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($login2, CURLOPT_URL, $url2);
curl_setopt($login2, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($login2, CURLOPT_FOLLOWLOCATION, false);
[...]
I then issue this to use the function:
echo login2("https://example.com/clue/holes.aspx");
This echoes the page I am requesting but I only want it to echo a specific piece of data from the HTML source. Here's the specific markup:
<h4>
<label id="cooling percent" for="symbol">*</label>
4.50
</h4>
The only piece of information I want is the figure, which in this specific example is 4.50.
So how can I go about this and make my cURL grab this and echo it instead of echoing the entire page?

You can solve this with XPath:
$html = login2('https://example.com/clue/holes.aspx');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$value = $xpath->query('//label[#id="ctl00_ctl00_PageContainer_MyAccountContainer_symPound"]/following-sibling::text()')->item(0)->nodeValue;
echo $value;

<<<TEXT is being using in php variable to output xml? [duplicate]

This question already has answers here:
In PHP, what does "<<<" represent?
(8 answers)
Closed 8 years ago.
$xmlSend = <<<TEXT
<?xml version="1.0" encoding="UTF-8"?>
<VMCAMEXM>
<business>
<id_company>{$idCompany}</id_company>
<id_branch>{$idBranch}</id_branch>
<country>{$Country}</country>
<user>{$User}</user>
<pwd>{$pwd}</pwd>
</business>
<transacction>
<merchant>{$Merchant}</merchant>
<reference>50000</reference>
<tp_operation>13</tp_operation>
<creditcard>
<crypto>{$Crypto}</crypto>
<type>V/MC</type>
<name>{$name}</name>
<number>{$number}</number>
<expmonth>{$expmonth}</expmonth>
<expyear>{$expyear}</expyear>
<cvv-csc>{$cvv}</cvv-csc>
</creditcard>
<amount>{$cantidad}</amount>
<currency>{$Currency}</currency>
<usrtransacction>1</usrtransacction>
</transacction>
</VMCAMEXM>
TEXT;
echo "<pre>";
print_r(htmlspecialchars($xmlSend));
echo "</pre>";
//$url = $tUrl;
$vars = "&xml=" . $rc4->limpiaVariable(urlencode($xmlSend));
$header[] = "Content-type: application/x-www-form-urlencoded";
$ch = curl_init();
$postfields = "info_asj3=1".$vars;
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL,$Url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 250);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
$data = curl_exec($ch);
if (curl_errno($ch)) {
$data = curl_error($ch);
} else {
curl_close($ch);
}
I am working with a new company that is using this to connect to an xml feed. But using this beginning <<<TEXT is messing up all the code below it. I do not get any errors, the code works, but all my php code below is in black and honestly just is not easy to manage. If I take it out the xml feed does not function properly. Can someone tell me why this is working and what is a better way to accomplish this? I have searched everywhere and can find nothing on the topic. PLEASE HELP!
Thank you to anyone in advance for taking the time to answer!

This is Heredoc syntax of string.
Using Heredoc for string has the benefit for multi-line strings and can avoid the quoting issues.

This is a HEREDOC string, if you are running into problems with it, it's likely due to the indentation of the closing sequence:
TEXT; //<-- must be in the 0th column, of the text file.
http://php.net/manual/en/language.types.string.php#language.types.string.syntax.heredoc

How to extract innerHTML using the PHP Dom [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
I'm currently using nodeValue to give me HTML output, however it is stripping the HTML code and just giving me plain text. Does anyone know how I can modify my code to give me the inner HTML of an element by using it's ID?
function getContent($url, $id){
// This first section gets the HTML stuff using a URL
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
// This second section analyses the HTML and outputs it
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$newDom->validateOnParse = true;
$sections = $newDom->getElementById($id)->nodeValue;
echo $sections;
}

This works for me:
$sections = $newDom->saveXML($newDom->getElementById($id));
http://www.php.net/manual/en/domdocument.savexml.php
If you have PHP 5.3.6, this might also be an option:
$sections = $newDom->saveHTML($newDom->getElementById($id));
http://www.php.net/manual/en/domdocument.savehtml.php

I have modify the code, and it's working fine for me. Please find below the code
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$newDom = new domDocument;
libxml_use_internal_errors(true);
$newDom->loadHTML($html);
libxml_use_internal_errors(false);
$newDom->preserveWhiteSpace = false;
$newDom->validateOnParse = true;
$sections = $newDom->saveHTML($newDom->getElementById('colophon'));
echo $sections;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Font or Unicode issue on Scraping [duplicate] - php

First off, you need to set the charset before anything else on top of PHP file: header('Content-Type: text/html; charset=utf-8'); You need to convert the html markup you got with mb_convert_encoding: #$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8')); Sample Output

First thing is to see if the captured HTML source is properly encoded. If yes try utf8_decode($Name2) This should get your string ready for saving as well as printing

Related

JSON API request in PHP not working Properly [duplicate]

cURL input to DOMDocument UTF-8

How do I extract text data from a web page? [duplicate]

<<<TEXT is being using in php variable to output xml? [duplicate]

How to extract innerHTML using the PHP Dom [duplicate]

Categories

Resources