Removing Accents from cURL'd page

Removing Accents from cURL'd page - php

I have a simple function that uses cURL to grab a page and pull out the first and surname:
$base_url = 'http://www.behindthename.com/random/random.php';
$query = http_build_query($params);
$url = $base_url . '?' . $query;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
A sample $params array could look like this:
Array (
[number] => 1
[gender] => f
[surname] =>
[randomsurname] => yes
[all] => no
[usage_lth] => 1
)
Some of the names that come back have accents in them (which is fine, and I would like that to happen). However, I need to remove the accents when I am doing certain things with the names.
I have tried using Wordpress's remove_accents function, but it never seems to get passed the first !preg_match conditional. The conditional always evaluates to true and the original string just gets returned.
However, if I copy paste a name with accents in it, hard code it, and then run remove_accents on it, everything works. For example:
$name = 'Þýri';
echo remove_accents($name);
Returns 'THyri'.
I don't really understand, since as far as I can tell, the result from curl_exec is utf8, which should work fine.
I have tried calling remove_accents directly on the result returned by cURL (to make sure that my method of pulling out the names wasn't messing anything up), and that doesn't work either - the accents remain.
I have also tried removing the !preg_match conditional, in this case, seems_utf8 comes back true, but still the accents don't get removed.
What am I doing wrong?

I think you could give a try to htmlentities, the function will convert name with accents to their respective html code you can read documentation here: http://php.net/manual/en/function.htmlentities.php
Using this function will covert for example:
$string = 'noè';
echo htmlentities($string);
This will output:
noè
Wich will be read by html as
noè
Otherwise if you just need to replace accent with normal letter you can use str_replace function wich will look for a given value and change it with a target value, here an example:
echo str_replace('è', 'e', $string);
Will output
noe
In this case you will have to manually insert all accents you want to replace.
UPDATED
in your case you already have html code for your accent so you can either decode them and then exchange with normal letters or store them with accents (wich is possible, depends on your db codification (UTF-8) should allow you to store them

Related

PHP parse_str of URL query returns array keys with 'amp' in key name

Environment: Wordpress (latest) on test site on Host (JustHost), with PHP version 5.4.43.
Using parse_str on a URL query, the resulting array returns array key names of amp;keyname . Example code:
$querystring = "ie=UTF8&qid=1468851514&sr=8-1&keywords=there&tag=sitename-20";
parse_str($querystring, $queryarray);
echo "<pre>";
print_r($queryarray);
echo "</pre>";
Returns
Array
(
[ie] => UTF8
[amp;qid] => 1468851514
[amp;sr] => 8-1
[amp;keywords] => there
[amp;tag] => sitename-20
)
Why is the 'amp;' in the key name?
Is this a PHP version issue (it seems to work OK in a EasyPHP local environment - version 5.4.24, but not on my test WP server on my hosting place)? Or am I confused (again)?

&'amp; must only be used when outputting URLs in HTML/XML data.
You can try use html_entity_decode()
$querystring = "ie=UTF8&qid=1468851514&sr=8-1&keywords=there&tag=sitename-20";
parse_str(html_entity_decode($querystring), $queryarray);
echo "<pre>";
print_r($queryarray);
echo "</pre>";
I hope it help.

If you are sure that $querystring doesn't contain other encoded entities, you can use html_entity_decode, as #Hugo E Sachez suggests.
But in some complex systems $querystring may come from a place that you don't have control of, and it may contain entities that were encoded on purpose as a safety measure, like ", or <, or > etc...
So, if you decode all entities, and then parse data, and then somehow return this data to user, probably some unwanted <script> can be executed.
I assume it would be safer to replace only & with &, and keep other entities encoded.
Like this: parse_str(str_replace('&', '&', $querystring), $queryarray);
Correct me if I am wrong

Æøå in returned JSON result - the data doesn't look like it's supposed to

I have fetched some data from a url request using JSON with the following code:
$url = 'https://recruit.zoho.com/ats/private/xml/JobOpenings/getRecords?authtoken=$at&scope=recruitapi';
$request = new WP_Http;
$result = $request->request($url, $data = array());
$input = json_encode($result, true);
var_dump($input);
This code worked absolutely fine, except the data coming out looked really weird, such as:
"content-encoding":"gzip","vary":"Accept-Encoding","strict-transport-security":"max-age=15768000"},"body":"\u003C?xml version=\"1.0\" encoding=\"UTF-8\" ?\u003E\n\u003Cresponse uri=\"\/ats\/private\/xml\/JobOpenings\/getRecords\"\u003E\u003Cresult\u003E\u003CJobOpenings\u003E\u003Crow no=\"1\"\u003E\u003CFL val=\"JOBOPENINGID\"\u003E\u003C![CDATA[213748000001263043]]\u003E\u003C\/FL\u003E\u003CFL val=\"Published in website\"\u003E\u003C![CDATA[false]]\u003E\u003C\/FL\u003E\u003CFL val=\"Modified by\"\u003E\u003C![CDATA
After some research, I realize that part of the problem most likely is the fact that there are æ, ø, and å in the data I'm requesting. Others have solved the problem this way:
$input = json_encode(utf8_decode($result), true);
However this gives me this error:
Warning: utf8_decode() expects parameter 1 to be string, array given in
I know the array is not a string, but how else do I deal with this? It seems to have worked for others, and I cant figure out why.
Thanks.
Edit:
I noticed this in the beginning of the printed data.
string(31486) "{"headers":{"server":"ZGS","date":"Wed, 12 Aug 2015 13:59:32 GMT","content-type":"text\/xml;charset=utf-8"
Does that mean it is already UTF-8 and I'm totally off?

What you receive in $result is an utf-8 string that seems to represent an url of some sort. Anyhow, json_encode will escape any unicode character to \u008E strings.
If you don't want to escape utf-8 character, this question is relevent to you : Why does the PHP json_encode function convert UTF-8 strings to hexadecimal entities?
Everything seems to work fine from what I see. Although, the string you have provided us seem to be troncated but I guess this is an error on your part.

Extract JSON from HTML using PHP

I'm reading source code of an online shop website, and on each product page I need to find a JSON string which shows product SKUs and their quantity.
Here are 2 samples:
'{"sku-SV023435_B_M":7,"sku-SV023435_BL_M":10,"sku-SV023435_PU_M":11}'
The sample above shows 3 SKUs.
'{"sku-11430_B_S":"20","sku-11430_B_M":"17","sku-11430_B_L":"30","sku-11430_B_XS":"13","sku-11430_BL_S":"7","sku-11430_BL_M":"17","sku-11430_BL_L":"4","sku-11430_BL_XS":"16","sku-11430_O_S":"8","sku-11430_O_M":"6","sku-11430_O_L":"22","sku-11430_O_XS":"20","sku-11430_LBL_S":"27","sku-11430_LBL_M":"25","sku-11430_LBL_L":"22","sku-11430_LBL_XS":"10","sku-11430_Y_S":"24","sku-11430_Y_M":36,"sku-11430_Y_L":"20","sku-11430_Y_XS":"6","sku-11430_RR_S":"4","sku-11430_RR_M":"35","sku-11430_RR_L":"47","sku-11430_RR_XS":"6"}',
The sample above shows many more SKUs.
The number of SKUs in the JSON string can range from one to infinity.
Now, I need a regex pattern to extract this JSON string from each page. At that point, I can easily use json_encode().
Update:
Here I found another problem, sorry that my question was not complete, there is another similar json string which is starting with sku- , Please have a look at source code of below link you will understand, the only difference is the value for that one is alphanumeric and for our required one is numeric. Also please note our final goal is to extract SKUs with their quantity, maybe you have a most straightforward solution.
Source
#chris85
Second update:
Here is another strange issue which is a bit off topic.
while I'm opening the URL content using below code there is no json string in the source!
$html = file_get_contents("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");
But when I'm opening the url with my browser the json is there! really confused about this :(

Trying to extract specific data from json directly with regexp is normally always a bad idea due to the way json is encoded. The best way is to regexp the whole json data, then decode using the php function json_decode.
The issue with the missing data is due to a missing required cookie. See my comments in the code below.
<?php
function getHtmlFromDresslinkUrl($url)
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
//You must send the currency cookie to the website for it to return the json you want to scrape
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Cookie: currencies_code=USD;',
));
$output=curl_exec($ch);
curl_close($ch);
return $output;
}
$html = getHtmlFromDresslinkUrl("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");
//Get the specific arguments for this js function call only
$items = preg_match("/DL\.items\_list\.initItemAttr\((.+)\)\;/", $html, $matches);
if (count($matches) > 0) {
$arguments = $matches[1];
//Split by argument seperator.
//I know, this isn't great but it seems to work.
$args_array = explode(", ", $arguments);
//You need the 5th argument
$fourth_arg = $args_array[4];
//Strip quotes
$fourth_arg = trim($fourth_arg, "'");
//json_decode
$qty_data = json_decode($fourth_arg, true);
//Then you can work with the php array
foreach ($qty_data as $name => $qtty) {
echo "Found " . $qtty . " of " . $name . "<br />";
}
}
?>
Special thanks to #chris85 for making me read the question again. Sorry but I couldn't undo my downvote.

You will want to use preg_match_all() to perform the regex matching operation (documentation here).
The following should do it for you. It will match each substring beginning with "sku" and ending with ",".
preg_match_all("/sku\-.+?:[0-9]*/", $input)
Working example here.
Alternatively, if you want to extract the entire string, you can use:
preg_match_all("/{.sku\-.*}/, $input")
This will grab everything between the opening and closing brackets.
Working example here.
Please note that $input denotes the input string.

A simple /'(\{"[^\}]+\})'/ will match all these JSON strings. Demo: https://regex101.com/r/wD5bO4/2
The first element of the returned array will contain the JSON string for json_decode:
preg_match_all ("/'(\{\"[^\}]+\})'/", $html, $matches);
$html is the HTML to be parsed, the JSON will be in $matches[0][1], $matches[1][1], $matches[2][1] etc.

Character Loss converting _GET Array to URL string

Having a very bizarre issue with the conversion of a $_GET request into a string.
(PHP 5.2.17)
Here is a small snippet of the problem area of the array from print_r():
_GET (array)
...
[address_country_code] => GB
[address_name] => Super Mario
[notify_version] => 3.7
...
There are two cases the _GET data is used:
Case 1): Saved then used later:
// Script1.php
$data = json_encode($_GET);
# > Save to MySQL Database ($data)
// Script2.php (For Viewing & Testing URL later)
# > Load from Database ($result)
echo http_build_query(json_decoded($result,true));
Result of above array snippet: (CORRECT OUTPUT)
address_country_code=GB&address_name=Super+Mario&notify_version=3.7
Case 2): Used in same script as Case 1) just before its saved in Case 1):
// Script1.php
echo http_build_query($_GET);
Results in: (INCORRECT OUTPUT)
address_country_code=GB&address_name=Super+Mario¬ify_version=3.7
How is it possible that a few chars are output as a ¬ in case 2 yet case 1 is fine!
It is driving me insane :(
I have tried also instead of using http_build_query a custom function that generates the url using urlencode() in the Key and Value of the foreach loop, this just resulted in the the ¬ being changed to %C2%AC in one of my test cases!

Everything is ok with your data. You can verify it if you do:
$query = http_build_query($_GET);
parse_str($query, $data);
print_r($data);
you will get the correct uncorrupted data.
And the reason why you see ¬ symbol is how browser interprets html entities. ¬ is represented as ¬ But browser will render it even without semicolon at the end.

You're most likely displaying this data in a web browser and that is interpreting
&not
as special HTML entity.
Pls see this: https://code.google.com/p/doctype-mirror/wiki/NotCharacterEntity
Try doing
var_dump(http_build_query($_GET))
instead of:
echo http_build_query($_GET)
and see HTML source to get/verify actual string.

So, even though both cases output to web a web browser and both convert from an array using http_build_query().
I fixed problem in Case 2 by replacing http_build_query (Case 1 still uses it..) with this function:
htmlspecialchars(http_build_query($_GET));

Jquery, ajax and the ampersand conundrum

I know that I should encodeURI any url passed to anything else, because I read this:
http://www.digitalbart.com/jquery-and-urlencode/
I want to share the current time of the current track I am listening to.
So I installed the excellent yoururls shortener.
And I have a bit of code that puts all the bits together, and makes the following:
track=2&time=967
As I don't want everyone seeing my private key, I have a little php file which takes the input, and appends the following, so it looks like this:
http://myshorten.example/yourls-api.php?signature=x&action=shorturl&format=simple&url=http://urltoshorten?track=2&time=967
So in the main page, I call the jquery of $("div.shorturl").load(loadall);
It then does a little bit of CURL and then shortener returns a nice short URL.
Like this:
$myurl='http://myshorten.example/yourls-api.php?signature=x&action=shorturl&format=simple&url=' . $theurl;
$ch = curl_init($myurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
if ($data === false) {
echo 'cURL failed';
exit;
}
echo $data;
All perfect.
Except... the URL which is shortened is always in the form of http://urltoshorten?track=2 - anything after the ampersand is shortened.
I have tried wrapping the whole URL in php's URLencode, I've wrapped the track=2&time=967 in both encodeURI and encodeURIComponent, I've evem tried wrapping the whole thing in one or both.
And still, the & breaks it, even though I can see the submitted url looks like track=1%26time%3D5 at the end.
If I paste this or even the "plain" version with the unencoded url either into the yoururls interface, or submit it to the yoururls via the api as a normal URL pasted into the location bar of the browser, again it works perfectly.
So it's not yoururls at fault, it seems like the url is being encoded properly, the only thing I can think of is CURL possibly?
Now at this point you might be thinking "why not replace the & with a * and then convert it back again?".
OK, so when the url is expanded, I get the values from
var track = $.getUrlVar('track');
var time = $.getUrlVar('time');
so I COULD lose the time var, then do a bit of finding on where the * is in track and then assume the rest of anything after * is the time, but it's a bit ugly, and more to the point, it's not really the correct way to do things.
If anyone could help me, it would be appreciated.

I have tried wrapping the whole URL in php's URLencode
That is indeed what you have to do (assuming by ‘URL’ you mean inner URL being passed as a component of the outer URL). Any time you put a value in a URL component, you need to URL-encode, whether the value you're setting is a URL or not.
$myurl='http://...?...&url='.rawurlencode($theurl);
(urlencode() is OK for query parameters like this, but rawurlencode() is also OK for path parts, so unless you really need spaces to look slightly prettier [+ vs %20], I'd go for rawurlencode() by default.)
This will give you a final URL like:
http://myshorten.example/yourls-api.php?signature=x&action=shorturl&format=simple&url=http%3A%2F%2Furltoshorten%3Ftrack%3D2%26time%3D967
Which you should be able to verify works. If it doesn't, there is something wrong with yourls-api.php.

I have tried wrapping the whole URL in php's URLencode, I've wrapped the track=2&time=967 in both encodeURI and encodeURIComponent, I've evem tried wrapping the whole thing in one or both. And still, the & breaks it, even though I can see the submitted url looks like track=1%26time%3D5 at the end.
Maybe an explanation of how HTTP variables work will help you out.
If I'm getting a page with the following variables and values:
var1 = Bruce Oxford
var2 = Brandy&Wine
var3 = ➋➌➔ (unicode chars)
We uri-encode the var name and the value of the var, ie:
var1 = Bruce+Oxford
var2 = Brandy%26Wine
var3 = %E2%9E%8B%E2%9E%8C%E2%9E%94
What we are not doing is encoding the delimiting charecters, so what the request data will look like for the above is:
?var1=Bruce+Oxford&var2=Brandy%26Wine&var3=%E2%9E%8B%E2%9E%8C%E2%9E%94
Rather than:
%3Fvar1%3DBruce+Oxford%26var2%3DBrandy%26Wine%26var3%3D%E2%9E%8B%E2%9E%8C%E2%9E%94
Which is of course just gibberish.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.