How to fix funny characters coming from twitter API

How to fix funny characters coming from twitter API - php

Im using Twitter's RESTful API 1.1 and on odd occations usually when there is a URL embedded in the tweet it pulls through in funny charcters e.g.
#MyHandle_123 RT #ThinkAfricaFeed: Controversy & acrimony may surround Nigeria's country's federalist system but it may be the country's best option: httÃ¢Â€Â¦
I tried to call the function utf8_decode but its still renders funny characters in my browser.
Any idea's on how I can get the returned values to show correctly?

I was running into a similar problem, since you tried the utf8 decode and it didn't work, try this:
htmlentities($td->text, ENT_NOQUOTES, 'UTF-8');
where td is the object whose text or item is being referenced.
Hope that helps

Related

PHP Escape a string if it hasn't already been escaped with entities

I'm using a 3rd party API that seems to return its data with the entity codes already in there. Such as The Lion’s Pride.
If I print the string as-is from the API it renders just fine in the browser (in the example above it would put in an apostrophe). However, I can't trust that the API will always use the entities in the future so I want to use something like htmlentities or htmlspecialchars myself before I print it. The problem with this is that it will encode the ampersand in the entity code again and the end result will be The Lion&#8217;s Pride in the HTML source which doesn't render anything user friendly.
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?

No one seems to be answering your actual question, so I will
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
It's impossible. What if I'm making an educational post about HTML entities and I want to actually print this on the screen:
The Lion&#8217;s Pride
... it would need to be encoded as...
The Lion&amp&semi;&num;8217&semi;s Pride
But what if that was the actual string we wanted to print on the string ? ... and so on.
Bottom line is, you have to know what you've been given and work from there – which is where the advice from the other answers comes in – which is still just a workaround.
What if they give you double-encoded strings? What if they start wrapping the html-encoded strings in XML? And then wrap that in JSON? ... And then the JSON is converted to binary strings? the possibilities are endless.
It's not impossible for the API you depend on to suddenly switch the output type, but it's also a pretty big violation of the original contract with your users. To some extent, you have to put some trust in the API to do what it says it's going to do. Unit/Integration tests make up the rest of the trust.
And because you could never write a program that works for any possible change they could make, it's senseless to try to anticipate any change at all.

Decode the string, then re-encode the entities. (Using html_entity_decode())
$string = htmlspecialchars(html_entity_decode($string));
https://eval.in/662095

There is NO WAY to do what you ask for!
You must know what kind of data is the service giving back.
Anything else would be guessing.
Example:
what if the service is giving back & but is not escaping ?
you would guess it IS escaping so you would wrongly interpret as & while the correct value is &

I think the best solution, is first to decode all html entities/special chars from the original string, and then html encode the string again.
That way you will end up with a correctly encoded string, no matter if the original string was encoded or not.

You also have the option of using htmlspecialchars_decode();
$string = htmlspecialchars_decode($string);

It's already in htmlentities:
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), false);
Hi&mom
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), true);
Hi&amp&semi;mom
Just use the [optional]4th argument to NOT double-encode.

Why is rawurlencode() in PHP adding additional escape characters to ampersands?

I think I'm missing something obvious here but it is driving me crazy and I can't figure it out. I'm developing a WordPress plugin and part of it needs to take the WordPress post title and send that to a RESTful web service to do something else. So of course I want to rawurlencode() the post title since who knows what text might be in there. However, for some reason the output I'm getting has extra escape characters and I have no idea where they are coming from (and it's causing problems with the web service I'm calling obviously).
My code is fairly straight forward:
$topic = get_the_title($post_id);
$curl_post_fields = 'name=' . rawurlencode( $topic );
Yet when I print the output of those two strings I get:
topic=a & b
name=a%20%26%23038%3B%20b
Whereas I would expect the URL encoded string to be
name=a%20%26%20b
I have no idea where that extra %23038%3B could be coming from. If I'm reading the encoding on that correctly it translates to #038; but I still don't know where it's coming from.

There seems to be a html encoding in between as well, instead of &, & is in the encoded string. Probably because & has to be escaped in html, and the get_title function escapes this using html_special_chars or something like that.

I had some problems with that when i used an older php version

Problem with cyrillic characters in friendly url

Here's the thing. I have friendly urls like
http://site.com/blog/read/мъдростта-на-вековете
http://site.com/blog/read/green-apple
The last segment is actually the friendly title of the blog article. The problem is when I try to pass that segment to the database, the cyrillic fonts turn into something like %D1%8A%D0%B4%D1%80%D0%BE%D1%81%D1%8 and couldn't match the database record. In the address bar in my browser it looks normal (мъдростта-на-вековете) but if I choose 'copy url location' the last segment again turns into these strange characters. I'm using CodeIgniter and everything is set to UTF-8.
Please help! :(

The text is just being encoded to fit the specification for URLs.
Echo out the data to a log to see what you are actually trying to pass to the database.
You should be able to decode it with urldecode.

The above answers are ok, but if you want to use routing with cyrillic it isn't enough. For example if you have http://site.com/блог/статия/мъдростта-на-вековете you will have to do something like this:
In config/routes.php: $route['блог/статия/(:any)'] = "blog/article/$1";
In system/core/URI.php , in the function _explode_segments(), you can change
$val = trim($this->_filter_uri($val));
to
$val = urldecode(trim($this->_filter_uri($val)));
This will solve the above problem plus controllers and functions.

Actually, Firefox is cheating you here: the URL actually is url-encoded, but is shown as if it wasn't. So copy-pasting and retrieving it on the server will have the URL encoded.
(Not sure if other browsers behave in the same way.)

Twitter URL encoding

We're about to launch a little twitter Christmas competition, and I've run into a little snag.
To enter, people will need to post a tweet in the following format:
#user blah, blah, blah #hashtag
Currently, I have a form where they enter their answer (the blah, blah, blah) and a PHP script which encodes the entire statement and adds on the twitter url:
http://www.twitter.com/home?status=%40user%20blah%2Cblah%2Cblah%20%23hashtag
Then takes the user to twitter and puts the status in the update field.
However, whilst the spaces (%20) are decoded fine the # and # characters remain as %40 & %23 respectively, even when the tweet is posted. I cannot put the actual characters in the url as twitter mistakes this for a search.
Is there any way to solve this? I'd like to do it without requiring username & password etc if possible.
Any help will be greatly appreciated.

I've had the same problem, and the solution was very simple.
Just use
http://twitter.com/home?status= instead of
http://www.twitter.com/home?status=
and it'll work as expected, even if the text isn't in ASCII.
If you want to know more details about this strange behavior see this blog post:
http://www.kilometer0.com/blog/2010/01/21/twitter-status-urls-and-ampersands/
Hope this helps someone.

Encode the spaces as + and it works:
http://twitter.com/home?status=%40user+blah%2Cblah%2Cblah+%23hashtag

You could try just posting right to Twitter:
<form action="http://www.twitter.com/home" method="GET">
<textarea name="status">
...

Hmm. At least when using the new Twitter layout ... this:
http://twitter.com/home?status=This+is+a+test+%26+So+is+this
... redirects to this (when logged in):
http://twitter.com/?status=This%20is%20a%20test%20&%20So%20is%20this
(notice the unencoded &) ... and the tweet-in-waiting becomes:
This is a test
:(
Myriad adjustments and variations didn't help. (Sigh.)
Admittedly sketchy workaround: Change & (%26) to + (%2B). It may be advisable do this with plain text, before (re-)introducing entities into the equation (e.g., don't change %26 to %2B). Measure twice, cut once, as they say.

After a wile i got this... You have to send as UTF8 encoded, you can use javascript to do that but I prefere PHP because my text also came from the tatabase....
SHARE ON TWITTER you can also put a twitter icon here...

I've done it using this function from MDN
To be more stringent in adhering to RFC 3986 (which reserves !, ', (, ), and *), even though these characters have no formalized URI delimiting uses, the following can be safely used:
function fixedEncodeURIComponent(str) {
return encodeURIComponent(str).replace(/[!'()*]/g, function (c) {
return '%' + c.charCodeAt(0).toString(16);
});
}
source

Scrape a price off a website

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).
I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:
$contents = curl_exec($curl);
preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);
So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.
I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?
(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

Have you try to use \ in front of £
preg_match_all('/(\$|\£)[0-9]+(\.[0-9]{2})/', $contents, $matches);
I have try this expression with .Net with \£ and it works. I just edited it and removed some ":".
(source: clip2net.com)
Read my comment about the possibility of Curl giving you bad encoding (comment of this post).

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).
i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values.
'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'
This will not work with thousand separator like 234,343 and 34,454.45.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to fix funny characters coming from twitter API - php

I was running into a similar problem, since you tried the utf8 decode and it didn't work, try this: htmlentities($td->text, ENT_NOQUOTES, 'UTF-8'); where td is the object whose text or item is being referenced. Hope that helps

Related

PHP Escape a string if it hasn't already been escaped with entities

Why is rawurlencode() in PHP adding additional escape characters to ampersands?

Problem with cyrillic characters in friendly url

Twitter URL encoding

Scrape a price off a website

Categories

Resources