Cannot convert special (ASCII/UTF-8) characters - php

I am trying to take a group of characters which are ASCII and it contains things like ☺ ☻ ♥.
When I try to echo them out with PHP I get ? for each one. And when I try to use htmlentities() it works for the heart but returns some jambled stuff like : ☺ for ☺.
Entirely what I am trying to do is convert a string of text that I have no control over, but is in ASCII and display it with HTML and store it with SQL.
I am sorry if this is a poorly formed question, but I am not sure how this whole section of conversion works.
Thanks.

Related

PHP Converting UTF-8 to TeX/LaTeX Formatting?

I have some strings that include encoding that would work with TeX, (For example they look like "Pi/~na Colada" instead of Piña Colada). Is there a simple way to convert this to show properly without creating my own function to convert the characters?
No.
But:
The tex.stackexchange.com wiki has a decent list of TeX accents.
Then you just need to correlate them with their UTF-8 combining mark.
Make sure you move the combining mark to after the character you want it combined with.
eg: "/~n" to "n" . $combining_mark
You might then want to run it through intl's Normalizer in NFC form to compose the character into a single codepoint, if it exists.

PHP - Convert unicode string to HTML entities

I need to take a unicode string to HTML Entities.
Let's say the string is: ιи¢яє∂ιвℓє ѕкιℓℓѕ
I want it to convert it to ιи¢яє∂ιвℓє ѕкιℓℓѕ
In order to be able to store it in my SQL db and then output the desired text later.
This basically has the functionality I want, where you type in a string and itn converts it: http://www.online-toolz.com/tools/unicode-html-entities-convertor.php
I'm not very familiar with UTF-8/ASNII/different types of special characters in general, But I'd like it to be able to handle all the Unicode characters here: https://unicode-table.com/en/
Sorry if this is a dumb question but I tried all combinations of uft8_decode, htmlentities, htmlspecialchars and mb_convert_encoding and I can't get the desired result.
Thank you for looking!

Strip Base64 strings from long text

I really wonder if I'm really the first one asking this question or am I so blind to finde some about this...
I have a longer text and I want to strip base64 encoded strings of it
I am a text and have some lines with some content
There are more than one line but sometimes I have
aSBhbSBhIG5vcm1hbCB0ZXh0IHRoYXQgd2FzIGNvZ
GVkIGluIGJhc2UgNjQgYW5kIG5vdyBpIHdhcyB0cmFu
c2xhdGVkIGJhY2sgdG8gYmxhbmsgdGV4dGZvcm1hd
C4gaSB0aGFuayB5b3UgZm9yIHBheWluZyBhdHRlbnRp
b24uIGJ5ZQ==
and this is what I want to strip / extract by using php
As you can see there is base64 encoded data in the text and I want to extract/strip these lines.
I allready tried a lot of regex samples from SO something like
$regex = '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#m';
preg_match($regex, $content, $output_array );
but this not solved anything...
What I need is a regex that only selects the base strings...
Is this even possible ? I mean is base64 selectable by regex ? I guess :)
EDIT: String-Source is the content of an email
EDIT2: Guess the best syntax for this case your be so track strings that has more than one uppdercased character and can have numbers and has no whitespaces. But regex is not my daily bread :D
First of all: You can not reliably do this!
Why?
Simple, the point why base64 is so great in some cases is, that is encodes all the data with "standard" characters. Those that are used in normal texts, sentences, and yes, even words.
Background
Is "Hello" a base64-encoded string? Well, yes, in the meaning of it is "valid base64 encoded". It probably returns a lot of jibberish, but it is a base64-ok string.
Therefore, you can only decide on a length after which you consider characters connected without any space to be base64 encoded. Of course in languages such as german you may have quite some trouble here, as there a compound nouns, such as "Bäckerfachverkäuferinnenhosenherstellungsautomatenzuliefererdienst" or such (just made that up).
Workaround
So on the length you have to decide yourself, an then you can go with this:
[a-zA-Z0-9\+\/\=]{20,}
Also see the example here: https://regex101.com/r/uK5gM1/1
I considered "20" to be the minimum length for "base64 encoded stuff" here, but as said, it is up to you. Also, as a small side note, the = is not really encoded content but fill bytes, but I still added it to the regex.
Edit: Gnah.. you can even see in my example that I did not catch the last line :) When changing the number to 12 it works fine here, but there may be words with more than 12 characters ... so - as said, not really reliably possible in this manner.
For the snippet in the example /^\w{53}$/gm does the job. If you can rely on length of course.
EDIT:
Considering circumstances and updates, I would go with /\n([\w=\n]{50,})\n/gs but without metadata it may be tricky to guess mime-type of the decoded stuff, and almost impossible to restore filenames etc.

converting special characters in HTML into the appropriate coding for PHP

I am making a website where one fills out a form and it creates a PDF. The user will be able to put in diacritic and special characters. The way I am sending the characters to the PHP, those characters will come into the PHP as HTML coded characters i.e. à. I need to change this to whatever it is PHP will read so when I put it through the PDF maker we have it has the diacritic character and not the HTML code for it.
I wrote a test to try this out but I haven't been able to figure it out. If I have to I will end up writing an array for every possible character they can use and translate the incoming string but I am trying to find an easier solution.
Here is the code of my test:
$title = "Test of Title for use With This Project and it should also wrap because it is sò long! Acutally it is even longer than previously expected!";
$ti = htmlspecialchars_decode($title);
I have been attempting to use the htmlspecialchars_decode() to convert it but it still comes out as &ograve and not ò. Is there an easy way to do this?
See the documentation which tells you it won't touch most of the characters you care about and to use html_entity_decode instead.
Use the html_entity_decode function instead of htmlspecialchars_decode (which only decodes entities such as &, ", < and > = special HTML chars, not all entities).

How to parse unicode format (e.g. \u201c, \u2014) using PHP

I am pulling data from the Facebook graph which has characters encoded like so: \u2014 and \u2014
Is there a function to convert those characters into HTML? i.e \u2014 -> —
If you have some further reading on these character codes), or suggested reading about unicode in general I would appreciate it. This is so confusing to me. I don't know what to call these codes... I guess unicode, but unicode seems to mean a whole lot of things.
that's not entirely true bobince.
How do you handle json containing spanish accents?
there are 2 problems.
I make FB.api(url, function(response)
... var s=JSON.stringify(response);
and pass it to a php script via $.post
First I get a truncated string. I need escape(JSON.stringify(response))
Then I get a full json encoded string with spanish accents.
As a test, I place it in a text file I load with file_get_contents and apply php json_decode and get nothing.
You first need utf8_encode.
And then you get awaiting object of your desire.
After a full day of test and google without any result when decoding unicode properly, I found your post.
So many thanks to you.
Someone asked me to solve the problem of Arabic texts from the Facebook JSON archive, maybe this code helps someone who searches for reading Arabic texts from Facebook (or instagram) JSON:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));
Facebook Graph API returns JSON objects. Use json_decode() to read them into PHP and you do not have to worry about handling string literal escapes like \uNNNN. Don't try to decode JSON/JavaScript string literals by yourself, or extract chosen properties using regex.
Having read the string value, you'll have a UTF-8-encoded string. If your target HTML is also UTF-8-encoded, you don't need to replace — (U+2014) with any entity reference. Just use htmlspecialchars() on the string when outputting it, so that any < or & characters in the string are properly encoded.
If you do for some reason need to produce ASCII-safe HTML, use htmlentities() with the charset arg set to 'utf-8'.

Categories