Convert ASCII to plaintext in PHP - php

I am scraping some sites, and have ASCII text that I want to convert to plain text for storing in a DB. For example I want
I have got to tell anyone who will listen that this is
one of THE best adventure movies I've ever seen.
It's almost impossible to convey how pumped I am
now that I've seen it.
converted to
I have got to tell anyone who will listen that this is
one of THE best adventure movies I've ever seen. It's
almost impossible to convey how pumped I am now that
I've seen it.
I have googled my fingers bloody, any help?

You can use html_entity_decode:
echo html_entity_decode('...', ENT_QUOTES, 'UTF-8');
Few notes:
Please note that it looks like you actually want to convert from HTML-encoded string(with entities like ) to ASCII AKA plaintext.
This example converts to UTF-8 which is ASCII-compatible character encoding for all ASCII characters (i.e. with char codes below 128). If you really want plain ASCII (thus loosing all accented characters and characters from foreign languages) you should strip all offending characters separately.
Last argument ('UTF-8') is necessary to keep compatibility with different PHP versions since the default value has changed since PHP 5.4.0.
Update: Example with your text in ideone.
Update2: Changed ENT_COMPAT to ENT_QUOTES by #Daan's suggestion.

Related

PHP - Strlen behaves very strange, the same things - different results, lol numbers?

tresc and tresc_pelna
The same type, the same content
The same content. 876 characters in total.
Taken from db by ...AS data_dodania, p.data_modyfikacji, p.tresc, p.tresc_pelna, p.url, count(k.id)...
Echeon to website by <?= strlen($post['tresc_pelna']).'----'.strlen($post['tresc']) ?>
And guess what?
This is the output
876----3248
What the...?
I have completly no Idea what is happening here xD.
Please help guys :D
Both fields utf8_polish_ci and exactly same content
<?= mb_strlen($post['tresc_pelna'], 'utf-8').'----'.mb_strlen($post['tresc'], 'utf-8') ?>
Still bad result.
tresc over 3 thousands... what the... How? why?
MySQL has two built-in functions for determining the length of variable-length items. One, which counts distinct unicode characters, is called CHAR_LENGTH(). The other counts octets (bytes), and is called LENGTH().
In PHP, strlen() counts octets, like MySQL's LENGTH(). Many unicode strings, especially those encoded in utf8, have a variable number of octets per character. You can use grapheme_strlen() to count those.
I've found it's sometimes helpful to do SELECT HEX(unicode_column) to figure out what's stashed in MySQL. Just fetching the column data puts you at the mercy of the character rendering of the MySQL client you use, and can be very confusing.
It's also possible your database columns have entitized data in them (for example the string é rather than the Unicode character é. If that entity text gets sent to a web browser, it renders as the letter.
The difference between LENGTH and CHAR_LENGTH could explain a ratio of under 1.2x for most European text. It won't explain 3248:876, which is nearly 4x.
Perhaps these are part of the answer:
Htmlentities, such as ó which is taking 8 bytes to represent a 2-byte utf8 character. We can't see whether one of them has < and the other has <.
Formatting tags, such as <p>. Again, possibly <p>
Still, that is not enough to explain nearly 4x. For example, a simple letter, such as a, will be one byte, regardless of how it is encoded. Please provide the HEX for a small sample.

Convert ä to Umlaut

I am parsing a Document via xpath and fetch info from a metatag.
I am passing thsi string through utf8_decode( $metadesc ) but still get no normal Umlauts. The Document is UTF-8.
I want to convert Ã&#xA4 to ä.
I am debugging via the console in firebug and write the data also into a DB.
In both cases, I get the same result.
For text inside Div's it works. Only that one of the metatag is wrong.
Many Thanks
Well, it's true that xC3A4 is the UTF-8 encoding of the Unicode character xE4 which is ä. But in XML, the sequence ä represents something quite different: it represents "capital A with tilde" followed by "currency sign" (that is, ä). If you use an XML parser, you will see these two characters, and you won't get any indication that they started life as hex character references.
If possible, you should try and fix the program that generated this incorrect encoding of the character: that's much better than trying to repair the damage later.
If you do want to do it by a "repair" operation, you need to take into account that the sequence ä might actually represent the two characters that XML says it represents: how will you tell the difference? I don't know any PHP, but basically the way to do it is to extract the hex value xC3A4 and then put this through UTF-8 decoding.

PHP - Can't remove strange character

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. メ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.
Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.
At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.
updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.

Converting UTF8 text for use in a URL

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.
Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).
You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):
// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo ''.$str.'';
This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).
If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.
I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:
último año
and produces output like:
'ultimo a~no
Then I use preg_replace() to replace white spaces with dashes:
'ultimo-a~no
... and remove unwanted chars, e.g.
[^a-z0-9-]
It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.
Obviously I can't use the non english characters in the URL.
In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.
Notice that you need to encode the URL appropriately, as shown in the other answers.
Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.
Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.
The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)
Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png
So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)
I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Categories