converting special characters in HTML into the appropriate coding for PHP - php

I am making a website where one fills out a form and it creates a PDF. The user will be able to put in diacritic and special characters. The way I am sending the characters to the PHP, those characters will come into the PHP as HTML coded characters i.e. à. I need to change this to whatever it is PHP will read so when I put it through the PDF maker we have it has the diacritic character and not the HTML code for it.
I wrote a test to try this out but I haven't been able to figure it out. If I have to I will end up writing an array for every possible character they can use and translate the incoming string but I am trying to find an easier solution.
Here is the code of my test:
$title = "Test of Title for use With This Project and it should also wrap because it is sò long! Acutally it is even longer than previously expected!";
$ti = htmlspecialchars_decode($title);
I have been attempting to use the htmlspecialchars_decode() to convert it but it still comes out as &ograve and not ò. Is there an easy way to do this?

See the documentation which tells you it won't touch most of the characters you care about and to use html_entity_decode instead.

Use the html_entity_decode function instead of htmlspecialchars_decode (which only decodes entities such as &, ", < and > = special HTML chars, not all entities).

Related

PHP Converting UTF-8 to TeX/LaTeX Formatting?

I have some strings that include encoding that would work with TeX, (For example they look like "Pi/~na Colada" instead of Piña Colada). Is there a simple way to convert this to show properly without creating my own function to convert the characters?
No.
But:
The tex.stackexchange.com wiki has a decent list of TeX accents.
Then you just need to correlate them with their UTF-8 combining mark.
Make sure you move the combining mark to after the character you want it combined with.
eg: "/~n" to "n" . $combining_mark
You might then want to run it through intl's Normalizer in NFC form to compose the character into a single codepoint, if it exists.

Remove certain special HTML characters from string in PHP

I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&amp". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. &ohm; or ) as well as decimal (e.g. &#01234) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.

Convert HTML numbered entities in php to unicode for use on iPhone

I'm creating a web service to transfer json to an iPhone app. I'm using json-framework to receive the json, and that works great because it automatically decodes things like "\u2018". The problem I'm running into is there doesn't seem to be a comprehensive way to get all the characters in one fell swoop.
For example html_entity_decode() gets most things, but it leaves behind stuff like ‘ (‘). In order to catch these entities and convert them to something json-framework can use (e.g., \u2018), I'm using this code to convert the &# to \u, convert the numbers to hex, and then strip the ending semicolon.
function func($matches) {
return "\u" . dechex($matches[1]);
}
$json = preg_replace_callback("/&#(\d{4});/", "func", $json);
This is working for me at the moment, but it just doesn't feel right. It seems like I'm surely missing some characters that are going to come back to haunt me later.
Does anyone see flaws in this approach? Can anyone think of characters this approach will miss?
Any help would be most appreciated!
From where are you getting this HTML-encoded input? If you're scraping a web page you should be using an HTML parser, which will decode both entity and character references for you. If you are getting them in form input data, you've got a problem with encodings (make sure to serve the page containing the form as UTF-8 to avoid this).
If you must convert an HTML-encoded stretch of literal text to JSON, you should do it by HTML-decoding first then JSON-encoding, rather than attempting to go straight to JSON format (which will fail for a bunch of other characters that need escaping). Use the built-in decoder and encoder functions rather than trying to create JSON-encoded characters like \u.... yourself (as there are traps there).
$html= 'abc " def Ӓ ghi ሴ jkl \n mno';
$raw= html_entity_decode($html, ENT_COMPAT, 'utf-8');
$json= json_encode($raw);
"abc \" def \u04d2 ghi \u1234 jkl \\n mno"
‘ is a decimal numbered entity, while I believe \u2018 is a hexadecimal representation. HTML also supports hexadecimal numbered entities (e.g., ‘), but once you've found # as the entity prefix you're looking at either decimal or hex. There are also named entities (e.g., &) but it doesn't sound like you need to cover those cases in your code.
$html_escape = ""Love sex magic rise" & 尹真希 ‘";
$utf8 = mb_convert_encoding($html_escape, 'UTF-8', 'HTML-ENTITIES');
echo json_encode(array(
"title" => $utf8
));
// {"title":"\"Love sex magic rise\" & \u5c39\u771f\u5e0c \u2018"}
This work well for me

getting json_encode to not escape html entities

I send json_encoded data from my PHP server to iPhone app. Strings containing html entities, like '&' are escaped by json_encode and sent as &.
I am looking to do one of two things:
make json_encode not escape html entities. Doc says 'normal' mode shouldn't escape it but it doesn't work for me. Any ideas?
make the iPhone app un-escape html entities cheaply. The only way I can think of doing it now involves spinning up a XML/HTML parser which is very expensive. Any cheaper suggestions?
Thanks!
Neither PHP 5.3 nor PHP 5.2 touch the HTML entities.
You can test this with the following code:
<?php
header("Content-type: text/plain"); //makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
You'll see the only thing PHP does is to add double quotes around the string.
json_encode does not do that. You have another component that is doing the HTML encoding.
If you use the JSON_HEX_ options you can avoid that any < or & characters appear in the output (they'd get converted to \u003C or similar JS string literal escapes), thus possibly avoiding the problem:
json_encode($s, JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_QUOT)
though this would depend on knowing exactly which characters are being HTML-encoded further downstream. Maybe non-ASCII characters too?
Based on the manual it appears that json_encode shouldn't be escaping your entities, unless you explicitly tell it to, in PHP 5.3. Are you perhaps running an older version of PHP?
Going off of Artefacto's answer, I would recommend using this header, it's specifically designed for JSON data instead of just using plain text.
<?php
header('Content-Type: application/json'); //Also makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
Make sure you check out this post for more specific reasons why to use this content type, What is the correct JSON content type?

Correct character encoding

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?
HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).
You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.
It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.
I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &amp;).

Categories