I am translating with poedit. However poedit seems to be ignoring apostrophes. For example shouldn't is coming through as shouldnt. I am encoding in utf-8. Does anyone know why this is the case and if there is a solution ?
I assure you that Poedit isn't somehow ignoring or eating apostrophes — that's preposterous. It's just an editor that puts whatever you wrote, exactly as you wrote it (yes, including ' or any Unicode characters), into your PO and MO files.
Your problem is in your PHP code where you incorrectly escape the (translated) strings before printing them — how and in what context you do that is unfortunately something you didn't share.
But this is why e.g. WordPress has functions like esc_attr_e that do any necessary escaping and do it correctly, so that you don't have to do anything ridiculous (and painful to work with!) like substituting ' with ’ in all your translations (which wouldn't even work when using untranslated text…).
You need to use the html entity: ’
Source: http://geektnt.com/tag/poedit
Some text characters need to be converted into html entities otherwise they will not display correctly. A very common example is a word containing an apostrophe or single quote (‘) which needs to be replaced with ’ — for example, Chloe O’Brian should be written as Chloe O’Brian. For a complete list of html entities, visit W3Schools.
Related
We are working on a large amount of HTML data that needs to be converted to plain text. In the process we find not html_entity_decode() nor htmlspecialchars_decode() converts more than a few entities, e.g. <, >, ", $amp; and that's it.
However in modern day HTML pages, there are quite some common entities:
→
»
°
®
©
'
£
¥
€
∑
™
Which are all ignored by these functions.
What are my options to convert them to their corresponding character? I guess my best option would be to manually write a string replace function to do this?
html_entity_decode should be the answer, it just has stupid defaults and thus you are probably using it wrong. try
$text=html_entity_decode($html,ENT_QUOTES|ENT_HTML5,'UTF-8')
alternatively,
$text=(#DOMDocument::loadHTML('<root>'.$html.'</root>'))->getElementsByTagName("root")->item(0)->textContent;
may also work
ps, i have no idea what all the downvotes are about, but i didnt read the comments either
I'm currently facing a very strange encoding issue when dealing with an html source code.
I got the following line:
"requête présentée par..."
When an extern library does an utf8_decode I got:
"reque^te présente´e par..."
So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."
Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".
I have come up with a basic replacement function:
"e^" => "ê",
"e´" => "é",
...
and some other french cases, and it's working properly for now.
But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.
Has anybody face this issue before and (hopefully) has a more general solution?
Thanks in advance.
It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.
See also Unicode equivalence.
I have the following value from a field in a database that I would like to output exactly like this...
Use the term “fully accredited”-there is no such thing as a partial accreditation.
I need to include the quotes around the term "fully accredited".
Here's my output in PHP...
echo "<p><strong>Never:</strong> <span id=\"nevermsg\">".$results['never1']."</span></p>";
But, when I render the data on the page, it's showing these little diamond shapes with question marks inside them.
*(The 'span id' is there for styling and isn't relevant)
I don't think escaping would work here because quotes are not used in all the data values.
Not sure what to do...
The quotes in your string are extended characters.
You could fix this problem pretty quickly by simply replacing them with standard " quote characters rather than the curly quotes “ ” you've got now.
However, in the long term, you probably need to be able to handle extended characters, as it includes all kinds of things you're likely to need in your text, not just curly quote marks.
To fix this problem properly, you need to ensure that your system uses UTF-8 encoding at all levels. This includes within the database, your PHP code files, and the data that is sent to the browser.
I suggest reading up further on this here: UTF-8 all the way through
I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.
Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.
EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:
<SomeTag>User’s Input</SomeTag>
Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:
<?xml version="1.0" encoding="UTF-8"?>
There may also be a UTF-8 option in the parser's API.
Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!
Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.
Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".
If you need to get rid of them, a simple global replace using a text editor will do the job fine.
But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).
If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:
$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;
For me gives:
”’
whereas
$html = htmlentities( '”’' );
echo $html;
gets confused:
â??â??
If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.
Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".
These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.
Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.
Use
$s = 'User’s Input';
$descriptfix = preg_replace('/[“”]/','\"',$s);
$descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo "<SomeTag>htmlentities($s)</SomeTag>";
I've run into an interesting behavior in the native PHP 5 implementation of json_encode(). Apparently when serializing an object to a json string, the encoder will null out any properties that are strings containing "curly" quotes, the kind that would potentially be copy-pasted out of MS Word documents with the auto conversion enabled.
Is this an expected behavior of the function? What can I do to force these kinds of characters to covert to their basic equivalents? I've checked for character encoding mismatches between the database returning the data and the administration page the inserts it and everything is setup correctly - it definitely seems like the encoder just refuses these values because of these characters. Has anyone else encountered this behavior?
EDIT:
To clarify;
MSWord will take standard quotation marks and apostraphes and convert them to more aesthetic "fancy" or "curly" quotes. These characters can cause problems when placed in content managers that have charset mistmatches between their editing interface (in the html) and the database encoding.
That's not the problem here, though. For example, I have a json_object representing a person's profile and the string:
Jim O’Shea
The UTF code for that apostraphe being \u2019
Will come out null in the json object when fetched from database and directly json_encoded.
{"model_name":"Bio","logged":true,"BioID":"17","Name":null,"Body":"Profile stuff!","Image":"","Timestamp":"2011-09-23 11:15:24","CategoryID":"1"}
Never had this specific problem (i.e. with json_encode()) but a simple - albeit a bit ugly - solution I have used in other places is to loop through your data and pass it through this function I got from somewhere (will credit it when I find out where I got it):
function convert_fancy_quotes ($str) {
return str_replace(array(chr(145),chr(146),chr(147),chr(148),chr(151)),array("'","'",'"','"','-'),$str);
}
json_encode has the nasty habit of silently dropping strings that it finds invalid (i.e. non-UTF8) characters in. (See here for background: How to keep json_encode() from dropping strings with invalid characters)
My guess is the curly quotes are in the wrong character set, or get converted along the way. For example, it could be that your database connection is ISO-8859-1 encoded.
Can you clarify where the data comes from in what format?
If I ever need to do that, I first copy the text into Notepad and then copy it from there. Notepad forces it to be normal quotes. Never had to do it through code though...