Wordpress special characters and the_content - php

I'm trying to solve a really strange bug on a website I've developed.
In specific tags of the markup the web font is not rendering special characters like "éáçã".
The font allows such characters. You can even see parts of the page with such characters.
One strange behavior is that if you type the character again with the developer tools the character gets displayed correctly.
Hope someone can help me on this one.
Best
Peter

Ok, so turns out the simplest way is to just confirm to deleted the text with bugs, save the post/page and write all again.

Related

user-submitted hyperlinks in a text editor are broken by quotes

I run a wordpress job board, on which users can post job listings and the job description input uses tinyMCE.
The problem: when some users insert hyperlinks into the body of their listing, the link is broken in the published ad, with some added quote marks inserted, eg: http://example.com/en became ””//example.com/en/”””
This happened once in a while, not for every ad and not for every link within an ad.
I recently found out that what seems to be happening is that the following characters %E2%80%9D are getting inserted within the links, and these encode double quote marks. Say the page on my site where the link is posted is https://example.org/mypage/, and that the link the user is trying to post is https://usersite.com/theirpage/, the resulting URL will be some weird mash-up of the two:
https://example.org/mypage/%E2%80%9D/usersite.com/theirpage/
From some googling it seems this might be caused by users copy/pasting hyperlinks from word or webpages.
I'm trying to find a way to automatically prevent this so I don't have to manually clean up tens of links a day. I figure there must be something I can change in the tinyMCE settings.
I have found seemingly related questions from several years ago that suggest it could be to do with magic quotes (example1 example2 ) but I don't know how to implement the proposed solutions as I'm no php pro.
How would you go about solving this for a wordpress website? Any advice would be appreciated!
Your issue is occurring because users are inserting content from Word with double quotes (”), but they're not your standard double quotes.
Standard: " (straight)
Non-standard: ” (on an angle)
The encoding you’ve inserted (%E2%80%9D) indicates a character that cannot be parsed (understood), and encoding has been used to transfer the character through databases and web servers. This encoding provides the double quotes on an angle (”).
In JavaScript, the decoding of this encoding in demonstrated in the following:
decodeURI('%E2%80%9D')
”
I don't believe this is an issue with TinyMCE.
How to fix the issue?
There are a few solutions, and they’re going to require some development knowledge. Two safe options that come to mind are:
Before the content is stored, remove these characters using PHP string replace
If the issue is how users are inserting the content into TinyMCE, it might be worthwhile adding TinyMCE Link Checker (https://www.tiny.cloud/tinymce/features/link-checker). Note, this is a premium feature.
Disclaimer, I'm affiliated with Tiny. The comments above are of my own.

Encoding Dilema ! %C3%A9 vs e%CC%81

if you look to the two links bellow they look the same, but one of them works and the other is not working, after analysing the problem it seems that ther's a diffrence in interpretation of the "é" character and all the accentuated characters and the encoder either treat it as on char or the letter without accentuation + the accetnuation char
This problem is causing the images on website to be broken but they are here in the FTP
The question is how to fix that, is the fix in wordpress, database or server ?
Thank's and sorry for my poor english.
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg
If you look at the screen capture below (taken from a text editor after copy-pasting your links above), you can see the difference.
The easiest solution I found was to change the filenames in an editor that only uses the accentuation you can see in the first row.
Conclusion: never trust uploads! Always check them for everything that's around! :) (This is true for filenames, texts, HTML, etc. - even if they are not intended to be harmful, they may block functions of your website / app and cause other problems!)
Note:
not all text editors show them to be different, so choose one that does!
if it is possible, get rid of accents in filenames (if uploaded use a function to sanitise them, or even make them like a slug in Wordpress with sanitize_title() or sanitize_title_with_dashes())

Diamond with Questionmark in fonts

In my website I use the language Malayalam. Everything works fine except the excerpts. In the front page the website only shows like <10 words for each products. So most of the time at the end of the excerpt there is a black diamond with a question mark.
Now when I open the product listing page f or the same(with the same content, the question mark is not there.See this:
So can some one please explain me why it comes like this and come up with possibly a solution?
PS: The website is in wordpress!
i think your wordpress malayalam font are not supported to your css please define your HTML character - encoding to set your language in your site.
Generally ? is one type of character which are not supported to your language that means if we define copy right symbol as it is in our site at that time it will display ? character but on that we use &copy at that time it will displayed right character .
If someone else have this problem, I guess this could help:
The issue was, the theme creators used substr function to remove the extra characters from the excerpt. So this function was removing the characters which are obsolete when stand alone in this language and the browsers are not finding a way to represent these. So it comes up with the question marks in the diamonds. So I removed the substr function and limited the excerpt with lesser number of words to achieve the same functionality without breaking the layout of the theme.
And it worked!

(?) marks in HTML. Encoding issue from content from the database?

Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>
As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!

What would cause an to turn into a unicode character?

I've got some documents on my website which users can edit via a rich text editor and then save them (to the DB) and print them. Some users are experiencing an issue (only happening on the live site) where some of the characters are getting screwed up. I've checked the DB, and the funny characters are in the DB, so it's not a display issue. It either happens when they save the document (submit the form on the site) or they've put something weird in there or their browser changed some of the characters.
The character that keeps appearing everywhere is  . It's an accented A followed by a space. Looking at the source HTML, it appears that the affected documents had all their 's converted. But whenever I try it, they come out fine.
What would cause an to turn into a unicode character, but only in limited cases?
Misinterpreting the UTF-8 encoding as Latin-1 will cause this.
>>> u'\xa0'.encode('utf-8').decode('latin-1')
u'\xc2\xa0'
>>> print u'\xa0*'.encode('utf-8').decode('latin-1')
 *

Categories