user-submitted hyperlinks in a text editor are broken by quotes

user-submitted hyperlinks in a text editor are broken by quotes - php

I run a wordpress job board, on which users can post job listings and the job description input uses tinyMCE.
The problem: when some users insert hyperlinks into the body of their listing, the link is broken in the published ad, with some added quote marks inserted, eg: http://example.com/en became ””//example.com/en/”””
This happened once in a while, not for every ad and not for every link within an ad.
I recently found out that what seems to be happening is that the following characters %E2%80%9D are getting inserted within the links, and these encode double quote marks. Say the page on my site where the link is posted is https://example.org/mypage/, and that the link the user is trying to post is https://usersite.com/theirpage/, the resulting URL will be some weird mash-up of the two:
https://example.org/mypage/%E2%80%9D/usersite.com/theirpage/
From some googling it seems this might be caused by users copy/pasting hyperlinks from word or webpages.
I'm trying to find a way to automatically prevent this so I don't have to manually clean up tens of links a day. I figure there must be something I can change in the tinyMCE settings.
I have found seemingly related questions from several years ago that suggest it could be to do with magic quotes (example1 example2 ) but I don't know how to implement the proposed solutions as I'm no php pro.
How would you go about solving this for a wordpress website? Any advice would be appreciated!

Your issue is occurring because users are inserting content from Word with double quotes (”), but they're not your standard double quotes.
Standard: " (straight)
Non-standard: ” (on an angle)
The encoding you’ve inserted (%E2%80%9D) indicates a character that cannot be parsed (understood), and encoding has been used to transfer the character through databases and web servers. This encoding provides the double quotes on an angle (”).
In JavaScript, the decoding of this encoding in demonstrated in the following:
decodeURI('%E2%80%9D')
”
I don't believe this is an issue with TinyMCE.
How to fix the issue?
There are a few solutions, and they’re going to require some development knowledge. Two safe options that come to mind are:
Before the content is stored, remove these characters using PHP string replace
If the issue is how users are inserting the content into TinyMCE, it might be worthwhile adding TinyMCE Link Checker (https://www.tiny.cloud/tinymce/features/link-checker). Note, this is a premium feature.
Disclaimer, I'm affiliated with Tiny. The comments above are of my own.

Related

php remove unknown characters

I am building a web application which will run in electron with angular as a frontend framework and laravel as a backend framework. In the application it's possible to login with a smartcard (thanks to node-pcsclite), it reads the bytes on the smartcard and then I convert them.
The smartcard contains a code which is linked to the staff table in my MSSQL database. I can retrieve the code from the smartcard and I can log into the application when it uses mysql as database server.
Now when I'm trying to do the same but with mssql, I get an error which should be viewed in html mode instead of the error page itself.
(The code can be alphanumeric)
So it adds all these strange characters (probably non-existing characters), not that much of a problem right? At least, that's what I thought. So I tried to fix it by using this code inside my laravel controller:
preg_replace('/[^A-Za-z0-9\-]/', '', $string);
This didn't solve anything. Then I thought I might have a problem with the query, so I ran SQL Profiler, the problem is that (probably because of the special characters) the query is broken.
select top 1 * from [Staff] where [CodeInit] = '
go
So does anyone know how to really remove the strange characters?
If you need more information feel free to ask.

I had this problem and landed to this question when searching for a solution. I was unable to find any fix.
The string with non-printable characters retrieved from mdecrypt_generic() so I wanted a way to remove those characters. When I copy and paste the retrieved value from browser to Brackets text editor, it show these red dots.
I just pasted it to google and then it was encoded to %10. Nothing helped till now, so as a temporary solution I just used rtrim() to remove those dots.
Copy the dot in brackets and replace with "DOT_HERE".
rtrim(rtrim($pvp, "DOT_HERE"), "\0\4");
"\0\4" will remove only nulls and EOT but not that dot character(%10).
Further here is a screenshot with that red dot. You can use Brackets text editor to see this.
Note that $pvp is the decrypted text.

Wordpress special characters and the_content

I'm trying to solve a really strange bug on a website I've developed.
In specific tags of the markup the web font is not rendering special characters like "éáçã".
The font allows such characters. You can even see parts of the page with such characters.
One strange behavior is that if you type the character again with the developer tools the character gets displayed correctly.
Hope someone can help me on this one.
Best
Peter

Ok, so turns out the simplest way is to just confirm to deleted the text with bugs, save the post/page and write all again.

Should I be using html_entity_decode to escape a Google Analytics custom variable?

I'm working on a WordPress site with some other developers and the code they wrote to set upcustom variables for Google Analytics, via the _setCustomVar, uses html_entity_decode. They pointed to the well known and much used Yoast plugin which uses a similar technique. I can't figure out why you would use it that way though.
At no point (that I can see) does the string get encoded, so the function doesn't do anything. WordPress delivers whole strings, even with accents on them, never anything encoded, so there aren't rogue encoded characters to worry about. In fact, the one thing you don't want to do is send Google Analytics a mess of HTML, right?
I've changed it because I'm pretty sure that what using html_entity_decode doesn't do is remove single quotes, which in a JS script where strings are contained by single quotes, means that any variable with an apostrophe just breaks Google Analytics tracking entirely.
Instead, I'm cleaning strings using a strip_tags and esc_js (a WordPress function).
I'm a little concerned because the linked script is very commonly used. It seems like I must be wrong about something and I don't want to screw up my own script because of it.
What am I missing?

The answer seems to be that Yoast uses that code as a 'just in case' measure for strings that might have encoded characters in them. It still doesn't seem to take care of quote marks though, which is a pretty big deal.
Here's the code I wrote to solve all the issues: https://gist.github.com/AramZS/8930496

(?) marks in HTML. Encoding issue from content from the database?

Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>

As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!

"Smart Quotes" not displaying properly in email from phpmailer

I'm dealing with a LAMP web server. I have forms that users use to submit text that is stored in a text field in mysql. Often this text is copied and pasted from Microsoft Office products, so I'm getting a lot of smart quotes and emdashes. These characters display properly if I retrieve them from the database and display them on the webpage, but where I'm running into trouble is sending the text in an email using the phpmailer class. I get stuff that looks like this: â€“ (where it should be an emdash).
One thing that may be important: If I pull up a console in mysql and select a field that has an emdash or smart quote in it, it will display on my console incorrectly: â€“, however, as stated above, if my php page (using PDO) selects the field and displays it, it will display correctly in a browser (as an emdash in this case).
I'm not sure if there's a way to select a character set in phpmailer, (maybe it's a simple setting somewhere?) or if there is a better way around this problem. I think I should be clear, though, that "search and replace smart quotes and emdashes with their regular equivalents" is NOT the answer I'm looking for (hopefully that's not the only solution).
I found this information:
My php webpage: utf-8
mysql client encoding: latin1
mysql server encoding: latin1
phpmailer character set: iso-8859-1

Character set can be switched in phpmailer with the following code:
$myMail->CharSet = "UTF-8";
This solved my issue. Typographic quotes and double dashes show up in my emails from phpmailer as expected now. This may have been a sorta noobish question (blush). Thanks, Col. Shrapnel for prompting me to look into what encoding all the pieces of the puzzle were using. I'd vote you up but don't have the reputation.
For anyone interested in homework, this link really helped me understand the basics of encoding:
http://www.joelonsoftware.com/articles/Unicode.html

The PEAR Mail_MIME package lets you do this via http://pear.php.net/manual/en/package.mail.mail-mime.get.php I am pretty certain I have used this feature before, but not positive.
You may also need to run things through iconv to normalize the character sets to a single one, if there are multiple data sources.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.