I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems to work.
First I have tried to use:
$newBody = str_replace("", "", $newBody);
to search for the HTML entity and remove it, as this is how it appears under Web Inspector. The spaces don't get removed. I have also tried it as:
$newBody = str_replace("​", "", $newBody);
and get the same no result.
The second method I tried was found on this question Remove ZERO WIDTH NON-JOINER character from a string in PHP
which looked like this:
$newBody = str_replace("\xE2\x80\x8C", "", $newBody);
but I also got no result. The ZWSP was not removed.
An example word in the text ($newBody) looks like this : ယူကရိန်
And I want to make it look like this : ယူကရိန်း
Any ideas? Would a preg_replace work better somehow?
So I did try
$newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody);
and it appears to be workings, but now there is another issue.
<a class="defined" title="Ukraine">ယူကရိန်း</a>
gets transformed into
<a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">ယူကရိန်း</a>
I don't want it to add all that extra stuff. Any ideas why this is happening? Apart from coming up with some way to target only the text in between , is there another way to prevent the preg_replace from adding all this extra stuff? Btw, using google chrome on a mac. It seems to act a bit differently with firefox...
This:
$newBody = str_replace("", "", $newBody);
presumes the text is HTML entity encoded. This:
$newBody = str_replace("\xE2\x80\x8C", "", $newBody);
should work if the offending characters are not encoded, but matches the wrong character (0xe2808c). To match the same character as #8203; you need 0xe2808b:
$newBody = str_replace("\xE2\x80\x8B", "", $newBody);
Related
The result from the Google+ API has \ufeff appended to the end of every "content" result (I don't really know why?)
What is the best way to remove this unicode character from the json result? It is producing a '?' in some of the output I am displaying.
Example:
https://developers.google.com/+/api/latest/activities/get#try-it
enter activity id
z12pvrsoaxqlw5imi22sdd35jwvkglj5204
and click Execute, result will be:
{
.....
"object": {
......
"content": "CONTENT OF GOOGLE PLUS POST HERE \ufeff",
......
example PHP code which shows a '?' where the '\ufeff' is:
<?php
$data = json_decode($result_from_google_plus_api, true);
echo $data['object']['content'];
// outputs "CONTENT OF GOOGLE PLUS POST HERE ?"
echo trim($data['object']['content']);
// outputs "CONTENT OF GOOGLE PLUS POST HERE ?"
Or am I going about this the wrong way? Should I be fixing the '?' issue rather than trying to remove the '\ufeff'?
In your case, you could use this regexp:
$str = preg_replace('/\x{feff}$/u', '', $str);
That way you can exactly match that code point value and have it removed.
From my experience there are a lot more white-spacey-character you want to remove. From my experienced this works well for me:
# I like to call this unicodeTrim()
$str = preg_replace(
'/
^
[\pZ\p{Cc}\x{feff}]+
|
[\pZ\p{Cc}\x{feff}]+$
/ux',
'',
$str
);
I found http://www.regular-expressions.info/unicode.html a pretty good resource about the fine details:
\pZ - match any kind of whitespace or invisible separator
\p{Cc} - match control characters
\x{feff} - match BOM
I've seen regex suggest to match \pC instead of \pCc, however this is dangerous because pC includes any code point to which no character has been assigned. I've had actual data (certain emojis or other stuff) being removed because of this.
But, YMMW, I cant' stress this.
By Respect to All Answers
I test most of answers but finally find solution here: GitHub
$field = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $field);
Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u
I am working with this daily data feed. To my surprise, one the fields didn't look right after it was in MySQL. (I have no control over who provides the feed.)
So I did a mysqldump and discovered the zip code and the city for this record contained a non-printing char. It displayed it in 'vi' as this:
<200e>
I'm working in PHP and I parse this data and put it into the MySQL database. I have used the trim function on this, but that doesn't get rid of it. The problem is, if you do a query on a zipcode in the MySQL database, it doesn't find the record with the non-printing character.
I'd like the clean this up before it's put into the MySQL database.
What can I do in PHP? At first I thought regular expression to only allow a-z,A-Z, and 0-9, but that's not good for addresses. Addresses use periods, commas, hyphens and perhaps other things I'm not thinking of at the moment.
What's the best approach? I don't know what it's called to define it exactly other than printing characters should only be allowed. Is there another PHP function like trim that does this job? Or regular expression? If so, I'd like an example. Thanks!
I have looked into using the PHP function, and saw this posted at PHP.NET:
<?php
$a = "\tcafé\n";
//This will remove the tab and the line break
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
//This will remove the é.
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
?>
While using FILTER_FLAG_STRIP_HIGH does indeed strip out the <200e> I mentioned seen in 'vi', I'm concerned that it would strip out the letter's accent in a name such as André.
Maybe a regular expression is the solution?
You can use PHP filters: http://www.php.net/manual/en/function.filter-var.php
I would recommend on using the FILTER_SANITIZE_STRING filter, or anything that fits what you need.
I think you could use this little regex replace:
preg_replace( '/[^[:print:]]+/', '', $your_value);
It basically strip out all non-printing characters from $your_value
I tried this:
<?php
$string = "\tabcde éç ÉäÄéöÖüÜß.,!-\n";
$string = preg_replace('/[^a-z0-9\!\.\, \-éâëïüÿçêîôûéäöüß]/iu', '', $string);
print "[$string]";
It gave:
[abcde éç ÉäÄéöÖüÜß.,!-]
Add all the special characters, you need into the regexp.
If you work in English and do not need to support unicode characters, then allow just [\x20-\x7E]
...and remove all others:
$s = preg_replace('/[^\x20-\x7E]+/', '', $s);
I've tried about everything to delete some extra \n characters in a web application I'm working with. I was hoping someone has encountered this issue before and knows what can be causing this. All my JS and PHP files are UTF-8 encoded with no BOM.
And yes I've tried things like
In JS:
text.replace(/\n/g,"")
In PHP:
preg_replace("[\n]","",$result);
str_replace("\n","",$result);
and when I try
text.replace(/\n/g,"")
in the firebug console using the same string I get from the server it works but for reason it doesn't work in a JS file.
I'm desperate, picky and this is killing me. Any input is appreciated.
EDIT:
If it helps, I know how to use the replace functions above. I'm able to replace any other string or pattern except \n for some reason.
Answer Explanation:
Some people do and use what works because it just works. If you are like me and for the record I always like to know why what works WORKS!
In my case:
Why this works? str_replace('\n', '', $result)
And this doesn't? str_replace("\n", '', $result)
Looks identical right?
Well it seems that when you enclose a string with a character value like \n in double quotes "\n" it's seen as it's character value NOT as a string. On the other hand if you enclose it in single quotes '\n' it's really seen as the string \n. At least that is what i concluded in my 3 hours headache.
If what I concluded is a setup specific issue OR is erroneous please do let me know or edit.
In php, use str_replace(array('\r','\n'), '', $string).
I guess the problem is you also have \r's in your code (carriage returns, also displayed as newlines).
In javascript, the .replace() method doesn't modify the string. It returns a new modified string, so you need to reference the result.
text = text.replace(/\n/g,"")
Both of the PHP functions you tried return the altered string, they do not alter their arguments:
$result = preg_replace("[\n]","",$result);
$result = str_replace("\n","",$result);
Strangely, using
str_replace(array('\r','\n'), '', $string)
didn't work for me. I can't really work out why either.
In my situation I needed to take output from the a WordPress custom meta field, and then I was placing that formatted as HTML in a javascript array for later use as info windows in a Google Maps instance on my site.
If I did the following:
$stockist_address = $stockist_post_custom['stockist_address'][0];
$stockist_address = apply_filters( 'the_content', $stockist_address);
$stockist_sites_html .= str_replace(array('\r','\n'), '', $stockist_address);
This did not give me a string with the html on a single line. This therefore threw an error on Google Maps.
What I needed to do instead was:
$stockist_address = $stockist_post_custom['stockist_address'][0];
$stockist_address = apply_filters( 'the_content', $stockist_address);
$stockist_sites_html .= trim( preg_replace( '/\s+/', ' ', $stockist_address ) );
This worked like a charm for me.
I believe that usage of \s in regular expressions tabs, line breaks and carriage returns.
"replace newline" seems to be a question asked here and there like hundred times already. But however, i haven't found any working solution for myself yet.
I have a textarea that i use to save data into DB. Then using AJAX I want to get data from the DB in the backend that is in TEXT field and to pass it to frontend using JSON. But pasing JSON returns an error, as new lines from DB are not valid JSON syntax, I guess i should use \n instead...
But how do i replace newlinew from DB with \n?
I've tried this
$t = str_replace('<br />', '\n', nl2br($t));
and this
$t = preg_replace("/\r\n|\n\r|\r|\n/", "\n", $t);
and using CHAR(13) and CHAR(10), and still I get an error
the new line in textarea is equivalent to, i guess
$t = 'text with a
newline';
it gives the same error. And in notepad i clearly see that it is crlf
You need to escape all the characters that have a special meaning in JSON, not only line feeds. And you also need to convert to UTF-8.
There's no need to reinvent the wheel, json_encode() can do everything for you.
Prfff... >_< silly me
I've lost another slash before replacing with \n
$t = preg_replace("/\r\n|\n\r|\r|\n/", "\\n", $t);