How do I strip out in PHP everything but printing characters? - php

I am working with this daily data feed. To my surprise, one the fields didn't look right after it was in MySQL. (I have no control over who provides the feed.)
So I did a mysqldump and discovered the zip code and the city for this record contained a non-printing char. It displayed it in 'vi' as this:
<200e>
I'm working in PHP and I parse this data and put it into the MySQL database. I have used the trim function on this, but that doesn't get rid of it. The problem is, if you do a query on a zipcode in the MySQL database, it doesn't find the record with the non-printing character.
I'd like the clean this up before it's put into the MySQL database.
What can I do in PHP? At first I thought regular expression to only allow a-z,A-Z, and 0-9, but that's not good for addresses. Addresses use periods, commas, hyphens and perhaps other things I'm not thinking of at the moment.
What's the best approach? I don't know what it's called to define it exactly other than printing characters should only be allowed. Is there another PHP function like trim that does this job? Or regular expression? If so, I'd like an example. Thanks!
I have looked into using the PHP function, and saw this posted at PHP.NET:
<?php
$a = "\tcafé\n";
//This will remove the tab and the line break
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
//This will remove the é.
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
?>
While using FILTER_FLAG_STRIP_HIGH does indeed strip out the <200e> I mentioned seen in 'vi', I'm concerned that it would strip out the letter's accent in a name such as André.
Maybe a regular expression is the solution?

You can use PHP filters: http://www.php.net/manual/en/function.filter-var.php
I would recommend on using the FILTER_SANITIZE_STRING filter, or anything that fits what you need.

I think you could use this little regex replace:
preg_replace( '/[^[:print:]]+/', '', $your_value);
It basically strip out all non-printing characters from $your_value

I tried this:
<?php
$string = "\tabcde éç ÉäÄéöÖüÜß.,!-\n";
$string = preg_replace('/[^a-z0-9\!\.\, \-éâëïüÿçêîôûéäöüß]/iu', '', $string);
print "[$string]";
It gave:
[abcde éç ÉäÄéöÖüÜß.,!-]
Add all the special characters, you need into the regexp.

If you work in English and do not need to support unicode characters, then allow just [\x20-\x7E]
...and remove all others:
$s = preg_replace('/[^\x20-\x7E]+/', '', $s);

Related

PHP highlight query and escape html special characters

I'm trying to program a search function that hightlights the search query in the result. At the moment I'm using this Code $hightlight = preg_replace('/'.strtolower($query).'/', '<span class=hightlight>'.strtolower($query).'</span>', strtolower($text)); for highlighting, which works fine. The text I'm searching in is a string from a database. The problem now is if the text contains some html special characters, and is for example <test> and the user searches for <te I get the following result: <span class="hightlight"><te< span="">st></te<></span> which is interpretated as st>. This makes sense, but I don't want this. I want <test> as result with <te highlighted. So I need to escape the special characters. I know that there is the function htmlspecialchars, but how can I use it in this case? Or another function? I can't escape them before searching, because than I'm also searching in the HTML-Codes. I also can't escape them after searching, because than are the <span> Tags in the text and they will also be converted to HTML-Codes. I hope you understand my problem. Has anyone a solution for that?
Using a combination of htmlspecialchars() and a regex negative lookahead, I think we're able to solve this.
<php
$text = "this is just my really basic <test> of words";
$query = "<te";
$text = htmlspecialchars($text);
$query = htmlspecialchars($query);
$highlight = preg_replace('/'.strtolower($query).'(?![^\&]*\;)/', '<span class=highlight>'.strtolower($query).'</span>', strtolower($text));
echo $highlight;
?>
(small note, I took the liberty of changing hightlight to highlight)
DEMO
The part of this that solves the issue mentioned in your comment is the negative lookahead: (?![^\&]*\;)
That basically means anything not between & and ;.
Now, this could obviously run into issues in some edge cases where & and ; are both part of the actual text. If you're not doing any sort of text and query limitation/sanitation, I'm not sure that there's anything that will work for all possible cases.

PHP: remove MySQL formatting such as line break and commas

I'm making a CSV file, one of the database filed has stored a section called comments/note which obviously have some commas and line breaks in it too. Looked around the web found usage of preg_replace(), not much familiar with regular expressions there fore combined two different ones and not getting anything in result its totally blank and i know all records have some sort of comments in it
this i used
preg_replace( "/\r|\n/|/[,]/", "", $string )
Please what do I need to do here get one text back without line breaks and commas
Regards
You can do it using strip_tags() and preg_replace();
$clean_str = trim(preg_replace('/\s\s+/', '', strip_tags($string)));
or try this
$clean_str = str_replace(array("\r\n","\r","\n", ","), '', strip_tags($string));

Replace characters in a string with their HTML coding

I need to replace characters in a string with their HTML coding.
Ex. The "quick" brown fox, jumps over the lazy (dog).
I need to replace the quotations with the & quot; and replace the brakets with & #40; and & #41;
I have tried str_replace, but I can only get 1 character to be replaced. Is there a way to replace multiple characters using str_replace? Or is there a better way to do this?
Thanks!
I suggest using the function htmlentities().
Have a look at the Manual.
PHP has a number of functions to deal with this sort of thing:
Firstly, htmlentities() and htmlspecialchars().
But as you already found out, they won't deal with ( and ) characters, because these are not characters that ever need to be rendered as entities in HTML. I guess the question is why you want to convert these specific characters to entities? I can't really see a good reason for doing it.
If you really do need to do it, str_replace() will do multiple string replacements, using arrays in both the search and replace paramters:
$output = str_replace(array('(',')'), array('&#40','&#41'), $input);
You can also use the strtr() function in a similar way:
$conversions = array('('=>'(', ')'=>')');
$output = strtr($conversions, $input);
Either of these would do the trick for you. Again, I don't know why you'd want to though, because there's nothing special about ( and ) brackets in this context.
While you're looking into the above, you might also want to look up get_html_translation_table(), which returns an array of entity conversions as used in htmlentities() or htmlspecialchars(), in a format suitable for use with strtr(). You could load that array and add the extra characters to it before running the conversion; this would allow you to convert all normal entity characters as well as the same time.
I would point out that if you serve your page with the UTF8 character set, you won't need to convert any characters to entities (except for the HTML reserved characters <, > and &). This may be an alternative solution for you.
You also asked in a separate comment about converting line feeds. These can be converted with PHP's nl2br() function, but could also be done using str_replace() or strtr(), so could be added to a conversion array with everything else.

Change spaces to -

I am inserting Alias field for my db called $alias
how do I code (I am using php for mysql insert)
to remove all spaces and replace space with "-" (trying to change it to "weburl format" ie removing spaces)
Thanks
Here's the method I use to santize strings for SEF urls:
$slug = trim(strtolower($value));
$slug = preg_replace('/[^a-z0-9 _-]/', '', $slug);
return preg_replace('/\s+/', '-', $slug);
Feel free to add additional allowed characters to the first regex.
Please note that this is NOT Unicode or even full ISO-8891 safe, well, it is, but it'll drop anything that isn't a-z. That is, you may need to normalize the string beforehand (i.e., replace accented characters with their closes ASCII equivalent.) There's a number of SO questions and answers dealing with this that I've seen before, but I can't find them at the moment. I'll edit them in here if I stumble upon any.
For just removing spaces, you want the str_replace method. However, when working with URLs, you might want to consider the urlencode and rawurlencode methods as well.

Regex to change spaces in images into entities

I'm having a lot of difficulty matching an image url with spaces.
I need to make this
http://site.com/site.com/files/images/img 2 (5).jpg
into a div like this:
.replace(/(http:\/\/([^\s]+\.(jpg|png|gif)))/ig, "<div style=\"background: url($1)\"></div>")
Here's the thread about that:
regex matching image url with spaces
Now I've decided to first make the spaces into entities so that the above regex will work.
But I'm really having a lot of difficulty doing so.
Something like this:
.replace(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/ig, "http://$1/$2%20$3$4")
Replaces one space, but all the rest are still spaces.
I need to write a regex that says, make all spaces between http:// and an image extension (png|jpg|gif) into %20.
At this point, frankly not sure if it's even possible. Any help is appreciated, thanks.
Trying Paolo's escape:
.escape(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/)
Another way I can do this is to escape serverside in PHP, and in PHP I can directly mess with the file name without having to match it in regex.
But as far as I know something like htmlentities do not apply to spaces. Any hints in this direction would be great as well.
Try the escape function:
>>> escape("test you");
test%20you
If you want to control the replacement character but don't want to use a regular expression, a simple...
$destName = str_replace(' ', '-', $sourceName);
..would probably be the more efficient solution.
Lets say you have the string variable urlWithSpaces which is set to a URL which contains spaces.
Simply go:
urlWithoutSpaces = escape(urlWithSpaces);
What about urlencode() - that may do what you want.
On the JS side you should be using encodeURI(), and escape() only as a fallback. The reason to use encodeURI() is that it uses UTF-8 for encoding, while escape() uses ISO Latin. Same problems applies for decoding.
encodeURI = encodeURI || escape;
alert(encodeURI('image name.png'));

Categories