How to remove unknown character in a text file?

How to remove unknown character in a text file? - php

I have a text file containing this; "I do not know" character. Already search on google but I'm having hard time getting desired search result since I do not know what is the general term for this kind of character.
I tried removing it using below code but nothing happens. I also tried "\f" because I thought that character is form feed but still can't remove.
$replace = str_replace("\0", ' ', $str);
EDIT:
The said character is really form feed but somehow below code is not working for me.
$replace = str_replace("\f", ' ', $str);

You can use prep_replace command to perform a regular expression search and replace.
$replace = preg_replace( '/[^A-Za-z0-9 _\-\+\&]/', '',$str);
Note: You need to decide the first parameter to the preg_replace function call for the set of unwanted characters you don't want. You might be interested to remove non printable characters.

I do not know why, but using str_replace is not working to remove the 'FF' (Form Feed) character
$replace = str_replace("\f", ' ', $str); // not working
Using below code solves my problem but kinda not look good because it is using regex to replace a single character only. Still, this works:
$replace = preg_replace('/[\f]/', " ", $str);

Seems like an issue with your header. Try once after adding this on you header.
header('Content-Type: text/html; charset=UTF-8');
Hope this helps

Related

Preg_match for different language URLs

I have some text like this :
$text = "Some thing is there http://example.com/جميع-وظائف-فى-السليمانية
http://www.example.com/جميع-وظائف-فى-السليمانية nothing is there
Check me http://example.com/test/for_me first
testing http://www.example.com/test/for_me the url
Should be test http://www.example.com/翻译-英语教师-中文教师-外贸跟单
simple text";
I need to preg_match the URL, but they are of different languages.
So, I need to get the URL itself, from each line.
I was doing like this :
$text = preg_replace("/[\n]/", " <br>", $text);
$lines = explode("<br>", $text);
foreach($line as $textLine){
if (preg_match("/(http\:\/\/(.*))/", $textLine, $match )) {
// some code
// Here I need the url
}
}
My current regex is /(http\:\/\/(.*))/, please suggest how I can make this compatible with the URLs in different languages?

A regular expression like this may work for you?
In my test it worked with the text example you gave however it is not very advanced. It will simple select all characters after http:// or https:// until a white-space character occures (space, new line, tab, etc).
/(https?\:\/\/(?:[^\s]+))/gi
Here is a visual example of what would be matched from your sample string:
http://regex101.com/r/bR0yE9

You don't need to work line by line, you can search directly:
if (preg_match_all('~\bhttp://\S+~', $text, $matches))
print_r($matches);
Where \S means "all that is not a white character".There is no special internalisation problem.
Note: if you want to replace all newlines after with <br/>, I suggest to use $text = preg_replace('~\R~', '<br/>', $text);, because \R handles several type of newlines when \n will match only unix newlines.

PHP removing noise words regular expression with boundaries with ' character

I'm attempting to remove noise words from a string, and I have what I believe is a good algorithm for it, but I'm running into a snag. Before I do my preg_replace I remove all punctuation except apostrophe ('). The I put it through this preg_replace:
$content = preg_replace('/\b('.implode('|', self::$noiseWords).')\b/','',$content);
Which works great, except for words that do indeed have that ' character. preg_replace seems to be treating that as a boundary character. This is a problem for me.
Is there a way I can get around this? A different solution perhaps?
Thanks!
Here is the example I'm using:
$content = strtolower(strip_tags($content));
$content = preg_replace("/(?!['])\p{P}/u", "", $content);// remove punctuation
echo $content;// i've added striptags for editing as well should still workyep it doesnbsp
$content = preg_replace("/\b(?<')(".implode('|', self::$noiseWords).")(?!')\b/",'',$content);
$contentArray = explode(" ", $content);
print_r($contentArray);
On the 3rd line you'll see the comment of what $content is right before the preg_replace
And though I'm assuming you can guess what my noiseWords array looks like, here's just a small fraction of it:
$noiseWords = array("a", "able","about","above","abroad","according","accordingly","across",
"actually","adj","after","afterwards","again",......)

You can use a negative lookbehind and positive lookahead to make sure you're not "around" a quote character:
$regex = "/\b(?<!')(".implode('|', self::$noiseWords).")(?!')\b/";
Now, your regex will not match anything that is preceded by or following with a single quote.

How can I remove the ascii character x02 from my string with php?

I have tried a lot of things but nothing happens. My generated xml is not well formed because of the ascii character x02 (in VIM it is '^B'). I have tried it with the following line:
$keywords = preg_replace('/\x02/', '', $keywords);
But that won't work. Do you have an idea?

Why use regexp?
str_replace(chr(2),'',$keywords);

You have to use " for using escape sequences. Replace your code with:
$keywords = preg_replace("/\x02/", '', $keywords);

Remove part of a string with regex

I'm trying to strip part of a string (which happens to be a url) with Regex. I'm getting better out regex but can't figure out how to tell it that content before or after the string is optional. Here is what I have
$string='http://www.example.com/username?refid=22';
$new_string= preg_replace('/[/?refid=0-9]+/', '', $string);
echo $new_string;
I'm trying to remove the ?refid=22 part to get http://www.example.com/username
Ideas?
EDIT
I think I need to use Regex instead of explode becuase sometimes the url looks like http://example.com/profile.php?id=9999&refid=22 In this case I also want to remove the refid but not get id=9999

parse_url() is good for parsing URLs :)
$string = 'http://www.example.com/username?refid=22';
$url = parse_url($string);
// Ditch the query.
unset($url['query']);
echo array_shift($url) . '://' . implode($url);
CodePad.
Output
http://www.example.com/username
If you only wanted to remove that specific GET param, do this...
parse_str($url['query'], $get);
unset($get['refid']);
$url['query'] = http_build_query($get);
CodePad.
Output
http://example.com/profile.php?id=9999
If you have the extension, you can rebuild the URL with http_build_url().
Otherwise you can make assumptions about username/password/port and build it yourself.
Update
Just for fun, here is the correction for your regular expression.
preg_replace('/\?refid=\d+\z/', '', $string);
[] is a character class. You were trying to put a specific order of characters in there.
\ is the escape character, not /.
\d is a short version of the character class [0-9].
I put the last character anchor (\z) there because it appears it will always be at the end of your string. If not, remove it.

Dont use regexs if you dont have to
echo current( explode( '?', $string ) );

PHP trim problem

I asked earlier how can I get rid of extra hyphens and whitespace added at the end and beginning of user submitted text for example, -ruby-on-rails- should be ruby-on-rails you guys suggested trim() which worked fine by itself but when I added it to my code it did not work at all it actually did some funky things to my code.
I tried placing the trim() code every where in my code but nothing worked can someone help me to get rid of extra hyphens and whitespace added at the end and beginning of user submitted text?
Here is my PHP code.
$tags = preg_split('/,/', strip_tags($_POST['tag']), -1, PREG_SPLIT_NO_EMPTY);
$tags = str_replace(' ', '-', $tags);

Update the trim statement to the following in order to update each item in the array:
foreach($tags as $key=>$value) {
$tags[$key] = trim($value, '-');
}
That should allow you to trim each value based on a string being expected.

If you have a string you can do this to strip hyphens from the beginning and end:
$tag = trim($tag, '-');
Your problem is that preg_split returns an array, but trim takes a string. You need to do the above for every string in the array.
Regarding trimming whitespace: if you are first converting all whitespace to hyphens then it should not be necessary to trim whitespace afterwards - the whitespace will already be gone. But be careful because the terms "whitespace" and "space" have different meanings. Your question seems to muddle these two terms.

Verify that the hyphen character you're attempting to trim is the same hyphen character that is wrapping -ruby-on-rails-. For example, these are all different characters that look similar: -, –, —, ―.

Im new to StackOverflow.com so I hope the function I wrote helps you in some way. You can specify what characters you want it to trim in the second parameter, for your example I've set it to just remove whitespace and 'dashes' by default, i've tested it using 'ruby-on-rails' and a somewhat extreme example of '- -- - - ruby-on-rails - -- - - -' and both produce the result: 'ruby-on-rails'.
The regular expression might be a bit of a q&d way of going about it but I hope it helps you, just reply if you have any problems implementing it or w/e.
function customTrim($s,$c='- ')
{
preg_match('#'.($a='[^'.$c.']').'.{1,}'.$a.'#',$s,$match);
return $match[0];
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to remove unknown character in a text file? - php

Seems like an issue with your header. Try once after adding this on you header. header('Content-Type: text/html; charset=UTF-8'); Hope this helps

Related

Preg_match for different language URLs

PHP removing noise words regular expression with boundaries with ' character

How can I remove the ascii character x02 from my string with php?

Remove part of a string with regex

PHP trim problem

Categories

Resources