converting C style unicode escapes to htmlentities - php

I want to change every C style unicode char into html entity. I've written this function to do that:
function ununicode($text) {
$text = preg_replace('/\\\\u([0-9a-f]{4})/i', '&#x$1;', $text);
return $text;
}
it works good, but ignores second character in sth like that \u00f6\u00df. ie it will produce: ö\u00df
whats wrong with my regex?

Try adding the g flag (which allows more than one replacement per line), so that it's this:
$text = preg_replace('/\\\\u([0-9a-f]{4})/ig', '&#x$1;', $text);
Edit:
Your code as written seems to work for me:
php > $text = "\u00f6\u00df";
php > print $text;
\u00f6\u00df
php > $text2 = preg_replace('/\\\\u([0-9a-f]{4})/i', '&#x$1;', $text);
php > print $text2;
öß

Related

PHP regex replace based on \v character (vertical tab)

I have a character string like (ascii codes):
32,13,7,11,11,
"string1,blah;like: this...", 10,10, 32,32,32,32, 138,138, 32,32,32,32, 13,7, 11,11,
"string2/lorem/example-text...", 10,10, 32,32,32,32,32, 143,143,143,143,143
So the sequence is:
any characters, followed by my search string, followed by any
characters
11,11
the string I want to replace
any non-printable characters
If the block contains string1 then I need to replace the next string with something else. The second string always starts directly after the 11,11.
I'm using PHP.
I thought something like this, but I am not getting the correct result:
$updated = preg_replace("/(.*string1.*?\\v+)([[:print:]]+)([[:ascii:]]*)/mi", "$1"."new string"."$3", $orig);
This puts "new string" between the 10,10 and the 138,138 (and replaces the 32's).
Also tried \xb instead of \v.
Normally I test with regex101, but not sure how to do that with non-printable characters. Any suggestions from regex guru's?
Edit: the expected output is the sequence:
32,13,7,11,11,
"string1,blah;like: this...", 10,10, 32,32,32,32, 138,138, 32,32,32,32, 13,7, 11,11,
"new string", 10,10, 32,32,32,32,32, 143,143,143,143,143
Edit: sorry for the confusion regarding the ascii codes.
Here's a complete example:
<?php
$s = chr(32).chr(32).chr(7).chr(11).chr(11);
$s .= "string1,blah;like: this...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138);
$s .= chr(32).chr(32).chr(32).chr(32).chr(13).chr(7).chr(11).chr(11);
$s .= "string2/lorem/example-text...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143);
$result = preg_replace('/(.*string1.*?\v+)([[:print:]]+)([[:ascii:]]*)/mi', "$1"."new string"."$3", $s);
echo "\n------------------------\n";
echo $result;
echo "\n------------------------\n";
The text string2/lorem/example-text... should be replaced by new string.
My php-cli halted every time preg_match has reached char(138) and I don't know why.
I will throw my hat on this RegEx (note: \v matches a new-line | no flags are set):
"[^"]*"[^\x0b]+\v{2}"\K[^"]*
PHP code:
$source = chr(32).chr(13).chr(7).chr(11).chr(11)."\"string1,blah;like: this...\"".chr(10).
chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138).chr(32).chr(32).chr(32).chr(32).
chr(13).chr(7).chr(11).chr(11)."\"string2/lorem/example-text...\"".chr(10).chr(10).chr(32).
chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143).chr(143).chr(143);
echo preg_replace('~"[^"]*"[^\x0b]+\v{2}"\K[^"]*~', "new string", $source);
Beautiful output:
"string1,blah;like: this..."
��
"new string"
�����
Live demo
Solved. It was a combination of things:
/mis was needed (instead of /mi)
\x0b was needed (instead of \v)
Complete working example:
<?php
$s = chr(32).chr(32).chr(7).chr(11).chr(11);
$s .= "string1,blah;like: this...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(138).chr(138);
$s .= chr(32).chr(32).chr(32).chr(32).chr(13).chr(7).chr(11).chr(11);
$s .= "string2/lorem/example-text...". chr(10).chr(10).chr(32).chr(32).chr(32).chr(32).chr(32).chr(143).chr(143).chr(143);
$result = preg_replace('/(.*string1.*?\x0b+)([[:print:]]+)/mis', "$1"."new string", $s);
echo "\n------------------------\n";
echo $result;
echo "\n------------------------\n";
Thanks for everyone's suggestions. It put me on the right track.

How to remove < and > symbol?

I have a text name #chatfun <chinu,25,M,123456> i want to #chatfun chinu,25,M,123456 How to replace < and > tag.
My Code :
<?php
$text = '#chatfun <chinu,25,M,123456>';
echo strip_tags($text);
?>
Above code i am getting the blank text. How to get my actual result?
Just use str_replace with the angular brackets in an array.
echo str_replace(array('>', '<'), '', $text);
Another option is to use regex with preg_replace
echo preg_replace("/[<>]+/", '', $text);
Have you tried str_replace?
$text = str_replace(array('<', '>'), '', '#chatfun <chinu,25,m,123456>');
That will replace the unwanted characters with nothing. Only caveat is that if the characters show up anywhere else, they, too, will be replaced.
<?php
$to_remove = array("<",">");
$text = '#chatfun <chinu,25,M,123456>';
echo str_replace($to_remove,"",$text);
?>
See refrence
You're getting blank text because the text between the < and > is considered part of a tag.
Use str_replace(array('<','>'),'',$text)

Making a preview of a long text

I'm working in PHP and I want to create a function that, given a text of arbitrary length and height, returns a restricted version of the same text with a maximum of 500 characters and 10 lines.
This is what I have so far:
function preview($str)
{
$partialPreview = explode("\n", substr($str, 0, 500));
$partialPreviewHeight = count($partialPreview);
$finalPreview = "";
// if it has more than 10 lines
if ($partialPreviewHeight > 10) {
for ($i = 0; $i < 10; $i++) {
$finalPreview .= $partialPreview[$i];
}
} else {
$finalPreview = substr($str, 0, 500);
}
return $finalPreview;
}
I have two questions:
Is using \n proper to detect new line feeds? I know that some
systems use \n, other \r\n and others \r, but \n is the most
common.
Sometimes, if there's an HTML entity like " (quotation mark) at
the end, it's left as &quot, and therefore it's not valid HTML. How
can I prevent this?
First replace <br /> tags with <br />\n and </p><p> or </div><div> with </p>\n<p> and </div>\n<div> respectively.
Then use the PHP function for strip tags which should yield a nice plain text with newlines in everyplace a newline should be.
Then you could replace \r\n with \n for consistency. And only after that you could extract the desired length of text.
You may want to use word wrapping to achieve your 10 line goal. For word wraps to work you need to define a number of characters per line and word wraps takes care of not braking mid-word.
You may want to use the html_entity_decode before using wordwrap as #PeeHaa suggested.
Is using \n proper to detect new line feeds? I know that some systems use \n, other \r\n and others \r, but \n is the most common.
It depends where the data is coming from. Different operating systems have different line breaks.
Windows uses \r\n, *nix (including mac OS) uses \n, (very) old macs used \r. If the data is coming from the web (e.g. a textarea) it will (/ should) always be \r\n. Because that's what the spec states user agents should do.
Sometimes, if there's an HTML entity like " (quotation mark) at the end, it's left as &quot, and therefore it's not valid HTML. How can I prevent this?
Before cutting the text you may want to convert html entities back to normal text. By using either htmlspecialchars_decode() or html_entity_decode depending on your needs. Now you won't have the problem of breaking the entities (don't forget to encode it again if needed).
Another option would be to only break the text on whitespace characters rather than a hard character limit. This way you will only have whole words in your "summary".
I've created a class which should deal with most issues. As I already stated when the data is coming from a textarea it will always be \r\n, but to be able to parse other linebreaks I came up with something like the following (untested):
class Preview
{
protected $maxCharacters;
protected $maxLines;
protected $encoding;
protected $lineBreaks;
public function __construct($maxCharacters = 500, $maxLines = 10, $encoding = 'UTF-8', array $lineBreaks = array("\r\n", "\r", "\n"))
{
$this->maxCharacters = $maxCharacters;
$this->maxLines = $maxLines;
$this->encoding = $encoding;
$this->lineBreaks = $lineBreaks;
}
public function makePreview($text)
{
$text = $this->normalizeLinebreaks($text);
// this prevents the breaking of the &quote; etc
$text = html_entity_decode($text, ENT_QUOTES, $this->encoding);
$text = $this->limitLines($text);
if (mb_strlen($text, $this->encoding) > $this->maxCharacters) {
$text = $this->limitCharacters($text);
}
return html_entity_decode($text, ENT_QUOTES, $this->encoding);
}
protected function normalizeLinebreaks($text)
{
return str_replace($lineBreaks, "\n", $text);
}
protected function limitLines($text)
{
$lines = explode("\n", $text);
$limitedLines = array_slice($lines, 0, $this->maxLines);
return implode("\n", $limitedLines);
}
protected function limitCharacters($text)
{
return substr($text, 0, $this->maxCharacters);
}
}
$preview = new Preview();
echo $preview->makePreview('Some text which will be turned into a preview.');

php non latin to hex function

I have website that's in win-1251 encoding and it needs to stay that way. But I also need to be able to echo few links that contain non latin, non cyrillic characters like šžāņūī...
I need a function that convert this
"māja un man tā patīk"
to
"māja un man tā patīk"
and that does not touch html, so if there is <b> it needs to stay as <b>, not > or <
And please no advices about the encoding and how wrong that is.
$str = "<b>Obāchan</b> おばあちゃん";
$str = preg_replace_callback('/./u', function ($matches) {
$chr = $matches[0];
if (strlen($chr) > 1) {
$chr = mb_convert_encoding($chr, 'HTML-ENTITIES', 'UTF-8');
}
return $chr;
}, $str);
This expects the original $str to be UTF-8 encoded, i.e. your PHP file should be saved in UTF-8. It encodes all non-ASCII compatible code points to HTML entities. Since all HTML special characters are ASCII characters, they remain untouched. The resulting string is pure ASCII. Since the lower Win-1251 code points are ASCII compatible, the resulting string is also a valid Win-1251 string. The above $str converts to:
<b>Obāchan</b> おばあちゃん
The main things you probably don't want to encode are <, > and &. Those are really the only special characters. So how about encoding everything first, and then just decode <, > and & I feel you should be fine.
This is untested:
$output =
htmlspecialchars_decode(
htmlentities($input, ENT_NOQUOTES, 'CP-1251')
);
let me know
What Evert suggest looks logical to me too! If you insist this is a way to do it if there are only two letters that bother you. For more letters the scrit will not be as effective and needs to change.
<?PHP
function myConvert($str)
{
$chars['ā']='ā';
$chars['ī']='ī';
foreach ($chars as $key => $value)
$output = str_replace($key, $value, $str);
echo $str;
}
myConvert("māja un man tā patīk");
?>
==================edited==============
For many characters maybe this one can help you:
<?PHP
function myConvert($str)
{
$final=null;
$parts = preg_split("/&#[0-9]*;/i", $str);//get all text parts
preg_match_all("/&#[0-9]*;/i", $str, $delimiters );//get delimiters;
$delimiters[0][]='';//make arrays equal size
foreach($parts as $key => $value)
$final.=$value.mb_convert_encoding
($delimiters[0][$key], "UTF-8", "HTML-ENTITIES");
return $final;
}
$fh = fopen("testFile.txt", 'w') ;
fwrite($fh, myConvert("māja un man tā patīkī"));
fclose($fh);
?>
The desired output is written in the text file. This code, exactly as it is -not merged in some project- does what it claims to do. Converts codes like ā to the analogous character they present.

Remove all text between <hr> and <embed> tag?

<hr>I want to remove this text.<embed src="stuffinhere.html"/>
I tried using regex but nothing works.
Thanks in advance.
P.S. I tried this: $str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str)
You'll get a lot of advice to use an HTML parser for this kind of thing. You should do that.
The rest of this answer is for when you've decided that the HTML parser is too slow, doesn't handle ill formed (i.e. standard in the wild) HTML, or is a pain in the ass to integrate into the system you don't control. I created the following small shell script
$str = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><embed src="stuffinhere.html"/>"
and it did remove the text, so I'd check your source documents and any other PHP code around your RegEx. You're not feeding preg_replace the string you think you are. My best guess is your source document has irregular case, or there's whitespace between the <hr /> and <embed>. Try the following regular expression instead.
$str = '<hr>I want to remove
this text.
<EMBED src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#si', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><EMBED src="stuffinhere.html"/>"
The "i" modifier says "make this search case insensitive". The "s" modifier says "the [.] character should also match my platform's line break/carriage return sequence"
But use a proper parser if you can. Seriously.
I think the code is self-explanatory and pretty easy to understand since it does not use regex (and it might be faster)...
$start='<hr>';
$end='<embed src="stuff...';
$str=' html here... ';
function between($t1,$t2,$page) {
$p1=stripos($page,$t1);
if($p1!==false) {
$p2=stripos($page,$t2,$p1+strlen($t1));
} else {
return false;
}
return substr($page,$p1+strlen($t1),$p2-$p1-strlen($t1));
}
$found=between($start,$end,$str);
while($found!==false) {
$str=str_replace($start.$found.$end,$start.$end,$str);
$found=between($start,$end,$str);
}
// do something with $str here...
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed.*?>)#', '$1$2', $text);
echo $text;
If you want to hard code src in embed tag:
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed src="stuffinhere.html"/>)#', '$1$2', $text);
echo $text;

Categories