How to convert a Unicode string to HTML entities? (HEX not decimal)
For example, convert Français to Français.
For the missing hex-encoding in the related question:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
This is similar to #Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.
If you prefer iconv over mb_convert_encoding, it's similar:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
I find this string manipulation a bit more clear then in Get hexcode of html entities.
Your string looks like UCS-4 encoding you can try
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
Output
string 'Français' (length=13)
Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.
Would like to document an alternative solution because it solved a similar problem for me.
I was using PHP's utf8_encode() to escape 'special' characters.
I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
Hope this helps somebody in need :-)
See How to get the character from unicode code point in PHP? for some code that allows you to do the following :
Example use :
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Output :
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.
Usage example:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
Result:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);
This is solution like #hakre (Nov 8, 2012 at 0:35) but to html entity names:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Obóz wędrowny Koła"
//while #hakre/#Baba both codes:
// => $output: "Obóz wędrowny Koła"
But always is problem with encountered not proper UTF-8, i.e.:
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Obóz wędrowny Koła - - okładka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")
but here
// => $output: (empty)
also with #hakre code... :(
It was hard to find out the cause, the only solution I know (maybe does anyone know a simpler one? please):
function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
/* Processing string against not UTF-8 chars... */
$output = ''; // New - repaired
for ($i=0; $i<strlen($input); $i++) {
if (($char = $input[$i])<"\x80") {
$output .= $char;
} else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
$j = 0; // how many chars more in UTF-8
$char = ord($char);
do { // checking first UTF-8 code char bits
$char = ($char << 1) % 0x100;
$j++;
} while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
$k = $i+1;
if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next bytes for valid UTF-8 codes
}
if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) { // Not UTF-8
$output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
} else { // UTF=8 !
$output .= substr($input, $i, 1+$j);
$i += $j;
}
}
}
return utf_entities($output); // recursively after repairing
}
return $output;
}
I.e.:
echo utf_entities("o\xC5\x82a - k\xB3a"); // oła - k³a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oñ¸¸¸¸¸a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// oa - k³a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oa - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// oña¸¸a - invalid UTF-8, fixed
Related
I have trouble with php rainbow text functions.
When I run this function, the output text did not support vietnamese.
For example: "tôi yêu em" ;
<?php
function rainbow($text)
{
/*** initialize the return string ***/
$ret = '';
/*** an array of colors ***/
$colors = array(
'ff00ff', 'ff0099', 'ff0033', 'ff3300',
'ff9900', 'ffff00', '99ff00', '33ff00',
'00ff33', '00ff99', '00ffff', '0099ff',
'0033ff', '3300ff', '9900ff'
);
/*** a counter ***/
$i = 0;
/*** get the length of the text ***/
$textlength = strlen($text);
/*** loop over the text ***/
while($i<=$textlength)
{
/*** loop through the colors ***/
foreach($colors as $value)
{
if ($text[$i] != "")
{
$ret .= '<span style="color:#'.$value.';">'.$text[$i]."</span>";
}
$i++;
}
}
/*** return the highlighted string ***/
$ret = html_entity_decode($ret, ENT_QUOTES, 'UTF-8');
return $ret;
}
echo rainbow('tôi yêu em');
?>
You're going to get uninitialized string offset notices in any language with that function due to the way you're iterating over the string bytes + colors. Better to access the colors via an InfiniteIterator which will just loop around and around.
Your specific problem with Vietnamese is that some of those characters are composed of multiple bytes. Functions like strlen() and accessing offsets via array brackets like $text[$i] are not multi-byte safe - they work on individual bytes rather than characters.
While it might be tempting to just use mb_strlen() in place of strlen() to get the number of characters rather than the number of bytes, and mb_substr() rather than $text[$i] to get a character rather than a byte, you'll still end up breaking up graphemes like è (which is here encoded as e followed by a combining grave accent.) A solution is to break up the string into an array with a regular expression.
Example:
function rainbow($text)
{
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
$return = '';
$colors = new InfiniteIterator(
new ArrayIterator(
['ff00ff', 'ff0099', 'ff0033', 'ff3300',
'ff9900', 'ffff00', '99ff00', '33ff00',
'00ff33', '00ff99', '00ffff', '0099ff',
'0033ff', '3300ff', '9900ff']
)
);
$colors->rewind();
// Match any codepoint along with any combining marks.
preg_match_all('/.\pM*+/su', $text, $matches);
foreach ($matches[0] as $char)
{
if (preg_match('/^\pZ$/u', $char)) {
// No need to color whitespace or invisible separators.
$return .= $char;
} else {
$return .= "<span style='color:#{$colors->current()};'>$char</span>";
$colors->next();
}
}
return $return;
}
echo rainbow('tôi yêu em evè foo baz');
Output:
<span style='color:#ff00ff;'>t</span><span style='color:#ff0099;'>ô</span><span style='color:#ff0033;'>i</span> <span style='color:#ff3300;'>y</span><span style='color:#ff9900;'>ê</span><span style='color:#ffff00;'>u</span> <span style='color:#99ff00;'>e</span><span style='color:#33ff00;'>m</span> <span style='color:#00ff33;'>e</span><span style='color:#00ff99;'>v</span><span style='color:#00ffff;'>è</span> <span style='color:#0099ff;'>f</span><span style='color:#0033ff;'>o</span><span style='color:#3300ff;'>o</span> <span style='color:#9900ff;'>b</span><span style='color:#ff00ff;'>a</span><span style='color:#ff0099;'>z</span>
This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
We have a database of Canadian addresses all in CAPS , the client requested that we transform to lower case expect the first letter and the letter after a '-'
So i made this function , but I'm having problem with french accented letters .
When having the file and charset as ISO-88591 It works fine , but when i try to make it UTF-8 it doesn't work anymore .
Example of input : 'damien-claude élanger'
output : Damien-Claude élanger
the é in utf-8 will become �
function cap_letter($string) {
$lower = str_split("àáâçèéêë");
$caps = str_split("ÀÁÂÇÈÉÊË");
$letters = str_split(strtolower($string));
foreach($letters as $code => $letter) {
if($letter === '-' || $letter === ' ') {
$position = array_search($letters[$code+1],$lower);
if($position !== false) {
// test
echo $letters[$code+1] . ' == ' . $caps[$position] ;
$letters[$code+1] = $caps[$position];
}
else {
$letters[$code+1] = mb_strtoupper($letters[$code+1]);
}
}
}
//return ucwords(implode($letters)) ;
return implode($letters) ;
}
The Other solution i have in mind is to do : ucwords(strtolower($str)) since all the addresses are already in caps so the É will stay É even after applying strtolower .
But then I'll have the problem of having É inside ex : XXXÉXXÉ
Try mb_* string functions for multibyte characters.
echo mb_convert_case(mb_strtolower($str), MB_CASE_TITLE, "UTF-8");
I have the same problem in spanish, and I create this function
function capitalize($string)
{
if (mb_detect_encoding($string) === 'UTF-8') {
$string = mb_convert_case(utf8_encode($string), MB_CASE_TITLE, 'UTF-8');
} else {
$string = mb_convert_case($string, MB_CASE_TITLE, 'UTF-8');
}
return $string;
}
I have a string as "€".
I want to convert it to hex to get the value as "\u20AC" so that I can send it to flash.
Same for all currency symbol..
£ -> \u00A3
$ -> \u0024
etc
First, note that $ is not a known entity in HTML 4.01. It is, however, in HTML 5, and, in PHP 5.4, you can call html_entity_decode with ENT_QUOTES | ENT_HTML5 to decode it.
You have to decode the entity and only then convert it:
//assumes $str is in UTF-8 (or ASCII)
function foo($str) {
$dec = html_entity_decode($str, ENT_QUOTES, "UTF-8");
//convert to UTF-16BE
$enc = mb_convert_encoding($dec, "UTF-16BE", "UTF-8");
$out = "";
foreach (str_split($enc, 2) as $f) {
$out .= "\\u" . sprintf("%04X", ord($f[0]) << 8 | ord($f[1]));
}
return $out;
}
If you want to replace only the entities, you can use preg_replace_callback to match the entities and then use foo as a callback.
function repl_only_ent($str) {
return preg_replace_callback('/&[^;]+;/',
function($m) { return foo($m[0]); },
$str);
}
echo repl_only_ent("€foobar ´");
gives:
\u20ACfoobar \u00B4
You might try the following function for string to hex conversion:
function strToHex($string) {
$hex='';
for ($i=0; $i < strlen($string); $i++) {
$hex .= dechex(ord($string[$i]));
}
return $hex;
}
From Greg Winiarski which is the fourth hit on Google.
In combination with html_entity_decode(). So something like this:
$currency_symbol = "€";
$hex = strToHex(html_entity_decode($currency_symbol));
This code is untested and therefore may require further modification to return the exact result you require
I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´ - which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".
<?php
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
}
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); print("\n");
printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");
printTruncated(10, "<em><b>Hello</b>w\xC3\xB8rld!</em>"); print("\n");
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.
(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.
I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
/**
* function to truncate and then clean up end of the HTML,
* truncates by counting characters outside of HTML tags
*
* #author alex lockwood, alex dot lockwood at websightdesign
*
* #param string $str the string to truncate
* #param int $len the number of characters
* #param string $end the end string for truncation
* #return string $truncated_html
*
* **/
public static function truncateHTML($str, $len, $end = '…'){
//find all tags
$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
//WSDDebug::dump($matches); exit;
$i =0;
//loop through each found tag that is within the $len, add those characters to the len,
//also track open and closed tags
// $matches[$i][0] = the whole tag string --the only applicable field for html enitities
// IF its not matching an &htmlentity; the following apply
// $matches[$i][1] = the start of the tag either '<' or '</'
// $matches[$i][2] = the tag name
// $matches[$i][3] = the end of the tag
//$matces[$i][$j][0] = the string
//$matces[$i][$j][1] = the str offest
while($matches[$i][0][1] < $len && !empty($matches[$i])){
$len = $len + strlen($matches[$i][0][0]);
if(substr($matches[$i][0][0],0,1) == '&' )
$len = $len-1;
//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
//ignore empty/singleton tags for tag counting
if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
//double check
if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
$openTags[] = $matches[$i][2][0];
elseif(end($openTags) == $matches[$i][2][0]){
array_pop($openTags);
}else{
$warnings[] = "html has some tags mismatched in it: $str";
}
}
$i++;
}
$closeTags = '';
if (!empty($openTags)){
$openTags = array_reverse($openTags);
foreach ($openTags as $t){
$closeTagString .="</".$t . ">";
}
}
if(strlen($str)>$len){
// Finds the last space from the string new length
$lastWord = strpos($str, ' ', $len);
if ($lastWord) {
//truncate with new len last word
$str = substr($str, 0, $lastWord);
//finds last character
$last_character = (substr($str, -1, 1));
//add the end text
$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
}
//restore any open tags
$truncated_html .= $closeTagString;
}else
$truncated_html = $str;
return $truncated_html;
}
100% accurate, but pretty difficult approach:
Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM
Easy brute-force approach:
Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy
I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP
The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.
function substr_html($string, $length)
{
$count = 0;
/*
* $state = 0 - normal text
* $state = 1 - in HTML tag
* $state = 2 - in HTML entity
*/
$state = 0;
for ($i = 0; $i < strlen($string); $i++) {
$char = $string[$i];
if ($char == '<') {
$state = 1;
} else if ($char == '&') {
$state = 2;
$count++;
} else if ($char == ';') {
$state = 0;
} else if ($char == '>') {
$state = 0;
} else if ($state === 0) {
$count++;
}
if ($count === $length) {
return substr($string, 0, $i + 1);
}
}
return $string;
}
you can use tidy as well:
function truncate_html($html, $max_length) {
return tidy_repair_string(substr($html, 0, $max_length),
array('wrap' => 0, 'show-body-only' => TRUE), 'utf8');
}
Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :
$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;
Should give output : Hello, my <strong>**name**</strong>.
I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function html_truncate($maxLength, $html){
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
ob_start();
while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_strcut($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength){
print(mb_strcut($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += mb_strlen($str);
if ($tag[0] == '&'){
// Handle the entity.
print($tag);
$printedLength++;
}
else{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/'){
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[mb_strlen($tag) - 2] == '/'){
// Self-closing tag.
print($tag);
}
else{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
print(mb_strcut($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
$bufferOuput = ob_get_contents();
ob_end_clean();
$html = $bufferOuput;
return $html;
}
Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
customisable truncation indicator (defaults to &hellips; i.e. … ),
return as a string without using output buffer, and
unit tests with 100% coverage.
All this at Pastie.
Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function htmlTruncate($maxLength, $html)
{
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
$out = "";
while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_substr($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength)
{
$out .= mb_substr($str, 0, $maxLength - $printedLength);
$printedLength = $maxLength;
break;
}
$out .= $str;
$printedLength += mb_strlen($str);
if ($tag[0] == '&')
{
// Handle the entity.
$out .= $tag;
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
$out .= $tag;
}
else if ($tag[mb_strlen($tag) - 2] == '/')
{
// Self-closing tag.
$out .= $tag;
}
else
{
// Opening tag.
$out .= $tag;
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
$out .= mb_substr($html, $position, $maxLength - $printedLength);
// Close any open tags.
while (!empty($tags))
$out .= sprintf('</%s>', array_pop($tags));
return $out;
}
function mb_preg_match(
$ps_pattern,
$ps_subject,
&$pa_matches,
$pn_flags = 0,
$pn_offset = 0,
$ps_encoding = NULL
) {
// WARNING! - All this function does is to correct offsets, nothing else:
//(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match) {
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
}
return $ret;
}
Use the function truncateHTML() from:
https://github.com/jlgrall/truncateHTML
Example: truncate after 9 characters including the ellipsis:
truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]);
// => "<p><b>A</b> red ba…</p>"
Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (<head>, <script>, <noscript>, <style>, <!-- comments -->), HTML $entities;, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.
I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.
The CakePHP framework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source (provided by #Quentin).
This is very difficult to do without using a validator and a parser, the reason being that imagine if you have
<div id='x'>
<div id='y'>
<h1>Heading</h1>
500
lines
of
html
...
etc
...
</div>
</div>
How do you plan to truncate that and end up with valid HTML?
After a brief search, I found this link which could help.
This question already has an answer here:
Convert HTML entities and special characters to UTF8 text in PHP
(1 answer)
Closed 9 months ago.
I am creating a RSS feed file for my application in which I want to remove HTML tags, which is done by strip_tags. But strip_tags is not removing HTML special code chars:
& ©
etc.
Please tell me any function which I can use to remove these special code chars from my string.
Either decode them using html_entity_decode or remove them using preg_replace:
$Content = preg_replace("/&#?[a-z0-9]+;/i","",$Content);
(From here)
EDIT: Alternative according to Jacco's comment
might be nice to replace the '+' with
{2,8} or something. This will limit
the chance of replacing entire
sentences when an unencoded '&' is
present.
$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content);
Use html_entity_decode to convert HTML entities.
You'll need to set charset to make it work correctly.
In addition to the good answers above, PHP also has a built-in filter function that is quite useful: filter_var.
To remove HTML characters, use:
$cleanString = filter_var($dirtyString, FILTER_SANITIZE_STRING);
More info:
function.filter-var
filter_sanitize_string
You may want take a look at htmlentities() and html_entity_decode() here
$orig = "I'll \"walk\" the <b>dog</b> now";
$a = htmlentities($orig);
$b = html_entity_decode($a);
echo $a; // I'll "walk" the <b>dog</b> now
echo $b; // I'll "walk" the <b>dog</b> now
This might work well to remove special characters.
$modifiedString = preg_replace("/[^a-zA-Z0-9_.-\s]/", "", $content);
If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...
function htmlToPlainText($str){
$str = str_replace(' ', ' ', $str);
$str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
$str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
$str = html_entity_decode($str);
$str = htmlspecialchars_decode($str);
$str = strip_tags($str);
return $str;
}
$string = '<p>this is ( ) a test</p>
<div>Yes this is! & does it get "processed"? </div>'
htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`
html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like '
htmlspecialchars_decode converts things like &
html_entity_decode converts things like '<
and strip_tags removes any HTML tags left over.
EDIT - Added str_replace(' ', ' ', $str); and several other html_entity_decode() as continued testing has shown a need for them.
A plain vanilla strings way to do it without engaging the preg regex engine:
function remEntities($str) {
if(substr_count($str, '&') && substr_count($str, ';')) {
// Find amper
$amp_pos = strpos($str, '&');
//Find the ;
$semi_pos = strpos($str, ';');
// Only if the ; is after the &
if($semi_pos > $amp_pos) {
//is a HTML entity, try to remove
$tmp = substr($str, 0, $amp_pos);
$tmp = $tmp. substr($str, $semi_pos + 1, strlen($str));
$str = $tmp;
//Has another entity in it?
if(substr_count($str, '&') && substr_count($str, ';'))
$str = remEntities($tmp);
}
}
return $str;
}
What I have done was to use: html_entity_decode, then use strip_tags to removed them.
try this
<?php
$str = "\x8F!!!";
// Outputs an empty string
echo htmlentities($str, ENT_QUOTES, "UTF-8");
// Outputs "!!!"
echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
?>
It looks like what you really want is:
function xmlEntities($string) {
$translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
foreach ($translationTable as $char => $entity) {
$from[] = $entity;
$to[] = '&#'.ord($char).';';
}
return str_replace($from, $to, $string);
}
It replaces the named-entities with their number-equivalent.
<?php
function strip_only($str, $tags, $stripContent = false) {
$content = '';
if(!is_array($tags)) {
$tags = (strpos($str, '>') !== false
? explode('>', str_replace('<', '', $tags))
: array($tags));
if(end($tags) == '') array_pop($tags);
}
foreach($tags as $tag) {
if ($stripContent)
$content = '(.+</'.$tag.'[^>]*>|)';
$str = preg_replace('#</?'.$tag.'[^>]*>'.$content.'#is', '', $str);
}
return $str;
}
$str = '<font color="red">red</font> text';
$tags = 'font';
$a = strip_only($str, $tags); // red text
$b = strip_only($str, $tags, true); // text
?>
The function I used to perform the task, joining the upgrade made by schnaader is:
mysql_real_escape_string(
preg_replace_callback("/&#?[a-z0-9]+;/i", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, strip_tags($row['cuerpo'])))
This function removes every html tag and html symbol, converted in UTF-8 ready to save in MySQL
You can try htmlspecialchars_decode($string). It works for me.
http://www.w3schools.com/php/func_string_htmlspecialchars_decode.asp
If you are working in WordPress and are like me and simply need to check for an empty field (and there are a copious amount of random html entities in what seems like a blank string) then take a look at:
sanitize_title_with_dashes( string $title, string $raw_title = '', string $context = 'display' )
Link to wordpress function page
For people not working on WordPress, I found this function REALLY useful to create my own sanitizer, take a look at the full code and it's really in depth!
$string = "äáčé";
$convert = Array(
'ä'=>'a',
'Ä'=>'A',
'á'=>'a',
'Á'=>'A',
'à'=>'a',
'À'=>'A',
'ã'=>'a',
'Ã'=>'A',
'â'=>'a',
'Â'=>'A',
'č'=>'c',
'Č'=>'C',
'ć'=>'c',
'Ć'=>'C',
'ď'=>'d',
'Ď'=>'D',
'ě'=>'e',
'Ě'=>'E',
'é'=>'e',
'É'=>'E',
'ë'=>'e',
);
$string = strtr($string , $convert );
echo $string; //aace
What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?
After all, just look at your example...
& ©
If you're stripping this for an RSS feed, shouldn't you want the equivalents?
" ", &, ©
Or maybe you don't exactly want the equivalents. Maybe you'd want to have just be ignored (to prevent too much space), but then have © actually get replaced. Let's work out a solution that solves anyone's version of this problem...
How to SELECTIVELY-REPLACE HTML Special Chars
The logic is simple: preg_match_all('/(&#[0-9]+;)/' grabs all of the matches, and then we simply build a list of matchables and replaceables, such as str_replace([searchlist], [replacelist], $term). Before we do this, we also need to convert named entities to their numeric counterparts, i.e., " " is unacceptable, but "�A0;" is fine. (Thanks to it-alien's solution to this part of the problem.)
Working Demo
In this demo, I replace { with "HTML Entity #123". Of course, you can fine-tune this to any kind of find-replace you want for your case.
Why did I make this? I use it with generating Rich Text Format from UTF8-character-encoded HTML.
See full working demo:
Full Online Working Demo
function FixUTF8($args) {
$output = $args['input'];
$output = convertNamedHTMLEntitiesToNumeric(['input'=>$output]);
preg_match_all('/(&#[0-9]+;)/', $output, $matches, PREG_OFFSET_CAPTURE);
$full_matches = $matches[0];
$found = [];
$search = [];
$replace = [];
for($i = 0; $i < count($full_matches); $i++) {
$match = $full_matches[$i];
$word = $match[0];
if(!$found[$word]) {
$found[$word] = TRUE;
$search[] = $word;
$replacement = str_replace(['&#', ';'], ['HTML Entity #', ''], $word);
$replace[] = $replacement;
}
}
$new_output = str_replace($search, $replace, $output);
return $new_output;
}
function convertNamedHTMLEntitiesToNumeric($args) {
$input = $args['input'];
return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
$c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
# return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below
$convmap = array(0x80, 0xffff, 0, 0xffff);
return mb_encode_numericentity($c, $convmap, 'UTF-8');
}, $input);
}
print(FixUTF8(['input'=>"Oggi è un bel giorno"]));
Input:
"Oggi è un bel giorno"
Output:
Oggi HTML Entity #232 un belHTML Entity #160giorno