Php function UTF-8 characters issue - php

Here is my function that makes the first character of the first word of a sentence uppercase:
function sentenceCase($str)
{
$cap = true;
$ret = '';
for ($x = 0; $x < strlen($str); $x++) {
$letter = substr($str, $x, 1);
if ($letter == "." || $letter == "!" || $letter == "?") {
$cap = true;
} elseif ($letter != " " && $cap == true) {
$letter = strtoupper($letter);
$cap = false;
}
$ret .= $letter;
}
return $ret;
}
It converts "sample sentence" into "Sample sentence". The problem is, it doesn't capitalize UTF-8 characters. See this example.
What am I doing wrong?

The most straightforward way to make your code UTF-8 aware is to use mbstring functions instead of the plain dumb ones in the three cases where the latter appear:
function sentenceCase($str)
{
$cap = true;
$ret = '';
for ($x = 0; $x < mb_strlen($str); $x++) { // mb_strlen instead
$letter = mb_substr($str, $x, 1); // mb_substr instead
if ($letter == "." || $letter == "!" || $letter == "?") {
$cap = true;
} elseif ($letter != " " && $cap == true) {
$letter = mb_strtoupper($letter); // mb_strtoupper instead
$cap = false;
}
$ret .= $letter;
}
return $ret;
}
You can then configure mbstring to work with UTF-8 strings and you are ready to go:
mb_internal_encoding('UTF-8');
echo sentenceCase ("üias skdfnsknka");
Bonus solution
Specifically for UTF-8 you can also use a regular expression, which will result in less code:
$str = "üias skdfnsknka";
echo preg_replace_callback(
'/((?:^|[!.?])\s*)(\p{Ll})/u',
function($match) { return $match[1].mb_strtoupper($match[2], 'UTF-8'); },
$str);

Related

in_array returns false on same characters, single character string matching

function make_ascii($str) {
$special = array('ľ','š','č','ť','ž','ý','á','í','é','ú','ä','ô','ň','ě');
$ascii = array('l','s','c','t','z','y','a','i','e','u','a','o','n','e');
$str = str_split($str);
foreach ($str as $k => $c) {
if(ctype_upper($c)) {
$u = true;
$c = strtolower($c);
} else {
$u = false;
}
if(in_array($c, $special, false)) {
$c = $ascii[array_search($c, $special)];
}
if($u) {
$c = strtoupper($c);
}
$str[$k] = $c;
}
return join($str);
}
In this function, even if I feed characters from $special array the in_array() returns false every time, if I would var_dump() on regular text I try to parse, the output will be just bool(false) with no mach, even if I copy paste the character from source to array. Also I'm looking for way to make this character replacement work.
As str_splitwill not work for multibyte you have to use mb_to perform multibyte string operation
function make_ascii($str) {
//return $str;
$special = array('ľ','š','č','ť','ž','ý','á','í','é','ú','ä','ô','ň','ě');
$ascii = array('l','s','c','t','z','y','a','i','e','u','a','o','n','e');
$str = array_map(function ($i) use ($str) {
return mb_substr($str, $i, 1);
}, range(0, mb_strlen($str) -1));
foreach ($str as $k => $c) {
if(ctype_upper($c)) {
$u = true;
$c = strtolower($c);
} else {
$u = false;
}
// print_r($c);
if(in_array($c, $special)) {
$c = $ascii[array_search($c, $special)];
}
if($u) {
$c = strtoupper($c);
}
$str[$k] = $c;
}
return join($str);
}
var_dump(make_ascii('áé'));
DEMO
If issue with uppercase letters you have to change functions to mb_strtoupper and mb_strtolower. Also ctype_upper will not work so change it also
function make_ascii($str) {
//return $str;
$special = array('ľ','š','č','ť','ž','ý','á','í','é','ú','ä','ô','ň','ě');
$ascii = array('l','s','c','t','z','y','a','i','e','u','a','o','n','e');
$str = array_map(function ($i) use ($str) {
return mb_substr($str, $i, 1);
}, range(0, mb_strlen($str) -1));
foreach ($str as $k => $c) {
if( mb_strtoupper($c, "UTF-8") == $c) {
$u = true;
$c = mb_strtolower($c);
} else {
$u = false;
}
// print_r($c);
if(in_array($c, $special)) {
$c = $ascii[array_search($c, $special)];
}
if($u) {
$c = mb_strtoupper($c);
}
$str[$k] = $c;
}
return join($str);
}
$str = "ľÁľa ýellow";
var_dump(make_ascii($str));
DEMO
As I mentioned str_split can't handle multi-byte strings. Don't believe me? Have a look.
At any rate here's a version that can split multibyte strings:
function make_ascii($str) {
$special = array('ľ','š','č','ť','ž','ý','á','í','é','ú','ä','ô','ň','ě');
$ascii = array('l','s','c','t','z','y','a','i','e','u','a','o','n','e');
$str = preg_split("//u",$str);
foreach ($str as $k => $c) {
if(ctype_upper($c)) {
$u = true;
$c = mb_strtolower($c);
} else {
$u = false;
}
if(in_array($c, $special, false)) {
$c = $ascii[array_search($c, $special)];
}
if($u) {
$c = mb_strtoupper($c);
}
$str[$k] = $c;
}
return join($str);
}
$str = "ľaľa ýellow";
print_r(make_ascii($str));
Prints:
lala yellow

Changed formation while use character limit in TBS library [duplicate]

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.
Original HTML string (288 characters):
$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";
Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:
$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */
Stripped HTML: Outputs correct character count but obviously looses formatting:
$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */
Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.
$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */
Challenge: To output clean HTML and n characters (excluding character count of HTML elements):
$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";
Similar Questions
How to clip HTML fragments without breaking up tags
Cutting HTML strings without breaking HTML tags
Not amazing, but works.
function html_cut($text, $max_length)
{
$tags = array();
$result = "";
$is_open = false;
$grab_open = false;
$is_close = false;
$in_double_quotes = false;
$in_single_quotes = false;
$tag = "";
$i = 0;
$stripped = 0;
$stripped_text = strip_tags($text);
while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
{
$symbol = $text{$i};
$result .= $symbol;
switch ($symbol)
{
case '<':
$is_open = true;
$grab_open = true;
break;
case '"':
if ($in_double_quotes)
$in_double_quotes = false;
else
$in_double_quotes = true;
break;
case "'":
if ($in_single_quotes)
$in_single_quotes = false;
else
$in_single_quotes = true;
break;
case '/':
if ($is_open && !$in_double_quotes && !$in_single_quotes)
{
$is_close = true;
$is_open = false;
$grab_open = false;
}
break;
case ' ':
if ($is_open)
$grab_open = false;
else
$stripped++;
break;
case '>':
if ($is_open)
{
$is_open = false;
$grab_open = false;
array_push($tags, $tag);
$tag = "";
}
else if ($is_close)
{
$is_close = false;
array_pop($tags);
$tag = "";
}
break;
default:
if ($grab_open || $is_close)
$tag .= $symbol;
if (!$is_open && !$is_close)
$stripped++;
}
$i++;
}
while ($tags)
$result .= "</".array_pop($tags).">";
return $result;
}
Usage example:
$content = html_cut($content, 100);
I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:
function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
if (is_array($ending)) {
extract($ending);
}
if ($considerHtml) {
if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
return $text;
}
$totalLength = mb_strlen($ending);
$openTags = array();
$truncate = '';
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
foreach ($tags as $tag) {
if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
array_unshift($openTags, $tag[2]);
} else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
$pos = array_search($closeTag[1], $openTags);
if ($pos !== false) {
array_splice($openTags, $pos, 1);
}
}
}
$truncate .= $tag[1];
$contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
if ($contentLength + $totalLength > $length) {
$left = $length - $totalLength;
$entitiesLength = 0;
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
foreach ($entities[0] as $entity) {
if ($entity[1] + 1 - $entitiesLength <= $left) {
$left--;
$entitiesLength += mb_strlen($entity[0]);
} else {
break;
}
}
}
$truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
break;
} else {
$truncate .= $tag[3];
$totalLength += $contentLength;
}
if ($totalLength >= $length) {
break;
}
}
} else {
if (mb_strlen($text) <= $length) {
return $text;
} else {
$truncate = mb_substr($text, 0, $length - strlen($ending));
}
}
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
if (isset($spacepos)) {
if ($considerHtml) {
$bits = mb_substr($truncate, $spacepos);
preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
if (!empty($droppedTags)) {
foreach ($droppedTags as $closingTag) {
if (!in_array($closingTag[1], $openTags)) {
array_unshift($openTags, $closingTag[1]);
}
}
}
}
$truncate = mb_substr($truncate, 0, $spacepos);
}
}
$truncate .= $ending;
if ($considerHtml) {
foreach ($openTags as $tag) {
$truncate .= '</'.$tag.'>';
}
}
return $truncate;
}
Use PHP's DOMDocument class to normalize an HTML fragment:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or .... Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.
I made another function to do it, it supports UTF-8:
/**
* Limit string without break html tags.
* Supports UTF8
*
* #param string $value
* #param int $limit Default 100
*/
function str_limit_html($value, $limit = 100)
{
if (mb_strwidth($value, 'UTF-8') <= $limit) {
return $value;
}
// Strip text with HTML tags, sum html len tags too.
// Is there another way to do it?
do {
$len = mb_strwidth($value, 'UTF-8');
$len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
$len_tags = $len - $len_stripped;
$value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
} while ($len_stripped > $limit);
// Load as HTML ignoring errors
$dom = new DOMDocument();
#$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);
// Fix the html errors
$value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));
// Remove body tag
$value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
// Remove empty tags
return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}
SEE DEMO.
I recommend use html_entity_decode at the start of function, so it preserves the UTF-8 characters:
$value = html_entity_decode($value);
Use a HTML parser and stop after 100 characters of text.
You should use Tidy HTML. You cut the string and then you run Tidy to close the tags.
(Credits where credits are due)
Regardless of the 100 count issues you state at the beginning, you indicate in the challenge the following:
output the character count of
strip_tags (the number of characters
in the actual displayed text of the
HTML)
retain HTML formatting close
any unfinished HTML tag
Here is my proposal:
Bascially, I parse through each character counting as I go. I make sure NOT to count any characters in any HTML tag. I also check at the end to make sure I am not in the middle of a word when I stop. Once I stop, I back track to the first available SPACE or > as a stopping point.
$position = 0;
$length = strlen($content)-1;
// process the content putting each 100 character section into an array
while($position < $length)
{
$next_position = get_position($content, $position, 100);
$data[] = substr($content, $position, $next_position);
$position = $next_position;
}
// show the array
print_r($data);
function get_position($content, $position, $chars = 100)
{
$count = 0;
// count to 100 characters skipping over all of the HTML
while($count <> $chars){
$char = substr($content, $position, 1);
if($char == '<'){
do{
$position++;
$char = substr($content, $position, 1);
} while($char !== '>');
$position++;
$char = substr($content, $position, 1);
}
$count++;
$position++;
}
echo $count."\n";
// find out where there is a logical break before 100 characters
$data = substr($content, 0, $position);
$space = strrpos($data, " ");
$tag = strrpos($data, ">");
// return the position of the logical break
if($space > $tag)
{
return $space;
} else {
return $tag;
}
}
This will also count the return codes etc. Considering they will take space, I have not removed them.
Here is a function I'm using in one of my projects. It's based on DOMDocument, works with HTML5 and is about 2x faster than other solutions I've tried (at least on my machine, 0.22 ms vs 0.43 ms using html_cut($text, $max_length) from the top answer on a 500 text-node-characters string with a limit of 400).
function cut_html ($html, $limit) {
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
cut_html_recursive($dom->documentElement, $limit);
return substr($dom->saveHTML($dom->documentElement), 5, -6);
}
function cut_html_recursive ($element, $limit) {
if($limit > 0) {
if($element->nodeType == 3) {
$limit -= strlen($element->nodeValue);
if($limit < 0) {
$element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
}
}
else {
for($i = 0; $i < $element->childNodes->length; $i++) {
if($limit > 0) {
$limit = cut_html_recursive($element->childNodes->item($i), $limit);
}
else {
$element->removeChild($element->childNodes->item($i));
$i--;
}
}
}
}
return $limit;
}
Here is my try at the cutter. Maybe you guys can catch some bugs. The problem, i found with the other parsers, is that they don't close tags properly and they cut in the middle of a word (blah)
function cutHTML($string, $length, $patternsReplace = false) {
$i = 0;
$count = 0;
$isParagraphCut = false;
$htmlOpen = false;
$openTag = false;
$tagsStack = array();
while ($i < strlen($string)) {
$char = substr($string, $i, 1);
if ($count >= $length) {
$isParagraphCut = true;
break;
}
if ($htmlOpen) {
if ($char === ">") {
$htmlOpen = false;
}
} else {
if ($char === "<") {
$j = $i;
$char = substr($string, $j, 1);
while ($j < strlen($string)) {
if($char === '/'){
$i++;
break;
}
elseif ($char === ' ') {
$tagsStack[] = substr($string, $i, $j);
}
$j++;
}
$htmlOpen = true;
}
}
if (!$htmlOpen && $char != ">") {
$count++;
}
$i++;
}
if ($isParagraphCut) {
$j = $i;
while ($j > 0) {
$char = substr($string, $j, 1);
if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
break;
} else if ($char === ">") {
$j++;
break;
}
$j--;
}
$string = substr($string, 0, $j);
foreach($tagsStack as $tag){
$tag = strtolower($tag);
if($tag !== "img" && $tag !== "br"){
$string .= "</$tag>";
}
}
$string .= "...";
}
if ($patternsReplace) {
foreach ($patternsReplace as $value) {
if (isset($value['pattern']) && isset($value["replace"])) {
$string = preg_replace($value["pattern"], $value["replace"], $string);
}
}
}
return $string;
}
try this function
// trim the string function
function trim_word($text, $length, $startPoint=0, $allowedTags=""){
$text = html_entity_decode(htmlspecialchars_decode($text));
$text = strip_tags($text, $allowedTags);
return $text = substr($text, $startPoint, $length);
}
and
echo trim_word("<h2 class='zzzz'>abcasdsdasasdas</h2>","6");
I know this is quite old, but I've recently made a small class for cutting HTML for previews: https://github.com/Simbiat/HTMLCut/
Why would you want to use that instead of the other suggestions? Here are a few things that come to my mind (taken from readme):
Preserve HTML tags, unless they are empty.
Preserve words.
Remove some orphaned punctuation signs at the end of the cut string.
Remove HTML tags, that you would not want in a preview (optional).
Limit number of paragraphs (optional).
Add an ellipsis if text was cut (optional).
Class operates with DOM, but also uses Regex in some places (mainly for cutting and trimming). Perhaps it can be of use to some.
Try the following:
<?php echo strip_tags(mb_strimwidth($VARIABLE_HERE, 0, 160, "...")); ?>
This will strip the HTML (strip_tags) amd limit characters (mb_strimwidth) to 160 characters

Sentence case without regular expressions

Is there a built-in php function, or a simple (efficient!) way to combine built-in functions, to give a string sentence case ("Sentence one. Sentence two.")?
PHP has similar built-in functions, but none that I can find for my it to my purposes:
ucfirst(strtolower("SENTENCE ONE. AND HERE'S TWO.")) returns "Sentence one. and here's two."; ucwords(strtolower("SENTENCE ONE. AND HERE'S TWO.")) "Sentence One. And Here's Two."
function sentence_case($str) {
$cap = true;
$ret='';
for($x = 0; $x < strlen($str); $x++){
$letter = substr($str, $x, 1);
if($letter == "." || $letter == "!" || $letter == "?"){
$cap = true;
}elseif($letter != " " && $cap == true){
$letter = strtoupper($letter);
$cap = false;
}
$ret .= $letter;
}
return $ret;
}
This will preserve existing proper noun capitals, acronyms and abbreviations.
You could split the string on ".", then ucfirst each sentence. Not the most elegant solution, but it works.
$sentences = explode(".",$paragraph);
$text = "";
foreach($sentences as $sentence) {
$text .= ucfirst(strtolower($sentence)).".";
}
Try this:
function sentenceCase($s){
$str = strtolower($s);
$cap = true;
for($x = 0; $x < strlen($str); $x++){
$letter = substr($str, $x, 1);
if($letter == "." || $letter == "!" || $letter == "?"){
$cap = true;
}elseif($letter != " " && $cap == true){
$letter = strtoupper($letter);
$cap = false;
}
$ret .= $letter;
}
return $ret;
}
Taken from php.net Works with more than just periods as line endings.
I came up with this solution using preg_split. It will try to split sentences on . boundaries where there is one or more spaces after the period.
It is still pretty efficient, but arguably less so that it's explode counterpart.
<?php
$str = "SENTENCE ONE. AND HERE'S TWO.";
$sentences = preg_split('/(\.\s+)/', $str, null, PREG_SPLIT_DELIM_CAPTURE);
array_walk(&$sentences, create_function('&$val', '$val = ucfirst(strtolower($val));'));
$str = implode('', $sentences);
echo $str; // Sentence one. And here's two.
Will work with new line breaks not only spaces.
function sentenceCase($text){
$cap = true; $newText = '';
for($x = 0; $x < strlen($text); $x++){
$letter = substr($text, $x, 1);
if($letter == '.' || $letter == '!' || $letter == '?' || $letter == "\n"){
$cap = true;
} elseif($letter != ' ' && $cap == true){
$letter = strtoupper($letter);
$cap = false;
}
$newText .= $letter;
}
return $newText;
}

Cutting text without destroying html tags

Is there a way to do this without writing my own function?
For example:
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 2, null, 20, true);
//result: Test <span><a>something</a></span>
I need to make this function indestructible
My problem is similar to
This thread
but I need a better solution. I would like to keep nested tags untouched.
So far my algorithm is:
function cutText($content, $max_words, $max_chars, $max_word_len, $html = false) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
for ($i = 0; $i<$max_chars;$i++) {
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
if ($inHtml) {
$max_chars++;
}
if ($html && !$inHtml) {
if ($content[$i] != ' ' && !$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '- ';
}
if (($content[$i] == ' ') && $word_started) {
$word_started = false;
$res .= $current_word;
$current_word = '';
$current_word_len = 0;
if ($word_count == $max_words) {
return $res;
}
}
}
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
}
return $res;
}
But of course it won't work. I thought about remembering opened tags and closing them if they were not closed but maybe there is a better way?
This works perfectly for me:
function trimContent ($str, $trimAtIndex) {
$beginTags = array();
$endTags = array();
for($i = 0; $i < strlen($str); $i++) {
if( $str[$i] == '<' )
$beginTags[] = $i;
else if($str[$i] == '>')
$endTags[] = $i;
}
foreach($beginTags as $k=>$index) {
// Trying to trim in between tags. Trim after the last tag
if( ( $trimAtIndex >= $index ) && ($trimAtIndex <= $endTags[$k]) ) {
$trimAtIndex = $endTags[$k];
}
}
return substr($str, 0, $trimAtIndex);
}
Try something like this
function cutText($inputText, $start, $length) {
$temp = $inputText;
$res = array();
while (strpos($temp, '>')) {
$ts = strpos($temp, '<');
$te = strpos($temp, '>');
if ($ts > 0) $res[] = substr($temp, 0, $ts);
$res[] = substr($temp, $ts, $te - $ts + 1);
$temp = substr($temp, $te + 1, strlen($temp) - $te);
}
if ($temp != '') $res[] = $temp;
$pointer = 0;
$end = $start + $length - 1;
foreach ($res as &$part) {
if (substr($part, 0, 1) != '<') {
$l = strlen($part);
$p1 = $pointer;
$p2 = $pointer + $l - 1;
$partx = "";
if ($start <= $p1 && $end >= $p2) $partx = "";
else {
if ($start > $p1 && $start <= $p2) $partx .= substr($part, 0, $start-$pointer);
if ($end >= $p1 && $end < $p2) $partx .= substr($part, $end-$pointer+1, $l-$end+$pointer);
if ($partx == "") $partx = $part;
}
$part = $partx;
$pointer += $l;
}
}
return join('', $res);
}
Parameters:
$inputText - input text
$start - position of first character
$length - how menu characters we want to remove
Example #1 - Removing first 3 characters
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 0, 3);
var_dump($text);
Output (removed "Tes")
string(47) "t <span><a>something</a> something else</span>."
Removing first 10 characters
$text = cutText($text, 0, 10);
Output (removed "Test somet")
string(40) "<span><a>hing</a> something else</span>."
Example 2 - Removing inner characters - "es" from "Test "
$text = cutText($text, 1, 2);
Output
string(48) "Tt <span><a>something</a> something else</span>."
Removing "thing something el"
$text = cutText($text, 9, 18);
Output
string(32) "Test <span><a>some</a>se</span>."
Hope this helps.
Well, maybe this is not the best solution but it's everything I can do at the moment.
Ok I solved this thing.
I divided this in 2 parts.
First cutting text without destroying html:
function cutHtml($content, $max_words, $max_chars, $max_word_len) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
$i = 0;
while ($i < $max_chars) {
//skip any html tags
if ($content[$i] == '<') {
$inHtml = true;
while (true) {
$res .= $content[$i];
$i++;
while($content[$i] == ' ') { $res .= $content[$i]; $i++; }
//skip any values
if ($content[$i] == "'") {
$res .= $content[$i];
$i++;
while(!($content[$i] == "'" && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
//skip any values
if ($content[$i] == '"') {
$res .= $content[$i];
$i++;
while(!($content[$i] == '"' && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
if ($content[$i] == '>') { $res .= $content[$i]; $i++; break;}
}
$inHtml = false;
}
if (!$inHtml) {
while($content[$i] == ' ') { $res .= $content[$i]; $letter_count++; $i++; } //skip spaces
$word_started = false;
$current_word = '';
$current_word_len = 0;
while (!in_array($content[$i], array(' ', '<', '.', ','))) {
if (!$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '-';
$current_word_len = 0;
}
$i++;
}
if ($letter_count > $max_chars) {
return $res;
}
if ($word_count < $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
}
if ($word_count == $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
return $res;
}
}
}
return $res;
}
And next thing is closing unclosed tags:
function cleanTags(&$html) {
$count = strlen($html);
$i = -1;
$openedTags = array();
while(true) {
$i++;
if ($i >= $count) break;
if ($html[$i] == '<') {
$tag = '';
$closeTag = '';
$reading = false;
//reading whole tag
while($html[$i] != '>') {
$i++;
while($html[$i] == ' ') $i++; //skip any spaces (need to be idiot proof)
if (!$reading && $html[$i] == '/') { //closing tag
$i++;
while($html[$i] == ' ') $i++; //skip any spaces
$closeTag = '';
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$closeTag .= $html[$i];
$i++;
}
$c = count($openedTags);
if ($c > 0 && $openedTags[$c-1] == $closeTag) array_pop($openedTags);
}
if (!$reading) //read only tag
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$tag .= $html[$i];
$i++;
}
//skip any values
if ($html[$i] == "'") {
$i++;
while(!($html[$i] == "'" && $html[$i-1] != "\\")) {
$i++;
}
}
//skip any values
if ($html[$i] == '"') {
$i++;
while(!($html[$i] == '"' && $html[$i-1] != "\\")) {
$i++;
}
}
if ($reading && $html[$i] == '/') { //self closed tag
$tag = '';
break;
}
}
if (!empty($tag)) $openedTags[] = $tag;
}
}
while (count($openedTags) > 0) {
$tag = array_pop($openedTags);
$html .= "</$tag>";
}
}
It's not idiot proof but tinymce will clear this thing out so further cleaning is not necessary.
It may be a little long but i don't think it will eat a lot of resources and it should be faster than regex.

Easiest way to remove all whitespace from a code file?

I'm participating in one of the Code Golf competitions where the smaller your file size is, the better.
Rather than manually removing all whitespace, etc., I'm looking for a program or website which will take a file, remove all whitespace (including new lines) and return a compact version of the file. Any ideas?
You could use:
sed 's/\s\s+/ /g' youfile > yourpackedfile`
There is also this online tool.
You can even do it in PHP (how marvelous is life):
$data = file_get_contents('foobar.php');
$data = preg_replace('/\s\s+/', ' ', $data);
file_put_contents('foobar2.php', $data);
You have to note this won't take care of a string variable like $bar = ' asd aa a'; it might be a problem depending on what you are doing. The online tool seems to handle this properly.
$ tr -d ' \n' <oldfile >newfile
In PowerShell (v2) this can be done with the following little snippet:
(-join(gc my_file))-replace"\s"
or longer:
(-join (Get-Content my_file)) -replace "\s"
It will join all lines together and remove all spaces and tabs.
However, for some languages you probably don't want to do that. In PowerShell for example you don't need semicolons unless you put multiple statements on a single line so code like
while (1) {
"Hello World"
$x++
}
would become
while(1){"HelloWorld"$x++}
when applying aforementioned statements naïvely. It both changed the meaning and the syntactical correctness of the program. Probably not too much to look out for in numerical golfed solutions but the issue with lines joined together still remains, sadly. Just putting a semicolon between each line doesn't actually help either.
This is a PHP function that will do the work for you:
function compress_php_src($src) {
// Whitespaces left and right from this signs can be ignored
static $IW = array(
T_CONCAT_EQUAL, // .=
T_DOUBLE_ARROW, // =>
T_BOOLEAN_AND, // &&
T_BOOLEAN_OR, // ||
T_IS_EQUAL, // ==
T_IS_NOT_EQUAL, // != or <>
T_IS_SMALLER_OR_EQUAL, // <=
T_IS_GREATER_OR_EQUAL, // >=
T_INC, // ++
T_DEC, // --
T_PLUS_EQUAL, // +=
T_MINUS_EQUAL, // -=
T_MUL_EQUAL, // *=
T_DIV_EQUAL, // /=
T_IS_IDENTICAL, // ===
T_IS_NOT_IDENTICAL, // !==
T_DOUBLE_COLON, // ::
T_PAAMAYIM_NEKUDOTAYIM, // ::
T_OBJECT_OPERATOR, // ->
T_DOLLAR_OPEN_CURLY_BRACES, // ${
T_AND_EQUAL, // &=
T_MOD_EQUAL, // %=
T_XOR_EQUAL, // ^=
T_OR_EQUAL, // |=
T_SL, // <<
T_SR, // >>
T_SL_EQUAL, // <<=
T_SR_EQUAL, // >>=
);
if(is_file($src)) {
if(!$src = file_get_contents($src)) {
return false;
}
}
$tokens = token_get_all($src);
$new = "";
$c = sizeof($tokens);
$iw = false; // Ignore whitespace
$ih = false; // In HEREDOC
$ls = ""; // Last sign
$ot = null; // Open tag
for($i = 0; $i < $c; $i++) {
$token = $tokens[$i];
if(is_array($token)) {
list($tn, $ts) = $token; // tokens: number, string, line
$tname = token_name($tn);
if($tn == T_INLINE_HTML) {
$new .= $ts;
$iw = false;
}
else {
if($tn == T_OPEN_TAG) {
if(strpos($ts, " ") || strpos($ts, "\n") || strpos($ts, "\t") || strpos($ts, "\r")) {
$ts = rtrim($ts);
}
$ts .= " ";
$new .= $ts;
$ot = T_OPEN_TAG;
$iw = true;
} elseif($tn == T_OPEN_TAG_WITH_ECHO) {
$new .= $ts;
$ot = T_OPEN_TAG_WITH_ECHO;
$iw = true;
} elseif($tn == T_CLOSE_TAG) {
if($ot == T_OPEN_TAG_WITH_ECHO) {
$new = rtrim($new, "; ");
} else {
$ts = " ".$ts;
}
$new .= $ts;
$ot = null;
$iw = false;
} elseif(in_array($tn, $IW)) {
$new .= $ts;
$iw = true;
} elseif($tn == T_CONSTANT_ENCAPSED_STRING
|| $tn == T_ENCAPSED_AND_WHITESPACE)
{
if($ts[0] == '"') {
$ts = addcslashes($ts, "\n\t\r");
}
$new .= $ts;
$iw = true;
} elseif($tn == T_WHITESPACE) {
$nt = #$tokens[$i+1];
if(!$iw && (!is_string($nt) || $nt == '$') && !in_array($nt[0], $IW)) {
$new .= " ";
}
$iw = false;
} elseif($tn == T_START_HEREDOC) {
$new .= "<<<S\n";
$iw = false;
$ih = true; // in HEREDOC
} elseif($tn == T_END_HEREDOC) {
$new .= "S;";
$iw = true;
$ih = false; // in HEREDOC
for($j = $i+1; $j < $c; $j++) {
if(is_string($tokens[$j]) && $tokens[$j] == ";") {
$i = $j;
break;
} else if($tokens[$j][0] == T_CLOSE_TAG) {
break;
}
}
} elseif($tn == T_COMMENT || $tn == T_DOC_COMMENT) {
$iw = true;
} else {
if(!$ih) {
$ts = strtolower($ts);
}
$new .= $ts;
$iw = false;
}
}
$ls = "";
}
else {
if(($token != ";" && $token != ":") || $ls != $token) {
$new .= $token;
$ls = $token;
}
$iw = true;
}
}
return $new;
}
// This is an example
$src = file_get_contents('foobar.php');
file_put_contents('foobar3.php',compress_php_src($src));
If your code editor programs supports regular expressions, you can try this:
Find this: [\r\n]{2,}
Replace with this: \n
Then Replace All
Notepad++ is quite a nice editor if you are on Windows, and it has a lot of predefined macros, trimming down code and removing whitespace among them.
It can do regular expressions and has a plethora of features to help the code hacker or script kiddie.
Notepad++ website
Run php -w on it!
php -w myfile.php
Unlike a regular expression, this is smart enough to leave strings alone, and it removes comments too.

Categories