How to remove html special chars? [duplicate] - php

This question already has an answer here:
Convert HTML entities and special characters to UTF8 text in PHP
(1 answer)
Closed 9 months ago.
I am creating a RSS feed file for my application in which I want to remove HTML tags, which is done by strip_tags. But strip_tags is not removing HTML special code chars:
& ©
etc.
Please tell me any function which I can use to remove these special code chars from my string.

Either decode them using html_entity_decode or remove them using preg_replace:
$Content = preg_replace("/&#?[a-z0-9]+;/i","",$Content);
(From here)
EDIT: Alternative according to Jacco's comment
might be nice to replace the '+' with
{2,8} or something. This will limit
the chance of replacing entire
sentences when an unencoded '&' is
present.
$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content);

Use html_entity_decode to convert HTML entities.
You'll need to set charset to make it work correctly.

In addition to the good answers above, PHP also has a built-in filter function that is quite useful: filter_var.
To remove HTML characters, use:
$cleanString = filter_var($dirtyString, FILTER_SANITIZE_STRING);
More info:
function.filter-var
filter_sanitize_string

You may want take a look at htmlentities() and html_entity_decode() here
$orig = "I'll \"walk\" the <b>dog</b> now";
$a = htmlentities($orig);
$b = html_entity_decode($a);
echo $a; // I'll "walk" the <b>dog</b> now
echo $b; // I'll "walk" the <b>dog</b> now

This might work well to remove special characters.
$modifiedString = preg_replace("/[^a-zA-Z0-9_.-\s]/", "", $content);

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...
function htmlToPlainText($str){
$str = str_replace(' ', ' ', $str);
$str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
$str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
$str = html_entity_decode($str);
$str = htmlspecialchars_decode($str);
$str = strip_tags($str);
return $str;
}
$string = '<p>this is ( ) a test</p>
<div>Yes this is! & does it get "processed"? </div>'
htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`
html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like '
htmlspecialchars_decode converts things like &
html_entity_decode converts things like '<
and strip_tags removes any HTML tags left over.
EDIT - Added str_replace(' ', ' ', $str); and several other html_entity_decode() as continued testing has shown a need for them.

A plain vanilla strings way to do it without engaging the preg regex engine:
function remEntities($str) {
if(substr_count($str, '&') && substr_count($str, ';')) {
// Find amper
$amp_pos = strpos($str, '&');
//Find the ;
$semi_pos = strpos($str, ';');
// Only if the ; is after the &
if($semi_pos > $amp_pos) {
//is a HTML entity, try to remove
$tmp = substr($str, 0, $amp_pos);
$tmp = $tmp. substr($str, $semi_pos + 1, strlen($str));
$str = $tmp;
//Has another entity in it?
if(substr_count($str, '&') && substr_count($str, ';'))
$str = remEntities($tmp);
}
}
return $str;
}

What I have done was to use: html_entity_decode, then use strip_tags to removed them.

try this
<?php
$str = "\x8F!!!";
// Outputs an empty string
echo htmlentities($str, ENT_QUOTES, "UTF-8");
// Outputs "!!!"
echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
?>

It looks like what you really want is:
function xmlEntities($string) {
$translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
foreach ($translationTable as $char => $entity) {
$from[] = $entity;
$to[] = '&#'.ord($char).';';
}
return str_replace($from, $to, $string);
}
It replaces the named-entities with their number-equivalent.

<?php
function strip_only($str, $tags, $stripContent = false) {
$content = '';
if(!is_array($tags)) {
$tags = (strpos($str, '>') !== false
? explode('>', str_replace('<', '', $tags))
: array($tags));
if(end($tags) == '') array_pop($tags);
}
foreach($tags as $tag) {
if ($stripContent)
$content = '(.+</'.$tag.'[^>]*>|)';
$str = preg_replace('#</?'.$tag.'[^>]*>'.$content.'#is', '', $str);
}
return $str;
}
$str = '<font color="red">red</font> text';
$tags = 'font';
$a = strip_only($str, $tags); // red text
$b = strip_only($str, $tags, true); // text
?>

The function I used to perform the task, joining the upgrade made by schnaader is:
mysql_real_escape_string(
preg_replace_callback("/&#?[a-z0-9]+;/i", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, strip_tags($row['cuerpo'])))
This function removes every html tag and html symbol, converted in UTF-8 ready to save in MySQL

You can try htmlspecialchars_decode($string). It works for me.
http://www.w3schools.com/php/func_string_htmlspecialchars_decode.asp

If you are working in WordPress and are like me and simply need to check for an empty field (and there are a copious amount of random html entities in what seems like a blank string) then take a look at:
sanitize_title_with_dashes( string $title, string $raw_title = '', string $context = 'display' )
Link to wordpress function page
For people not working on WordPress, I found this function REALLY useful to create my own sanitizer, take a look at the full code and it's really in depth!

$string = "äáčé";
$convert = Array(
'ä'=>'a',
'Ä'=>'A',
'á'=>'a',
'Á'=>'A',
'à'=>'a',
'À'=>'A',
'ã'=>'a',
'Ã'=>'A',
'â'=>'a',
'Â'=>'A',
'č'=>'c',
'Č'=>'C',
'ć'=>'c',
'Ć'=>'C',
'ď'=>'d',
'Ď'=>'D',
'ě'=>'e',
'Ě'=>'E',
'é'=>'e',
'É'=>'E',
'ë'=>'e',
);
$string = strtr($string , $convert );
echo $string; //aace

What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?
After all, just look at your example...
& ©
If you're stripping this for an RSS feed, shouldn't you want the equivalents?
" ", &, ©
Or maybe you don't exactly want the equivalents. Maybe you'd want to have just be ignored (to prevent too much space), but then have © actually get replaced. Let's work out a solution that solves anyone's version of this problem...
How to SELECTIVELY-REPLACE HTML Special Chars
The logic is simple: preg_match_all('/(&#[0-9]+;)/' grabs all of the matches, and then we simply build a list of matchables and replaceables, such as str_replace([searchlist], [replacelist], $term). Before we do this, we also need to convert named entities to their numeric counterparts, i.e., " " is unacceptable, but "&#00A0;" is fine. (Thanks to it-alien's solution to this part of the problem.)
Working Demo
In this demo, I replace { with "HTML Entity #123". Of course, you can fine-tune this to any kind of find-replace you want for your case.
Why did I make this? I use it with generating Rich Text Format from UTF8-character-encoded HTML.
See full working demo:
Full Online Working Demo
function FixUTF8($args) {
$output = $args['input'];
$output = convertNamedHTMLEntitiesToNumeric(['input'=>$output]);
preg_match_all('/(&#[0-9]+;)/', $output, $matches, PREG_OFFSET_CAPTURE);
$full_matches = $matches[0];
$found = [];
$search = [];
$replace = [];
for($i = 0; $i < count($full_matches); $i++) {
$match = $full_matches[$i];
$word = $match[0];
if(!$found[$word]) {
$found[$word] = TRUE;
$search[] = $word;
$replacement = str_replace(['&#', ';'], ['HTML Entity #', ''], $word);
$replace[] = $replacement;
}
}
$new_output = str_replace($search, $replace, $output);
return $new_output;
}
function convertNamedHTMLEntitiesToNumeric($args) {
$input = $args['input'];
return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
$c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
# return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below
$convmap = array(0x80, 0xffff, 0, 0xffff);
return mb_encode_numericentity($c, $convmap, 'UTF-8');
}, $input);
}
print(FixUTF8(['input'=>"Oggi è un bel giorno"]));
Input:
"Oggi è un bel giorno"
Output:
Oggi HTML Entity #232 un belHTML Entity #160giorno

Related

How to remove "<>" brackets from string in php?

$cont=htmlspecialchars(file_get_contents("https://myanimelist.net/anime/30276/One_Punch_Man"));
function getBetween($string, $start = "", $end = ""){
if (strpos($string, $start)) { // required if $start not exist in $string
$startCharCount = strpos($string, $start) + strlen($start);
$firstSubStr = substr($string, $startCharCount, strlen($string));
$endCharCount = strpos($firstSubStr, $end);
if ($endCharCount == 0) {
$endCharCount = strlen($firstSubStr);
}
return substr($firstSubStr, 0, $endCharCount);
} else {
return '';
}
}
$name=getBetween($cont,'title',' - MyAnimeList.net');
//$name=preg_replace('/[^a-zA-Z0-9 \p{L}]/m', '', $name);
preg_replace('/(*UTF8)[\>\<]/m', '', $name);
trim($name," ");
//$name=str_replace("gt", "", $name);
echo $name;
i want to find the text between title tags. how to do this?
for example in this page title contains 'One Punch Man - MyAnimeList.net' i want to get that
Just use string replace function:
$string = '<BoomBox>';
$string = str_replace('<', '', $string);
$string = str_replace('>', '', $string);
echo $string; // output: Boombox
http://php.net/manual/en/function.str-replace.php
You edited your answer, and we can now see you are dealing with XML/HTML. It's always better to work with the DOM classes. Never use regex! There is a famous Stack Overflow post explaining why never to parse html with regex. Try this solution instead:
<?php
$dom = new DOMDocument();
$dom->loadHTML('<title>BoomBox</title>');
echo $dom->getElementsByTagName('title')->item(0)->textContent;
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/class.domnode.php
See it working here https://3v4l.org/EjPQd
You can use preg_replace();, or strip_tags();.
Example preg_replace();:
$str = '> One Punch Man';
$new = preg_replace('/[^a-zA-Z0-9 \p{L}]/m', '', $str);
echo $new;
Output: One Punch Man
Above example will only allow a-z, A-Z and 0-9. You can expand this.
Example strip_tags();:
$str = '<title> BoomBox </title>';
$another = strip_tags($str);
echo $another;
Output: BoomBox
Documentation:
http://php.net/manual/en/function.preg-replace.php // preg_replace();
http://php.net/manual/en/function.strip-tags.php // strip_tags();
You can also use a single call to str_replace with the ['<','>'] as the search argument:
$string = '<BoomBox>';
echo str_replace(['<', '>'], '', $string) . PHP_EOL;
// => Boombox
Or, you may use a regex with preg_replace (especially, if you plan on adding more restrictions for in-context matching to it):
echo preg_replace('~[<>]~', '', $string);
// => Boombox
See the PHP demo.

How can I pad each multibyte character / emoji with spaces around it in a string?

I'd like to pad each multibyte character with spaces on either side. I can strip them out just fine, but I'd like to leave them in and just pad them.
For example: 👉😀👈 to 👉 😀 👈.
Using underscores to represent spaces: 👉😀👈 to _👉__😀__👈_
Use this monsterous-already-cooked regex:
$regex = "[\\x{fe00}-\\x{fe0f}\\x{2712}\\x{2714}\\x{2716}\\x{271d}\\x{2721}\\x{2728}\\x{2733}\\x{2734}\\x{2744}\\x{2747}\\x{274c}\\x{274e}\\x{2753}-\\x{2755}\\x{2757}\\x{2763}\\x{2764}\\x{2795}-\\x{2797}\\x{27a1}\\x{27b0}\\x{27bf}\\x{2934}\\x{2935}\\x{2b05}-\\x{2b07}\\x{2b1b}\\x{2b1c}\\x{2b50}\\x{2b55}\\x{3030}\\x{303d}\\x{1f004}\\x{1f0cf}\\x{1f170}\\x{1f171}\\x{1f17e}\\x{1f17f}\\x{1f18e}\\x{1f191}-\\x{1f19a}\\x{1f201}\\x{1f202}\\x{1f21a}\\x{1f22f}\\x{1f232}-\\x{1f23a}\\x{1f250}\\x{1f251}\\x{1f300}-\\x{1f321}\\x{1f324}-\\x{1f393}\\x{1f396}\\x{1f397}\\x{1f399}-\\x{1f39b}\\x{1f39e}-\\x{1f3f0}\\x{1f3f3}-\\x{1f3f5}\\x{1f3f7}-\\x{1f4fd}\\x{1f4ff}-\\x{1f53d}\\x{1f549}-\\x{1f54e}\\x{1f550}-\\x{1f567}\\x{1f56f}\\x{1f570}\\x{1f573}-\\x{1f579}\\x{1f587}\\x{1f58a}-\\x{1f58d}\\x{1f590}\\x{1f595}\\x{1f596}\\x{1f5a5}\\x{1f5a8}\\x{1f5b1}\\x{1f5b2}\\x{1f5bc}\\x{1f5c2}-\\x{1f5c4}\\x{1f5d1}-\\x{1f5d3}\\x{1f5dc}-\\x{1f5de}\\x{1f5e1}\\x{1f5e3}\\x{1f5ef}\\x{1f5f3}\\x{1f5fa}-\\x{1f64f}\\x{1f680}-\\x{1f6c5}\\x{1f6cb}-\\x{1f6d0}\\x{1f6e0}-\\x{1f6e5}\\x{1f6e9}\\x{1f6eb}\\x{1f6ec}\\x{1f6f0}\\x{1f6f3}\\x{1f910}-\\x{1f918}\\x{1f980}-\\x{1f984}\\x{1f9c0}\\x{3297}\\x{3299}\\x{a9}\\x{ae}\\x{203c}\\x{2049}\\x{2122}\\x{2139}\\x{2194}-\\x{2199}\\x{21a9}\\x{21aa}\\x{231a}\\x{231b}\\x{2328}\\x{2388}\\x{23cf}\\x{23e9}-\\x{23f3}\\x{23f8}-\\x{23fa}\\x{24c2}\\x{25aa}\\x{25ab}\\x{25b6}\\x{25c0}\\x{25fb}-\\x{25fe}\\x{2600}-\\x{2604}\\x{260e}\\x{2611}\\x{2614}\\x{2615}\\x{2618}\\x{261d}\\x{2620}\\x{2622}\\x{2623}\\x{2626}\\x{262a}\\x{262e}\\x{262f}\\x{2638}-\\x{263a}\\x{2648}-\\x{2653}\\x{2660}\\x{2663}\\x{2665}\\x{2666}\\x{2668}\\x{267b}\\x{267f}\\x{2692}-\\x{2694}\\x{2696}\\x{2697}\\x{2699}\\x{269b}\\x{269c}\\x{26a0}\\x{26a1}\\x{26aa}\\x{26ab}\\x{26b0}\\x{26b1}\\x{26bd}\\x{26be}\\x{26c4}\\x{26c5}\\x{26c8}\\x{26ce}\\x{26cf}\\x{26d1}\\x{26d3}\\x{26d4}\\x{26e9}\\x{26ea}\\x{26f0}-\\x{26f5}\\x{26f7}-\\x{26fa}\\x{26fd}\\x{2702}\\x{2705}\\x{2708}-\\x{270d}\\x{270f}]|\\x{23}\\x{20e3}|\\x{2a}\\x{20e3}|\\x{30}\\x{20e3}|\\x{31}\\x{20e3}|\\x{32}\\x{20e3}|\\x{33}\\x{20e3}|\\x{34}\\x{20e3}|\\x{35}\\x{20e3}|\\x{36}\\x{20e3}|\\x{37}\\x{20e3}|\\x{38}\\x{20e3}|\\x{39}\\x{20e3}|\\x{1f1e6}[\\x{1f1e8}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f2}\\x{1f1f4}\\x{1f1f6}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fd}\\x{1f1ff}]|\\x{1f1e7}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ef}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1e8}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ee}\\x{1f1f0}-\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}-\\x{1f1ff}]|\\x{1f1e9}[\\x{1f1ea}\\x{1f1ec}\\x{1f1ef}\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1ff}]|\\x{1f1ea}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ed}\\x{1f1f7}-\\x{1f1fa}]|\\x{1f1eb}[\\x{1f1ee}-\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1f7}]|\\x{1f1ec}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ee}\\x{1f1f1}-\\x{1f1f3}\\x{1f1f5}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1ed}[\\x{1f1f0}\\x{1f1f2}\\x{1f1f3}\\x{1f1f7}\\x{1f1f9}\\x{1f1fa}]|\\x{1f1ee}[\\x{1f1e8}-\\x{1f1ea}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}]|\\x{1f1ef}[\\x{1f1ea}\\x{1f1f2}\\x{1f1f4}\\x{1f1f5}]|\\x{1f1f0}[\\x{1f1ea}\\x{1f1ec}-\\x{1f1ee}\\x{1f1f2}\\x{1f1f3}\\x{1f1f5}\\x{1f1f7}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1f1}[\\x{1f1e6}-\\x{1f1e8}\\x{1f1ee}\\x{1f1f0}\\x{1f1f7}-\\x{1f1fb}\\x{1f1fe}]|\\x{1f1f2}[\\x{1f1e6}\\x{1f1e8}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1ff}]|\\x{1f1f3}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f4}\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}\\x{1f1ff}]|\\x{1f1f4}\\x{1f1f2}|\\x{1f1f5}[\\x{1f1e6}\\x{1f1ea}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1f3}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1f6}\\x{1f1e6}|\\x{1f1f7}[\\x{1f1ea}\\x{1f1f4}\\x{1f1f8}\\x{1f1fa}\\x{1f1fc}]|\\x{1f1f8}[\\x{1f1e6}-\\x{1f1ea}\\x{1f1ec}-\\x{1f1f4}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fd}-\\x{1f1ff}]|\\x{1f1f9}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ed}\\x{1f1ef}-\\x{1f1f4}\\x{1f1f7}\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1ff}]|\\x{1f1fa}[\\x{1f1e6}\\x{1f1ec}\\x{1f1f2}\\x{1f1f8}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1fb}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ee}\\x{1f1f3}\\x{1f1fa}]|\\x{1f1fc}[\\x{1f1eb}\\x{1f1f8}]|\\x{1f1fd}\\x{1f1f0}|\\x{1f1fe}[\\x{1f1ea}\\x{1f1f9}]|\\x{1f1ff}[\\x{1f1e6}\\x{1f1f2}\\x{1f1fc}]";
Inside a preg_replace_callback():
var_dump(preg_replace_callback("#$regex#u", function($match) {
return $match[0]." ";
}, '👉😀👈'));
Outputs:
string(18) "👉 😀 👈 "
Live demo
I found this function that someone had added in the PHP docs that splits a multibyte string into an array of characters (like str_split) and modified it.
function addSpaces($string) {
$strlen = mb_strlen($string);
$new_string = '';
while ($strlen) {
$char = mb_substr($string,0,1,"UTF-8");
if (strlen($char) > 1) {
$new_string .= " $char ";
} else {
$new_string .= $char;
}
$string = mb_substr($string,1,$strlen,"UTF-8");
$strlen = mb_strlen($string);
}
return $new_string;
}
This question has other ways to do that split that could be similarly modified. The modification is, if strlen of one of the split characters is greater than 1, then it's multibyte, so add the spaces.
Simple regex replace could work as well...
mb_regex_encoding("UTF-8");
echo mb_ereg_replace(
'([^\p{L}\s])',
' \\1 ',
'text 👉😀👈 other text 👉😀👈'
);
outputs: text 👉 😀 👈 other text 👉 😀 👈
function pad_emojis($string) {
$default_encoding = mb_regex_encoding();
mb_regex_encoding("UTF-8");
$string = mb_ereg_replace('([^\p{L}\s])', ' \\1 ', $string);
mb_regex_encoding($default_encoding);
return $string;
}

Regular expression for finding multiple patterns from a given string

I am using regular expression for getting multiple patterns from a given string.
Here, I will explain you clearly.
$string = "about us";
$newtag = preg_replace("/ /", "_", $string);
print_r($newtag);
The above is my code.
Here, i am finding the space in a word and replacing the space with the special character what ever i need, right??
Now, I need a regular expression that gives me patterns like
about_us, about-us, aboutus as output if i give about us as input.
Is this possible to do.
Please help me in that.
Thanks in advance!
And finally, my answer is
$string = "contact_us";
$a = array('-','_',' ');
foreach($a as $b){
if(strpos($string,$b)){
$separators = array('-','_','',' ');
$outputs = array();
foreach ($separators as $sep) {
$outputs[] = preg_replace("/".$b."/", $sep, $string);
}
print_r($outputs);
}
}
exit;
You need to do a loop to handle multiple possible outputs :
$separators = array('-','_','');
$string = "about us";
$outputs = array();
foreach ($separators as $sep) {
$outputs[] = preg_replace("/ /", $sep, $string);
}
print_r($outputs);
You can try without regex:
$string = 'about us';
$specialChar = '-'; // or any other
$newtag = implode($specialChar, explode(' ', $string));
If you put special characters into an array:
$specialChars = array('_', '-', '');
$newtags = array();
foreach ($specialChars as $specialChar) {
$newtags[] = implode($specialChar, explode(' ', $string));
}
Also you can use just str_replace()
foreach ($specialChars as $specialChar) {
$newtags[] = str_replace(' ', $specialChar, $string);
}
Not knowing exactly what you want to do I expect that you might want to replace any occurrence of a non-word (1 or more times) with a single dash.
e.g.
preg_replace('/\W+/', '-', $string);
If you just want to replace the space, use \s
<?php
$string = "about us";
$replacewith = "_";
$newtag = preg_replace("/\s/", $replacewith, $string);
print_r($newtag);
?>
I am not sure that regexes are the good tool for that. However you can simply define this kind of function:
function rep($str) {
return array( strtr($str, ' ', '_'),
strtr($str, ' ', '-'),
str_replace(' ', '', $str) );
}
$result = rep('about us');
print_r($result);
Matches any character that is not a word character
$string = "about us";
$newtag = preg_replace("/(\W)/g", "_", $string);
print_r($newtag);
in case its just that... you would get problems if it's a longer string :)

Uppercase the first character of each word in a string except 'and', 'to', etc

How can I make upper-case the first character of each word in a string accept a couple of words which I don't want to transform them, like - and, to, etc?
For instance, I want this - ucwords('art and design') to output the string below,
'Art and Design'
is it possible to be like - strip_tags($text, '<p><a>') which we allow and in the string?
or I should use something else? please advise!
thanks.
None of these are really UTF8 friendly, so here's one that works flawlessly (so far)
function titleCase($string, $delimiters = array(" ", "-", ".", "'", "O'", "Mc"), $exceptions = array("and", "to", "of", "das", "dos", "I", "II", "III", "IV", "V", "VI"))
{
/*
* Exceptions in lower case are words you don't want converted
* Exceptions all in upper case are any words you don't want converted to title case
* but should be converted to upper case, e.g.:
* king henry viii or king henry Viii should be King Henry VIII
*/
$string = mb_convert_case($string, MB_CASE_TITLE, "UTF-8");
foreach ($delimiters as $dlnr => $delimiter) {
$words = explode($delimiter, $string);
$newwords = array();
foreach ($words as $wordnr => $word) {
if (in_array(mb_strtoupper($word, "UTF-8"), $exceptions)) {
// check exceptions list for any words that should be in upper case
$word = mb_strtoupper($word, "UTF-8");
} elseif (in_array(mb_strtolower($word, "UTF-8"), $exceptions)) {
// check exceptions list for any words that should be in upper case
$word = mb_strtolower($word, "UTF-8");
} elseif (!in_array($word, $exceptions)) {
// convert to uppercase (non-utf8 only)
$word = ucfirst($word);
}
array_push($newwords, $word);
}
$string = join($delimiter, $newwords);
}//foreach
return $string;
}
Usage:
$s = 'SÃO JOÃO DOS SANTOS';
$v = titleCase($s); // 'São João dos Santos'
since we all love regexps, an alternative, that also works with interpunction (unlike the explode(" ",...) solution)
$newString = preg_replace_callback("/[a-zA-Z]+/",'ucfirst_some',$string);
function ucfirst_some($match)
{
$exclude = array('and','not');
if ( in_array(strtolower($match[0]),$exclude) ) return $match[0];
return ucfirst($match[0]);
}
edit added strtolower(), or "Not" would remain "Not".
How about this ?
$string = str_replace(' And ', ' and ', ucwords($string));
You will have to use ucfirst and loop through every word, checking e.g. an array of exceptions for each one.
Something like the following:
$exclude = array('and', 'not');
$words = explode(' ', $string);
foreach($words as $key => $word) {
if(in_array($word, $exclude)) {
continue;
}
$words[$key] = ucfirst($word);
}
$newString = implode(' ', $words);
I know it is a few years after the question, but I was looking for an answer to the insuring proper English in the titles of a CMS I am programming and wrote a light weight function from the ideas on this page so I thought I would share it:
function makeTitle($title){
$str = ucwords($title);
$exclude = 'a,an,the,for,and,nor,but,or,yet,so,such,as,at,around,by,after,along,for,from,of,on,to,with,without';
$excluded = explode(",",$exclude);
foreach($excluded as $noCap){$str = str_replace(ucwords($noCap),strtolower($noCap),$str);}
return ucfirst($str);
}
The excluded list was found at:
http://www.superheronation.com/2011/08/16/words-that-should-not-be-capitalized-in-titles/
USAGE: makeTitle($title);

How can I shorten a very long string in PHP

I have a problem with a PHP breadcrumb function I am using, when the page name is very long, it overflows out of the box, which then looks really ugly.
My question is, how can I achieve this: "This is a very long string" to "This is..." with PHP?
Any other ideas on how I could handle this problem would also be appreciated, thanx in advance!
Here is the breadcrumb function:
function breadcrumbs() {
// Breadcrumb navigation
if (is_page() && !is_front_page() || is_single() || is_category()) {
echo '<ul class="breadcrumbs">';
echo '<li class="front_page">'.get_bloginfo('name').' <span style="color: #FFF;">»</span> </li>';
if (is_page()) {
$ancestors = get_post_ancestors($post);
if ($ancestors) {
$ancestors = array_reverse($ancestors);
foreach ($ancestors as $crumb) {
echo '<li>'.get_the_title($crumb).' <span style="color: #FFF;">»</span> </li>';
}
}
}
if (is_single()) {
$category = get_the_category();
echo '<li>'.$category[0]->cat_name.'</li>';
}
if (is_category()) {
$category = get_the_category();
echo '<li>'.$category[0]->cat_name.'</li>';
}
// Current page
if (is_page() || is_single()) {
echo '<li class="current">'.get_the_title().'</li>';
}
echo '</ul>';
} elseif (is_front_page()) {
// Front page
echo '<ul class="breadcrumbs">';
echo '<li class="front_page">'.get_bloginfo('name').'</li>';
echo '<li class="current">Home Page</li>';
echo '</ul>';
}
}
If you want a more nice (word limited) trucation you can use explode to split the string by spaces and then append each word (array entry) until you've reached your max limit
Something like:
define("MAX_LEN", 15);
$sentance = "Hello this is a long sentance";
$words = explode(' ', $sentance);
$newStr = "";
foreach($words as $word) {
if(strlen($newStr." ".$word) >= MAX_LEN) {
break;
}
$newStr = $newStr." ".$word;
}
If you are working with UTF-8 as charset, I suggest using the mb_strimwidth method as it is multibyte safe and won´t mess up multibyte chars. It also appends a placeholder string like ... automatically, with substr you´d have to do that in an additional step.
Usage sample:
echo mb_strimwidth("Hello World", 0, 10, "...", "UTF-8"); // .. or some other charset
// outputs Hello W...
You can safely use substr.
and eventually wordwrap() to break long words
$string = "This is a very long string";
$newString = substr( $string, 0, 7)."...";
// Output = This is...
Ideally, it should be done on the client side. You can use CSS/JS for the same.
Set this CSS property: text-overflow: ellipsis.
However, it will work only in IE. To use the same in Firefox as well, you can do something like this.
If you do not mind javascript plugins, use one of the jQuery ellipsis plugin.
Edit: These methods will work even when dealing with unicode, which can be a bit tricky if you try to handle this using php. (Like substr function)
Edit 2: If your problem is just the overflowing text and you do not mind not having the "..." at the end then it is even more simple. Simply, use the CSS: text-overflow: hidden;.
You can truncate the string at max length and then search for the last space:
Multibyte safe (Requires PHP > = 4.2)
function mb_TruncateString($string, $length = 40, $marker = "...")
{
if (mb_strlen($string) <= $length)
return $string;
// Trim at given length
$string = mb_substr($string, 0, $length);
// Get the text before the last space
if(mb_ereg("(.*)\s", $string, $matches))
$string = $matches[1];
return $string . $marker;
}
Following is not multibyte safe
function TruncateString($string, $length = 40, $marker = "...")
{
if (strlen($string) <= $length)
return $string;
// Trim at given length
$string = substr($string, 0, $length);
// Get the text before the last space
if(preg_match("/(.*)\s/i", $string, $matches))
$string = $matches[1];
return $string . $marker;
}
You're after a truncate function. This is what I use:
/**
* #param string $str
* #param int $length
* #return string
*/
function truncate($str, $length=100)
{
$str = substr($str, $length);
$words = explode(' ', $str); // separate words into an array
array_pop($words); // discard last item, as 9/10 times it's a partial word
$str = implode(' ', $words); // re-glue the string
return $str;
}
And usage:
echo truncate('This is a very long page name that will eventually be truncated', 15);

Categories