PHP - Substring after X characters with special-characters - php

Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è

The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);

The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (` equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).

The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.

You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);

A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...

I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.

Related

Chop() is chopping more than what is asked

I have multiple random strings, and I'm trying to pull "SpottedBlanket" out of the string. Some of them work fine:
DarkBaySpottedBlanket --
DarkBay
BaySpottedBlanket --
Bay
but others are cutting out more than it should.
RedRoanSpottedBlanket --
RedR
BlackSpottedBlanket --
Blac
DunSpottedBlanket --
Du
this is the code I'm using, but I thought it would be self explanatory:
$AppyShortcut = chop($AppyColor,"SpottedBlanket");
$AppyColor would obviously be the random generated string. Any clue why this is happening?
The chop function takes the string in the second argument - which in this case is "SpottedBlanket", and removes any contiguous characters that it finds from the right hand side.
So for the case of "RedRoanSpottedBlanket", you'd get back "RedR" because "o", "a", and "n" are letters that can be found in the string "SpottedBlanket".
chop() is usually used to remove trailing white space - a way of cleaning user input before performing some action on it.
Give your array:
$strings = ["DarkBaySpottedBlanket", "RedRoanSpottedBlanket", "BlackSpottedBlanket", "DunSpottedBlanket"];
What you might be looking for is somerthing like this:
foreach ($strings as $string) {
print substr($string, 0, strrpos($string, "SpottedBlanket")) . "\n";
}
This finds the position of the string from the end using strrpos(), then returns the start of the string until that position, using substr().

PHP detect variable length string contains any character other than 1

Using PHP I sometimes have strings that look like the following:
111
110
011
1111
0110012
What is the most efficient way (preferably without regex) to determine if a string contains any character other then the character 1?
Here's a one-line code solution that can be put into a conditional etc.:
strlen(str_replace('1','',$mystring))==0
It strips out the "1"s and sees if there's anything left.
User Don't Panic commented that str_replace could be replaced by trim:
strlen(trim($mystring, '1'))==0
which removes leading and trailing 1s and sees if there's anything left. This would work for the particular case in OP's request but the first option will also tell you how many non-"1" characters you have (if that information matters). Depending on implementation, trim might run slightly faster because PHP doesn't have to check any characters between the first and last non-"1" characters.
You could also use a string like a character array and iterate through from the beginning until you find a character which is not =='1' (in which case, return true) or reach the end of the array (in which case, return false).
Finally, though OP here said "preferably without regex," others open to regexes might use one:
preg_match("/[^1]/", $mystring)==1
Another way to do it:
if (base_convert($string, 2, 2) === $string) {
// $string has only 0 and 1 characters.
}
since your $string is basically a binary number, you can check it with base_convert.
How it works:
var_dump(base_convert('110', 2, 2)); // 110
var_dump(base_convert('11503', 2, 2)); // 110
var_dump(base_convert('9111111111111111111110009', 2, 2)); // 11111111111111111111000
If the returned value of base_convert is different from the input, there're something other characters, beside 0 and 1.
If you want checks if the string has only 1 characters:
if(array_sum(str_split($string)) === strlen($string)) {
// $string has only 1 characters.
}
You retrieve all the single numbers with str_split, and sum them with array_sum. If the result isn't the same as the length of the string, then you've other number in the string beside 1.
Another option is treat string like array of symbols and check for something that is not 1. If it is - break for loop:
for ($i = 0; $i < strlen($mystring); $i++) {
if ($mystring[$i] != '1') {
echo 'FOUND!';
break;
}
}

how to use similar text php code in arabic

Trying to use php similar_text() with arabic, but it's not working.
However it works great with english.
<?php
$var = similar_text("ياسر","عمار","$per");
echo $var;
?>
outbot : 5
that's wrong result, it should be 2. Is there similar_text() with arabic letters?
Here's one I'm using
//from http://www.phperz.com/article/14/1029/31806.html
function mb_split_str($str) {
preg_match_all("/./u", $str, $arr);
return $arr[0];
}
//based on http://www.phperz.com/article/14/1029/31806.html, added percent
function mb_similar_text($str1, $str2, &$percent) {
$arr_1 = array_unique(mb_split_str($str1));
$arr_2 = array_unique(mb_split_str($str2));
$similarity = count($arr_2) - count(array_diff($arr_2, $arr_1));
$percent = ($similarity * 200) / (strlen($str1) + strlen($str2) );
return $similarity;
}
So
$var = mb_similar_text('عمار', 'ياسر', $per);
output: $var = 2, $per = 25
Because the Arabic text are multibyte strings normal PHP functions cannot be used (such as 'similar_text()').
echo(strlen("عمار"));
The above code outputs: 8
echo(mb_strlen("عمار", "UTF-8"));
Using the mb_strlen function with the UTF-8 encoding specified, the output is: 4 (the correct number of characters).
You can use the mb_ functions to make your own version of the similar_text function: http://php.net/manual/en/ref.mbstring.php
Just for the record and hopefully to make some help, I want to clarify the behavior of the similar_text() function when some multi-byte character strings are given (including the character strings of the Arabic.)
The function simply treats each byte of the input string as an individual character (which implies it neither supports multi-byte characters nor the Unicode.)
The byte streams of the عمار and ياسر strings are respectively represented as the following (the bytes (in the hexadecimal representation) are separated using . and, where the end of a character is reached, then a : is used instead):
06.39:06.45:06.27:06.31 <-- Byte stream for عمار
|| || || || ||
06.4A:06.27:06.33:06.31 <-- Byte stream for ياسر
As you can tell, there are five matching, and that's the reason why the function returns 5 in this case (every two hexadecimal digits represent a byte.)

PHP Strip String, Convert to int

I have a STRING $special which is formatted like £130.00 and is also an ex TAX(VAT) price.
I need to strip the first char so i can run some simple addition.
$str= substr($special, 1, 0); // Strip first char '£'
echo $str ; // Echo Value to check its worked
$endPrice = (0.20*$str)+$str ; // Work out VAT
I don't receive any value when i echo on the second line ? Also would i then need to convert the string to an integer in order to run the addition ?
Thanks
Matt
+++ UPDATE
Thanks for your help with this, I took your code and added some of my own, There are more than likely nicer ways to do this but it works :) I found out that if the price was below 1000 would look like £130.00 if the price was a larger value it would include a break. ie £1,400.22.
$str = str_replace('£', '', $price);
$str2 = str_replace(',', '', $str);
$vatprice = (0.2 * $str2) + $str2;
$display_vat_price = sprintf('%0.2f', $vatprice);
echo "£";
echo $display_vat_price ;
echo " (Inc VAT)";
Thanks again, Matt
You cannot use substr the way you are using it currently. This is because you are trying to remove the £ char, which is a two-byte unicode character, but substr() isn't unicode safe. You can either use $str = substr($string, 2), or, better, str_replace() like this:
$string = '£130.00';
$str = str_replace('£', '', $string);
echo (0.2 * $str) + $str; // 156
Original answer
I'll keep this version as it still can give some insight. The answer would be OK if £ wouldn't be a 2byte unicode character. Knowing this, you can still use it but you need to start the sub-string at offset 2 instead of 1.
Your usage of substr is wrong. It should be:
$str = substr($special, 1);
Check the documentation the third param would be the length of the sub-string. You passed 0, therefore you got an empty string. If you omit the third param it will return the sub-string starting from the index given in the first param until the end of the original string.

PHP - How do I generate strings with control characters or binary data?

For testing purposes I need strings such as:
"test\x00string"
I would like to loop over the control characters (00-1F) and generate the strings automatically so I don't have to clutter my code with 31 lines like this but don't know how to realize that in php.
Also for testing malformed utf I might want to insert other byte sequences into strings.
For certain characters there are predefined escape sequences, which can be used in double quotes:
$nullByte = "\0";
However, if you're gonna loop, your best bet would be chr():
$string = '';
foreach (range( 0x00, 0x1F ) as $i)
{
$string .= chr($i);
}
And as a one-liner:
$string = implode('', array_map('chr', range(0x00, 0x1F)));
$nullByte = chr(0);
You can just concatenate bytes to make a multibyte string.

Categories