I am having trouble with a preg_match_all on a string that contains a degree symbol. The sample of code is below.
//Sample data
$x = "<array_0>
<id>text-21650</id>
<text>Lat/Long 38° 57' 34 N, 106° 21' 38 W</text>
</array_0>";
$reels = '/<(\w+)\s*([^\/>]*)\s*(?:\/>|>(.*)<\/\s*\\1\s*>)/s';
preg_match_all($reels, $x, $elements);
foreach ($elements[1] as $ie => $xx) {
$name = $elements[1][$ie];
$cdend = strpos($elements[3][$ie], "<");
if ($cdend > 0) {
$xmlary[$name] = substr($elements[3][$ie], 0, $cdend - 1);
}
if (preg_match($reels, $elements[3][$ie]))
$xmlary[$name] = processEl($elements[3][$ie]);
else if ($elements[3][$ie] !== null) {
$xmlary[$name] = $elements[3][$ie];
}
}
For some reason it doesn't work properly with the degree symbols in there. If I take it out it works. I would really like to find a way that they can stay in there without changing them. I am also wondering if there may be other extended character that could cause a problem too.
Any help would be greatly appreciated.
Thanks
Have a look at this previous answer on StackOverflow.
Basically you will have to switch to Unicode matching.
Use mb_ereg_match instead to support UTF-8 chars. Docs:
http://php.net/manual/en/book.mbstring.php
Initialize mb* like this:
mb_regex_encoding('UTF-8'); mb_internal_encoding('UTF-8');
I had the same problem, and this other post from stackoverflow helped me. Basically, to look for a degree symbol, you'd use \x{00B0}, ie.
preg_match_all("/\x{00B0}/", $x, $elements);
Related
This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris
The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case
My aim is to validate a last name by allowing it to only contain letters or a single quote.
I do not know what the fastest way is..maybe regex I suppose..
Anyway, so far I have this:
function check_surname($surname)
{
$c = str_split($surname,1);
$i = 0;
$test = 1; // Wrong surname
while($i < strlen($surname))
{
if(ctype_alpha($c[$i]) or $c[$i] == '\'')
{
$test = 0;
$i++;
}
else
{
return false;
}
}
}
I can feel that something is wrong here but I can't see where it is.
Could anyone help me out?
There are some good suggestions in the comments, and I definitely agree with #Cyclone that you should take into account diacritics (accented letters).
Fortunately, PHP regexes support Unicode classes, so this is easy to do. Unicode includes a class L for any letter (uppercase, lowercase, modified, and title case). This will allow accented letters in the name.
I would also recommend that you allow for dashes (Katherine Zeta-Jones) and spaces (Guido van Rossum). Given all that, I would use the following regex:
preg_match("/^[\p{L} '-]+$/", lname);
I'm struggling to find the best way to do this. Basically I am provided strings that are like this with the task of printing out the string with the math parsed.
Jack has a [0.8*100]% chance of passing the test. Katie has a [(0.25 + 0.1)*100]% chance.
The mathematical equations are always encapsulated by square brackets. Why I'm dealing with strings like this is a long story, but I'd really appreciate the help!
There are plenty of math evaluation libraries for PHP. A quick web search turns up this one.
Writing your own parser is also an option, and if it's just basic arithmetic it shouldn't be too difficult. With the resources out there, I'd stay away from this.
You could take a simpler approach and use eval. Be careful to sanitize your input first. On the eval docs's page, there are comments with code to do that. Here's one example:
Disclaimer: I know eval is just a misspelling of evil, and it's a horrible horrible thing, and all that. If used right, it has uses, though.
<?php
$test = '2+3*pi';
// Remove whitespaces
$test = preg_replace('/\s+/', '', $test);
$number = '(?:\d+(?:[,.]\d+)?|pi|π)'; // What is a number
$functions = '(?:sinh?|cosh?|tanh?|abs|acosh?|asinh?|atanh?|exp|log10|deg2rad|rad2deg|sqrt|ceil|floor|round)'; // Allowed PHP functions
$operators = '[+\/*\^%-]'; // Allowed math operators
$regexp = '/^(('.$number.'|'.$functions.'\s*\((?1)+\)|\((?1)+\))(?:'.$operators.'(?2))?)+$/'; // Final regexp, heavily using recursive patterns
if (preg_match($regexp, $q))
{
$test = preg_replace('!pi|π!', 'pi()', $test); // Replace pi with pi function
eval('$result = '.$test.';');
}
else
{
$result = false;
}
?>
preg_match_all('/\[(.*?)\]/', $string, $out);
foreach ($out[1] as $k => $v)
{
eval("\$result = $v;");
$string = str_replace($out[0][$k], $result, $string);
}
This code is highly dangerous if the strings are user inputs because it allows any arbitrary code to be executed
The eval approach updated from PHP doc examples.
<?php
function calc($equation)
{
// Remove whitespaces
$equation = preg_replace('/\s+/', '', $equation);
echo "$equation\n";
$number = '((?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)?|pi|π)'; // What is a number
$functions = '(?:sinh?|cosh?|tanh?|acosh?|asinh?|atanh?|exp|log(10)?|deg2rad|rad2deg|sqrt|pow|abs|intval|ceil|floor|round|(mt_)?rand|gmp_fact)'; // Allowed PHP functions
$operators = '[\/*\^\+-,]'; // Allowed math operators
$regexp = '/^([+-]?('.$number.'|'.$functions.'\s*\((?1)+\)|\((?1)+\))(?:'.$operators.'(?1))?)+$/'; // Final regexp, heavily using recursive patterns
if (preg_match($regexp, $equation))
{
$equation = preg_replace('!pi|π!', 'pi()', $equation); // Replace pi with pi function
echo "$equation\n";
eval('$result = '.$equation.';');
}
else
{
$result = false;
}
return $result;
}
?>
Sounds, like your homework....but whatever.
You need to use string manipulation php has a lot of built in functions so your in luck. Check out the explode() function for sure and str_split().
Here is a full list of functions specifically related to strings: http://www.w3schools.com/php/php_ref_string.asp
Good Luck.
Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.
The code below gives me this mysterious error, and i cannot fathom it. I am new to regular expressions and so am consequently stumped. The regular expression should be validating any international phone number.
Any help would be much appreciated.
function validate_phone($phone)
{
$phoneregexp ="^(\+[1-9][0-9]*(\([0-9]*\)|-[0-9]*-))?[0]?[1-9][0-9\- ]*$";
$phonevalid = 0;
if (ereg($phoneregexp, $phone))
{
$phonevalid = 1;
}else{
$phonevalid = 0;
}
}
Hmm well the code you pasted isn't quite valid, I fixed it up by adding the missing quotes, missing delimiters, and changed preg to preg_match. I didn't get the warning.
Edit: after seeing the other comment, you meant "ereg" not "preg"... that gives the warning. Try using preg_match() instead ;)
<?php
function validate_phone($phone) {
$phoneregexp ='/^(\+[1-9][0-9]*(\([0-9]*\)|-[0-9]*-))?[0]?[1-9][0-9\- ]*$/';
$phonevalid = 0;
if (preg_match($phoneregexp, $phone)) {
$phonevalid = 1;
} else {
$phonevalid = 0;
}
}
validate_phone("123456");
?>
If this is PHP, then the regex must be enclosed in quotes. Furthermore, what's preg? Did you mean preg_match?
Another thing. PHP knows boolean values. The canonical solution would rather look like this:
return preg_match($regex, $phone) !== 0;
EDIT: Or, using ereg:
return ereg($regex, $phone) !== FALSE;
(Here, the explicit test against FALSE isn't strictly necessary but since ereg returns a number upon success I feel safer coercing the value into a bool).
It's the [0-9\\- ] part of your RE - it's not escaping the "-" properly. Change it to [0-9 -] and you should be OK (a "-" at the last position in a character class is treated as literal, not part of a range specification).
Just to provide some reference material please read
Regular Expressions (Perl-Compatible)
preg_match()
or if you'd like to stick with the POSIX regexp:
Regular Expression (POSIX Extended)
ereg()
The correct sample code has already been given above.