preg_replace - NULL result? - php

Here's a small example (download, rename to .php and execute it in your shell):
test.txt
Why does preg_replace return NULL instead of the original string?
\x{2192} is the same as HTML "→" ("→").

I had an null response when my regular expression included the u UTF-8 PCRE modifier. If your source text is not UTF and you have this modifier, you'll get a null result.

From the documentation on preg_replace():
Return Values
preg_replace() returns an array if the
subject parameter is an array, or a
string otherwise.
If matches are found, the new subject
will be returned, otherwise subject
will be returned unchanged or NULL if
an error occurred.
In your pattern, I don't think the u flag is supported. WRONG
Edit: It seems like some kind of encoding issue with the subject. When I erase '147 3.2 V6 - GTA (184 kW)' and manually re-type it everything seems to work.
Edit 2: In the pattern you provided, there are 3 spaces that seem to be giving issues to the regex engine. When I convert them to decimal their value is 160 (as opposed to normal space 32). When I replace those spaces with normal ones it seems to work.
I've replaced the offending spaces with underscores below:
'147 3.2 V6 - GTA (184 kW)'
'147 3.2_V6 - GTA_(184_kW)'

You are using single quotes, which means the only thing that you can escape is other single quotes. To enable escape sequences (e.g. \x32, then use double quotes "")
I am not a UTF8 expert, but the escape code \x2192 is not correct either. You can do: \x21\x92 to get both bytes into your string, but you may want to look at utf8_encode and utf8_decode
Your source string has invalid characters in it, or something. PHP gives:
Warning: preg_replace(): Compilation failed: invalid UTF-8 string at offset 0 in test.php on line 7

I believe there is also a fault in your Regex expression: ~\x{2192}~u
Try replacing what I have and see if that works out for you: /\x{2192}/u

Related

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

PHP - preg_match() - matching substitution character black diamond with question mark

I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);

PHP very strict complex string validation using preg_match

I am trying to validate a string against the following regular expression which has been imposed upon me:
[-,.:; 0-9A-Z&#$£¥€'"«»‘’“”?!/\\()\[\]{}<>]{3}[-,.:; 0-9A-Z&#$£¥€'"«»‘’“”?!/\\()\[\]{}<>*=#%+]{0,157}
Can anybody help with writing a preg_match in PHP to validate an input string against this? I am struggling because:
my knowledge of regex isn't that great in the first place
I see special characters in the regex itself which I feel sure PHP won't be happy about me inserting directly into a string (e.g. $£¥€)
In vain hope I just tried sticking it into preg_match, escaping the double quotes, thus:
$ste = "Some string input";
if(preg_match("/[-,.:; 0-9A-Z&#$£¥€'\"«»‘’“”?!/\\()\[\]{}<>]{3}[-,.:; 0-9A-Z&#$£¥€'\"«»‘’“”?!/\\()\[\]{}<>*=#%+]{0,157}/",$ste))
{
echo "OK";
}
else
{
echo "Not OK";
}
Thanks in advance!!
PHP will be perfectly happy with the "special" characters in the expression, provided you do the following:
Make sure the input string is encoded with UTF-8 encoding.
Make sure your PHP program file is saved using UFT-8 encoding. (and obviously you'll need to use UTF-8 encoding in all other parts of your system too, or you'll get borked characters showing up somewhere along the line, but that's outside the scope of this question)
Add the add the u modifier to the end of the regex pattern string to tell the regex parser to handle UTF-8 characters. ie:
preg_match("/....../u", ...);
^
add this
Other than that, you've got it pretty much spot on already.
You can do that:
if (preg_match('~^[ -"$&-),-<>?-\]{}£¥€«»‘’“”]{3}[ -\]{}£¥€«»‘’“”]{0,157}$~u', $ste))
echo 'OK';
else
echo 'Not OK';
I have added the "u" modifier for unicode, and reduced the size of the character classes using ranges (example:,-< means all characters between , and < in the unicode table).
But the most important, I have added anchors ^ and $ that means respectivly start and end of the string.

PHP regex not matching utf-8 decoded string

I am having trouble with some a regex statement. I'm not sure why it is doing this, however I think it may have something to do with character encoding.
So I am using curl to receive the page content from a website. Then I am using domXPath query to get a certain element, then from that element I get its content, then from that content I perform a regex statement. However the regex statement is not working and I don't know why.
This is what I receive from the element:
X: asdasdfgdgdrrY: dfgdfgfgZ: ukuykyukjghj
a B 7dd.
Now when I try to match it with this code:
/X: (?P<x>.*)Y: (?P<y>.*)Z: (?P<z>.*)\s*(?P<a>[a-zA-Z]+) (?P<b>[a-zA-Z]+) (?P<c>[0-9]+)dd/
I have tested this in Dreamweaver and it matches so I have no idea what it wouldn't online
Also the page I am receiving has a content of utf-8,
I attempt to convert the content to remove the utf-8 characters by using
iconv('utf-8', 'ISO-8859-1//IGNORE', $td->item(0)->nodeValue);
if I don't remove the utf-8 characters there are weird Á symbols after the 'a', 'b' and 'c' variable values.
Ok I figured it out,
all i had to do to get rid of these invisible invalid characters was:
$value = preg_replace("/[^a-zA-Z0-9 %():\$.\/-]/",' ',$value);
pre much just replace any character that wasnt valid, with a space, or blank. In my case I used space because it appeared some spaces were invalid.

how to fix a malformed JSON in php

i have this JSON string that i want to decode it with json_decode(); function
{"phase":2,"id":"pagelet_profile_picture","css":["VCxcl","Ix2pq"],"js":["fZYUE","VfnZ3"],"content":{"pagelet_profile_picture":"\u003cdiv class=\"profile-picture\">\u003cspan class=\"profile-picture-overlay\">\u003c\/span>\u003cimg class=\"photo img\" src\=\"http:\/\/profile.ak.fbcdn.net\/hprofile-ak-snc4\/222_111_2222_n.jpg\" alt=\"bla bla\" id=\"profile_pic\" \/>\u003c\/div>"}}
there is the json_last_error(); but it not helping me. (got JSON_ERROR_STATE_MISMATCH and JSON_ERROR_SYNTAX sometimes)
i want to know what wrong with this JSON string and how i can fix it automatically in PHP so i can decode it.
some code will be very helpful
thanks.
Using a json lint, it seems the problem is the src\=
the \ escapes the = sign, which makes no sense.
If you replace src\= with src= it passes the validator.
The fix:
Fix the code that generates the json string in the first place.
or
use str_replace to change 'src\=' to 'src='
The problem with a wrong encoding is that it's just a wrong encoding. Things then break.
If the problem is related to invalid escape sequences as Ben pointed out in his answer, you can try to fix the input string for these sequences, probably with a smarter algorithm that is looking for any not-needed escape sequence replacing it with it's non-escaped value by removing the escape character \.
You can do so by creating a list of characters that need actual to be escaped, then parse the whole string for the escape character, if found, check if the next character requires escaping or not and then act upon.
However that's only one strategy and as the input is not properly encoded, it's not easy to just fix things because they are already broken.

Categories