PHP regex not matching utf-8 decoded string - php

I am having trouble with some a regex statement. I'm not sure why it is doing this, however I think it may have something to do with character encoding.
So I am using curl to receive the page content from a website. Then I am using domXPath query to get a certain element, then from that element I get its content, then from that content I perform a regex statement. However the regex statement is not working and I don't know why.
This is what I receive from the element:
X: asdasdfgdgdrrY: dfgdfgfgZ: ukuykyukjghj
a B 7dd.
Now when I try to match it with this code:
/X: (?P<x>.*)Y: (?P<y>.*)Z: (?P<z>.*)\s*(?P<a>[a-zA-Z]+) (?P<b>[a-zA-Z]+) (?P<c>[0-9]+)dd/
I have tested this in Dreamweaver and it matches so I have no idea what it wouldn't online
Also the page I am receiving has a content of utf-8,
I attempt to convert the content to remove the utf-8 characters by using
iconv('utf-8', 'ISO-8859-1//IGNORE', $td->item(0)->nodeValue);
if I don't remove the utf-8 characters there are weird Á symbols after the 'a', 'b' and 'c' variable values.

Ok I figured it out,
all i had to do to get rid of these invisible invalid characters was:
$value = preg_replace("/[^a-zA-Z0-9 %():\$.\/-]/",' ',$value);
pre much just replace any character that wasnt valid, with a space, or blank. In my case I used space because it appeared some spaces were invalid.

Related

PHP - preg_match() - matching substitution character black diamond with question mark

I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);

PHP strpos says different croatian chars are the same: š č

I have the following code:
$text = 'Tomáš'
echo strpos($text, "č");
# result if 4
I believe they are different chars so why is PHP telling me they are the same?
What is going on and how can I correct this?
The encoding you chose to save your source code file in cannot encode the characters you're trying to save. Whatever characters PHP is seeing, it's not comparing the strings you think it is. Save your source code in an encoding that can encode all characters, preferably UTF-8.
You should try with mb_strpos function.
Performs a multi-byte safe strpos() operation based on number of characters. The first character's position is 0, the second character position is 1, and so on.
With a regular setup, it returns false to me.
However if you've troubles with such special characters, using mb_strpos instead of strpos should help.
http://php.net/manual/en/function.mb-strpos.php

Remove certain special HTML characters from string in PHP

I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&amp". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. &ohm; or ) as well as decimal (e.g. &#01234) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

preg_replace - NULL result?

Here's a small example (download, rename to .php and execute it in your shell):
test.txt
Why does preg_replace return NULL instead of the original string?
\x{2192} is the same as HTML "→" ("→").
I had an null response when my regular expression included the u UTF-8 PCRE modifier. If your source text is not UTF and you have this modifier, you'll get a null result.
From the documentation on preg_replace():
Return Values
preg_replace() returns an array if the
subject parameter is an array, or a
string otherwise.
If matches are found, the new subject
will be returned, otherwise subject
will be returned unchanged or NULL if
an error occurred.
In your pattern, I don't think the u flag is supported. WRONG
Edit: It seems like some kind of encoding issue with the subject. When I erase '147 3.2 V6 - GTA (184 kW)' and manually re-type it everything seems to work.
Edit 2: In the pattern you provided, there are 3 spaces that seem to be giving issues to the regex engine. When I convert them to decimal their value is 160 (as opposed to normal space 32). When I replace those spaces with normal ones it seems to work.
I've replaced the offending spaces with underscores below:
'147 3.2 V6 - GTA (184 kW)'
'147 3.2_V6 - GTA_(184_kW)'
You are using single quotes, which means the only thing that you can escape is other single quotes. To enable escape sequences (e.g. \x32, then use double quotes "")
I am not a UTF8 expert, but the escape code \x2192 is not correct either. You can do: \x21\x92 to get both bytes into your string, but you may want to look at utf8_encode and utf8_decode
Your source string has invalid characters in it, or something. PHP gives:
Warning: preg_replace(): Compilation failed: invalid UTF-8 string at offset 0 in test.php on line 7
I believe there is also a fault in your Regex expression: ~\x{2192}~u
Try replacing what I have and see if that works out for you: /\x{2192}/u

Categories