Match exact string without worrying about special characters - php

I have a large amount of text data which here and there contain a lot of special character, and I need to validate it to other data (text) sources.
my question is: is it possible to "escape" a string in regex so that it does not consider the special characters?
example:
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/$text/", $other_text);
in the example here, there are a lot of special characters in $text as this is a massive amount of incoming text to be compared to a large amount of $other_text string, and sometimes the data actually contains regexps) so I need to use preg_match.
What I'm getting at, is there a "turn special characters off in this string" type of delimiter?
using above example:
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/%$text%/", $other_text);
Here, the % characters surrounding $text indicates that the string is to be taken "literal" as opposed as containing regex characters.
any ideas?

The method you are looking for is called preg_quote - it will escape any relevant character to have a "plain, stupid" String match.
http://php.net/manual/en/function.preg-quote.php
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/". preg_quote($text, "/") . "/", $other_text);
Will exactly match the given string in other text.

Related

PHP: Split a string at the first period that isn't the decimal point in a price or the last character of the string

I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?

regex to remove complete HTML entity

We have a requirement to remove special characters from text strings. For example, we may get a string that looks like this; the ® is the registered trademark symbol:
PEPSI® Bottle 20 oz<br><br>
I'm not great with regex, and can't figure out how to edit the existing code to produce that.
Here's what we currently have:
$ui = "PEPSI Bottle 20 oz<br><br>";
$ui = preg_replace('/[^A-Za-z0-9\.\' -]/', '', $ui);
This results in PEPSI174 Bottle 20 ozbrbr.
Our desired result is PEPSI Bottle 20 oz<br><br>.
How can I edit the regex to make sure that
It doesn't remove valid HTML tags like <br>, and
If it does find a special character entity, it removes not only the special characters (the & and #), but also the numbers and semicolon?
We don't want to have it remove all the numbers, as obviously the string can contain numbers; it's only numbers that are part of the entity code that we need to remove.
You could use this but now I can't guaranty it covers all the possible HTML entities:
$res = preg_replace('/&[A-Za-z0-9#]+;/', '', $ui);
That says replace any substring that:
- starts with &
- followed by any number of alphanumeric characters or # in random order
- followed by ;.

Insert space into string

Could anyone tell me how to insert a space in between characters in a string using PHP, depending on the length for a UK post code?
e.g. if the string is 5 charterers, insert a space after the second character? If it is 6 characters insert after the third etc?
Use regex:
$formatted = preg_replace('/([a-Z0-9]{3})$/', ' \1', $postalCode);
Note that this only works on alphanumeric characters, but I'm assuming that's what the scope of the input should be.

Splitting string containing letters and numbers not separated by any particular delimiter in PHP

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers
You can use preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS
how about this:
you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.
just an idea, but imho might work for you.
EDIT:
try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)
<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);
print_r($matches[0]);
print $str;

How to correctly parse a mixed latin/ideographic full text query with regex?

I'm trying to sanitize/format some input using regex for a mixed latin/ideographic(chinese/japanse/korean) full text search.
I found an old example of someone's attempt at sanitizing a latin/asian language string on a forum of which I cannot find again (full credit to the original author of this code).
I am having trouble fully understanding the regex portion of the function in particular why it seems to be treating the numbers 0, 2, and 3 differently than the rest of the latin based numbers 1,4-9 (basically it treats the numbers 0,4-9 properly, but the numbers 0,2-3 in the query are treated as if they are Asian characters).
For example. I am trying to sanitize the following string:
"hello 1234567890 蓄積した abc123def"
and it will turn into:
"hello 1 456789 abc1 def 2 3 0 蓄 積 し た 2 3"
the correct output for this sanitized string should be:
"hello 1234567890 蓄 積 し た abc123def"
As you can see it properly spaces out the Asian characters but the numbers 0, 2, 3 are treated differently than all other number. Any help on why the regex is treating those numbers 0,2 and 3 differently would be a great help (or if you know of a better way of achieving a similar result)! Thank you
I have included the function below
function prepareString($str) {
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));
return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#([^\12544-\65519])#u', ' ', $str) . ' ' . implode(' ', preg_split('#([\12544-\65519\s])?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}
UPDATE: Providing context for clarity
I am authoring a website that will be launched in China. This website will have a search function and I am trying to write a parser for the search query input.
Unlike the English language which uses a " " as the delimiter between words in a sentence, Chinese does not use spaces between words. Because of this, I have to re-format a search query by breaking apart each Chinese character and searching for each character individually within the database. Chinese users will also use latin/english characters for things such as brand names which they can mix together with their Chinese characters (eg. Ivy牛仔舖).
What I would like to do is separate all of the English words out from the Chinese characters, and Seperate each Chinese character with a space.
A search query could look like this: Ivy牛仔舖
And I would want to parse it so that it looks like this: Ivy 牛 仔 舖
The problem appears to be with the regex [^\12544-\65519]. That looks like it's supposed to be a range defined by two, five-digit octal escapes, but it doesn't work that way. The actual breakdown is like this:
\125 => octal escape for 'U'
4 => '4'
4 => '4'
-
\655 => octal escape for... (something)
1 => '1'
9 => '9'
Which is effectively the same as:
[^14-\655]
What \655 means as the top of a range isn't clear, but the character class matches anything except a '1', a '4', or any ASCII character with a code point higher than '4' (which includes '9' and 'U'). It doesn't really matter though; the important point is that octal escapes can contain a maximum of three digits, which makes them unsuitable for your needs. I suggest you use PHP's \x{nnn} hexadecimal notation instead.
I'm not set up to work with either PHP or Chinese, so I can't give you a definitive answer, but this should at least help you refine the question. As I see it, it's basically a four-step process:
get rid of undesirable characters like punctuation, replacing them with whitespace
normalize whitespace: get rid of leading and trailing spaces, and collapse runs of two or more spaces to one space
normalize case: replace any uppercase letters with their lowercase equivalents
wherever a Chinese character is next to another non-whitespace character, separate the two characters with a space
For the first three steps, the first line of the code you posted should suffice:
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));
For the final step, I would suggest lookarounds:
$str = preg_replace(
'#(?<=\S)(?=\p{Chinese})|(?<=\p{Chinese})(?=\S)#u',
' ', $str);
That should insert a space at any position where the next character is Chinese and the previous character is not whitespace, or the previous character is Chinese and the next character is not whitespace.
After further research and the help of Alan's comments I was able to find the correct regex combinations to achieve a query parsing function for seperating lating and ideographic (chinese/japanese) characters that I'm happy with:
function prepareString($str) {
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}]+#u', ' ', $str)));
return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#\p{Han}#u', ' ', $str) . ' ' . implode(' ', preg_split('#\P{Han}?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}
$query = "米娜Mi-NaNa日系時尚館╭☆ 旅行 渡假風格 【A6402】korea拼接條紋口袋飛鼠棉"
echo prepareString($query); //"mi nana a6402 korea 米 娜 日 系 時 尚 館 旅 行 渡 假 風 格 拼 接 條 紋 口 袋 飛 鼠 棉"
Disclaimer: I cannot read mandarin and the string above was copied from a Chinese website. if it says anything offensive please let me know and I will remove it.

Categories