Could anyone tell me how to insert a space in between characters in a string using PHP, depending on the length for a UK post code?
e.g. if the string is 5 charterers, insert a space after the second character? If it is 6 characters insert after the third etc?
Use regex:
$formatted = preg_replace('/([a-Z0-9]{3})$/', ' \1', $postalCode);
Note that this only works on alphanumeric characters, but I'm assuming that's what the scope of the input should be.
Related
I have a large amount of text data which here and there contain a lot of special character, and I need to validate it to other data (text) sources.
my question is: is it possible to "escape" a string in regex so that it does not consider the special characters?
example:
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/$text/", $other_text);
in the example here, there are a lot of special characters in $text as this is a massive amount of incoming text to be compared to a large amount of $other_text string, and sometimes the data actually contains regexps) so I need to use preg_match.
What I'm getting at, is there a "turn special characters off in this string" type of delimiter?
using above example:
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/%$text%/", $other_text);
Here, the % characters surrounding $text indicates that the string is to be taken "literal" as opposed as containing regex characters.
any ideas?
The method you are looking for is called preg_quote - it will escape any relevant character to have a "plain, stupid" String match.
http://php.net/manual/en/function.preg-quote.php
$text = "My Random St. [486] s/n 445 (don't call these guys)";
preg_match("/". preg_quote($text, "/") . "/", $other_text);
Will exactly match the given string in other text.
I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?
Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.
As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )
I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$
This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3
Im reluctant to ask but I cant figure out php preg_replace and ignore certain bits of the sting.
$string = '2012042410000102';
$string needs to look like _0424_102
The showing numbers are variable always changing and 2012 changes ever year
what I've tried:
^\d{4}[^\d{4}]10000[^\d{3}]$
^\d{4}[^\d]{4}10000[^\d]{3}$
Any help would be appreciated. I know it's a noob question but easy points for whoever helps.
Thanks
Your first regex is looking for:
The start of the string
Four digits (the year)
Any single character that is not a digit nor { or }
The number 10000
Any single character that is not a digit nor { or }
The end of the string
Your second regex is looking for:
The start of the string
Four digits (the year)
Any four characters that are not digits
The number 10000
Any three characters that are not digits
The end of the string
The regex you're looking for is:
^\d{4}(\d{4})10000(\d{3})$
And the replacement should be:
_$1_$2
This regex looks for:
The start of the string
Four digits (the year)
Capture four digits (the month and day)
The number 10000
Capture three digits (the 102 at the end in your example)
The end of the string
Try the following:
^\d{4}|10000(?=\d{3}$)
This will match either the first four digits in a string, or the string '10000' if there are three digits after '10000' before the end of the string.
You would use it like this:
preg_replace('/^\d{4}|10000(?=\d{3}$)/', '_', $string);
http://codepad.org/itTgEGo4
Just use simple string functions:
$string = '2012042410000102';
$new = '_'.str_replace('10000', '_', substr($string, 4));
http://codepad.org/elRSlCIP
If they're always in the same character locations, regular expressions seem unnecessary. You could use substrings to get the parts you want, like
sprintf('_%s_%s', substr($string,4,4), substr($string,13))
or
'_' . substr($string,4,4) . '_' . substr($string,13)
How to count characters including white space and then break after a certain length for instance how would i break a string after 25 characters onto a new line using PHP?
Fortunately somebody's already done the work. Use wordwrap.
If you really want to reinvent the wheel for learning sake, here are a few pieces to get you started:
for (...) { }
strlen()
$str[$x] to access character x of string $str
%
.
Try chunk_split() if you don't mind cutting words in half. It treats whitespace as any other char.
I'm trying to sanitize/format some input using regex for a mixed latin/ideographic(chinese/japanse/korean) full text search.
I found an old example of someone's attempt at sanitizing a latin/asian language string on a forum of which I cannot find again (full credit to the original author of this code).
I am having trouble fully understanding the regex portion of the function in particular why it seems to be treating the numbers 0, 2, and 3 differently than the rest of the latin based numbers 1,4-9 (basically it treats the numbers 0,4-9 properly, but the numbers 0,2-3 in the query are treated as if they are Asian characters).
For example. I am trying to sanitize the following string:
"hello 1234567890 蓄積した abc123def"
and it will turn into:
"hello 1 456789 abc1 def 2 3 0 蓄 積 し た 2 3"
the correct output for this sanitized string should be:
"hello 1234567890 蓄 積 し た abc123def"
As you can see it properly spaces out the Asian characters but the numbers 0, 2, 3 are treated differently than all other number. Any help on why the regex is treating those numbers 0,2 and 3 differently would be a great help (or if you know of a better way of achieving a similar result)! Thank you
I have included the function below
function prepareString($str) {
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));
return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#([^\12544-\65519])#u', ' ', $str) . ' ' . implode(' ', preg_split('#([\12544-\65519\s])?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}
UPDATE: Providing context for clarity
I am authoring a website that will be launched in China. This website will have a search function and I am trying to write a parser for the search query input.
Unlike the English language which uses a " " as the delimiter between words in a sentence, Chinese does not use spaces between words. Because of this, I have to re-format a search query by breaking apart each Chinese character and searching for each character individually within the database. Chinese users will also use latin/english characters for things such as brand names which they can mix together with their Chinese characters (eg. Ivy牛仔舖).
What I would like to do is separate all of the English words out from the Chinese characters, and Seperate each Chinese character with a space.
A search query could look like this: Ivy牛仔舖
And I would want to parse it so that it looks like this: Ivy 牛 仔 舖
The problem appears to be with the regex [^\12544-\65519]. That looks like it's supposed to be a range defined by two, five-digit octal escapes, but it doesn't work that way. The actual breakdown is like this:
\125 => octal escape for 'U'
4 => '4'
4 => '4'
-
\655 => octal escape for... (something)
1 => '1'
9 => '9'
Which is effectively the same as:
[^14-\655]
What \655 means as the top of a range isn't clear, but the character class matches anything except a '1', a '4', or any ASCII character with a code point higher than '4' (which includes '9' and 'U'). It doesn't really matter though; the important point is that octal escapes can contain a maximum of three digits, which makes them unsuitable for your needs. I suggest you use PHP's \x{nnn} hexadecimal notation instead.
I'm not set up to work with either PHP or Chinese, so I can't give you a definitive answer, but this should at least help you refine the question. As I see it, it's basically a four-step process:
get rid of undesirable characters like punctuation, replacing them with whitespace
normalize whitespace: get rid of leading and trailing spaces, and collapse runs of two or more spaces to one space
normalize case: replace any uppercase letters with their lowercase equivalents
wherever a Chinese character is next to another non-whitespace character, separate the two characters with a space
For the first three steps, the first line of the code you posted should suffice:
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str)));
For the final step, I would suggest lookarounds:
$str = preg_replace(
'#(?<=\S)(?=\p{Chinese})|(?<=\p{Chinese})(?=\S)#u',
' ', $str);
That should insert a space at any position where the next character is Chinese and the previous character is not whitespace, or the previous character is Chinese and the next character is not whitespace.
After further research and the help of Alan's comments I was able to find the correct regex combinations to achieve a query parsing function for seperating lating and ideographic (chinese/japanese) characters that I'm happy with:
function prepareString($str) {
$str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}]+#u', ' ', $str)));
return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#\p{Han}#u', ' ', $str) . ' ' . implode(' ', preg_split('#\P{Han}?#u', $str, -1, PREG_SPLIT_NO_EMPTY))));
}
$query = "米娜Mi-NaNa日系時尚館╭☆ 旅行 渡假風格 【A6402】korea拼接條紋口袋飛鼠棉"
echo prepareString($query); //"mi nana a6402 korea 米 娜 日 系 時 尚 館 旅 行 渡 假 風 格 拼 接 條 紋 口 袋 飛 鼠 棉"
Disclaimer: I cannot read mandarin and the string above was copied from a Chinese website. if it says anything offensive please let me know and I will remove it.