Is there an equivalent of the PHP function preg_split for JavaScript?
Any string in javascript can be split using the string.split function, e.g.
"foo:bar".split(/:/)
where split takes as an argument either a regular expression or a literal string.
You can use regular expressions with split.
The problem is the escape characters in the string as the (? opens a non capturing group but there is no corresponding } to close the non capturing group it identifies the string to look for as '
If you want support for all of the preg_split arguments see https://github.com/kvz/phpjs/blob/master/_workbench/pcre/preg_split.js (though not sure how well tested it is).
Just bear in mind that JavaScript's regex syntax is somewhat different from PHP's (mostly less expressive). We would like to integrate XRegExp at some point as that makes up for some of the missing features of PHP regexes (as well as fixes the many browser reliability problems with functions like String.split()).
Related
I am no PHP expert. I am looking for the PHP equivalent of isLetter() in Java, but I can't find it. Does it exist?
I need to extract letters from a given string and make them lower case, for example: "Ap.ér4i5T i6f;" should give "apéritif'. So, yes, there are accentuated characters in my strings.
ctype_alpha().
In addition to regex / preg_replace, you can also use strtoupper($string) and strtolower($string), if you need to universally upper-case a string. As Konrad mentioned, preg_replace is probably your best bet though.
http://php.net/manual/en/function.strtoupper.php
http://www.php.net/manual/en/function.strtolower.php
In PHP (and in Java) you wouldn’t use isLetter to implement it, you’d rather replace all characters that aren’t letters using a regular expression:
echo preg_replace('/\P{L}/', '', input);
Loop up the documentation of preg_replace and the regex pattern syntax desciption, in particular the relevant Unicode character classes.
You could probably use the php-slugs source code, with appropriate modifications.
I know I can use preg_match but I was wondering if php had a way to evaluate to a regular expression like:
if(substr($example, 0, 1) == /\s/){ echo 'whitespace!'; }
PHP does not have first-class regular expressions.
You will need to use the functions provided by the default PCRE extension. Sorry. It's a backslash-escaping nightmare, but it's all we've got.
(There's also the now-deprecated POSIX regex extension, but you should not use them any longer. They are slower, less featureful, and most important, they aren't Unicode-safe. Modern PCRE versions understand Unicode very well, even if PHP itself is ignorant about it.)
With regard to the backslash-escaping nightmare, you can keep the horror to a minimum by using single quotes to enclose the string containing the regex instead of doubles, and picking an appropriate delimiter. Compare:
"/^http:\\/\\/www.foo.bar\\/index.html\\?/"
versus
'!^http://www.foo.bar/index.html\?!'
Inside single quotes, you only need to backslash-escape backslashes and single quotes, and picking a different delimiter avoids needing to escape the delimiter inside the regex.
:)
if(substr($example, 0, 1) == " "){ echo 'whitespace!';}
You should not be using regexp when it is not needed.
There would also be the microoptimization option:
if (strstr(" \t\r\n", $example{0})) {
The {0} is an outdated way to get the first character (same as [0] actually). And strstr simply checks if the character is contained in the list of whitespace characters. Another option would be strspn, at least in your example case.
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.
I get this warning from php after the change from split to preg_split for php 5.3 compatibility :
PHP Warning: preg_split(): Delimiter must not be alphanumeric or backslash
the php code is :
$statements = preg_split("\\s*;\\s*", $content);
How can I fix the regex to not use anymore \
Thanks!
The error is because you need a delimiter character around your regular expression.
$statements = preg_split("/\s*;\s*/", $content);
Although the question was tagged as answered two minutes after being asked, I'd like to add some information for the records.
Similar to the way strings are delimited by quotation marks, regular expressions in many languages, such as Perl or JavaScript, are delimited by forward slashes. This will lead to expressions looking like this:
/\s*;\s*/
This syntax also allows to specify modifiers:
/\s*;\s*/Ui
PHP's Perl-compatible regular expressions (aka preg_... functions) inherit this. However, PHP itself doesn't support this syntax so feeding preg_split() with /\s*;\s*/ would raise a parse error. Instead, you enclose it with quotes to build a regular string.
One more thing you must take into account is that PHP allows to change the delimiter. For instance, you can use this:
#\s*;\s*#Ui
What is it good for? It simplifies the use of forward slashes inside the expression since you don't need to escape them. Compare:
/^\/home\/.*$/i
#^/home/.*$#i
If you don't like delimiters, you can use T-Regx tool:
pattern("\\s*;\\s*")->split($content):
You can also use Pattern::of("\\s*;\\s*")->split()