php preg_split functionality - php

is there a way to understand the following logic contained in the splitting pattern:
preg_split("/[\s,]+/", "hypertext language, programming");
in the grand scheme of things i understand what it is doing, but i really want a granular understand of how to use the escapes and special character notation. is there a granular explanation of this anywhere? if not could someone please provide a breakdown of how this works. it is something very useful, and something i would like to have completely under in my belt so to speak.

+ means 1 or more
[\s,] means a space and/or comma character
This will split the text by 1 or more spaces and commas together
definitely read http://www.regular-expressions.info/ as Silfverstrom recommended. Also what helped me learn was this game: http://www.javaregex.com/agame.html

you should have a look at regular expressions, this might be a good place to start
http://www.regular-expressions.info/reference.html

Related

What's the best approach to find words from a set of words in a string?

I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated
A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

Basic Regular Expression for

For some reason I always get stuck making anything past extremely basic regular expressions.
I'm trying to make a regular expression that kind of looks like a URL. I only want basic checking.
I would like it to match the following patterns where X is "something".
X://X.X
X://X.X... etc.
X.X
X.X... etc
If the string contains one of these patterns, it is sufficient checking for me. This way a url like www.example.com:8888 will still match. I have tried many different REGEX combinations with preg_match and cannot seem to get any to behave the way I want it to. I have consulted many other related REGEX questions on SO but my readings have not helped me.
Any help? I will be happy to provide more information if you would like but I don't know what else you would need.
It takes practice but here is one that I made using a regex tester (http://www.regextester.com/) to check my pattern:
^.+(:\/\/|\.)([a-zA-Z0-9]+\.)+.+
My approach is to slowly build my pattern from the beginning and add on one piece at a time. This cheatsheet is extremely helpful for remembering http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ what everything is.
Basically the pattern starts at the beginning of the string and checks for any characters followed by either :// or . then checks for groupings of letters and numbers followed by a . ending with any number of characters.
The pattern could probably be improved with groupings to not pass on invalid characters. But this one was quick and dirty. You could replace the first and last . with the characters that would be valid.
UPDATE
Per the comments here is an updated pattern:
^.+?(:\/\/|\.)?([a-zA-Z0-9]+?\.)+.+
/^(.+:\/\/)?[^.]+\.[^.\/]+([.\/][^.\/]+)*$/

Regex to check if exact string exists

I am looking for a way to check if an exact string match exists in another string using Regex or any better method suggested. I understand that you tell regex to match a space or any other non-word character at the beginning or end of a string. However, I don't know exactly how to set it up.
Search String: t
String 1: Hello World, Nice to see you! t
String 2: Hello World, Nice to see you!
String 3: T Hello World, Nice to see you!
I would like to use the search string and compare it to String 1, String 2 and String 3 and only get a positive match from String 1 and String 3 but not from String 2.
Requirements:
Search String may be at any character position in the Subject.
There may or may not be a white-space character before or after it.
I do not want it to match if it is part of another string; such as part of a word.
For the sake of this question:
I think I would do this using this pattern: /\bt\b/gi
/\b{$search_string}\b/gi
Does this look right? Can it be made better? Any situations where this pattern wouldn't work?
Additional info: this will be used in PHP 5
Your suggestion of /\bt\b/gi will work and is probably the way to go. You've correctly used \b for word boundaries. You're using the global and case-insensitive modifiers which will find all matches in both cases. Simple, straight forward, clean. Look no further than what you've already come up with.
Looks fine to me. You might want to check the exact meaning of the \b assertion to make sure it's exactly what you need.
Can't really name any situation where this pattern "wouldn't work" without a more elaborate description, but \b would work fine for your testcases.
According to the old saying give a man a reg expression and he is happy for a day, teach him to write regular expression and he is happy for a lifetime (or something to that effect) try out the "regulator"
It provides a GUI and some pretty good examples for reg exp needs.

Regex, encoding, and characters that look a like

First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).

Categories