Preg_split matching more than what it should

Preg_split matching more than what it should - php

Code:
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
$urls = array();
preg_match($pattern, $comment, $urls);
return $urls;
According to an online regex tester, this regex is correct and should be working:
http://regexr.com?35nf9
I am outputting the $links array using:
$linkItems = $model->getLinksInComment($model->comments);
//die(print_r($linkItems));
echo '<ul>';
foreach($linkItems as $link) {
echo '<li>'.$link.'</li>';
}
echo '</ul>';
The output looks like the following:
http://google.com
http
The $model->comments looks like the following:
destined for surplus
RT#83015
RT#83617
http://google.com
https://google.com
non-link
The list generated is only suppose to be links, and there should be no lines that are empty. Is there something wrong with what I did, because the Regex seems to be correct.

If I'm understanding right, you should use preg_match_all in your getLinksInComment function instead:
preg_match_all($pattern, $comment, $matches);
if (isset($matches[0])) {
return $matches[0];
}
return array(); #in case there are no matches
preg_match_all gets all matches in a string (even if the string contains newlines) and puts them into the array you supply as the third argument. However, anything matched by your regex's capture groups (e.g. (http|https|ftp|ftps)) will also be put into your $matches array (as $matches[1] and so on). That's why you want to return just $matches[0] as your final array of matches.
I just ran this exact code:
$line = "destined for surplus\n
RT#83015\n
RT#83617\n
http://google.com\n
https://google.com\n
non-link";
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $line, $matches);
var_dump($matches);
and got this for my output:
array(3) {
[0]=>
array(2) {
[0]=>
string(17) "http://google.com"
[1]=>
string(18) "https://google.com"
}
[1]=>
array(2) {
[0]=>
string(4) "http"
[1]=>
string(5) "https"
}
[2]=>
array(2) {
[0]=>
string(0) ""
[1]=>
string(0) ""
}
}

Your comment is structured as multiple lines, some of which contain the URLs in which you're interested and nothing else. This being the case, you need not use anything remotely resembling that disaster of a regex to try to pick URLs out of the full comment text; you can instead split by newline, and examine each line individually to see whether it contains a URL. You might therefore implement a much more reliable getLinksInComment() thus:
function getLinksInComment($comment) {
$links = array();
foreach (preg_split('/\r?\n/', $comment) as $line) {
if (!preg_match('/^http/', $line)) { continue; };
array_push($links, $line);
};
return $links;
};
With suitable adjustment to serve as an object method instead of a bare function, this should solve your problem entirely and free you to go about your day.

Related

Extracting data using hash sign in preg_match_all() pattern does not work

I am new to RegEx. I am parsing a HTML page and because it is buggy I cannot use a XML or HTML parser. So I am using a regular expression.
My code looks like this:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="[A-Z\\d]+" data-index="\\d+"/', $html, $result);
var_dump($result);
The output looks good so the code is working. Now I want to extract the matched values. I did it exactly as described in this answer and now the code looks like this:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="#([A-Z\\d]+)" data-index="#(\\d+)"/', $html, $result);
var_dump($result);
But it outputs an empty array. What is wrong? Please don't improve the pattern by adding the closing '>' or making it robust against white spaces. I just need to get the code running.

You could write the code and the pattern like this, using a single backslash to match digits \d and omit the # in the pattern as that is not in the example data:
$html = '<html><div data-id="ABC012" data-index="123" ...';
preg_match_all('/<div data-id="([A-Z\d]+)" data-index="(\d+)"/', $html, $result);
var_dump($result);
Output
array(3) {
[0]=>
array(1) {
[0]=>
string(38) "<div data-id="ABC012" data-index="123""
}
[1]=>
array(1) {
[0]=>
string(6) "ABC012"
}
[2]=>
array(1) {
[0]=>
string(3) "123"
}
}

PHP pregmatch textarea

Ok,so I have this form that contains textareas and I want to verify they dont contain any illegal characters.
Html:
<textarea minlength="100" required name="Description" maxlength="800">
</textarea>
Php:
if(!preg_match("/^[-\p{L}\p{N} #&()!*,.;'\/\\\\]+$/u",$_POST["Description"])){
//error
}
I have tried multiple completely legal texts but it returns false.
What am I missing?

I guess your expression works fine, you might want to remove the u flag:
if(!preg_match("/^[-\p{L}\p{N} #&()!*,.;'\/\\\\]+$/s",$_POST["Description"])){
//error
}
Or, you might be trying to do,
if(!preg_match("/^[^-\p{L}\p{N} #&()!*,.;'\/\\\\]+$/s",$_POST["Description"])){
//error
}
if you want to exclude things out.
Demo 2
Test
$re = '/^[-\p{L}\p{N} #&()!*,.;\'\/\\\\]+$/m';
$str = 'abcd
abcd\\\\\\\\\\\\
&*&??
abc?
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Output
array(2) {
[0]=>
array(1) {
[0]=>
string(4) "abcd"
}
[1]=>
array(1) {
[0]=>
string(10) "abcd\\\\\\"
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regex preg_match word

I'm trying to get a part of a string that starts with for example Name:. If the whole string looks like Name: Carl, I just want the Carl part and not the Name: prefix.
How can I do that? I have tried with:
$data = file_get_contents('page.html');
$regex = '/Name:.*/';
preg_match($regex,$data,$match);
var_dump($match);
But I get the output:
array(1) { [0]=> string(28) "Name: Carl"
The other thing I don't understand is why the array(1) { [0]=> string(28) is showing.

You have to put what you want to retrieve in ():
'/Name:(.*)/i'

For your match line, do the following instead:
$regex = '/Name:(.*)/';
The matched portion (inside (.*)) will be in $match.

How to match this specific string in RE?

Once again I'm stuck at regular expression. There is nowhere any good material where to learn the more advance usage.
I'm trying to match [image width="740" height="249" parameters=""]51lca7dn56.jpg[/image] to $cache->image_tag("$4", $1, $2, "$3").
Everything works great if all the [image] parameters are there, but I need it to match, even if something is missing. So for example [image width="740"]51lca7dn56.jpg[/image].
Current code is:
$text = preg_replace('#\[image width=\"(.*?)\" height=\"(.*?)\" parameters=\"(.*?)\"\](.*?)\[/image\]#e', '$cache->image_tag("$4", $1, $2, "$3")', $text);
Regular expression is the only thing that always gets me stuck, so if anybody could also refer some good resource, so I could manage these types of issues myself, it would be much appreciated.
My dummy version what I'm trying to do is this:
// match only [image]
$text = preg_replace('#\[image\](.*?)\[/image\]#si', '$cache->image_tag("$1", 0, 0, "")', $text);
// match only width
$text = preg_replace('#\[image width=\"(.*?)\"\](.*?)\[/image\]#si', '$cache->image_tag("$2", $1, 0, "")', $text);
// match only width and height
$text = preg_replace('#\[image width=\"(.*?)\" height=\"(.*?)\"\](.*?)\[/image\]#si', '$cache->image_tag("$3", $1, $2, "")', $text);
// match only all
$text = preg_replace('#\[image width=\"(.*?)\" height=\"(.*?)\" parameters=\"(.*?)\"\](.*?)\[/image\]#si', '$cache->image_tag("$4", $1, $2, $3)', $text);
(This code actually doesn't work as expected, but you will understand my point more better.) I hope to put all this horrible mess into one RE call basically.
Final code tested and working based on Ωmega's answer:
// Match: [image width="740" height="249" parameters="bw"]51lca7dn56.jpg[/image]
$text = preg_replace('#\[image\b(?=(?:[^\]]*\bwidth="(\d+)"|))(?=(?:[^\]]*\bheight="(\d+)"|))(?=(?:[^\]]*\bparameters="([^"]+)"|))[^\]]*\]([^\[]*)\[\/image\]#si', '$cache->image_tag("$4", $1, $2, "$3")', $text); // the end is #si, so it would be eaiser to debug, in reality its #e
However, since if width or height might not be there, it will return empty not NULL. So I adopted drews idea of preg_replace_callback():
$text = preg_replace_callback('#\[image\b(?=(?:[^\]]*\bwidth="(\d+)"|))(?=(?:[^\]]*\bheight="(\d+)"|))(?=(?:[^\]]*\bparameters="([^"]+)"|))[^\]]*\]([^\[]*)\[\/image\]#', create_function(
'$matches',
'global $cache; return $cache->image_tag($matches[4], ($matches[1] ? $matches[1] : 0), ($matches[2] ? $matches[2] : 0), $matches[3]);'), $text);

Maybe try a regex like this instead which tries to grab extra params in the image tag (if any). This way, the parameters can be in any order with any combination of included and omitted parameters:
$string = 'this is some code and it has bbcode in it like [image width="740" height="249" parameters=""]51lca7dn56.jpg[/image] for example.';
if (preg_match('/\[image([^\]]*)\](.*?)\[\/image\]/i', $string, $match)) {
var_dump($match);
}
Resulting match:
array(3) {
[0]=>
string(68) "[image width="740" height="249" parameters=""]51lca7dn56.jpg[/image]"
[1]=>
string(39) " width="740" height="249" parameters="""
[2]=>
string(14) "51lca7dn56.jpg"
}
So you can then examine $match[1] and parse out the parameters. You may need to use preg_replace_callback to implement the logic inside the callback.
Hope that helps.

I would suggest you to use regex
\[image\b(?=(?:[^\]]*\bwidth="(\d+)"|))(?=(?:[^\]]*\bheight="(\d+)"|))(?=(?:[^\]]*\bparameters="([^"]+)"|))[^\]]*\]([^\[]*)\[\/image\]
Edit:
$string = 'this is some code and it has bbcode in it like [image width="740" height="249" parameters=""]51lca7dn56.jpg[/image] for example and [image parameters="" height="123" width="456"]12345.jpg[/image].';
if (preg_match_all('/\[image\b(?=(?:[^\]]*\bwidth="(\d+)"|))(?=(?:[^\]]*\bheight="(\d+)"|))(?=(?:[^\]]*\bparameters="([^"]+)"|))[^\]]*\]([^\[]*)\[\/image\]/i', $string, $match) > 0) {
var_dump($match);
}
Output:
array(5) {
[0]=>
array(2) {
[0]=>
string(68) "[image width="740" height="249" parameters=""]51lca7dn56.jpg[/image]"
[1]=>
string(63) "[image parameters="" height="123" width="456"]12345.jpg[/image]"
}
[1]=>
array(2) {
[0]=>
string(3) "740"
[1]=>
string(3) "456"
}
[2]=>
array(2) {
[0]=>
string(3) "249"
[1]=>
string(3) "123"
}
[3]=>
array(2) {
[0]=>
string(0) ""
[1]=>
string(0) ""
}
[4]=>
array(2) {
[0]=>
string(14) "51lca7dn56.jpg"
[1]=>
string(9) "12345.jpg"
}
}

Regexp for extracting a mailto: address

I'd like a reg exp which can take a block of string, and find the strings matching the format:
....
And for all strings which match this format, it will extract out the email address found after the mailto:. Any thoughts?
This is needed for an internal app and not for any spammer purposes!

If you want to match the whole thing from :
$r = '`\<a([^>]+)href\=\"mailto\:([^">]+)\"([^>]*)\>(.*?)\<\/a\>`ism';
preg_match_all($r,$html, $matches, PREG_SET_ORDER);
To fastern and shortern it:
$r = '`\<a([^>]+)href\=\"mailto\:([^">]+)\"([^>]*)\>`ism';
preg_match_all($r,$html, $matches, PREG_SET_ORDER);
The 2nd matching group will be whatever email it is.
Example:
$html ='<div>test</div>';
$r = '`\<a([^>]+)href\=\"mailto\:([^">]+)\"([^>]*)\>(.*?)\<\/a\>`ism';
preg_match_all($r,$html, $matches, PREG_SET_ORDER);
var_dump($matches);
Output:
array(1) {
[0]=>
array(5) {
[0]=>
string(39) "test"
[1]=>
string(1) " "
[2]=>
string(13) "test#live.com"
[3]=>
string(0) ""
[4]=>
string(4) "test"
}
}

There are plenty of different options on regexp.info
One example would be:
\b[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b
The "mailto:" is trivial to prepend to that.

/(mailto:)(.+)(\")/
The second matching group will be the email address.

You can work with the internal PHP filter http://us3.php.net/manual/en/book.filter.php
(they have one which is specially there for validating or sanitizing email -> FILTER_VALIDATE_EMAIL)
Greets

for me worked ~<mailto(.*?)>~
will return an array containing elements found.
Here you can test it: https://regex101.com/r/rTmKR4/1

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Preg_split matching more than what it should - php

Related

Extracting data using hash sign in preg_match_all() pattern does not work

PHP pregmatch textarea

Regex preg_match word

How to match this specific string in RE?

Regexp for extracting a mailto: address

Categories

Resources