regex regular expression to match most of URLs needs improvement - php

I need a function which will check for the existing URLs in a string.
function linkcleaner($url) {
$regex="(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))";
if(preg_match($regex, $url, $matches)) {
echo $matches[0];
}
}
The regular expression is taken from the John Gruber's blog, where he addressed the problem of creating a regex matching all the URLs.
Unfortunately, I can't make it work. It seems the problem is coming from the double quotes inside the regex or the other punct symbols at the end of the expression.
Any help is appreciated.
Thank you!

You need to escape the " with a \

Apart from #tandu's answer, you also need delimiters for a regex in php.
The easiest would be to start and end your pattern with an # as that character does not appear in it:
$regex="#(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))#";

Jack Maney's comment...EPIC :D
On a more serious note, it does not work because you terminated the string literal right in the middle.
To include a double quote (") in a string, you need to escape it using a \
So, the line will be
$regex="/(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))/";
Notice I've escaped the (') as well. That is for when you define a string between 2 single quotes.

I am not sure how you guys read this regex, cause it's a real pain to read/modify... ;)
try this (this is not a one-liner, yes, but it is easy to understand and modify if needed):
<?php
$re_proto = "(?:https?|ftp|gopher|irc|whateverprotoyoulike)://";
$re_ipv4_segment = "[12]?[0-9]{1,2}";
$re_ipv4 = "(?:{$re_ipv4_segment}[.]){3}".$re_ipv4_segment;
$re_hostname = "[a-z0-9_]+(?:[.-][a-z0-9_]+){0,}";
$re_hostname_fqdn = "[a-z0-9_](?:[a-z0-9_-]*[.][a-z0-9]+){1,}";
$re_host = "(?:{$re_ipv4}|{$re_hostname})";
$re_host_fqdn = "(?:{$re_ipv4}|{$re_hostname_fqdn})";
$re_port = ":[0-9]+";
$re_uri = "(?:/[a-z0-9_.%-]*){0,}";
$re_querystring = "[?][a-z0-9_.%&=-]*";
$re_anchor = "#[a-z0-9_.%-]*";
$re_url = "(?:(?:{$re_proto})(?:{$re_host})|{$re_host_fqdn})(?:{$re_port})?(?:{$re_uri})?(?:{$re_querystring})?(?:{$re_anchor})?";
$text = <<<TEXT
http://www.example.com
http://www.example.com/some/path/to/file.php?f1=v1&f2=v2#foo
http://localhost.localdomain/
http://localhost/docs/???
www....wwhat?
www.example.com
ftp://ftp.mozilla.org/pub/firefox/latest/
Some new Mary-Kate Olsen pictures I found: the splendor of the Steiner Street Picture of href… http://t.co/tJ2NJjnf
TEXT;
$count = preg_match_all("\01{$re_url}\01is", $text, $matches);
var_dump($count);
var_dump($matches);
?>

Related

Get html or text from inside quotes including escape quotes with RegEx

What I want to do is to get the attribute value from a simple text I'm parsing. I want to be able to contain HTML as well inside the quotes, so that's what got me stalling right now.
$line = 'attribute = "<p class=\"qwerty\">Hello World</p>" attribute2 = "value2"'
I've gotten to the point (substring) where I'm getting the value
$line = '"<p class=\"qwerty\">Hello World</p>" attribute2 = "value2"'
My current regex works if there are no escaped quotes inside the text. However, when I try to escape the HTML quotes, it doesn't work at all. Also, using .* is going to the end of the second attribute.
What I'm trying to obtain from the string above is
$result = '<p class=\"qwerty\">Hello World</p>'
This is how far I've gotten with my trial and error regex-ing.
$value_regex = "/^\"(.+?)\"/"
if (preg_match($value_regex, $line, $matches))
$result = $matches[1];
Thank you very much in advance!
You can use negative lookbehind to avoid matching escaped quotes:
(?<!\\)"(.+?)(?<!\\)"
RegEx Demo
Here (?<!\\) is negative lookbehind that will avoid matching \".
However I would caution you on using regex to parse HTML, better to use DOM for that.
PHP Code:
$value_regex = '~(?<!\\\\)"(.+?)(?<!\\\\)"~';
if (preg_match($value_regex, $line, $matches))
$result = $matches[1];

Erasing C comments with preg_replace

I need to erase all comments in $string which contains data from some C file.
The thing I need to replace looks like this:
something before that shouldnt be replaced
/*
* some text in between with / or * on many lines
*/
something after that shouldnt be replaced
and the result should look like this:
something before that shouldnt be replaced
something after that shouldnt be replaced
I have tried many regular expressions but neither work the way I need.
Here are some latest ones:
$string = preg_replace("/\/\*(.*?)\*\//u", "", $string);
and
$string = preg_replace("/\/\*[^\*\/]*\*\//u", "", $string);
Note: the text is in UTF-8, the string can contain multibyte characters.
You would also want to add the s modifier to tell the regex that .* should include newlines. I always think of s to mean "treat the input text as a single line"
So something like this should work:
$string = preg_replace("/\\/\\*(.*?)\\*\\//us", "", $string);
Example: http://codepad.viper-7.com/XVo9Tp
Edit: Added extra escape slashes to the regex as Brandin suggested because he is right.
I don't think regexp fit good here. What about wrote a very small parse to remove this? I don't do PHP coding for a long time. So, I will try to just give you the idea (simple alogorithm) I haven't tested this, it's just to you get the idea, as I said:
buf = new String() // hold the source code without comments
pos = 0
while(string[pos] != EOF) {
if(string[pos] == '/') {
pos++;
while(string[pos] != EOF)
{
if(string[pos] == '*' && string[pos + 1] == '/') {
pos++;
break;
}
pos++;
}
}
buf[buf_index++] = string[pos++];
}
where:
string is the C source code
buf a dynamic allocated string which expands as needed
It is very hard to do this perfectly without ending up writing a full C parser.
Consider the following, for example:
// Not using /*-style comment here.
// This line has an odd number of " characters.
while (1) {
printf("Wheee!
(*\/*)
\\// - I'm an ant!
");
/* This is a multiline comment with a // in, and
// an odd number of " characters. */
}
So, from the above, we can see that our problems include:
multiline quote sequences should be ignored within doublequotes. Unless those doublequotes are part of a comment.
single-line comment sequences can be contained in double-quoted strings, and in multiline strings.
Here's one possibility to address some of those issues, but far from perfect.
// Remove "-strings, //-comments and /*block-comments*/, then restore "-strings.
// Based on regex by mauke of Efnet's #regex.
$file = preg_replace('{("[^"]*")|//[^\n]*|(/\*.*?\*/)}s', '\1', $file);
try this:
$string = preg_replace("#\/\*\n?(.*)\*\/\n?#ms", "", $string);
Use # as regexp boundaries; change that u modifier with the right ones: m (PCRE_MULTILINE) and s (PCRE_DOTALL).
Reference: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
It is important to note that my regexp does not find more than one "comment block"... Use of "dot match all" is generally not a good idea.

Saving strings in quotes (") using regex

I've got a simple string that looks like a:104:{i:143;a:5:{s:5:"naz";s:7:"Alb";s:10:"base"}} and I'd like to save all text in quotation mark cleaning it of things like s:5 and stuff using regex. Is this possible?
Want to get everything between quotes? use: ".*" as your search string (escape " characters as required)
..also you can check out http://www.zytrax.com/tech/web/regex.htm for more help with regex. (It's got a great tool where you can test input text, RE, and see what you get out)
As long as the double quotes are matched, the following call
preg_match_all('/"([^"]*)"/',$input_string,$matches);
will give you all the text between the quotes as array of strings in $matches[1]
function session_raw_decode ($data) {
$vars = preg_split('/([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff^|]*)\|/', $data, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$result = array();
for($i = 0; isset($vars[$i]); $i++)
$result[$vars[$i++]] = unserialize($vars[$i]);
return $result;
}
I have this snippet somewhere found on my server... (no idea from where it is or if I have it written myself)
You can use it and do:
$json = json_encode(session_raw_decode($string));
This should do the job.

Preg_match, Replace and back to string

sorry but i cant solve my problem, you know , Im a noob.
I need to find something in string with preg_match.. then replace it with new word using preg_replace, that's ok, but I don't understand how to put replaced word back to that string.
This is what I got
$text ='zda i "zda"';
preg_match('/"(\w*)"/', $text);
$najit = '/zda/';
$nahradit = 'zda';
$o = '/zda/';
$a = 'if';
$ahoj = preg_replace($najit, $nahradit, $match[1]);
Please, can you help me once again?
You can use e.g. the following code utilizing negative lookarounds to accomplish what you want:
$newtext = preg_replace('/(?<!")zda|zda(?!")/', 'if', $text)
It will replace any occurence of zda which is not enclosed in quotes on both sides (i.e. in U"Vzda"W the zda will be replaced because it is not enclosed directly into quotes).

Supposedly valid regular expression doesn't return any data in PHP

I am using the following code:
<?php
$stock = $_GET[s]; //returns stock ticker symbol eg GOOG or YHOO
$first = $stock[0];
$url = "http://biz.yahoo.com/research/earncal/".$first."/".$stock.".html";
$data = file_get_contents($url);
$r_header = '/Prev. Week(.+?)Next Week/';
$r_date = '/\<b\>(.+?)\<\/b\>/';
preg_match($r_header,$data,$header);
preg_match($r_date, $header[1], $date);
echo $date[1];
?>
I've checked the regular expressions here and they appear to be valid. If I check just $url or $data they come out correctly and if I print $data and check the source the code that I'm looking for to use in the regex is in there. If you're interested in checking anything, an example of a proper URL would be http://biz.yahoo.com/research/earncal/g/goog.html
I've tried everything I could think of, including both var_dump($header) and var_dump($date), both of which return empty arrays.
I have been able to create other regular expressions that works. For instance, the following correctly returns "Earnings":
$r_header = '/Company (.+?) Calendar/';
preg_match($r_header,$data,$header);
echo $header[1];
I am going nuts trying to figure out why this isn't working. Any help would be awesome. Thanks.
Your regex doesn't allow for the line breaks in the HTML Try:
$r_header = '/Prev\. Week((?s:.*))Next Week/';
The s tells it to match the newline characters in the . (match any).
Problem is that the HTML has newlines in it, which you need to incorporate with the s regex modifier, as below
<?php
$stock = "goog";//$_GET[s]; //returns stock ticker symbol eg GOOG or YHOO
$first = $stock[0];
$url = "http://biz.yahoo.com/research/earncal/".$first."/".$stock.".html";
$data = file_get_contents($url);
$r_header = '/Prev. Week(.+?)Next Week/s';
$r_date = '/\<b\>(.+?)\<\/b\>/s';
preg_match($r_header,$data,$header);
preg_match($r_date, $header[1], $date);
var_dump($header);
?>
Dot does not match newlines by default. Use /your-regex/s
$r_header should probably be /Prev\. Week(.+?)Next Week/s
FYI: You don't need to escape < and > in a regex.
You want to add the s (PCRE_DOTALL) modifier. By default . doesn't match newline, and I see the page has them between the two parts you look for.
Side note: although they don't hurt (except readability), you don't need a backslash before < and >.
I think this is because you're applying the values to the regex as if it's plain text. However, it's HTML. For example, your regex should be modified to parse:
Prev. Week ...
Not to parse regular plain text like: "Prev. Week ...."

Categories