PHP: Regular Expression to get a URL from a string [duplicate]

PHP: Regular Expression to get a URL from a string [duplicate] - php

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Identifying if a URL is present in a string
Php parse links/emails
I'm working on some PHP code which takes input from various sources and needs to find the URLs and save them somewhere. The kind of input that needs to be handled is as follows:
http://www.youtube.com/watch?v=IY2j_GPIqRA
Try google: http://google.com! (note exclamation mark is not part of the URL)
Is http://somesite.com/ down for anyone else?
Output:
http://www.youtube.com/watch?v=IY2j_GPIqRA
http://google.com
http://somesite.com/
I've already borrowed one regular expression from the internet which works, but unfortunately wipes the query string out - not good!
Any help putting together a regular expression, or perhaps another solution to this problem, would be appreciated.

Jan Goyvaerts, Regex Guru, has addressed this issue in his blog. There are quite a few caveats, for example extracting URLs inside parentheses correctly. What you need exactly depends on the "quality" of your input data.
For the examples you provided, \b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$] works when used in case-insensitive mode.
So to find all matches in a multiline string, use
preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&##\/%=~_|$?!:,.]*[A-Z0-9+&##\/%=~_|$]/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];

Why not try this one. It is the first result of Googling "URL regular expression".
((https?|ftp|gopher|telnet|file|notes|ms-help):((\/\/)|(\\\\))+[\w\d:##%\/;$()~_?\+-=\\\.&]*)
Not PHP, but it should work, I just slightly modified it by escaping forward slashes.
source

Related

Don't know how to write this preg

I'm making a function call to a library that is returning a malformed json array. I can work around this if I can get a preg written to extract the part that I want.
The array is a jumbled mess, but buried deep inside it is a string that looks like this:
token=??????,
I need to write a preg to grab the characters represented by the question marks. I wrote this, but it's not getting the part of the text that I want:
$token = preg_match('#^(?:token=)?([^,]+)#i', $badJson, $matches);
Can anyone help me? Thanks.

You can try:
/token=([^,]+)/i
and the use the first sub-match to extract the token. Being more specific is usually a good idea with regex (eg. does the token have a set length? does it only contain hex characters? etc.)
Site note: https://leaverou.github.io/regexplained/ is a great site for testing regular expressions.

Strip RTF strings with PHP - Regex [duplicate]

This question already has answers here:
Regular Expression for extracting text from an RTF string
(11 answers)
Closed 9 years ago.
A column in the database I work with contains RTF strings, I would like to strip these out using PHP, leaving just the sentence between.
It is a MS SQL database 2005 if I recall correctly.
An example of the kind of strings pulled from the database (need any more let me know, all the rest are similar):
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Tahoma;}}
\viewkind4\uc1\pard\lang1033\f0\fs17 ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.\lang2057\fs17\par
}
I would like this to be stripped to only return:
ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.
Now, I have successfully managed to strip the characters in ASP.NET for a previous project, however I would like to do so using PHP. Here is the regular expression I used in ASP.NET, which works flawlessly may I add:
"(\{.*\})|}|(\\\S+)"
However when I try to use the same expression in PHP with a preg_replace it does not strip half of the characters.
Any regex gurus out there?

Use this code. it will work fine.
$string = preg_replace("/(\{.*\})|}|(\\\S+)/", "", $string);
Note that I added a '/' in the beginning and at the end '/' in the regex.

PHP Regular expression: Get all urls with question mark

I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus

preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.

Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.

Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".

The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.

Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).

Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';

<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

Another eregi replacement issue, for web page navigation [duplicate]

This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 3 years ago.
My hosts recently updated their PHP to 5.3 (without warning) and I now have to replace the code for my page navigation on my index page. This is what I'm currently using:
<? if (eregi(".shtml", $load)) {if (!#readfile("$load")) { readfile("error.shtml"); } } if (!eregi(".shtml", $load)) {if (!#readfile(include("/home/content/j/p/l/jplegacy/html/coranto/news.txt") )) ;}?>
This is what I use as a sample in my links to navigate with:
Archive
About Us
I looked at two different techniques, preg_match() and stristr().
preg_match() outputs my error page and news.txt file from Coranto, but doesn't navigate me to the pages like in the links above. What can I do here to make that work?
stristr() doesn't give me the warnings, but it doesn't navigate the page and only outputs my Coranto news page. If this would be better, what can I do to make this work?
What do I need to do to fix this? I am completely lost. :(

ereg(i) is considered to be deprecated. You should use preg_match for regular expressions as the preg is faster than the other engine. Also, if you want ends with ".shtml" that regex is wrong, and you don't need to resort to regular expressions. Ends with .shtml is "^*.shtml$". If you do choose to use preg_match, all preg_match regular expresions, like perl, are surrounded by "/" (slash), so the correct regex would look like:
preg_match("/^.*\.shtml$/", $load)
But on to the better solution for "ends with .shtml". Generally you shouldn't use regular expressions unless the situation actually calls for them (they're significantly slower than using substr and matching the last part of the string), simple matches like that don't qualify.
if (substr($test, sizeof($load) - 7, 6 )) == '.shtml') { ... }

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP: Regular Expression to get a URL from a string [duplicate] - php

Why not try this one. It is the first result of Googling "URL regular expression". ((https?|ftp|gopher|telnet|file|notes|ms-help):((\/\/)|(\\\\))+[\w\d:##%\/;$()~_?\+-=\\\.&]*) Not PHP, but it should work, I just slightly modified it by escaping forward slashes. source

Related

Don't know how to write this preg

Strip RTF strings with PHP - Regex [duplicate]

PHP Regular expression: Get all urls with question mark

Can't use OR( | ) in php Regular expression

Another eregi replacement issue, for web page navigation [duplicate]

Categories

Resources