This is how far I got.
This is working:
$urls = $this->match_all('/<a href="(http:\/\/www.imdb.de\/title\/tt.*?)".*?>.*?<\/a>/ms',
$content, 1);
Now I wan't to do the same with a different site.
But the link of the site has different structure:
http://www.example.org/ANYTHING
I don't know what I am doing wrong but with this other site (example.org) it is not working.
Here is what I have tried
$urls = $this->match_all('/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms',
$content, 1);
Thank you for your help. Stackoverflow is so awesome!
ANYTHING is usually represented by .*? (which you already use in your original regex). You could also use [^"]+ as placeholder in your case.
It sounds like you want the following regular expression:
'/<a href="(http:\/\/example\.org\/.*?)".*?>.*?<\/a>/ms'
You can also use a different delimiter to avoid escaping the backslashes:
'#<a href="(http://example\.org/.*?)".*?>.*?</a>#ms'
Note the escaping of the . in the domain name, as you intend to match a literal ., not any character.
I think this should help
/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms
text
Result:
Array
(
[0] => text
[1] => http://www.example.org/ANYTHING
)
EDIT: I always find this site very useful for when i want to try out preg_match - http://www.solmetra.com/scripts/regex/index.php
Related
Hi I have the following text:
file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/
I need to retrieve test-test-test#2016-10-04.txt# from the string above. If I can also exclude the hash even better.
I've tried looking at examples like this Regex to find text between second and third slashes but having trouble getting it working, can anyone help?
I'm using PHP regex to do this.
You may try regex expression below
\/([a-z\-]*\#[0-9\-\.]*[a-z]{3}\#)\/
A working example is here: https://www.regex101.com/r/RYsh7H/1
Explanation:
[a-z\-]* => Matches test-test-test part with lowercase and can contain dahses
\# => Matches constant # sign
[0-9\-\.]* => Matches the file name with digits, dashes and {dot}
[a-z]{3}\# => Matches your 3 letter extension and #
PS: If you really do not need # you do not have to use regex. And you may consider using parse_url method of PHP.
Hope this helps;
basename() also works, so you can also do like this:
echo basename('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/');
Without regex you can do:
$url_parts = parse_url('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/');
echo end(explode('/', $url_parts['path']));
or better:
$url_path = parse_url('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/', PHP_URL_PATH);
echo end(explode('/', $url_path));
I want to match all href values in my page content. I wrote regex for that and tested it on regex101
href[ ]*=[ ]*("|')(.+?)\1
This finds all my href values properly. If I use
href[ ]*=[ ]*(?:"|')(.+?)(?:"|')
its even better since I do not have to use certain group later.
With " and ' in regex string I cannot run the regex properly with
$matches = array();
$pattern = "/href[ ]*=[ ]*("|')(.+?)\1/"; // syntax error
$numOfMatches = preg_match_all($pattern, $pattern, $matches);
print_r($matches);
If I "escape" double quote and thus repair the syntax error I get no matches.
So - what is the correct way to apply the given regex in PHP?
Thanks for any help
Notes:
addslashes or preg_quote won't help since I need to pass legit string first
escaping all the special chars \ + * ? [ ^ ] $ ( ) { } = ! < > | : - didn't help either
EDIT: Ok, I see I really shouldn't be doing this with regex. Could you please provide some helpful DOM parsers or any other tool I 'should' use with PHP for instance ?
For your case, the following should work:
/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.
spaces around the = after href:
/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
matching only links starting with http:
/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
single quotes around the link address:
/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Source
I had to use this regex to make it work. Next time I will definitely try with DOM parser :)
$regexForHREF = "/href[ ]*=[ ]*(?:\"|')(.+?)(?:\"|')/";
I'm creating some custom BBcode for a forum. I'm trying to get the regular expression right, but it has been eluding me for weeks. Any expert advice is welcome.
Sample input (a very basic example):
[quote=Bob]I like Candace. She is nice.[/quote]
Ashley Ryan Thomas
Essentially, I want to encase any names (from a specified list) in [user][/user] BBcode... except, of course, those being quoted, because doing that causes some terrible parsing errors.
The desired output:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
[user]Ashley[/user] [user]Ryan[/user] [user]Thomas[/user]
My current code:
$searchArray = array(
'/(?i)([^=]|\b|\s|\/|\r|\n|\t|^)(Ashley|Bob|Candace|Ryan|Thomas)(\s|\r|\n|\t|,|\.(\b|\s|\.|$)|;|:|\'|"|-|!|\?|\)|\/|\[|$)/'
);
$replaceArray = array(
"\\1[user]\\2[/user]\\3"
);
$text = preg_replace($searchArray, $replaceArray, $input);
What it currently produces:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
[user]Ashley[/user] Ryan [user]Thomas[/user]
Notice that Ryan isn't encapsulated by [user] tags. Also note that much of the additional regex matching characters were added on an as-needed basis as they cropped up on the forums, so removing them will simply make it fail to match in other situations (i.e. a no-no). Unless, of course, you spot a glaring error in the regex itself, in which case please do point it out.
Really, though, any assistance would be greatly appreciated! Thank you.
It's quite simply that you are matching delimiters (\s|\r|...) at both ends of the searched names. The poor Ashley and Ryan share a single space character in your test string. But the regex can only match it once - as left or right border.
The solution here is to use assertions. Enclose the left list in (?<= ) and the right in (?= ) so they become:
(?<=[^=]|\b|\s|\/|^)
(?=\s|,|\.(\b|\s|\.|$)|;|:|\'|"|-|!|\?|\)|\/|\[|$)
Btw, \s already contains \r|\n|\t so you can probably remove that.
Since you don't really need to match the spaces on either side (just make sure they're there, right?) try replacing your search expression with this:
$searchArray = array(
'/\b(Ashley|Bob|Candace|Ryan|Thomas)\b/i'
);
$replaceArray = array(
'[user]$1[/user]'
);
$text = preg_replace($searchArray, $replaceArray, $input);
I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm
I'm new to preg_replace() and I've been trying to get this to work, I couldn't so StackOverflow is my last chance.
I have a string with a few of these:
('pm_IDHERE', 'NameHere');">
I want it to be replaced with nothing, so it would require 2 wildcards for NameHere and pm_IDHERE.
But I've tried it and failed myself, so could someone give me the right code please, and thanks :)
Update:
You are almost there, you just have to make the replacement an empty string and escape the parenthesis properly, otherwise they will be treated as capture group (which you don't need btw):
$str = preg_replace("#\('pm_.+?', '.*?'\);#si", "", $str);
You probably also don't need the modifiers s and i but that is up to you.
Old answer:
Probably str_replace() is sufficient:
$str = "Some string that contains pm_IDHERE and NameHere";
$str = str_replace(array('pm_IDHERE', 'NameHere'), '', $str);
If this is not what you mean and pm_IDHERE is actually something like pm_1564 then yes, you probably need regular expressions for that. But if NameHere has no actual pattern or structure, you cannot replace it with regular expression.
And you definitely have to explain better what kind of string you have and what kind of string you have want to replace.