Fetch All URLs from a Page using Regex - php

Original format:
<a href="http://www.example.com/t434234.html" ...>
1. I need to fetch all URLs of this format:
http://www.example.com/t[ANY CHARACTER].html
ANY CHARACTER is where value changes from URL to another. The rest are fixed.
Here is my attempt:
preg_match("#http:\/\/www\.aqarcity\.com\/t[a-zA-Z0-9_]\.html#", $page, $urls);
I get empty results. I don't know where i went wrong...

The problem appears to be that [a-zA-Z0-9_] will only match exactly one character. If you want to match zero or more characters, use [a-zA-Z0-9_]*. For one or more, use [a-zA-Z0-9_]+. For exactly six characters, use [a-zA-Z0-9_]{6}. For e.g. one to six characters, use [a-zA-Z0-9_]{1,6}.
Also note that, since you're using # as the delimiter, you don't need to escape the / characters. As far as I know this will not make your code misbehave, but it'll be easier to read if you remove the backslashes before the slashes.
Finally, please realize that regular expressions are a rather dangerous way to work with HTML. In this case, you may pick up matching URLs from comments, Javascript code, and other things that aren't links. It is literally impossible to correctly parse HTML with unaugmented regular expressions—they don't have the expressive power necessary to do so. I don't know what sorts of HTML parsers are available for PHP, but you may want to look into them.

Related

PHP: Setting missing " in HTML-code with preg_replace?

I've got a database with a lot of user made entries grown about 10 years. The users had the option to put HTML-code in their content. And some didn't that well. So I've a lot of content in where the quotes are missing. Need a valid HTML-code for an ex/import via XML.
Had tested to replace width but my regex doesn't work. Do you've an idea where's my fault?
$out=preg_replace("/<a href=h(.)*>/","<a href=\"h$1\">",$out);
PS: If you have an idea how to automatically make a correction on wrong html source this would alternatively be great.
I think you wanted to use "/<a href=h(.*)>/" (mind the star inside the parenthesis) since you want to capture all characters after the h and before the > inside the capture group.
You can also use <a href=([^"].*)> since the href may not start with h. This regex captures all href values that do not start with ".
Yet, all of these assume that the href is the last attribute in your a, i.e.., ending with >.
As a more general rule, I came up with (?<key>\w*)\s*=\s*(?<value>[^"][^\s>]*) that finds attribute-value pairs, separated by =. The values may not start with ", and they go until the next whitespace or >. Use this with caution, since it may fail in serveral circumstances: Multi-line html, inline JavaScript, etc.
Whether it is a good idea to use RegEx for such a task is a different discussion.

how to replace '\\\' to '\'?

my code is not working ? and i dont want to use str_replace , for there maybe more slashes than 3 to be replaced. how can i do the job using preg_replace?
my code here like this:
<?php
$str='<li>
<span class=\"highlight\">Color</span>
Can\\\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
$str=preg_replace("#\\+#","\\",$str);
echo $str;
There is merit in the other answers, but to me it looks like what you're actually trying to accomplish is something very different. In the php code \\\' is not three slashes followed by an apostrophe, it's one escaped slash followed by an escaped apostrophe, and in the rendered output, that's exactly what you see—a slash followed by an apostrophe (with no need to escape them in the rendered html). It's important to realize that the escape character is not actually part of the string; it's merely a way to help you represent a character that normally has very different meaning in within php—in this case, an apostrophe normally terminates a string literal. What looks like 4 characters in php is actually only 2 characters in the string.
If this is the extent of your code, there's no need for string manipulation or regular expressions. What you actually need is just this:
<?php
$str='<li>
<span class="highlight">Color</span>
Can\'t find the exact color shown on the model pictures? Just leave a message (eg: color as shown in the first picture...) when you place order.
Please note that colors on your computer monitor may differ slightly from actual product colors depending on your monitor settings.
</li>';
echo $str;
?>
Only one escape character is needed here for the apostrophe, and in the rendered HTML you will see no slashes at all.
Further Reading:
Escape sequences
The root of this problem is actually in how it was written into your database and likely to be caused by magic_quotes_gpc; this was used in older versions and a really bad idea.
The best fix
This requires a few steps:
Fix the script that puts the HTML inside your database by disabling magic_quotes_gpc.
Write a script that reads all existing database entries, applies stripslashes() and saves the changes.
Fix the presentation part (though, that may need no changes at all.
Alternative patch
Use stripslashes() before you present the HTML.
use this pattern
preg_replace('#\\+#', '\\', $text);
This replaces two or more \ symbols preceding an ' symbol with \'
$theConvertedString = preg_replace("/\\{2,}'/", "\'", $theSourceString);
Ideally, you shouldn't have code causing this issue in the first place so I would have a look at why you have \\' in your code to begin with. If you've manually put it in your variables, take it out. Often, this also happens with multiple calls to addslashes() or mysql_real_escape_string() or a cheap hosting providers' automatic transformation of all POST request variables to escape slashes, combined with your server side PHP code to do the same.

regex for preg_match not working

I need to scrape some data from a website. For that I am using preg_match, but I am not able to write the regex for it. The data on the website is
title="Russia"/></a>
<small>*</small> <a href="/profile/roman
I have written the regex as #title=\"Russia\"\/><\/a>((\n|\r)*)<small>*<\/small> <a href=\"/profile/(.+?)\"#sx
But this is not working and I dont know why ? When I echo my regex it says #title="Russia"\/><\/a>(( | )*)*<\/small> . Where are the others gone? And why is it not working ?
Try this:
#title=\"Russia\"/></a>(\s*)<small>\*</small>\s+<a\s+href=\"/profile/(.+?)\"#sx
I have escaped the * because its a metacharacter. Without it, you would match strings containing the word small followed by zero or more >s.
You really should not use regexes to evaluate markup content, especially when you acquire it by scrapping pages.
In your case there are at least three reasons that might be responsible for breaking your regex.
Do not attempt to write your own whitespace evaluators when you can simply use \s which stands for "any whitespace character"
In regular expressions asterisk (*) has a special meaning which is why you can't simply use it to identify asterisks. If you want to collect content inside the small attribute you should use <small>(.*)</small> instead. If on the other hand you are actually expecting an asterisk then you have to escape it like this <small>\*</small>.
Your regex expects a closing quote for your href attribute on that last <a> but in your sample markup you have none. Provided that on the original page you do have a closing quote the following regex should do the trick.
#title=\"Russia\"\/><\/a>(\s*)<small>\*</small> <a href="/profile/(.+)?\"#sx
However once again I have to advise using a DOM parser like DOMDocument for this not only because it is much more reliable when handling markup content but also because it can interpret bad markup as well (if its loaded as HTML of course).

How to match anything except a pattern between two tags

I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories