I've got a database with a lot of user made entries grown about 10 years. The users had the option to put HTML-code in their content. And some didn't that well. So I've a lot of content in where the quotes are missing. Need a valid HTML-code for an ex/import via XML.
Had tested to replace width but my regex doesn't work. Do you've an idea where's my fault?
$out=preg_replace("/<a href=h(.)*>/","<a href=\"h$1\">",$out);
PS: If you have an idea how to automatically make a correction on wrong html source this would alternatively be great.
I think you wanted to use "/<a href=h(.*)>/" (mind the star inside the parenthesis) since you want to capture all characters after the h and before the > inside the capture group.
You can also use <a href=([^"].*)> since the href may not start with h. This regex captures all href values that do not start with ".
Yet, all of these assume that the href is the last attribute in your a, i.e.., ending with >.
As a more general rule, I came up with (?<key>\w*)\s*=\s*(?<value>[^"][^\s>]*) that finds attribute-value pairs, separated by =. The values may not start with ", and they go until the next whitespace or >. Use this with caution, since it may fail in serveral circumstances: Multi-line html, inline JavaScript, etc.
Whether it is a good idea to use RegEx for such a task is a different discussion.
Related
I need to scrape some data from a website. For that I am using preg_match, but I am not able to write the regex for it. The data on the website is
title="Russia"/></a>
<small>*</small> <a href="/profile/roman
I have written the regex as #title=\"Russia\"\/><\/a>((\n|\r)*)<small>*<\/small> <a href=\"/profile/(.+?)\"#sx
But this is not working and I dont know why ? When I echo my regex it says #title="Russia"\/><\/a>(( | )*)*<\/small> . Where are the others gone? And why is it not working ?
Try this:
#title=\"Russia\"/></a>(\s*)<small>\*</small>\s+<a\s+href=\"/profile/(.+?)\"#sx
I have escaped the * because its a metacharacter. Without it, you would match strings containing the word small followed by zero or more >s.
You really should not use regexes to evaluate markup content, especially when you acquire it by scrapping pages.
In your case there are at least three reasons that might be responsible for breaking your regex.
Do not attempt to write your own whitespace evaluators when you can simply use \s which stands for "any whitespace character"
In regular expressions asterisk (*) has a special meaning which is why you can't simply use it to identify asterisks. If you want to collect content inside the small attribute you should use <small>(.*)</small> instead. If on the other hand you are actually expecting an asterisk then you have to escape it like this <small>\*</small>.
Your regex expects a closing quote for your href attribute on that last <a> but in your sample markup you have none. Provided that on the original page you do have a closing quote the following regex should do the trick.
#title=\"Russia\"\/><\/a>(\s*)<small>\*</small> <a href="/profile/(.+)?\"#sx
However once again I have to advise using a DOM parser like DOMDocument for this not only because it is much more reliable when handling markup content but also because it can interpret bad markup as well (if its loaded as HTML of course).
I need to match all three types of comments that PHP might have:
# Single line comment
// Single line comment
/* Multi-line comments */
/**
* And all of its possible variations
*/
Something I should mention: I am doing this in order to be able to recognize if a PHP closing tag (?>) is inside a comment or not. If it is then ignore it, and if not then make it count as one. This is going to be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (because it's driving me nuts!). I tried to achieve this a couple of hours, but I wasn't able. How can I translate for it to work with XML?
So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure regular expression expression, no language features or anything. :)
Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:
<?php
echo 'something'; # this is a comment
?>
Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.
But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):
~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s
Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.
You can also combine all of these into one regex:
~(?:#|//)[^\r\n]*|/\*.*?\*/~s
If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.
If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):
(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/
See here for a working demo.
Now if you also want to capture the contents of the comments in a group, then you could do this
(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)
Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.
Another working demo.
Single-line comments
singleLineComment = /'[^']*'|"[^"]*"|((?:#|\/\/).*$)/gm
With this regex you have to replace (or remove) everything that was captured by ((?:#|\/\/).*$). This regex will ignore contents of strings that would look like comments (e.g. $x = "You are the #1"; or $y = "You can start comments with // or # in PHP, but I'm a code string";)
Multiline comments
multilineComment = /^\s*\/\*\*?[^!][.\s\t\S\n\r]*?\*\//gm
Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?
Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>
You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.
preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.
Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/
I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"