How can I create my own regex for "parse" HTML links?

How can I create my own regex for "parse" HTML links? - php

The strings looks like hyperlinks, such as http://somethings. This is what I need :
I need to check them only if they doesnt start with the character "; I mean, only that characters : if before there aren't characters it must check;
That somethings string means that every kind of characters can be used (of course, is a link) except a whitespace (The end marker link); I know, it's permitted by RFC, but is the only way I know to escape;
these string are previously filtered by using htmlentities($str, ENT_QUOTES, "UTF-8"), that's why every kind of characters can be used. Is it secure? Or I risk problems with xss or html broked?
the occurences of this replacement can me multiple, not only 1, and must be case insenstive;
This is my actual regex :
preg_replace('#\b[^"](((http|https|ftp)://).+)#', '<a class="lforum" href="$1">$1</a>', $str);
But it check only those string that START with ", and I want the opposite. Any helps answering to this question would be good, Thanks!

For both of your cases you'll want lookbehind assertions.
\b(?<!")(\w)\b - negative lookbehind to match only if not preceded by "
(?<=ThisShouldBePresent://)(.*) - positive lookbehind to match only if preceded by the your string.

Something like this: preg_match('/\b[^"]/',$input_string);
This looks for a word-break (\b), followed by any character other than a double quote ([^"]).
Something like this: preg_match('~(((ThisShouldBePresent)://).+)~');
I've assumed the brackets you specified in the question (and the plus sign) were intended as part of the regex rather than characters to search for.
I've also taken #ThiefMaster's advice and changed the delimiter to ~ to avoid having to escape the //.

Related

PHP regex remove all digits except character codes

As per this thread it's pretty easy to remove all digits from a string in PHP.
For example:
$no_digits = preg_replace('/\d/', '', 'This string contains digits! 1234');
But, I don't want digits removed that are part of HTML charactr codes such as:
)
©
How can I get Regex to ignore numbers that are part of a HTML character code? i.e. numbers that are sandwiched between &# and ; characters?

You can use (*SKIP)(*F) verb:
echo preg_replace('/&#\d+;(*SKIP)(*F)|\d+/', '',
'This string contains digits! 1234 ) © 5678');
//=> This string contains digits! ) ©
&#\d+;(*SKIP)(*F) will skip the match id regex matches &#\d+; pattern.
Alternatively you can use lookarounds:
echo preg_replace('/(?<!&#)\d+|\d+(?!;)/', '',
'This string contains digits! 1234 ) © 5678');
Which means match 1 or digits that are either not preceded by &# OR not followed by ; thus making it skip &#\d+; pattern.

You can use
var output = Regex.Replace(input, #"[\d-]", string.Empty);
***The \d identifier simply matches any digit character.

As an option, you could convert your code to UTF-8 encoding (if it’s not already UTF-8), then convert HTML entities to corresponding characters with html_entity_decode(), then remove numbers with a regexp, then, if needed, convert special characters to corresponding entities again with htmlentities() (in UTF-8, it’s actually enough to escape just a minimal subset of special characters via htmlspecialchars()), then convert code back to your original encoding (if the original string was not in UTF-8).

You can use look behind and look ahead.
$no_digits = preg_replace('/(?<!&#)\d+(?=[^;\d])/', '', 'This string contains ) digits! 1234');
So basically, (?<!&#) tells RegEx to look behind \d+ to make sure that there is no &# and (?=[^;\d]) tells RegEx to look ahead of \d+ to make sure that it is not a semicolon or a number.
I like this solution a bit better as it can be used on most RegEx like in Java and JavaScript.
Hope this helps.
Edit: miss one character <.

Regex: Replace Characters In-between Two Characters

I'm having trouble using Regex to replace strings that have a ? in between two characters. Two examples of what I'd like Regex to match are:
• Replace thi?s question mark but not this one?
• ? Replace the lonely question mark
What's the best way to:
a) Match a character surrounded by other characters
b) Match a character that is on it's own and has no characters before it or after it
I'm using PHP preg_match and MySQL REGEXP to preform these pattern matchings. For MySQL I've tried:
SELECT description
FROM locations
WHERE description
REGEXP '/|([^?]+)\/'
For PHP I've tried:
preg_match('/|([^?]+)\/', $string);

I suggest this one for PHP:
(?<!\w(?=\? ))\?(?!\s*$)\s*
(?!\s*$) is a negative lookahead that will prevent a ? from matching if it is at the end of a sentence (I added whitespaces just in case).
(?<!\w(?=\? )) is a little more complex. It will prevent a match if the ? is preceded by a \w character (typically read as [a-zA-Z0-9_]) and followed by a space.
regex101 demo
I don't know whether mysql supports lookbehinds though.
|([^?]+)\
This is your current regex and I don't think your PHP code runs. The \ at the end is not escaping anything (in fact, it's trying to escape the delimiter) so... :s

Check this Demo Code Viper
Pattern
/(\w+)?(\w+)/g
Test this Pattern
PHP
<?php
echo preg_replace("/(\w+)?(\w+)/i", "thi?s", "?");
?>
Result
?
Hope this help you!

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.

Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.

I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.

Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.

Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.

You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.

"[^"]+"

Something like below. s is dotall mode where . will match even newline:
/".+?"/s

$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

rexexp solution for php

I have tried to work this out myself (even bought a Kindle book!), but I am struggling with backreferences in php.
What I want is like the following example:
var $html = "hello %world|/worldlink/% again";
output:
hello world again
I tried stuff like:
preg_replace('/%([a-z]+)|([a-z]+)%/', '\1', $html);
but with no joy.
Any ideas please? I am sure someone will post the exact answer but I would like an explanation as well please - so that I don't have to keep asking these questions :)

The slashes "/" are not included in your allowed range [a-z]. Instead use
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your expression:
'/%([a-z]+)|([a-z]+)%/'
Is only capturing one thing. The | in the middle means "OR". You're trying to capture both, so you don't need an OR in there. You want a literal | symbol so you need to escape it:
'/%([a-z]+)\|([a-z\/]+)%/'
The / character also needs to be included in your char set, and escaped as above.

Your regex (/%([a-z]+)|([a-z]+)%/) reads this way:
Match % followed by + (= one or
more) a-z characters (and store this
into backreference #1).
Or (the |):
Match + (= one or more) a-z
characters (and store this into
backreference #2) followed by a
%.
What you are looking for is:
preg_replace('~%([a-z]+)[|]([a-z/]+)%~', '$1', $html);
Basically I just escaped the | regex meta character (you can do this by either surrounding it with [] like I did or just prepending a backwards slash \, personally I find the former easier to read), and added a / to the second capture group.
I also changed your delimiters from / to ~ because tildes are much more unlikely to appear in strings, if you want to keep using / as your delimiter you also have to escape their occurrences in your regex.
It's also recommended that you use the $ syntax instead of \ in your replacement backreferences:
$replacement may contain references
of the form \\n or (since PHP 4.0.4)
$n, with the latter form being the
preferred one.

Here is a version that works according to the OPs data/information provided (using a non-slash delimiter to avoid escaping slashes):
preg_replace('#%([a-z]+)\|([a-z/]+)%#', '\1', $html);
Using a non slash delimiter, would alleviate the need to escape slashes.
Outputs:
hello world again
The Explanation
Why yours did not work. First up the | is an OR operator, and, in your example, should be escaped. Second up, since you are using /'s or expect slashes it is better to use a non-slash delimiter, such as #. Third up, the slash needed to be added to list of allowed matches. As stated before you may want to include a bit more options, as any type of word with numbers underscores periods hyphens will fail / break the script. Hopefully that is the explanation you were looking for.

Here's what works for me:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your regular expression doesn't escape the |, and doesn't include the proper characters for the URL.
Here's a basic live example supporting only a-z and slashes:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
In reality, you're going to want to change those [a-z]+ blocks to something more expressive. Do some searches for URL-matching regular expressions, and pick one that fits what you want.

$html = "hello %world|/worldlink/% again";
echo preg_replace('/([A-ZA-z_ ]*)%(.+)\|(.+)%([A-ZA-z_ ]*)/', '$1$2$4', $html);
output:
hello world again
here is a working code : http://www.ideone.com/0qhZ8

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I create my own regex for "parse" HTML links? - php

For both of your cases you'll want lookbehind assertions. \b(?<!")(\w)\b - negative lookbehind to match only if not preceded by " (?<=ThisShouldBePresent://)(.*) - positive lookbehind to match only if preceded by the your string.

Related

PHP regex remove all digits except character codes

Regex: Replace Characters In-between Two Characters

PHP regex lookbehind with wildcard

regex: remove all text within "double-quotes" (multiline included)

rexexp solution for php

Categories

Resources