regex for preg_match not working - php

I need to scrape some data from a website. For that I am using preg_match, but I am not able to write the regex for it. The data on the website is
title="Russia"/></a>
<small>*</small> <a href="/profile/roman
I have written the regex as #title=\"Russia\"\/><\/a>((\n|\r)*)<small>*<\/small> <a href=\"/profile/(.+?)\"#sx
But this is not working and I dont know why ? When I echo my regex it says #title="Russia"\/><\/a>(( | )*)*<\/small> . Where are the others gone? And why is it not working ?

Try this:
#title=\"Russia\"/></a>(\s*)<small>\*</small>\s+<a\s+href=\"/profile/(.+?)\"#sx
I have escaped the * because its a metacharacter. Without it, you would match strings containing the word small followed by zero or more >s.

You really should not use regexes to evaluate markup content, especially when you acquire it by scrapping pages.
In your case there are at least three reasons that might be responsible for breaking your regex.
Do not attempt to write your own whitespace evaluators when you can simply use \s which stands for "any whitespace character"
In regular expressions asterisk (*) has a special meaning which is why you can't simply use it to identify asterisks. If you want to collect content inside the small attribute you should use <small>(.*)</small> instead. If on the other hand you are actually expecting an asterisk then you have to escape it like this <small>\*</small>.
Your regex expects a closing quote for your href attribute on that last <a> but in your sample markup you have none. Provided that on the original page you do have a closing quote the following regex should do the trick.
#title=\"Russia\"\/><\/a>(\s*)<small>\*</small> <a href="/profile/(.+)?\"#sx
However once again I have to advise using a DOM parser like DOMDocument for this not only because it is much more reliable when handling markup content but also because it can interpret bad markup as well (if its loaded as HTML of course).

Related

PHP: Setting missing " in HTML-code with preg_replace?

I've got a database with a lot of user made entries grown about 10 years. The users had the option to put HTML-code in their content. And some didn't that well. So I've a lot of content in where the quotes are missing. Need a valid HTML-code for an ex/import via XML.
Had tested to replace width but my regex doesn't work. Do you've an idea where's my fault?
$out=preg_replace("/<a href=h(.)*>/","<a href=\"h$1\">",$out);
PS: If you have an idea how to automatically make a correction on wrong html source this would alternatively be great.
I think you wanted to use "/<a href=h(.*)>/" (mind the star inside the parenthesis) since you want to capture all characters after the h and before the > inside the capture group.
You can also use <a href=([^"].*)> since the href may not start with h. This regex captures all href values that do not start with ".
Yet, all of these assume that the href is the last attribute in your a, i.e.., ending with >.
As a more general rule, I came up with (?<key>\w*)\s*=\s*(?<value>[^"][^\s>]*) that finds attribute-value pairs, separated by =. The values may not start with ", and they go until the next whitespace or >. Use this with caution, since it may fail in serveral circumstances: Multi-line html, inline JavaScript, etc.
Whether it is a good idea to use RegEx for such a task is a different discussion.

Finding substring whilst ignoring HTML tags

I need to match parts of string whilst ignoring HTML tags. Which means if user wants to look for string "foo and foo1" in source code.
Two strings, <u>foo</u> and foo1
He'd not get the match, because of the tags.
I've tried regex, but since the tags can and don't have to be there, it seems rather too complicated.
It's not server-side script. It'd be an application run from console.
To be more specific: it is for syntax highlight. So user wants "foo and foo1" to be italic, but part of it is already underline and wouldn't match anyway. That's why I can't strip the string.
Use the PHP function strip_tags to remove the HTML tags from the text. Then do your search.
http://php.net/manual/en/function.strip-tags.php
Use strip_tags as you have been advised, it is really the best way. However, if you want to have fun or experiment and benchmark your regex engine :) you can insert (?:<\/?[^>]+>)? after each symbol of the query passed, and you will have a match, and in the very beginning of the query (or the opening tag won't be captured).
Here is an example for a "foo and foo1":
(?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?a(?:<\/?[^>]+>)?n(?:<\/?[^>]+>)?d(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?1(?:<\/?[^>]+>)?
This will match <u>foo</u> and foo1.
https://regex101.com/r/aF8fJ8/4
This regex will ignore the <> and slashes in html tags, only extracting words.
(?!</?[^>]+>)([a-zA-Z]+)
just replace the [a-zA-Z]+ with what you want to match.

Php regex match a string between two html tags with the tags been unknown

Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?

Regular Expression formatting help required

I am trying to remove a part of a document on the fly using preg_replace().
/* target example:
<li id="footer-poweredbyico">
<img src="//bits.wikimedia.org/skins-1.18/common/images/poweredby_mediawiki_88x31.png" alt="Powered by MediaWiki" width="88" height="31" />
</li>
*/
$reg = preg_quote('<li id="footer-poweredbyico">.*?</li>');
preg_replace($reg,"",$str);
Ignore any errors in PHP, this question is about how to format the regular expression correctly to remove anything matching the target example opening and closing tags. The contents of the containing HTML tags will be different each time, hence .*? (I think that's wrong).
The preg_quote function actually does the opposite of what you want: its purpose is to disable all regex-features in a string. So in your case, what you currently have is (roughly) looking for an actual .*? in your HTML, instead of looking for zero or more characters. What you want is:
$str = preg_replace('/<li id="footer-poweredbyico">.*?<\/li>/s', '', $str);
The .*? portion of your regex is being escaped. Therefore, it isn't matching anything. Try this.
$reg = preg_quote('<li id="footer-poweredbyico">') . '.*?' . preg_quote('</li>');
preg_replace($reg,"",$str);
you don't need to use this hack approach, read the faq
"How can I edit / remove the Powered by MediaWiki image in the footer?"
preg_quote() will disable all the special characters you used, like .*?.
Try something like:
preg_replace('#<li id="footer-poweredbyico">.*?</li>#s', '', $str);
Now, the difficult question is whether to make this regex "greedy". Right now, it's ungreedy, which means it will break your page if there's another <li> inside the one you're trying to remove. But if you make it greedy, it will remove everything from the beginning of the <li> tag until the end of the last <li> element in the page, even if it's a different <li> element. Neither is ideal. This is why a proper HTML parser usually does a better job at manipulating HTML.
But if the page is simple enough, a regex will work.
EDIT Corrected a gross error, thanks to #Nilpo.

matching html attributes with regex in php

I'm trying to make an expression that will search through a page like how2bypass.co.cc and return the contents of the "action" attribute in the "form" tag, and the contents of the "name" and "type" attributes in any input tags. I can't use an html parser because my ultimate goal is to automatically detect if a given page is a web proxy, and once sites catch on that I'm doing that they're probably going to start doing silly things like writing the entire document with javascript to stop me from parsing it.
I'm using the code
preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches);
which works fine for the action attribute, but once I put a " after type\= the code stops working. why is this? It works fine once, but not twice?
Regular expressions are greedy...
If you inspect the page source, the following is probably matching the first <input with the last type=, and capturing everything in between.
`<input.*type\=`
You're not going to be able to capture the form and all inputs with your current expression because not every input is prefixed with the form markup. You need to approach it one of the following ways:
Capture the entire form markup, <form>...</form>, and then a regex to match all the inputs in the capture
Adjust your current expression to be non-greedy, .*?, and allow for multiple captures of input markup.
Without seeing the target page that you want to extract from, there are only a few things to guess:
The type= attribute might not have double quotes, as type=text is valid too. Or it might have single quotes instead, or some whitespace around the =.
The .* placeholders might fail if there are newlines between or within the tags. Using the /s regex flag is advisable.
And it's usually more reliable to use negated character classes like [^<>]* or [^"] instead of .* anyway.
You don't need to escape the \= equal sign.
And maybe you should split it up. Use one regex to extract the <form>..</form> block. And then search for the <input> tags within.

Categories