I am trying to parse a string of HTML tag attributes in php. There can be 3 cases:
attribute="value" //inside the quotes there can be everything also other escaped quotes
attribute //without the value
attribute=value //without quotes so there are only alphanumeric characters
can someone help me to find a regex that can get in the first match the attribute name and in the second the attribute value (if it's present)?
Never ever use regular expressions for processing html, especially if you're writing a library and don't know what your input will look like. Take a look at simplexml, for example.
Give this a try and see if it is what you want to extract from the tags.
preg_match_all('/( \\w{1,}="\\w{1,}"| \\w{1,}=\\w{1,}| \\w{1,})/i',
$content,
$result,
PREG_PATTERN_ORDER);
$result = $result[0];
The regex pulls each attribute, excludes the tag name, and puts the results in an array so you will be able to loop over the first and second attributes.
Related
I have the following string inside the source of some website:
user_count: <b>5.122.512</b>
Is this possible to get the number out of this string, even if the tags around this number were different? I mean, "user_count:" part won't change, but the tags can be changed, to strong for example. Or the tags could be doubled, or whatever.
How can I do that?
You can use
user_count:\s*<.*?>(.*?)<.*?>
See DEMO
I'd imagine you have to use JS to extract the content between the tags <b>5.122.512<b> from the DOM.
If you can assign an ID to this you can probably use document.getElementById('NAME_OF_YOUR_ID').innerHTML; to extract the number between it. If you need to process this inside a PHP script, you would probably need to POST this back to the server.
There are a couple of ways to get the number out of the string. One would be just to strip the tags and run a regular expression.
$s = "user_count: <b>5.122.512</b>"
preg_match_all("#user_count: (.+)#", strip_tags($s), $matches);
print_r($matches)
$matches[1] should match the number.
I need to do some cleanup on strings that look like this:
$author_name = '<a href="http://en.wikipedia.org/wiki/Robert_Jones_Burdette>Robert Jones Burdette </a>';
Notice the href tag doesn't have closing quotes - I'm using the DOMParser on a large table of these to extract the text, and it borks on this.
I would like to look at the string in $author_name;
IF the first > does NOT have a " before it, replace it with "> to close the tag correctly. If it is okay, just skip and do the next step. Be sure not to replace the second > at all.
Using php regex, I haven't been able to find a working solution - I could chop up the whole thing and check its parts, but that would be slow and I think there must be a regex that can do what I want.
TIA
What you can do is, find the first closing tag, with or without the double-quote ("), and replace it with (">):
$author_name = preg_replace('/(.+?)"?>(.+?)/', '$1">$2', $author_name);
http://www.barattalo.it/html-fixer/
Download that, then include it in your php.
The rest is quite easy:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It's common for people to want to use regular expressions, but you must remember that HTML is not regular.
I am using curl to get a large string on text and basically 3 things could happen the string could contain
a div with a unique name inside it for example "class=\"asl bwd asd\">{Valid user}\u003C\/div>"
"The email you entered does not exist"
a div with a unique name inside it for example "class=\"asl bwd
asd\">{UNIQUE STRING}\u003C\/div>"
could someone help me write 3 separate preg matches so I can then do something if one of the three strings are found. The string will never have more than one of the three strings.
Do not try to parse XML or HTML with Regular Expressions. Neither is fully expressible using RegEx.
Use the XML parser functions of PHP instead.
Or something like PHPQuery (I just found that one, I like the idea)
Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?
I'm trying to make an expression that will search through a page like how2bypass.co.cc and return the contents of the "action" attribute in the "form" tag, and the contents of the "name" and "type" attributes in any input tags. I can't use an html parser because my ultimate goal is to automatically detect if a given page is a web proxy, and once sites catch on that I'm doing that they're probably going to start doing silly things like writing the entire document with javascript to stop me from parsing it.
I'm using the code
preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches);
which works fine for the action attribute, but once I put a " after type\= the code stops working. why is this? It works fine once, but not twice?
Regular expressions are greedy...
If you inspect the page source, the following is probably matching the first <input with the last type=, and capturing everything in between.
`<input.*type\=`
You're not going to be able to capture the form and all inputs with your current expression because not every input is prefixed with the form markup. You need to approach it one of the following ways:
Capture the entire form markup, <form>...</form>, and then a regex to match all the inputs in the capture
Adjust your current expression to be non-greedy, .*?, and allow for multiple captures of input markup.
Without seeing the target page that you want to extract from, there are only a few things to guess:
The type= attribute might not have double quotes, as type=text is valid too. Or it might have single quotes instead, or some whitespace around the =.
The .* placeholders might fail if there are newlines between or within the tags. Using the /s regex flag is advisable.
And it's usually more reliable to use negated character classes like [^<>]* or [^"] instead of .* anyway.
You don't need to escape the \= equal sign.
And maybe you should split it up. Use one regex to extract the <form>..</form> block. And then search for the <input> tags within.