What's wrong with my PHP regex? - php

I'm trying to pull a specific link from a feed where all of the content is on one line and there are multiple links present. The one I want has the content of "[link]" in the the A tag. Here's my example:
test1 test2 [link] test3test4
... could be more links before and/or after
How do I isolate just the href with the content "[link]"?
This regex goes to the correct end of the block I want, but starts at the first link:
(?<=href\=\").*?(?=\[link\])
Any help would be greatly appreciated! Thanks.

Try this updated regex:
(?<=href\=\")[^<]*?(?=\">\[link\])
See demo.
The problem is that the dot matches too many characters and in order to get the right 'href' you need to just restrict the regex to [^<]*?.

Alternatively :)
This code :
$string = 'test1 test2 [link] test3test4';
$regex = '/href="([^"]*)">\[link\]/i';
$result = preg_match($regex, $string, $matches);
var_dump($matches);
Will return :
array(2) {
[0] =>
string(41) "href="http://www.amazingpage.com/">[link]"
[1] =>
string(27) "http://www.amazingpage.com/"
}

You can avoid using regular expression and use DOM to do this.
$doc = DOMDocument::loadHTML('
test1
test2
[link]
test3
test4
');
foreach ($doc->getElementsByTagName('a') as $link) {
if ($link->nodeValue == '[link]') {
echo $link->getAttribute('href');
}
}

With DOMDocument and XPath:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
foreach ($xpath->query('//a[. = "[link]"]/#href') as $node) {
echo $node->nodeValue;
}
or if you are looking for only one result:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
$nodeList = $xp->query('//a[. = "[link]"][1]/#href');
if ($nodeList->length)
echo $nodeList->item(0)->nodeValue;
xpath query details:
//a # 'a' tag everywhere in the DOM tree
[. = "[link]"] # (condition) which has "[link]" as value
/#href # "href" attribute
The reason your regex pattern doesn't work:
The regex engine walks from left to right and for each position in the string it tries to succeed. So, even if you use a non-greedy quantifier, you obtain always the leftmost result.

Related

Matching string without specific pattern between specific places

$example_string = "<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"
what i need to match is the classes and the "rating" part (8/10).
Something like this, except i dont know how to write (ANYTHING EXCEPT <br> here) in regexp:
preg_match_all('#class="([0-9]{3})"><br>(ANYTHING EXCEPT <br> here)*?([0-9]/10)#',
$example_string, matches);
So a preg_match_all should give these results:
$matches[1][1] = '190';
$matches[1][2] = '8/10';
$matches[2][1] = '154';
$matches[2][2] = '9/10';
to work off of your pattern, and to answer your question
class="([0-9]{3})"><br>(?:(?!<br>).)*?([0-9]\/10)
Demo
I don't know php, but it should work as it does in python...
get the matches between "classes", and iterate to get your data in the returned matched strings
import re # the regex module
example_string = '"<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"'
for match in re.findall(r'(?:class[^\d]")([^\/]+)(?!class)', example_string):
print(list(re.findall(r'(\d+)', match)))
yields the following lists:
['190', '8']
['154', '9']
A simple DOM parser would be able to give you that information:
$example_string = '<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>';
$dom = new DOMDocument;
$dom->loadHTML($example_string);
$xpath = new DOMXPath($dom);
// get all text nodes that have an anchor parent with a class attribute
$query = '//text()[parent::a[#class]]';
foreach ($xpath->query($query) as $node) {
echo $node->textContent, "\n";
echo "parent node: ", $node->parentNode->getAttribute('class'), "\n";
}
Output
hello.. 8/10
parent node: 190
9/10
parent node: 154
(?<=class=")(\d+)|(\d+\/\d+)
Try this.See demo.
https://regex101.com/r/yR3mM3/58
$re = "/(?<=class=\")(\\d+)|(\\d+\\/\\d+)/";
$str = "<a class=\"190\"><br>hello.. 8/10<br><a class=\"154\"><br>9/10<br>";
preg_match_all($re, $str, $matches);

Extract attribute values from xPath query PHP [duplicate]

Trying to find the links on a page.
my regex is:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag?
Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A elements in the $html string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href attribute you'd do
echo $node->getAttribute( 'href' );
To change the href attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href attribute you'd do
$node->removeAttribute('href');
You can also query for the href attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
Best methods to parse HTML
DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here
I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.
For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('Click here');
echo $a['href']; // will echo www.something.com
Its working for me
Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.
See live example on: http://www.rubular.com/r/jsKyK2b6do
I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()
If you really need to use a regular expression then check out this tool, it may help:
http://regex.larsolavtorvik.com/
Using your regex, I modified it a bit to suit your need.
<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>
I personally suggest you use a HTML Parser
EDIT: Tested
The following is working for me and returns both href and value of the anchor tag.
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls contains now associative sub-arrays that are easy to use.
preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code.

Find Stylesheet URLS [duplicate]

Trying to find the links on a page.
my regex is:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag?
Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A elements in the $html string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href attribute you'd do
echo $node->getAttribute( 'href' );
To change the href attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href attribute you'd do
$node->removeAttribute('href');
You can also query for the href attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
Best methods to parse HTML
DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here
I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.
For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('Click here');
echo $a['href']; // will echo www.something.com
Its working for me
Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.
See live example on: http://www.rubular.com/r/jsKyK2b6do
I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()
If you really need to use a regular expression then check out this tool, it may help:
http://regex.larsolavtorvik.com/
Using your regex, I modified it a bit to suit your need.
<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>
I personally suggest you use a HTML Parser
EDIT: Tested
The following is working for me and returns both href and value of the anchor tag.
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls contains now associative sub-arrays that are easy to use.
preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code.

php regular expression to get the specific url

I would like to get the urls from a webpage that starts with "../category/" from these tags below:
PC<br>
Carpet<br>
Any suggestion would be very much appreciated.
Thanks!
No regular expressions is required. A simple XPath query with DOM will suffice:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[starts-with(#href, "../category/")]');
foreach ($nodes as $node) {
echo $node->nodeValue.' = '.$node->getAttribute('href').PHP_EOL;
}
Will print:
PC = ../category/product/pc.html
Carpet = ../category/product/carpet.html
This regex searches for your ../category/ string:
preg_match_all('#......="(\.\./category/.*?)"#', $test, $matches);
All text literals are used for matching. You can replace the ..... to make it more specific. Only the \. need escaping. The .*? looks for a variable length string. And () captures the matched path name, so it appears in $matches. The manual explains the rest of the syntax. http://www.php.net/manual/en/book.pcre.php

Reg expression to remove empty Tags (any of them)?

I like to remove any empty html tag which is empty or containing spaces.
something like to get:
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
to:
$string ="<b>text</b>=;
Here is an approach with DOM:
// init the document
$dom = new DOMDocument;
$dom->loadHTML($string);
// fetch all the wanted nodes
$xp = new DOMXPath($dom);
foreach($xp->query('//*[not(node()) or normalize-space() = ""]') as $node) {
$node->parentNode->removeChild($node);
}
// output the cleaned markup
echo $dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
);
This would output something like
<body><b>text</b></body>
XML documents require a root element, so there is no way to omit that. You can str_replace it though. The above can handle broken HTML.
If you want to selectively remove specific nodes, adjust the XPath query.
Also see
How do you parse and process HTML/XML in PHP?
Locating the node by value containing whitespaces using XPath
function stripEmptyTags ($result)
{
$regexps = array (
'~<(\w+)\b[^\>]*>\s*</\\1>~',
'~<\w+\s*/>~'
);
do
{
$string = $result;
$result = preg_replace ($regexps, '', $string);
}
while ($result != $string);
return $result;
}
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
echo stripEmptyTags ($string);
You will need to run the code multiple times in order to do this only with regular expressions.
the regex that does this is:
/<(?:(\w+)(?: [^>]*)?`> *<\/$1>)|(?:<\w+(?: [^>]*)?\/>)/g
But for example on your string you have to run it at least twice. Once it will remove the <br/> and the second time will remove the remaining <p> </p>.

Categories