Extract attribute values from xPath query PHP [duplicate] - php

Trying to find the links on a page.
my regex is:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag?

Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A elements in the $html string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href attribute you'd do
echo $node->getAttribute( 'href' );
To change the href attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href attribute you'd do
$node->removeAttribute('href');
You can also query for the href attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
Best methods to parse HTML
DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here

I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}

The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.

For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('Click here');
echo $a['href']; // will echo www.something.com
Its working for me

Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.
See live example on: http://www.rubular.com/r/jsKyK2b6do

I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()
If you really need to use a regular expression then check out this tool, it may help:
http://regex.larsolavtorvik.com/

Using your regex, I modified it a bit to suit your need.
<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>
I personally suggest you use a HTML Parser
EDIT: Tested

The following is working for me and returns both href and value of the anchor tag.
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls contains now associative sub-arrays that are easy to use.

preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code.

Related

Build Stripped HTML Array from String in PHP

I have a String which looks something like this:
$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>"
I'd like to do some parsing on the content inside the tags, so I think that creating an array from this would be easiest. Currently I'm using a series of explode and implode to achieve what I want:
$stripped = explode('<p>', $html_string);
$joined = implode(' ', $stripped);
$parsed = explode('</p>', $joined);
which in effect gives:
array('Some content', 'separated by', 'paragraphs');
Is there a better, more robust way to create an array from HTML tags? Looking at the docs, I didn't see any mention of parsing via a regular expression.
Thanks for your help!
If its only that simple with no/not much other tags inside the content you can simply use regex for that:
$string = '<p>Some content</p><p>separated by</p><p>paragraphs</p>';
preg_match_all('/<p>([^<]*?)<\/p>/mi', $string, $matches);
var_dump($matches[1]);
which creates this output:
array(3) {
[0]=>
string(12) "Some content"
[1]=>
string(12) "separated by"
[2]=>
string(10) "paragraphs"
}
Keep in mind that this is not the most effective way nor is it the fastest, but its shorter then using DOMDocument or anything like that.
If you need to do some html parsing in php, there is a nice library for that, called php html parser.
https://github.com/paquettg/php-html-parser
which can give you a jquery like api, to parse html.
an example:
// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
$dom = new Dom;
$dom->load('<p>Some content</p><p>separated by</p><p>paragraphs</p>');
$pTags = $dom->find('p');
foreach ($pTags as $tag)
{
// do something with the html
$content = $tag->innerHtml;
}
Here is the DOMDocument solution (native PHP), which will also work when your p tags have attributes, or contain other tags like <br>, or have lots of white-space in between them (which is irrelevant in HTML rendering), or contain HTML entities like or <, etc, etc:
$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>";
$doc = new DOMDocument();
$doc->loadHTML($html_string);
foreach($doc->getElementsByTagName('p') as $p ) {
$paras[] = $p->textContent;
}
// Output array:
print_r($paras);
If you really want to stick with regular expressions, then at least allow tag attributes and HTML entities, translating the latter to their corresponding characters:
$html_string = "<p>Some content & text</p><p>separated by</p><p style='background:yellow'>paragraphs</p>";
preg_match_all('/<p(?:\s.*?)?>\s*(.*?)\s*<\/p\s*>/si', $html_string, $matches);
$paras = $matches[1];
array_walk($paras, 'html_entity_decode');
print_r($paras);

What's wrong with my PHP regex?

I'm trying to pull a specific link from a feed where all of the content is on one line and there are multiple links present. The one I want has the content of "[link]" in the the A tag. Here's my example:
test1 test2 [link] test3test4
... could be more links before and/or after
How do I isolate just the href with the content "[link]"?
This regex goes to the correct end of the block I want, but starts at the first link:
(?<=href\=\").*?(?=\[link\])
Any help would be greatly appreciated! Thanks.
Try this updated regex:
(?<=href\=\")[^<]*?(?=\">\[link\])
See demo.
The problem is that the dot matches too many characters and in order to get the right 'href' you need to just restrict the regex to [^<]*?.
Alternatively :)
This code :
$string = 'test1 test2 [link] test3test4';
$regex = '/href="([^"]*)">\[link\]/i';
$result = preg_match($regex, $string, $matches);
var_dump($matches);
Will return :
array(2) {
[0] =>
string(41) "href="http://www.amazingpage.com/">[link]"
[1] =>
string(27) "http://www.amazingpage.com/"
}
You can avoid using regular expression and use DOM to do this.
$doc = DOMDocument::loadHTML('
test1
test2
[link]
test3
test4
');
foreach ($doc->getElementsByTagName('a') as $link) {
if ($link->nodeValue == '[link]') {
echo $link->getAttribute('href');
}
}
With DOMDocument and XPath:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
foreach ($xpath->query('//a[. = "[link]"]/#href') as $node) {
echo $node->nodeValue;
}
or if you are looking for only one result:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
$nodeList = $xp->query('//a[. = "[link]"][1]/#href');
if ($nodeList->length)
echo $nodeList->item(0)->nodeValue;
xpath query details:
//a # 'a' tag everywhere in the DOM tree
[. = "[link]"] # (condition) which has "[link]" as value
/#href # "href" attribute
The reason your regex pattern doesn't work:
The regex engine walks from left to right and for each position in the string it tries to succeed. So, even if you use a non-greedy quantifier, you obtain always the leftmost result.

Matching string without specific pattern between specific places

$example_string = "<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"
what i need to match is the classes and the "rating" part (8/10).
Something like this, except i dont know how to write (ANYTHING EXCEPT <br> here) in regexp:
preg_match_all('#class="([0-9]{3})"><br>(ANYTHING EXCEPT <br> here)*?([0-9]/10)#',
$example_string, matches);
So a preg_match_all should give these results:
$matches[1][1] = '190';
$matches[1][2] = '8/10';
$matches[2][1] = '154';
$matches[2][2] = '9/10';
to work off of your pattern, and to answer your question
class="([0-9]{3})"><br>(?:(?!<br>).)*?([0-9]\/10)
Demo
I don't know php, but it should work as it does in python...
get the matches between "classes", and iterate to get your data in the returned matched strings
import re # the regex module
example_string = '"<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"'
for match in re.findall(r'(?:class[^\d]")([^\/]+)(?!class)', example_string):
print(list(re.findall(r'(\d+)', match)))
yields the following lists:
['190', '8']
['154', '9']
A simple DOM parser would be able to give you that information:
$example_string = '<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>';
$dom = new DOMDocument;
$dom->loadHTML($example_string);
$xpath = new DOMXPath($dom);
// get all text nodes that have an anchor parent with a class attribute
$query = '//text()[parent::a[#class]]';
foreach ($xpath->query($query) as $node) {
echo $node->textContent, "\n";
echo "parent node: ", $node->parentNode->getAttribute('class'), "\n";
}
Output
hello.. 8/10
parent node: 190
9/10
parent node: 154
(?<=class=")(\d+)|(\d+\/\d+)
Try this.See demo.
https://regex101.com/r/yR3mM3/58
$re = "/(?<=class=\")(\\d+)|(\\d+\\/\\d+)/";
$str = "<a class=\"190\"><br>hello.. 8/10<br><a class=\"154\"><br>9/10<br>";
preg_match_all($re, $str, $matches);

Find Stylesheet URLS [duplicate]

Trying to find the links on a page.
my regex is:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag?
Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A elements in the $html string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href attribute you'd do
echo $node->getAttribute( 'href' );
To change the href attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href attribute you'd do
$node->removeAttribute('href');
You can also query for the href attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
Best methods to parse HTML
DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here
I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.
For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('Click here');
echo $a['href']; // will echo www.something.com
Its working for me
Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.
See live example on: http://www.rubular.com/r/jsKyK2b6do
I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()
If you really need to use a regular expression then check out this tool, it may help:
http://regex.larsolavtorvik.com/
Using your regex, I modified it a bit to suit your need.
<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>
I personally suggest you use a HTML Parser
EDIT: Tested
The following is working for me and returns both href and value of the anchor tag.
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls contains now associative sub-arrays that are easy to use.
preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code.

Get the href attribute in <a> Tag in plain text with embedded html tags [duplicate]

Trying to find the links on a page.
my regex is:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag?
Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A elements in the $html string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href attribute you'd do
echo $node->getAttribute( 'href' );
To change the href attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href attribute you'd do
$node->removeAttribute('href');
You can also query for the href attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
Best methods to parse HTML
DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here
I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? then href= followed by the link surrounded by either " or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.
For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('Click here');
echo $a['href']; // will echo www.something.com
Its working for me
Quick test: <a\s+[^>]*href=(\"\'??)([^\1]+)(?:\1)>(.*)<\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'.
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same.
See live example on: http://www.rubular.com/r/jsKyK2b6do
I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var()
If you really need to use a regular expression then check out this tool, it may help:
http://regex.larsolavtorvik.com/
Using your regex, I modified it a bit to suit your need.
<a.*?href=("|')(.*?)("|').*?>(.*)<\/a>
I personally suggest you use a HTML Parser
EDIT: Tested
The following is working for me and returns both href and value of the anchor tag.
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls contains now associative sub-arrays that are easy to use.
preg_match_all("/(]>)(.?)(</a)/", $contents, $impmatches, PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code.

Categories