How to get specific attribute of html in string using php? - php

I got a string and I need to find out all the data-id numbers.
This is the string
<li data-type="mentionable" data-id="2">bla bla...
<li data-type="mentionable" data-id="812">some test
<li>bla bla </li>more text
<li data-type="mentionable" data-id="282">
So in the end It will find me this : 2,812,282

Use DOMDocument instead:
<?php
$data = <<<DATA
<li data-type="mentionable" data-id="2">bla bla...
<li data-type="mentionable" data-id="812">some test
<li>bla bla </li>more text
<li data-type="mentionable" data-id="282">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
$ids = [];
foreach ($xpath->query("//li[#data-id]") as $item) {
$ids[] = $item->getAttribute('data-id');
}
print_r($ids);
?>
Which gives you 2, 812, 282, see a demo on ideone.com.

You can use regex to find target part of string in preg_match_all().
preg_match_all("/data-id=\"(\d+)\"/", $str, $matches);
// $matches[1] is array contain target values
echo implode(',', $matches[1]) // return 2,812,282
See result of code in demo
Because your string is HTML, you can use DOMDocument class to parse HTML and find target attribute in document.

Related

Match multiple results single line php regex

I would like to match multiple results on a single line string but I am only able to get the last iteration on the result I excpected.
For example I have this string : <ul><li>test1</li><li>test2</li>test3</li></ul>
I would like to get :
test1
test2
test3
As result but I only get "test3"
I used this regex <ul>(<li><a.*>(.*)<\/a><\/li>)*<\/ul> on : https://regex101.com/ but I don't know what I did wrong.
Use a parser instead:
<?php
$html = <<<DATA
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
DATA;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DomXPath($dom);
$links = $xpath->query("//li/a");
foreach ($links as $link) {
echo $link->textContent;
}
?>
This sets up the DOM and uses an xpath expression to get the element(s).
Try like this:
(?<=(<a href="#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(<a href="#">))(.)+?(?=(<\/a>))
link with example:
https://regex101.com/r/MHnxxh/1
or
https://regex101.com/r/MHnxxh/2
<?php
$str = '
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
';
preg_match_all('/(?<=(#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))/', $str, $matches);
// display array if need
echo "<pre>";
print_r($matches);
// display list
foreach ($matches[0] as $key => $value) {
echo $value ."\r\n";
}
?>
preg_match_all("\#\"\>[a-z]\w+\<\/\a\>,
$out, PREG_PATTERN_ORDER)
this the regex pattern....try this
("#\">[a-z]\w+\</\a>)
this will extract only all text strings....
you cane use of preg_replace
$test = '<ul><li>test1</li><li>test2</li>test3</li></ul>';
echo preg_replace('/<[^>]*>/', ' ', $test);

How to generate PHP regex to match all consecutive digits preceded by the hash character unless in an anchor / link

I would expect lines 1, 3, 4 and 6 to be matched but only line 6 is being matched.
Regex:
(#[0-9]+\b)(?!.*?\<\/a\>)
Sample string:
#2222
<a target="_blank" href="http://localhost/#/app/job/2222/1">#2222</a>
#3535
#3553
<a target="_blank" href="http://localhost/#/app/job/5242/1">#5242</a>
#3333
The regex is shown here:
https://regex101.com/r/JpyfzQ/3
In live demo you set s modifier which you shouldn't and instead you have to set g global modifier.
Regex (a better way):
<a\b[^>]*>.*<\/a>(*SKIP)(*F)|#\d+\b
Live demo
PHP:
preg_replace('#<a\b[^>]*>.*</a>(*SKIP)(*F)|#\d+\b#', 'replacement', $input);
A more sophisticated way would be to use DOM functions in conjunction with regex functions, such as preg_match():
<?php
$html = <<<DATA
#2222
<a target="_blank" href="http://localhost/#/app/job/2222/1">#2222</a>
#3535
#3553
<a target="_blank" href="http://localhost/#/app/job/5242/1">#5242</a>
#3333
DATA;
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// you need to register the namespace "php" to make it available in the query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions('preg_match');
$regex = '~#\d+~';
$items = $xpath->query("//*[not(self::a)][php:functionString('preg_match', '$regex', text()) = '1']");
foreach ($items as $item) {
print_r($item);
}
?>

Get text between 2 tags that change (regex)(php)

How should I get the text between 2 html tags that are not always the same. How should I let regex "ignore" a part.
Lets say this is my html:
<html>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">stirng 1</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl04_lblName">string 2</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl53_lblName">string 3</span>
...
</html>
As you see the ctlxx part is not always the same, this code only gets the first string:
preg_match('#\\<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">(.+)\\</span>#s',$html,$matches);
$match = $matches[0];
echo $match;
How can I let regex ignore the ctlxx part and echo all the strings?
Thanks in advance
You can do it by DomDocument and DomXpath with using preg_match
$dom = new DOMDocument();
$dom->loadHTML($str);
$x = new DOMXpath($dom);
// Next two string to use Php functions within within Xpath expression
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions();
// Select span tags with proper id
foreach($x->query('//span[php:functionString("preg_match", "/ctl00_ContentPlaceHolder1_gvDomain_ctl\d+_lblName/", .)]') as $node)
echo $node->nodeValue;
If you want to solve it using regular expression then you can do something like this
<?php
preg_match('/<span id="[^"]*">(.+)<\/span>/is',$html,$matches);
$match = $matches[0];
echo $match;

Regex preg_replace find image in string WITH img attributes

I'm trying to find ALL images in my blog posts with regex. The code below returns images IF the code is clean and the SRC tag comes right after the IMG tag. However, I also have images with other attributes such as height and width. The regex I have does not pick that up... Any ideas?
The following code returns images that looks like this:
<img src="blah_blah_blah.jpg">
But not images that looks like this:
<img width="290" height="290" src="blah_blah_blah.jpg">
Here is my code
$pattern = '/<img\s+src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
$html = <<<DATA
<img width="290" height="290" src="blah.jpg">
<img src="blah_blah_blah.jpg">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
blah.jpg
blah_blah_blah.jpg
Ever think of using the DOM object instead of regex?
$doc = new DOMDocument();
$doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
You'd better to use a parser, but here is a way to do with regex:
$pattern = '/<img\s.*?src="([^"]+)"/i';
The problem is that you only accept \s+ after <img. Try this instead:
$pattern = '/<img\s+[^>]*?src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Try this:
$pattern = '/<img\s.*?src=["\']([^"\']+)["\']/i';
Single or double quote and dynamic src attr position.

PHP regex - Find the highest value

I need find the highest number on a string like this:
Example
<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>
In this example, I should get 89, because it's the highest number on "end" value.
I think I should use regex, but I don't know how :(
Any help would be very appreciated!
You shouldn't be doing this with a regex. In fact, I don't even know how you would. You should be using an HTML parser, parsing out the end parameter from each <a> tag's href attribute with parse_str(), and then finding the max() of them, like this:
$doc = new DOMDocument;
$doc->loadHTML( $str); // All & should be encoded as &
$xpath = new DOMXPath( $doc);
$end_vals = array();
foreach( $xpath->query( '//div[#id="pages"]/a') as $a) {
parse_str( $a->getAttribute( 'href'), $params);
$end_vals[] = $params['end'];
}
echo max( $end_vals);
The above will print 89, as seen in this demo.
Note that this assumes your HTML entities are properly escaped, otherwise DOMDocument will issue a warning.
One optimization you can do is instead of keeping an array of end values, just compare the max value seen with the current value. However this will only be useful if the number of <a> tags grows larger.
Edit: As DaveRandom points out, if we can make the assumption that the <a> tag that holds the highest end value is the last <a> tag in this list, simply due to how paginated links are presented, then we don't need to iterate or keep a list of other end values, as shown in the following example.
$doc = new DOMDocument;
$doc->loadHTML( $str);
$xpath = new DOMXPath( $doc);
parse_str( $xpath->evaluate( 'string(//div[#id="pages"]/a[last()]/#href)'), $params);
echo $params['end'];
To find the highest number in the entire string, regardless of position, you can use
preg_split — Split string by a regular expression
max — Find highest value
Example (demo)
echo max(preg_split('/\D+/', $html, -1, PREG_SPLIT_NO_EMPTY)); // prints 89
This works by splitting the string by anything that is not a number, leaving you with an array containing all the numbers in the string and then fetching the highest number from that array.
first extract all the numbers from the links then apply max function:
$str = "<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>";
if(preg_match_all("/href=['][^']+end=([0-9]+)[']/i", $str, $matches))
{
$maxVal = max($matches[1]);
echo $maxVal;
}
function getHighest($html) {
$my_document = new DOMDocument();
$my_document->loadHTML($html);
$nodes = $my_document->getElementsByTagName('a');
$numbers = array();
foreach ($nodes as $node) {
if (preg_match('\d+$', $node->getAttribute('href'), $match) == 1) {
$numbers[]= intval($match[0])
}
}
return max($numbers);
}

Categories