PHP regex - Find the highest value - php

I need find the highest number on a string like this:
Example
<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>
In this example, I should get 89, because it's the highest number on "end" value.
I think I should use regex, but I don't know how :(
Any help would be very appreciated!

You shouldn't be doing this with a regex. In fact, I don't even know how you would. You should be using an HTML parser, parsing out the end parameter from each <a> tag's href attribute with parse_str(), and then finding the max() of them, like this:
$doc = new DOMDocument;
$doc->loadHTML( $str); // All & should be encoded as &
$xpath = new DOMXPath( $doc);
$end_vals = array();
foreach( $xpath->query( '//div[#id="pages"]/a') as $a) {
parse_str( $a->getAttribute( 'href'), $params);
$end_vals[] = $params['end'];
}
echo max( $end_vals);
The above will print 89, as seen in this demo.
Note that this assumes your HTML entities are properly escaped, otherwise DOMDocument will issue a warning.
One optimization you can do is instead of keeping an array of end values, just compare the max value seen with the current value. However this will only be useful if the number of <a> tags grows larger.
Edit: As DaveRandom points out, if we can make the assumption that the <a> tag that holds the highest end value is the last <a> tag in this list, simply due to how paginated links are presented, then we don't need to iterate or keep a list of other end values, as shown in the following example.
$doc = new DOMDocument;
$doc->loadHTML( $str);
$xpath = new DOMXPath( $doc);
parse_str( $xpath->evaluate( 'string(//div[#id="pages"]/a[last()]/#href)'), $params);
echo $params['end'];

To find the highest number in the entire string, regardless of position, you can use
preg_split — Split string by a regular expression
max — Find highest value
Example (demo)
echo max(preg_split('/\D+/', $html, -1, PREG_SPLIT_NO_EMPTY)); // prints 89
This works by splitting the string by anything that is not a number, leaving you with an array containing all the numbers in the string and then fetching the highest number from that array.

first extract all the numbers from the links then apply max function:
$str = "<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>";
if(preg_match_all("/href=['][^']+end=([0-9]+)[']/i", $str, $matches))
{
$maxVal = max($matches[1]);
echo $maxVal;
}

function getHighest($html) {
$my_document = new DOMDocument();
$my_document->loadHTML($html);
$nodes = $my_document->getElementsByTagName('a');
$numbers = array();
foreach ($nodes as $node) {
if (preg_match('\d+$', $node->getAttribute('href'), $match) == 1) {
$numbers[]= intval($match[0])
}
}
return max($numbers);
}

Related

Regex to get href value of links that do not have rel='nofollow'

I have a string that contains html link tags and I need to user php preg_match_all to get the href value of the tags, but only if the tag does not have a rel='nofollow' attribute. I found the following expression that gets the href value of all the links.
$regex= "/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU";
How can I modify it to only get the links I want? Here is what it should look like:
$string= "<a href='link1.php'>Link</a>";
$string.= "<a href='link2.php'>Link2</a>";
$string.= "<a href='link3.php' rel='nofollow'>Link3</a>";
$string.= "<a href='link4.php'>Link4</a>";
preg_match_all($regex, $string, $links);
so links should be:
$links[0] => 'link1.php';
$links[1] => 'link2.php';
$links[2] => 'link4.php';
I need the expression to pick up links that use both single and double quotes. Bonus would be to pick up ill formatted but still valid links. If it's not possible to get just the links I want then just a way to find the links I don't want and remove them from the array. Note string is generated dynamically and may not have the same attribute order and will contain other tags and characters besides just the links.
#revo is correct, this is not a job for regular expressions. Use a proper HTML parser to deconstruct the HTML, and then an XPath query to find the information you need.
$html = <<<HTML
<html>
<head>
<title>Example</title>
</head>
<body>
<a href='link1.php'>Link</a>
Link2
<a class="link" href='link3.php' rel='nofollow'>Link3</a>
<a href='link4.php'><span>Link4</span></a>
</body>
</html>
HTML;
$doc = new DOMDocument();
$valid = $doc->loadHTML($html);
$result = [];
if ($valid) {
$xpath = new DOMXpath($doc);
// find any <a> elements that do not have a rel="nofollow" attribute,
// then pick up their href attribute
$elements = $xpath->query("//a[not(#rel='nofollow')]/#href");
if (!is_null($elements)) {
foreach ($elements as $element) {
$result[] = $element->nodeValue;
}
}
}
print_r($result);
# => Array
# (
# [0] => link1.php
# [1] => link's 2.php
# [2] => link4.php
# )

Difficulties with the function preg_match_all

I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Get text between 2 tags that change (regex)(php)

How should I get the text between 2 html tags that are not always the same. How should I let regex "ignore" a part.
Lets say this is my html:
<html>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">stirng 1</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl04_lblName">string 2</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl53_lblName">string 3</span>
...
</html>
As you see the ctlxx part is not always the same, this code only gets the first string:
preg_match('#\\<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">(.+)\\</span>#s',$html,$matches);
$match = $matches[0];
echo $match;
How can I let regex ignore the ctlxx part and echo all the strings?
Thanks in advance
You can do it by DomDocument and DomXpath with using preg_match
$dom = new DOMDocument();
$dom->loadHTML($str);
$x = new DOMXpath($dom);
// Next two string to use Php functions within within Xpath expression
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions();
// Select span tags with proper id
foreach($x->query('//span[php:functionString("preg_match", "/ctl00_ContentPlaceHolder1_gvDomain_ctl\d+_lblName/", .)]') as $node)
echo $node->nodeValue;
If you want to solve it using regular expression then you can do something like this
<?php
preg_match('/<span id="[^"]*">(.+)<\/span>/is',$html,$matches);
$match = $matches[0];
echo $match;

PHP: Removing duplicate words from between quotes

How can I remove the duplicates from between class="" in the following string?
<li class="active active">Sample Page</li>
Please note that the classes shown can change and be in different positions.
You can use DOM parser then explode and array_unique:
$html = '<li class="active active">
Sample Page</li>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//li");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$tok = explode(' ', $node->getAttribute('class'));
$tok = array_unique($tok);
$node->setAttribute('class', implode(' ', $tok));
}
$html = $doc->saveHTML();
echo $html;
OUTPUT:
<html><body>
<li class="active">Sample Page</li>
</body></html>
Online Demo
With regex you could use a lookbehind and lookahead for finding duplicates:
$pattern = '/(?<=class=")(?:([-\w]+) (?=\1[ "]))+/i';
This would replace multiple instances of capture group 1 ([-\w]+) in a sequence.
$str = '<li class="active active">';
echo preg_replace($pattern, "", $str);
output:
<li class="active">
Test at regex101
EDIT 08.04.2014
To remove duplicates, that are not directly after the lookbehind (?<=class=")...
The problem is, that a lookbehind assertion can only be of fixed length. so something like (?<=class="[^"]*?) is not possible. As an alternative \K could be used, which resets the beginning of the match. A pattern could be:
$pattern = '/class="[^"]*?\K(?<=[ "])(?:([-\w]+) (?=\1[ "]))+/i';
You could imagine everything before \K as a virtual lookbehind of variable length.
This regex, as the first one, would only replace multiple instances of one duplicate in a sequence.
EDIT 11.09.2014
Finally I think a single regex, that would strip out all of different duplicates is getting rather complex:
/(?>(?<=class=")|(?!^)\G)(?>\b([-\w]++)\b(?=[^"]*?\s\1[\s"])\s+|[-\w]+\s+\K)/
This one uses continuous matching, as soon class=" is found.
Test at regex101; Also see SO Regex FAQ
A more simple way using regex would be a preg_replace_callback():
$html = '<li class="a1 a1 li li-home active li li active a1">';
$html = preg_replace_callback('/\sclass="\K[^"]+/', function ($m) {
return trim(implode(" ",array_unique(preg_split('~\s+~', $m[0]))));
}, $html);
Note that older PHP-versions don't support anonymous functions (if so, change to a normal function).
A way to do it would be to add these values into an array and to filter them. Here is how it can be made.
<?php
preg_match_all('/class="([A-Za-z0-9 ]+)"/',$htmlString, $result);
$classes = explode(" ",$result[0]);
$classes = array_unique($classes);
echo "<li class=\"".implode(" ",$classes)."\">Sample Page</li>";
?>

php preg_match_all words starting with?

I'm pulling in a calendar from an external site with file_get_contents, so I can use jQuery .load on it.
In order to fix the relative path issues with this approach, I'm using
preg_match_all.
So doing
preg_match_all("/<a href='([^\"]*)'/iU", $string, $match);
Gets me all the occurrences of <a href = ''
What I'm after are the just the links inside the single quotes.
Now each link starts with "?date" so I have <a href='?date=4%2F9%2F2014&a' etc.
How can I efficiently get the string between single quotes in all <a href= occurrences.
Use the Dom parser to get the href from the <a> tag
<?php
$file = "your.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('a');
foreach ($elements as $tag) {
echo $tag->getAttribute('href');
}

Categories