PHP => How can i search through this string in such a way that when i have class="font8text">N</span>' to give me 'EARLL' which is in the next <span>.
<div align="left" style=";">
<span style="width:15px; padding:1px; border:1pt solid #999999; background-color:#CCFFCC; text-align:center;" class="font8text">Y</span>
<span style="text-align:left; white-space:nowrap;" class="font8text">DINNIMAN</span>
</div>
<div align="left" style="background-color:#F8F8FF;">
<span style="width:15px; padding:1px; border:1pt solid #999999; background-color:#FFCCCC; text-align:center;" class="font8text">N</span>
<span style="text-align:left; white-space:nowrap;" class="font8text">EARLL</span>
</div>
Use a DOM-parser like: http://simplehtmldom.sourceforge.net/
As mentioned (a painless amount of times). Regex is not a good way to parse HTML. Actually, you can't really parse HTML with Regex. HTML is not regular in any form. You can only extract bits. And that's still (in most cases) very unreliable data.
It's better to use a DOM-parser. Because a parser that parses the HTML to a document, makes it easier to traverse.
Example:
include_once('simple_html_dom.php');
$dom = file_get_html('<html>...');
foreach($dom->find("div.head div.fact p.fact") as $element)
die($element->innertext);
I think you're better off using strpos and substr succinct with each other.
Example:
$str = <insert your string here>; // populate data
$_find = 'class="font8text">'; // set the search text
$start = strpos($str,$find) + strlen($_find); // find the start off the text and offset by the $needle
$len = strpos($str,'<',$start) - $start; find the end, then subtract the start for length
$text = substr($str,$start,$len); // result
This would do it:
/class="font8text">N.*?class="font8text">(.*?)</m
EARLL would be in the first match group. Try it on Rubular.
Related
I have a php script which generates a html email. In order to optimise the size to not fall foul of Google's 102kB limit I'm trying to squeeze as unnecessary characters out of the code as possible.
I currently use Emogrifier to inline the css and then TinyMinify to minify.
The output from this still has spaces between properties and values in the inlined styles (eg style="color: #ffffff; font-weight: 16px")
I've developed the following regex to remove the extra whitespace, but it also affects the actual content too (eg this & that becomes this &that)
$out = preg_replace("/(;|:)\s([a-zA-Z0-9#])/", "$1$2", $newsletter);
How can I modify this regex to be limited to inlines styles, or is there a better approach?
There are no bullitproof ways to not match the payload (style="" can appear anywhere) and to not match actual CSS values (as in content: 'a: b'). Furthermore consider also
shortening the values: red is shorter than #f00, which is shorter than #ff0000
remove leading and trailing bogus, like whitespaces and semicolons
redesigning your HTML: i.e. using <ins> and <strong> can be effectively shorter than using inline CSS
One approach would be to match all inline style HTML attributes first and then operate on their content only, but you have to test for yourself how good this works:
$out= preg_replace_callback
( '/( style=")([^"]*)("[ >])/' // Find all appropriate HTML attributes
, function( $aMatch ) { // Per match
// Kill any amount of any kind of spaces after colon or semicolon only
$sInner= preg_replace
( '/([;:])\\s*([a-zA-Z0-9#])/' // Escaping backslash in PHP string context
, '$1$2'
, $aMatch[2] // Second sub match
);
// Kill any amount of leading and trailing semicolons and/or spaces
$sInner= preg_replace
( array( '/^\\s*;*\\s*/', '/\\s*;*\\s*$/' )
, ''
, $sInner
);
return $aMatch[1]. $sInner. $aMatch[3]; // New HTML attribute
}
, $newsletter
);
You haven't provided sample input for us to use, but you have mentioned that you are dealing with html. This should sound alarm bells that using regex as a direct solution is ill-advised. When intending to process valid html, you should be using a dom parser to isolate the style attributes.
Why shouldn't you use regex to isolate the inline style declarations? Simply put: Regex is "dom-unaware". It doesn't know when it is inside or outside of a tag (I'll provide a contrived monkeywrench in my demo to express this vulnerability. Furthermore, using a dom parser will add the benefit of correctly handling different types of quoting. While regex can be written to match/acknowledge balanced quoting, it adds considerable bloat (when executed well) and damages the readability and maintainability of your script.
In my demo, I'll show how spaces after colons, semicolons, and commas can be simply/accurately purged after isolating true inline style declarations. I've gone that little bit farther (since color hexcode condensing was mentioned on this page) to show how regex can be used to reduce some six character hexcodes to three characters.
Code: (Demo)
$html = <<<HTML
<div style='font-family: "Times New Roman", Georgia, serif; background-color: #ffffff; '>
<p>Some text
<span class="ohyeah" style="font-weight: bold; color: #ff6633 !important; border: solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration: underline; ">Underlined</strong>
</p>
<h1 style="margin: 1px 2px 3px 4px;">Heading</h1>
<span style="background-image: url('images/not_a_hexcode_ffffff.png'); ">Text</span>
</div>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
$style = $node->getAttribute('style');
if ($style) {
$patterns = ['~[:;,]\K\s+~', '~#\K([\da-f])\1([\da-f])\2([\da-f])\3~i'];
$replaces = ['', '\1\2\3'];
$node->setAttribute('style', preg_replace($patterns, $replaces, $style));
}
}
$html = $dom->saveHtml();
echo $html;
Output:
<div style='font-family:"Times New Roman",Georgia,serif;background-color:#fff;'>
<p>Some text
<span class="ohyeah" style="font-weight:bold;color:#f63 !important;border:solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration:underline;">Underlined</strong>
</p>
<h1 style="margin:1px 2px 3px 4px;">Heading</h1>
<span style="background-image:url('images/not_a_hexcode_ffffff.png');">Text</span>
</div>
The above snippet uses \K in the patterns to avoid the use of lookaround and excess capture groups.
I am not writing a pattern that removes the space before !important because I have read some (not so recent) posts that some browsers express buggy behavior without the space.
Looking for the best way to get the content of some HTML text in some random pieces of HTML
I cannot seem to figure out the regex for it.
<td valign="top" style="border: solid 1px black; padding: 4px;">
<h4>Dec 05, 2015 23:16:52</h4>
<h3>rron7pam has won</h3>
</td>
<table width="100%" style="border: 1px solid #DED3B9" id="attack_info_att">
<tbody>
<tr>
<th style="width:20%">Attacker:</th>
<th><a title="..." href="/guest.php?screen=info_player&id=255995">Bliksem</a></th>
</tr>
</tbody>
</table>
The above are only examples, but for these examples, I am interested in
Getting the date (date = Dec 05, 2015 23:16:52)
Who won the battle (name = rron7pam)
The name of the attacker (name = Bliksem)
Attacker's ID (id = 255995)
There are lots more information that I need from separate HTML code pieces, but if I can get one or two right, I might be able to get some more.
EDIT based on comments and answers:
There could be any arbitrary text in the HTML, depending on how the report was set up (to hide attacker's units, etc.) I need to look for patterns of specific HTML tags
In the example above, "The text between the <h4></h4> tags directly following a set of <h3></h3> tags inside a <td>" will be the date that I need.
Some examples of links with different formats:
https://enp2.tribalwars.net/public_report/70d3a2a55461e9eb09f543958b608304
https://enp2.tribalwars.net/public_report/5216e0e16c9d3657f981ce7e3cb02580
There are elements that will always be the same, as far as I can tell, e.g., as per the above to get the date.
An example with DOMDocument:
$url = 'https://enp2.tribalwars.net/public_report/70d3a2a55461e9eb09f543958b608304';
// prevent warnings to be displayed
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
# lets find interesting nodes:
// td that contains all the needed informations (the nearest common ancestor in other words)
$rootNode = $xp->query('(//table[#class="vis"]/tr/td[./h4])[1]')->item(0);
// first h4 node that contains the date
$dateNode = $xp->query('(./h4)[1]', $rootNode)->item(0);
// following h3 node that contains the player name
$winnerNode = $xp->query('(./following-sibling::h3)[1]', $dateNode)->item(0);
$attackerNode = $xp->query('(./table[#id="attack_info_att"]/tr/th/a)[1]', $rootNode)->item(0);
# extract special values
$winner = preg_replace('~ has won$~', '', $winnerNode->nodeValue);
$attackerID = html_entity_decode($attackerNode->getAttribute('href'));
$attackerID = parse_url($attackerID, PHP_URL_QUERY);
parse_str($attackerID, $queryVars);
$attackerID = $queryVars['id'];
$result = [ 'date' => $dateNode->nodeValue,
'winner' => $winner,
'attacker' => $attackerNode->nodeValue,
'attackerID' => $attackerID ];
print_r($result);
it wouldnt be pretty but could you use strpos to return the start and end position of the tags/content. Then use substr to return that portion of the string.
string substr ( string $string , int $start [, int $length ] )
mixed strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )
I would say that having to do it like this probably means there is something wrong with how your recieving the data/further up. I really odnt think it's going to be efficient to keep scanning the dom over and over.
i got this table generated with php:
a function generates a string with all the html code:
<table><tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td>6</td><td>7</td><td>8</td><td>9</td><td>10</td></tr><tr><td>2</td><td>4</td><td>6</td><td>8</td><td>10</td><td>12</td><td>14</td><td>16</td><td>18</td><td>20</td></tr><tr><td>3</td><td>6</td><td>9</td><td>12</td><td>15</td><td>18</td><td>21</td><td>24</td><td>27</td><td>30</td></tr><tr><td>4</td><td>8</td><td>12</td><td>16</td> .... </table>
now i want to make the numbers 1 to 10 black. i'm trying to replace '<td>(10|[0-9])</td>' with <td style="font-weight: bold">THE-ORIGINAL-NUMBER</td>.
Thanx in advance!
p.s. i know there're alot of similir answers out there but i just couldnt figure it out.. is there an actually noob-friendly tut/glossary of regex out there? i couldn't really find a modern day site.
If you are matching this regular expression:
<td>(10|[0-9])</td>
You are capturing 10|[0-9] into capture group #1. This can be referenced in your replacement with either of the following backreferences:
\1
$1
Full PHP code:
$html = '<td>1</td>';
$html = preg_replace(
'~<td>(10|[0-9])</td>~',
'<td style="font-weight: bold">\1</td>',
$html
);
use this regex
(?<=<td>)(10|[0-9])(?=<\/td>)
replace group #1 with:
<span class="BoldText">$1</span>
Style:
.BoldText {
font-weight: bold;
}
using <b> may be useful:
replace
'~<td>(10|[0-9])</td>~'
with
'<td><b>\1</b></td>'
I am trying to replace a string having span tag with the input tag as follows
original string:
<span style="font-family: Times New Roman; font-size: 12pt;"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></span>
the string i want to change:
<input type="radio" value="1" name="choice"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></input>
mycode is:
$oldstr1='<span style="font-family: Times New Roman; font-size: 12pt;">';
$oldstr2='</span>';
$newstr1='<input type="radio" value="1" name="choice">';
$newstr2="</input>";
$str=A super set html content of the span i mentioned;
while (preg_match($oldstr1, $str) && preg_match($oldstr2, $str)) {
$str = preg_replace($oldstr1,$newstr1, $str, 1);
$str = preg_replace($oldstr2,$newstr2, $str, 1);
}
return $str;
However, the output i am getting is having extra "<" and ">" tags in the output. like "<" and then the radio button with proper tags and again an extra ">" at the end.Please suggest.
The problem is in your patterns. $oldstr1 and $oldstr2.
#Flosi posted correct answer, but here alternative solution - in your case you can use str_replace which will be faster (without while loop and you dont need to change your patterns):
$str = str_replace($oldstr1,$newstr1, $str);
$str = str_replace($oldstr2,$newstr2, $str);
You didn't set your delimiters, and your strings are not properly escaped. It works if you do that, e.g.
$oldstr1='/\<span style="font-family: Times New Roman; font-size: 12pt;"\>/';
$oldstr2='/\<\/span\>/';
Try to add '/' to your old string. Like this:
$oldstr1='/<span style="font-family: Times New Roman; font-size: 12pt;">/';
$oldstr2='/<\/span>/';
EDIT: I guess for your case, would be better to use #MarkS answer and just replace instead of regex.
I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">
~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)
Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link
Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.
This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.
Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si
This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)