I'm having some trouble with delimiter for explode. I have a rather chunky string as a delimiter, and it seems it breaks down when I add another letter (start of a word), but it doesn't get fixed when I remove first letter, which would indicate it isn't about lenght.
To wit, the (working) code is:
$boom = htmlspecialchars("<td width=25 align=\"center\" ");
$arr[1] = explode($boom, $arr[1]);
The full string I'd like to use is <td width=25 align=\"center\" class=\", and when I start adding in class, explode breaks down, and nothing gets done. That happens as soon as I add c, and it doesn't go away if I remove <, which it would if it's just a matter of string lenght.
Basically, the problem isn't dire, since I can just replace class=" with "" after the explode, and get the same result, but this has given me headaches to diagnose, and it seems like a really wierd problem. For what it's worth, I'm using PHP 5.3.0 in XAMPP 1.7.2.
Thanks in advance!
You could try converting every occurrence of the delimiter in the original string
"<td width=25 align=\"center\" "
in something more manageable like:
"banana"
and then explode on that word
Have you tried adding htmlspecialchars to the explode.
$arr[1] = explode($boom, htmlspecialchars($arr[1]));
I get unexpected results without it, but with it it works perfectly.
$s = '<td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>';
$boom = htmlspecialchars("<td width=25 align=\"center\" class=");
$sex = explode($boom, $s);
print_r($sex);
Outputs:
Array
(
[0] => <td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>
)
Whereas
$s = '<td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>';
$boom = htmlspecialchars("<td width=25 align=\"center\" class=");
$sex = explode($boom, htmlspecialchars($s));
print_r($sex);
Outputs
Array
(
[0] =>
[1] => "asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>
)
This is because $boom is htmlspecialchar encoded, < and > get transformed into < and >, which it cannot find the in the string, so it just returns the whole string.
Related
So I wanted to enable UBB Code on my Website using preg_replace
$text = $dbentry['text'];
$bbformat = array(
"/\\r?\\n/i" => "<br>",
"/\[i\](.*?)\[\/i\]/i" => "<i>$1</i>",
"/\[code\](.*?)\[\/code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">$1</textarea></form></td></tr></table></center></div>",
....
);
foreach($bbformat as $match=>$replacement){
$text = preg_replace($match, $replacement, $text);
}
echo $text;
This works, however, it replaces text between two [code][/code] elements too, resulting in <br> or <i></i> codes in the textarea, where they do not belong:
I would like to skip any text that is between [code][/code] UBB elements and output the raw text as it is stored in the DB:
How can this be achieved in PHP?
I actually tried "/\[code\][^>]+\[\/code\](*SKIP)(*FAIL)|\\r?\\n/i" => "<br>", (which I read of here) with the result, that the latter argument "/\[code\](.*?)\[\/code\]/i" to replace [code][/code] itself with <textarea></textarea> is also skipped and the output on the website is [code]bl \n ah[i]test[/i][/code] instead of <textarea>bl \n ah[i]test[/i]</textarea>. I could workaround this by adding these two arguments to the array:
"/\[code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">",
"/\[\/code\]/i" => "</textarea></form></td></tr></table></center></div>",
This replaces [code] and [/code] on each own with the unsexy result that the website looks like garbage if someone forgets to write [/code]. Looks like that the condition "/\[code\](.*?)\[\/code\]/i" is no longer valid within the same array due to (*SKIP)(*FAIL) why replacing each on its own works?
There must be a better solution...what am I missing?
You are dealing with one string each time, preg_replace does not know about the UBB, so it will replace all \n occurrences to <br />, the only way I can see is to use using PHP is something like preg_replace_callback_array.
$bbformat = array(
"/\\r?\\n/i" => function ($m) {
return "<br>";
},
"/\[i\](.*?)\[\/i\]/i" => function ($m) {
return "<i>".$m[1]."</i>";
},
"/\[code\](.*?)\[\/code\]/i" => function ($m) {
return "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">".str_replace("<br>", "\n", $m[1])."</textarea></form></td></tr></table></center></div>";
},
);
$text = preg_replace_callback_array($bbformat, $text);
echo $text;
but as you can see, you must re-replace the <br> to \n again which may consume a lot of memory in some cases.
however, you can play around with regex itself to exclude the \n => <br> replacement if the occurrence is not between the [code] block by changing your first pattern from /\\r?\\n/i to: /\\r?\\n+(?![^\[].*\])/i or even /\\r?\\n+(?![^\[code\]].*\[\/code\])/i if you want to keep your [code] block
so your final code may look like this:
$bbformat = array(
"/\\r?\\n+(?![^\[code\]].*\[\/code\])/i" => "<br>",
"/\[i\](.*?)\[\/i\]/i" => "<i>$1</i>",
"/\[code\](.*?)\[\/code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">$1</textarea></form></td></tr></table></center></div>",
);
foreach($bbformat as $match=>$replacement){
$text = preg_replace($match, $replacement, $text);
}
echo $text;
I have a title="" attribute in an anchor that contains HTML. I'm trying to remove the title attribute entirely but for whatever reason the preg replace I'm using will not work. I've tried:
$output = preg_replace( '/title=\"(.*?)\"/', '', $output );
$output = preg_replace( '/\title="(.*?)"/', '', $output );
$output = preg_replace( '` title="(.+)"`', '', $output );
None of the above works, but I can use something like:
$output = str_replace( 'title', 'class', $output );
Just to prove that I was able to do something ( and I wasn't uploading the wrong file or something ). Output looks like this:
<a href="#" title="<table border=\"0\" width=\"100%\" cellspacing=\"0\" cellpadding=\"0\">
<tbody>
<tr>
<td colspan=\"2\" align=\"center\" valign=\"top\"></td>
</tr>
<tr>
<td valign=\"top\" width=\"50%\">
table content
</td>
<td valign=\"top\" width=\"50%\">
table content
</td>
</tr>
</tbody>
</table>">Link Title</a>
So what I'm trying to do is filter $output and remove the title attribute entirely including everything inside the title attribute. Why will the preg_replace() above not work and what are my options?
I would not use a regex to do operations on [x]html, I'd use a html parser instead.
But if you still want to use a regex then you can use a regex like this:
title="[\s\S]*?"
Working demo
You can have this code:
$re = "/title=\"[\\s\\S]*?\"/";
$str = "Link Title";
$subst = "";
$result = preg_replace($re, $subst, $str);
Update: You can see a clear example about why you shouldn't use regex to parse html in Andrei P. comment
I have sets of HTML anchor elements enclosing image elements. For each set, using PHP-CLI, I want to pull the URLs and classify them according to their types. The type of anchor can only be determined by an attribute of its child image element. It would be easy if there was only one of each type per set. My problem is when two anchor elements of one type are separated by one or more of the other types. My non-greedy parenthesized sub-pattern seems to become greedy and expands to find the second relevant child attribute. In my test script I'm trying to pull the 'Userlink' URLs from amongst the other types. Using a simple pattern like:
#<a href="(.*?)" custattr="value1"><img alt="Userlink"#
On a set like:
<li><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></li><li><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></li><li><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></li><li><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></li><li><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></li>
(sorry, but the actual html is on one line like that)
My sub-pattern captures from the beginning of the first "Userlink" URL to the end of the last one.
I've tried many variations of look-aheads, not sure I should list them all here. So far they've either returned no match at all or the same as described above.
Here's my test script (running in a Bash shell):
#!/usr/bin/php
<?
$lines = 0;
$input = "";
$matches = array();
while ($line = fgets(STDIN)){
$input .= $line;
$lines++;
}
fwrite(STDERR, "Processing $lines\n");
$pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';
if (preg_match_all($pcre,$input,$matches)){
fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
foreach ($matches[1] as $match){
fwrite(STDOUT, $match . "\n");
}
}
?>
What PCRE pattern for PHP's preg_match_all() would return the two "Userlink" URLs in the above example?
I have taken the liberty of changing your variable names:
$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~';
if ($nb = preg_match_all($pattern, $input, $matches)) {
fwrite(STDERR, "\$matches has " . $nb . " elements\n");
fwrite(STDOUT, implode("\n", $match) . "\n");
}
Note that the preg_match_all function returns the number of matches.
This regex should work -
<a href="([^"]*?)"[^>]*\><img alt="Userlink"
You can see how it work here.
Testing it -
$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
var_dump($matches);
//$matches[1] will be the array containing the urls.
}
/*
OUTPUT-
array
0 =>
array
0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
1 =>
array
0 => string 'http://www.userlink1.com/my/page.html' (length=37)
1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/
I am stuck with some regular expression problem.
I have a huge file in html and i need to extract some text (Model No.) from the file.
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr>
.... so on
and this is a huge page with all webpage built in table and divless...
The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.
There are about 10000 model No and i need to extract them.
is there any way do do this with regrex... like
"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"
and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.
any help would be greatly appreciated...
Description
This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.
<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>
Groups
Group 0 gets the entire td tag from open tag to close tag
gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text
PHP Code Example:
Input text
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> </b></td></tr>
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td></tr>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Array
(
[0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
[1] => <td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td>
)
[1] => Array
(
[0] => "
[1] => "
)
[2] => Array
(
[0] => SK10014
[1] => SK1998
)
)
Method with DOMDocument:
// $html stands for your html content
$doc = new DOMDocument();
#$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');
foreach($td_nodes as $td_node){
if ($td_node->getAttribute('class')=='thumimages')
echo $td_node->firstChild->textContent.'<br/>';
}
Method with regex:
$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class"
class \s*+ = \s*+ # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1 # "thumimages" between quotes or not
(?>[^>]++|(?<!b)>)+> # all characters until the ">" from "<b>"
\s*+ \K # any spaces and pattern reset
[^<\s]++ # all chars that are not a "<" or a space
~xi
LOD;
preg_match_all($pattern, $html, $matches);
echo '<pre>' . print_r($matches[0], true);
/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i
This works.
You can use php DOMDocument Class
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('load.html');
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $tr){
echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
}
?>
I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">
~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)
Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link
Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.
This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.
Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si
This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)