need help in Regular Expression - php

I am in a weird scenerio where I need to show the content in multiple columns. I am using css3 column-cont and jquery plugin columnizer for older versions of IE.
The problem is that I do not have complete control over the data as it is served by an external webservice.
In most cases the content is wrapped in multiple paragraph tabs
Content#1
<p><strong>Heading</strong><br>This is a content</p>
<p><strong>Heading</strong><br>This is a content</p>
But In few cases the data is not wrapped in <p> tag and looks like below:
Content#2
<strong>Day 1: xyz </strong><br>
lorem lipsum <br> <br>
<strong>Dag 2: lorem lipsum</strong><br>
Morgonflyg till Arequipa i södra Peru.
<br> <br>
The real problem is jquery columnizer plugin hangs up the browser with this markup when it is asked to columnize such content.
Now I want to transform Content#2 to Content#1 with the help of regular expression,ie wrap the contents into sensible paragraphs. I hope I have made myself clear
I am using PHP.
Thank you in advance!

Your content is not stable and Regular Expression won't do magics with distinct contents like this. With this being said, whenever you're receiving the data from the other website, there might be a high chance that someday it'll return different pattern so your rules won't be good anymore. You need to have a reliable source to get a reliable result.
This is a filthy string manipulation but it'll get what you need if the pattern stays consistent. And, I still insist that you have to use a reliable source.
$str = "<strong>Day 1: xyz </strong><br>
lorem lipsum <br> <br>
<strong>Dag 2: lorem lipsum</strong><br>
Morgonflyg till Arequipa i södra Peru.
<br> <br> ";
function parse($data)
{
if(substr($data, 0, 3) == "<p>") return $data;
$chunks = explode("<strong>", $data);
$out = array();
foreach($chunks as $chunk)
{
$item = $chunk;
$last_br = strpos($item, "<br> <br>");
if($last_br > -1){ $item = substr($item, 0, $last_br); }
$item = "<p>" . $item . "</p>";
$out[] = $item;
}
return implode("\n", $out);
}
echo parse($str);

You can use this pattern:
/(?<!^<p>)(<strong>.*?)(<strong>.*)$/gs
Demo
Notice that the exclusion in the negative lookbehind will ONLY work if your strings starts with a <p>... so consider to trim it before applying your regex...
<br> tags has to be removed using another regex or str_replace()
Also, consider maybe using another aproach than Regex to parse DOM HTML...

Related

Get the feed value from particular string in php

I have the below feed value
<item>
<description><strong>Contact Number:</strong> +91-00-000-000<br /><br /><strong>Rate:</strong> xx.xx<br /><br /><strong>Fees and Comments:<br /></strong><ul><li>$0 fees</li><li>Indicative Exchange Rate</li></description>
</item>
Now i wanna get Contact number and rate as well as Fees and comments in separte value.
how can i get this value ..any one????
Description
You should probably read this with a parsing engine. however if your use case is this simple then this regex will:
capture each of the fields
allow the fields to appear in any order
^(?=.*?Contact\sNumber:<\/strong>([^<]*))(?=.*?Rate:<\/strong>([^<]*))(?=.*?Fees\sand\sComments:.*?<li>([^<]*)<.*?<li>([^<]*)<)
Live Example: http://www.rubular.com/r/j0aStij3L8
It kind of depends on what reliable patterns there are to the rest of your feed (or future feeds). It doesn't look like an XML parser is going to work here as the example doesn't look like well formed XML.
A good way to start is using explode to split the string into an array of strings, it looks like is a good delimiter to split on. So this would look like:
$split_feed = explode("<br />",$feed);
where $feed is your feed input in the question, and $split_feed will be your output array.
Then, from that split feed, you can use strpos (or stripos) to test for keys in your string, to determine which field it references, and replace to get the value out of the key/value string.
I think this is you want
<?php
$value = '<strong>Contact Number:</strong> +91-00-000-000<br /><br />
<strong>Rate:</strong> xx.xx<br /><br />
<strong>Fees and Comments:<br /></strong><ul><li>$0 fees</li>
<li>Indicative Exchange Rate</li>';
$steps = explode('<br /><br />', $value);
$step_2_for_contact_number = explode('</strong>', $steps[0]);
$contact_number = $step_2_for_contact_number[1];
$step_for_rate = explode('</strong>', $steps[1]);
$rate = $step_for_rate[1];
$feed_n_comment_s_1 = explode('</li>', $steps[2]);
$feed_n_comment_s_2 = explode('<li>', $feed_n_comment_s_1[0]);
$feed_n_comment = $feed_n_comment_s_2[1];
echo $contact_number;
echo "<br/>";
echo $rate;
echo "<br/>";
echo $feed_n_comment;
?>
You can also have a look at this pattern: (uses named groups)
(?<key>[a-zA-Z\d\s]+)(?=\:).*?\>(?<value>[^<]+)
Live Demo

strip_tags disallow some tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?
I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.
EDIT
To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes docs, as well:
Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:
Note:
This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.
First, see what others have said on this topic:
Strip <script> tags and everything in between with PHP?
and
remove script tag from HTML content
It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.
If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.
PHP(5 or greater) solution:
If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:
OPTION 1 (simplest):
preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
OPTION 2 (more versatile):
<?php
$html = "<p>Your HTML code</p><script>With malicious code</script>"
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
Then $html will be:
"<p>Your HTML code</p>"
This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.
$description = trim(preg_replace([
# Strip tags around content
'/\<(.*)doctype(.*)\>/i',
'/\<(.*)html(.*)\>/i',
'/\<(.*)head(.*)\>/i',
'/\<(.*)body(.*)\>/i',
# Strip tags and content inside
'/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));
Input example:
$description = '<html>
<head>
</head>
<body>
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
<script type="application/javascript">alert('Hello world');</script>
</body>
</html>';
Output result:
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
I use the following:
function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
foreach (explode(',', $forbidden_tags) as $tag) {
$tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
$input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
}
return $input;
}
Then you can do:
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');
Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');
Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

Replacing words with tag links in PHP

I have a text ($text) and an array of words ($tags). These words in the text should be replaced with links to other pages so they don't break the existing links in the text. In CakePHP there is a method in TextHelper for doing this but it is corrupted and it breaks the existing HTML links in the text. The method suppose to work like this:
$text=Text->highlight($text,$tags,'\1',1);
Below there is existing code in CakePHP TextHelper:
function highlight($text, $phrase, $highlighter = '<span class="highlight">\1</span>', $considerHtml = false) {
if (empty($phrase)) {
return $text;
}
if (is_array($phrase)) {
$replace = array();
$with = array();
foreach ($phrase as $key => $value) {
$key = $value;
$value = $highlighter;
$key = '(' . $key . ')';
if ($considerHtml) {
$key = '(?![^<]+>)' . $key . '(?![^<]+>)';
}
$replace[] = '|' . $key . '|ix';
$with[] = empty($value) ? $highlighter : $value;
}
return preg_replace($replace, $with, $text);
} else {
$phrase = '(' . $phrase . ')';
if ($considerHtml) {
$phrase = '(?![^<]+>)' . $phrase . '(?![^<]+>)';
}
return preg_replace('|'.$phrase.'|i', $highlighter, $text);
}
}
You can see (and run) this algorithm here:
http://www.exorithm.com/algorithm/view/highlight
It can be made a little better and simpler with a few changes, but it still isn't perfect. Though less efficient, I'd recommend one of Ben Doom's solutions.
Replacing text in HTML is fundamentally different than replacing plain text. To determine whether text is part of an HTML tag requires you to find all the tags in order not to consider them. Regex is not really the tool for this.
I would attempt one of the following solutions:
Find the positions of all the words. Working from last to first, determine if each is part of a tag. If not, add the anchor.
Split the string into blocks. Each block is either a tag or plain text. Run your replacement(s) on the plain text blocks, and re-assemble.
I think the first one is probably a bit more efficient, but more prone to programmer error, so I'll leave it up to you.
If you want to know why I'm not approaching this problem directly, look at all the questions on the site about regex and HTML, and how regex is not a parser.
This code works just fine. What you may need to do is check the CSS for the <span class="highlight"> and make sure it is set to some color that will allow you to distinguish that it is high lighted.
.highlight { background-color: #FFE900; }
Amorphous - I noticed Gert edited your post. Are the two code fragments exactly as you posted them?
So even though the original code was designed for highlighting, I understand you're trying to repurpose it for generating links - it should, and does work fine for that (tested as posted).
HOWEVER escaping in the first code fragment could be an issue.
$text=Text->highlight($text,$tags,'\1',1);
Works fine... but if you use speach marks rather than quote marks the backslashes disappear as escape marks - you need to escape them. If you don't you get %01 links.
The correct way with speach marks is:
$text=Text->highlight($text,$tags,"\\1",1);
(Notice the use of \1 instead of \1)

PHP - Strings - Remove a HTML tag with a specific class, including its contents

I have a string like this:
<div class="container">
<h3 class="hdr"> Text </h3>
<div class="main">
text
<h3> text... </h3>
....
</div>
</div>
how do I remove the H3 tag with the .hdr class using as little code as possible ?
Using as little code as possible? Shortest code isn't necessarily best. However, if your HTML h3 tag always looks like that, this should suffice:
$html = preg_replace('#<h3 class="hdr">(.*?)</h3>#', '', $html);
Generally speaking, using regex for parsing HTML isn't a particularly good idea though.
Something like this is what you're looking for...
$output = preg_replace("#<h3 class=\"hdr\">(.*?)</h3>#is", "", $input);
Use "is" at the end of the regex because it will cause it to be case insensitive which is more flexible.
Stumbled upon this via Google - for anyone else feeling dirty using regex to parse HTML, here's a DOMDocument solution I feel much safer with going:
function removeTagByClass(string $html, string $className) {
$dom = new \DOMDocument();
$dom->loadHTML($html);
$finder = new \DOMXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' {$className} ')]");
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
return $dom->saveHTML();
}
Thanks to this other answer for the XPath query.
try a preg_match, then a preg_replace on the following pattern:
/(<h3
[\s]+
[^>]*?
class=[\"\'][^\"\']*?hdr[^\"\']*?[\"\']
[^>]*?>
[\s\S\d\D\w\W]*?
<\/h3>)/i
It's messy, and it should work fine only if the h3 tag doesn't have inline javascript which might contain sequences that this regular expression will react to. It is far from perfect, but in simple cases where h3 tag is used it should work.
Haven't tried it though, might need adjustments.
Another way would be to copy that function, use your copy, without the h3, if it's possible.
This would help someone if above solutions dont work. It remove iframe and content having tag '-webkit-overflow-scrolling: touch;' like i had :)
RegEx, or regular expressions is code for what you would like to remove, and PHP function preg_replace() will remove all div or divs matching, or replacing them with something else. In the examples below, $incoming_data is where you put all your content before removing elements, and $result is the final product. Basically we are telling the code to find all divs with class=”myclass” and replace them with ” ” (nothing).
How to remove a div and its contents by class in PHP
Just change “myclass” to whatever class your div has.
$result = preg_replace('#<div class="myclass">(.*?)</div>#', ' ',
$incoming_data);
How to remove a div and its contents by ID in PHP
Just change “myid” to whatever ID your div has.
$result = preg_replace('#(.*?)#', ' ', $incoming_data);
If your div has multiple classes?
Just change “myid” to whatever ID your div has like this.
$result = preg_replace('#<div id="myid(.*?)</div>#', ' ', $incoming_data);
or if div don’t have an ID, filter on the first class of the div like this.
$result = preg_replace('#<div class="myclass(.*?)</div>#', ' ', $incoming_data);
How to remove all headings in PHP
This is how to remove all headings.
$result = preg_replace('#<h1>(.*?)</h1>#', ' ', $incoming_data);
and if the heading have a class, do something like this:
$result = preg_replace('#<h1 class="myclass">(.*?)</h1>#', ' ', $incoming_data);
Source: http://www.lets-develop.com/html5-html-css-css3-php-wordpress-jquery-javascript-photoshop-illustrator-flash-tutorial/php-programming/remove-div-by-class-php-remove-div-contents/
$content = preg_replace('~(.*?)~', '', $content);
Above code only works if the div haves are both on the same line. what if they aren't?
$content = preg_replace('~[^|]*?~', '', $content);
This works even if there is a line break in between but fails if the not so used | symbol is in between anyone know a better way?

Categories