Manipulate the content of HTML strings without changing the HTML - php

If I have a string of HTML, maybe like this...
<h2>Header</h2><p>all the <span class="bright">content</span> here</p>
And I want to manipulate the string so that all words are reversed for example...
<h2>redaeH</h2><p>lla eht <span class="bright">tnetnoc</span> ereh</p>
I know how to extract the string from the HTML and manipulate it by passing to a function and getting a modified result, but how would I do so whilst retaining the HTML?
I would prefer a non-language specific solution, but it would be useful to know php/javascript if it must be language specific.
Edit
I also want to be able to manipulate text that spans several DOM elements...
Quick<em>Draw</em>McGraw
warGcM<em>warD</em>kciuQ
Another Edit
Currently, I am thinking to somehow replace all HTML nodes with a unique token, whilst storing the originals in an array, then doing a manipulation which ignores the token, and then replacing the tokens with the values from the array.
This approach seems overly complicated, and I am not sure how to replace all the HTML without using REGEX which I have learned you can go to the stack overflow prison island for.
Yet Another Edit
I want to clarify an issue here. I want the text manipulation to happen over x number of DOM elements - so for example, if my formula randomly moves letters in the middle of a word, leaving the start and end the same, I want to be able to do this...
<em>going</em><i>home</i>
Converts to
<em>goonh</em><i>gmie</i>
So the HTML elements remain untouched, but the string content inside is manipulated (as a whole - so goinghome is passed to the manipulation formula in this example) in any way chosen by the manipulation formula.

If you want to achieve a similar visual effect without changing the text you could cheat with css, with
h2, p {
direction: rtl;
unicode-bidi: bidi-override;
}
this will reverse the text
example fiddle: http://jsfiddle.net/pn6Ga/

Hi I came to this situation long time ago and i used the following code. Here is a rough code
<?php
function keepcase($word, $replace) {
$replace[0] = (ctype_upper($word[0]) ? strtoupper($replace[0]) : $replace[0]);
return $replace;
}
// regex - match the contents grouping into HTMLTAG and non-HTMLTAG chunks
$re = '%(</?\w++[^<>]*+>) # grab HTML open or close TAG into group 1
| # or...
([^<]*+(?:(?!</?\w++[^<>]*+>)<[^<]*+)*+) # grab non-HTMLTAG text into group 2
%x';
$contents = '<h2>Header</h2><p>the <span class="bright">content</span> here</p>';
// walk through the content, chunk, by chunk, replacing words in non-NTMLTAG chunks only
$contents = preg_replace_callback($re, 'callback_func', $contents);
function callback_func($matches) { // here's the callback function
if ($matches[1]) { // Case 1: this is a HTMLTAG
return $matches[1]; // return HTMLTAG unmodified
}
elseif (isset($matches[2])) { // Case 2: a non-HTMLTAG chunk.
// declare these here
// or use as global vars?
return preg_replace('/\b' . $matches[2] . '\b/ei', "keepcase('\\0', '".strrev($matches[2])."')",
$matches[2]);
}
exit("Error!"); // never get here
}
echo ($contents);
?>

Parse the HTML with something that will give you a DOM API to it.
Write a function that loops over the child nodes of an element.
If a node is a text node, get the data as a string, split it on words, reverse each one, then assign it back.
If a node is an element, recurse into your function.

could use jquery?
$('div *').each(function(){
text = $(this).text();
text = text.split('');
text = text.reverse();
text = text.join('');
$(this).text(text);
});
See here - http://jsfiddle.net/GCAvb/

I implemented a version that seems to work quite well - although I still use (rather general and shoddy) regex to extract the html tags from the text. Here it is now in commented javascript:
Method
/**
* Manipulate text inside HTML according to passed function
* #param html the html string to manipulate
* #param manipulator the funciton to manipulate with (will be passed single word)
* #returns manipulated string including unmodified HTML
*
* Currently limited in that manipulator operates on words determined by regex
* word boundaries, and must return same length manipulated word
*
*/
var manipulate = function(html, manipulator) {
var block, tag, words, i,
final = '', // used to prepare return value
tags = [], // used to store tags as they are stripped from the html string
x = 0; // used to track the number of characters the html string is reduced by during stripping
// remove tags from html string, and use callback to store them with their index
// then split by word boundaries to get plain words from original html
words = html.replace(/<.+?>/g, function(match, index) {
tags.unshift({
match: match,
index: index - x
});
x += match.length;
return '';
}).split(/\b/);
// loop through each word and build the final string
// appending the word, or manipulated word if not a boundary
for (i = 0; i < words.length; i++) {
final += i % 2 ? words[i] : manipulator(words[i]);
}
// loop through each stored tag, and insert into final string
for (i = 0; i < tags.length; i++) {
final = final.slice(0, tags[i].index) + tags[i].match + final.slice(tags[i].index);
}
// ready to go!
return final;
};
The function defined above accepts a string of HTML, and a manipulation function to act on words within the string regardless of if they are split by HTML elements or not.
It works by first removing all HTML tags, and storing the tag along with the index it was taken from, then manipulating the text, then adding the tags into their original position in reverse order.
Test
/**
* Test our function with various input
*/
var reverse, rutherford, shuffle, text, titleCase;
// set our test html string
text = "<h2>Header</h2><p>all the <span class=\"bright\">content</span> here</p>\nQuick<em>Draw</em>McGraw\n<em>going</em><i>home</i>";
// function used to reverse words
reverse = function(s) {
return s.split('').reverse().join('');
};
// function used by rutherford to return a shuffled array
shuffle = function(a) {
return a.sort(function() {
return Math.round(Math.random()) - 0.5;
});
};
// function used to shuffle the middle of words, leaving each end undisturbed
rutherford = function(inc) {
var m = inc.match(/^(.?)(.*?)(.)$/);
return m[1] + shuffle(m[2].split('')).join('') + m[3];
};
// function to make word Title Cased
titleCase = function(s) {
return s.replace(/./, function(w) {
return w.toUpperCase();
});
};
console.log(manipulate(text, reverse));
console.log(manipulate(text, rutherford));
console.log(manipulate(text, titleCase));
There are still a few quirks, like the heading and paragraph text not being recognized as separate words (because they are in separate block level tags rather than inline tags) but this is basically a proof of method of what I was trying to do.
I would also like it to be able to handle the string manipulation formula actually adding and removing text, rather than replacing/moving it (so variable string length after manipulation) but that opens up a whole new can of works I am not yet ready for.
Now I have added some comments to the code, and put it up as a gist in javascript, I hope that someone will improve it - especially if someone could remove the regex part and replace with something better!
Gist: https://gist.github.com/3309906
Demo: http://jsfiddle.net/gh/gist/underscore/1/3309906/
(outputs to console)
And now finally using an HTML parser
(http://ejohn.org/files/htmlparser.js)
Demo: http://jsfiddle.net/EDJyU/

You can use a setInterval to change it every ** time for example:
const TITTLE = document.getElementById("Tittle") //Let's get the div
setInterval(()=> {
let TITTLE2 = document.getElementById("rotate") //we get the element at the moment of execution
let spanTittle = document.createElement("span"); // we create the new element "span"
spanTittle.setAttribute("id","rotate"); // attribute to new element
(TITTLE2.textContent == "TEXT1") // We compare wich string is in the div
? spanTittle.appendChild(document.createTextNode(`TEXT2`))
: spanTittle.appendChild(document.createTextNode(`TEXT1`))
TITTLE.replaceChild(spanTittle,TITTLE2) //finally, replace the old span for a new
},2000)
<html>
<head></head>
<body>
<div id="Tittle">TEST YOUR <span id="rotate">TEXT1</span></div>
</body>
</html>

Related

XHP with Regex for link Replacement

I am trying to implement a simple function that given a text input, returns the text modified with xhp_a when a link is detected, within a paragraph xhp_p.
Consider this class
class Urlifier {
protected static $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
public static function convertParagraphWithLink(?string $input):xhp_p{
if (!$input)
return <p></p>;
else
{
if (preg_match(self::$reg_exUrl,$input,$url_match)) //match found
{
return <p>{preg_replace($reg_exUrl, '<a href="'.$url_match[0].'>'.$url_match[0].'</a>', $input)}<p>;
}else{//no link inside
<p>{$input}</p>
}
}
}
The problem here is that xhp escapes html and links are not shown as expected. I suppose that this happens because a do not create a dom hierarchy as expected (with appendChild method for example) and thus everything regex replaces is a string.
So my other approach to this problem was to use preg_match_callback with a callback function that would create xhp_a and add to hierarchy under xhp_p but that did not work either.
Am i wrong somewhere ? If not would there by any security risk / bigger overhead by just finding and replacing on load the html on client side instead of server ?
Thanks for your time !
Since XHP maintains object hierarchy that maps to DOM, simply replacing parts of a string won't create any new objects. To manipulate XHP objects corresponding methods should be used, e.g. appendChild.
Here's an example of how what you need can be achieved with XHP manipulation.
class Urlifier {
public static function convertParagraphWithLink(
?string $input,
): xhp_p {
$url_pattern = re"/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
if (HH\Lib\Str\is_empty($input)) {
return <p/>;
}
$input = $input as nonnull;
// Extract links
$link_matches = HH\Lib\Regex\every_match($input, $url_pattern);
$links = HH\Lib\Vec\map($link_matches, $m ==> $m[0]);
$a_elements = HH\Lib\Vec\map($links, $link ==> <a href={$link}>{$link}</a>);
// Extract all pieces between matches
$texts = HH\Lib\Regex\split($input, $url_pattern);
$p_elements = HH\Lib\Vec\map($texts, $text ==> <p>{$text}</p>);
// Merge texts and links
$pairs = HH\Lib\Vec\zip($p_elements, $a_elements);
$elements = HH\Lib\Vec\flatten($pairs);
// Because there's one more p element than a element, append last p
$elements[] = HH\Lib\C\last($p_elements);
$result = <p/>;
$result->appendChild($elements);
return $result;
}

How to save regex backreferences to an array during preg_replace or preg_replace_callback

Here's the problem: I have a database full of articles marked up in XHTML. Our application uses Prince XML to generate PDFs. An artifact of that is that footnotes are marked up inline, using the following pattern:
<p>Some paragraph text<span class="fnt">This is the text of a footnote</span>.</p>
Prince replaces every span.fnt with a numeric footnote marker, and renders the enclosed text as a footnote at the bottom of the page.
We want to render the same content in ebook formats, and XHTML is a great starting point, but the inline footnotes are terrible. What I want to do is convert the footnotes to endnotes in my ebook build script.
This is what I'm thinking:
Create an empty array called $endnotes to store the endnote text.
Set a variable $endnote_no to zero. This variable will hold the current endnote number, to display inline as an endnote marker, and to be used in linking the endnote marker to the particular endnote.
Use preg_replace or preg_replace_callback to find every instance of <span class="fnt">(.*?)</span>.
Increment $endnote_no for each instance, and replace the inline span with '<sup><a href="#endnote_' . $endnote_no . '">' .$endnote_no . ''`
Push the footnote text to the $endnotes array so that I can use it at the end of the document.
After replacing all the footnotes with numeric endnote references, iterate through the $endnotes array to spit out the endnotes as an ordered list in XHTML.
This process is a bit beyond my PHP comprehension, and I get lost when I try to translate this into code. Here's what I have so far, which I mainly cobbled together based on code examples I found in the PHP documentation:
$endnotes = array();
$endnote_no = 0;
class Endnoter {
public function replace($subject) {
$this->endnote_no = 0;
return preg_replace_callback('`<span class="fnt">(.*?)</span>`', array($this, '_callback'), $subject);
}
public function _callback($matches) {
array_push($endnotes, $1);
return '<sup>' . $this->endnote_no . '</sup>';
}
}
...
$replacer = new Endnoter();
$replacer->replace($body);
echo '<pre>';
print_r($endnotes); // Just checking to see if the $endnotes are there.
echo '</pre>';
Any guidance would be helpful, especially if there is a simpler way to get there.
Don't know about a simpler way, but you were halfway there. This seems to work.
I just cleaned it up a bit, moved the variables inside your class and added an output method to get the footnote list.
class Endnoter
{
private $number_of_notes = 0;
private $footnote_texts = array();
public function replace($input) {
return preg_replace_callback('#<span class="fnt">(.*)</span>#i', array($this, 'replace_callback'), $input);
}
protected function replace_callback($matches) {
// the text sits in the matches array
// see http://php.net/manual/en/function.preg-replace-callback.php
$this->footnote_texts[] = $matches[1];
return '<sup>'.$this->number_of_notes.'</sup>';
}
public function getEndnotes() {
$out = array();
$out[] = '<ol>';
foreach($this->footnote_texts as $text) {
$out[] = '<li>'.$text.'</li>';
}
$out[] = '</ol>';
return implode("\n", $out);
}
}
First, you're best off not using a regex for HTML manipulation; see here:
How do you parse and process HTML/XML in PHP?
However, if you really want to go that route, there are a few things wrong with your code:
return '<sup>' . $this->endnote_no . '</sup>';
if endnote_no is 1, for example this will produce
'<sup>2</sup>';
If those values are both supposed to be the same, you want to increment endnote_no first:
return '<sup>' . $this->endnote_no . '</sup>';
Note the ++ in front of the call instead of after.
array_push($endnotes, $1);
$1 is not a defined value. You're looking for the array you passed in to the callback, so you want $matches[1]
print_r($endnotes);
$endnotes is not defined outside the class, so you either want a getter function to retrieve $endnotes (usually preferable) or make the variable public in the class. With a getter:
class Endnotes {
private $endnotes = array();
//replace any references to $endnotes in your class with $this->endnotes and add a function:
public function getEndnotes() {
return $this->endnotes;
}
}
//and then outside
print_r($replacer->getEndnotes());
preg_replace_callback doesn't pass by reference, so you aren't actually modifying the original string. $replacer->replace($body); should be $body = $replacer->replace($body); unless you want to pass body by reference into the replace() function and update its value there.

get wrapping element using preg_match php

I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.

Ignore html tags in preg_replace

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

Recursive replacement of matching tags with regular expressions

I have the following string:
<?foo?> <?bar?> <?baz?> hello world <?/?> <?/?> <?/?>
I need a regular expression to convert it into
<?foo?> <?bar?> <?baz?> hello world <?/baz?> <?/bar?> <?/foo?>
The following code works for non-recursive tags:
$x=preg_replace_callback('/.*?<\?\/\?>/',function($x){
return preg_replace('/(.*<\?([^\/][\w]+)\?>)(.*?)(<\?\/?\?>)/s',
'\1\3<?/\2?>',$x[0]);
},$str);
You can't do this with regular expressions. You need to write a parser!
So create a stack (an array where you add and remove items from the end. use array_push() array_pop() ).
Iterate through the tags, pushing known opening tags on the stack.
When you come to a closing tag, pop the stack and that will tell you the tag you need to close.
For a recursive structure, make a recursive function. In some form of pseudo-code:
tags = ['<?foo?>', '<?bar?>', '<?baz?>']
// output consumed stream to 'output' and return the rest
function close_matching(line, output) {
for (tag in tags) {
if line.startswith(tag) {
output.append(tag)
line = close_matching(line.substring(tag.length()), output)
i = line.indexof('<')
... // check i for not found
output.append(line.substring(0, i))
j = line.indexof('>')
... // check j for error, and check what's between i,j is valid for close tag
output.append(closetag_for_tag(tag))
line = line.substring(j + 1)
}
}
return line;
}
This should give you a basic structure that works.

Categories