Convert HTML text to Leet (1337) Speak with XPath - php

I want to transform a webpage into leet (1337) speak with XPath and PHP.
It can be done with only PHP but then the HTML nodes are also replaced with leet speak.
Example ($html is the webpage):
$find = array("a","b","c","d","e","f","g","h","i","j"."k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z");
$repl = array("4","b","c","d","3","f","g","h","1","j","k","1","m","n","0","p","9","r","5","7","u","v","w","x","y","2");
$html = str_replace($find, $repl, $html);
That replaces also the HTML nodes.
Can this be done with XPath and PHP with the XPath selector text()? Example ($html is the webpage):
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$xpath->query('//text()');
\\HERE THE REPLACING IN XPATH

Try this:
$dom = new DOMDocument;
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( '//text()' );
foreach( $nodes as $node )
{
$node->nodeValue = str_replace( $find, $repl, $node->nodeValue );
}
echo $dom->saveHTML();
Note that this is probably a more useful xpath query for your needs:
$nodes = $xpath->query( '//head/title/text() | //body//text()' );
... as this will only replace text in <head><title> or text being descendants of <body>. Probably wouldn't want to replace possible styles, Javascript and what have you. ;-)
On a side note: I've tested this with your arrays of find and replace characters, but there's something fishy going on with them, that I can't figure out. The replacement characters don't seem to line up with the found characters all the time. I have no idea why that is.
I've recreated the arrays, and these work for me:
$find = array('a','b','c','d','e','f','g','h','i','j'.'k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z');
$repl = array('4','b','c','d','3','f','g','h','1','j'.'k','1','m','n','0','p','9','r','5','7','u','v','w','x','y','2');
I just can't figure out why your arrays are not working for me. :-/ Perhaps an encoding issue? If anybody want to chime in and venture a guess, please do.
Edit: As rxdazn noticed, "j"."k" was the problem in the first array, which I totally overlooked as you can see from my recreated arrays (I copied $find over to $repl, replaced quotes, and filled in the leet characters).

Related

preg_replace and preg_match_all to move img from wordpress $content

I am using preg_replace to delete from $content certain <img>:
$content=preg_replace('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/','',$content);
When I am now displaying the content using wordpress the_content function, I did indeed remove the <img>s from $content:
I'd like beforehand to get this images to place them elsewhere in the template. I am using the same regex pattern with preg_match_all:
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
But I can't get my imgs?
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
print_r($matches);
Array ( [0] => Array ( ) )
assuming and hopefully you are using php5, this is a task for DOMDocument and xpath. regex with html elements mostly will work, but check the following example from
<img alt=">" src="/path.jpg" />
regex will fail. since there aren't many guarantees in programming, take the guarantee that xpath will find EXACTLY what you want, at a perfomance cost, so to code it:
$doc = new DOMDocument();
$doc->loadHTML('<span><img src="com.png" /><img src="com2.png" /></span>');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//span/img');
$html = '';
foreach($imgs as $img){
$html .= $doc->saveXML($img);
}
now you have all img elements in $html, use str_replace() to remove them from $content, and from there you can have a drink and be pleased that xpath with html elements is painless, just a little slower
ps. i couldnt be be bother understanding your regex, i just think xpath is better in your situation
at the end i have used preg_replace_callback:
$content2 = get_the_content();
$removed_imgs = array();
$content2 = preg_replace_callback('#(?!<img.+?id="featured_img".*?\/>)(<img.+? />)#',function($r) {
global $removed_imgs;
$removed_imgs[] = $r[1];
return '';
},$content2);
foreach($removed_imgs as $img){
echo $img;
}

Replace strings possible with output of PHP file_get_contents?

I'm using PHP to get content from an external website.
I want to know if it's possible to find and replace strings from the output so I can make all links absolute.
I need to convert "/ and '/ to "$url/
If it's possible to do that, I can figure out how to do the rest. I don't know if it's possible though.
Thanks
For simple string replacement, use str_replace(), eg
$html = str_replace(array("'/", '"/'), array("'$url/", '"' . $url . '/'), $html);
If you're after a more robust solution, I'd suggest loading the HTML string into a DOMDocument, loop over all the tags with href starting with / and change the attribute of each before writing out the HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$anchors = $xpath->query('//*[starts-with(#href, "/")]');
foreach ($anchors as $anchor) {
$href = $anchor->getAttribute('href');
$anchor->setAttribute('href', $url . $href);
}
$html = $doc->saveHTML();
You'll probably want to do the same for tags with src attributes.
You could also use preg_replace(), though the DOMDocument parsing is the most robust.

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.
$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.
You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).
<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.
HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

Extract text from HTML

Actors: example world
this example word using regular expression in php .....
preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);
print_r($matches);
Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML.
The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd. In the case of the example HTML, this would be the content of the TextNode, e.g. : example world.
Example HTML:
$html = <<< HTML
<p>
<strong class="nfpd">Actors</strong>: example world <br />
something else
</p>
HTML;
And extraction with DOM
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();
$nodes = $xPath->query('//strong[#class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
echo $node->nodeValue; // : example world
}
You can also do it withouth an XPath, though it gets more verbose then:
$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
if($node->hasAttribute('class') &&
$node->getAttribute('class') === 'nfpd' &&
$node->nextSibling) {
echo $node->nextSibling->nodeValue; // : example world
}
}
Removing the colon and whitespace is trivial: Use trim.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories