Regex syntax question - trying to understand

Regex syntax question - trying to understand - php

I'm a self taught PHP programmer and I'm only now starting to grasp the regex stuff. I'm pretty aware of its capabilities when it is done right, but this is something I need to dive in too. so maybe someone can help me, and save me so hours of experiment.
I have this string:
here is the <img src="http://www.somewhere.com/1.png" alt="some' /> and there is not a chance...
now, I need to preg_match this string and search for the a href tag that has an image in it, and replace it with the same tag with a small difference: after the title attribute inside the tag, I'll want to add a rel="here" attribute.
of course, it should ignore links (a href's) that don't have img tag inside.

First of all: never ever ever use regex for html!
You're much better off using an XML parser: create a DOMDocument, load your HTML, and then use XPath to get the node you want.
Something like this:
$str = 'here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...';
$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$results = $xpath->query('//a/img');
foreach ($results as $result) {
// edit result node
}
$doc->saveHTML();

Ideally you should use HTML (or XML) parser for this purpose. Here is an example using PHP built-in XML manipulation functions:
<?php
error_reporting(E_ALL);
$doc = new DOMDocument();
$doc->loadHTML('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...</p>
</body></html>');
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[img]');
foreach ($result as $r) {
$r->setAttribute('rel', $r->getAttribute('title')); // i am confused whether you want a hard-coded "here" or the value of the title
}
echo $doc->saveHTML();
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some"> and there is not a chance...</p>
</body></html>

here a couple of link that might help you with Regex:
RegEx Tutorial
Email Samples of RegEx
I used the web site in the last link extensively in my previous Job. It is a great collections of RegEx that you can also test according to your specific case.
First two links would help you to find to get some further knowledge about it.

Related

DOMDocument moves <p> tags outside of <h1> tags - what now?

According to this summary of the HTML standards, https://stackoverflow.com/a/19779520/2113148, <p><p/> shall not be part of headings.
In PHP, this is translated to the following DOMDocument behaviour:
$DOM = new DOMDocument('1.0', 'UTF-8');
$html = "<html><body><h1><p>Hello</p></h1></body></html>";
#$DOM->loadHTML($html);
$output = $DOM->saveHTML();
Output:
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><h1></h1><p>Hello</p></body></html>
"""
The paragraph tag is moved outside of the heading tag.
But this is not very "user-friendly" as websites don't stick to the rules and one can very simply understand what is meant by original HTML.
How can one disable or hot-wire this behaviour and make DOMDocument behave graciously?
Browsers don't parse it like DOMDocument and we also understand it differently.

Strangely enough, if your html is also well formed xml, you can parse it as xml:
$DOM->loadXML($html);
$output = $DOM->saveXML($DOM->documentElement);
echo $output;
Output:
<html><body><h1><p>Hello</p></h1></body></html>

Replace and return partial HTML with DOMDocument without adding body, doctype, etc

I want to operate some replacement on a partial document HTML document. Let's say, I want to add something on the src argument of img tags.
(Example) Replace:
<p>hello</p><img src="REPLACE" /><p></p>
By:
<p>hello</p><img src="http://example.org/image.jpeg" /><p></p>
I did want to use DOMDocument to achieve this, so I coded something like this:
$doc = new \DOMDocument( '1.0', 'utf-8');
$doc->loadHTML('<p>hello</p><img src="REPLACE" /><p></p>');
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
$tag->setAttribute('src', 'http://example.org/image.jpeg');
}
var_dump($doc->saveHTML());
But it returns:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>hello</p><img src="http://example.org/image.jpeg"><p></p>
</body></html>
There are several problems with this return:
It uses a strange doctype: HTML 4.0!
It did add a doctype, a html tag and a body tag.
I know it's "normal" that DOMDocument add doctype, html and body tag, but can I avoid this? Is there anyway to "just" recover my HTML slice, with only the replacement I performed? Using regex is not an option, because there post everywhere saying it's bad practice.
Side note: I use Laravel, so if there is something out of the box for Laravel, it could be great too!

you can use extra options available in loadHTML() to achieve what you want. Check the options parameter. Detail about the libxml constants here. And note that its available since PHP 5.4. Like:
...
$doc->loadHTML('<p>hello</p><img src="REPLACE" /><p></p>',
LIBXML_HTML_NOIMPLIED |
LIBXML_HTML_NODEFDTD);
...
$doc->saveHTML();
Update
If you see UTF-8 characters being changed to some odd characters, then using mb_convert_encoding can fix this, like:
$doc->loadHTML(
mb_convert_encoding('<p>hello</p><img src="REPLACE" /><p></p>', 'HTML-ENTITIES', 'UTF-8'),
LIBXML_HTML_NOIMPLIED |
LIBXML_HTML_NODEFDTD
);

If you want to use laravel option then you can just call the partial that you have and have it return the html for you:
$src = "http://example.org/image.jpeg"
return view('path_to_partial', compact('src'))->render();

How to convert HTML <TAGS> to <tags> in PHP?

I have a lot of HTML data to import which uses uppercase tag and attribute names. Unfortunately the receiving system does not allow this, insisting that they are all lower case.
How can I safely change all the tags and attribute names?
I would jump to a regular expression preg_replace_callback, but I know that can end up really tricky when it comes to parsing HTML - kind of reinventing the wheel.
Is there a DOMDocument or other safer solution?

As #niet suggested, you can try to use DOMDocument then save it and try to output it.
Consider this example:
<?php
$html_with_uppercase_tags = '<BODY><DIV class="container"><H1>Headers</H1><P>This is paragraph one</P></DIV></BODY>';
$dom = new DOMDocument();
$dom->loadHTML($html_with_uppercase_tags);
echo htmlentities($dom->saveHTML()); // check the tags
// http://www.php.net/manual/en/domdocument.savehtml.php
?>
Should yield something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><div class="container"><h1>Headers</h1><p>This is paragraph one</p></div></body></html>

automatic link creation using php without breaking the html tags

i want to convert text links in my content page into active links using php. i tried every possible script out there, they all fine but the problem that they convert links in img src tag. they convert links everywhere and break the html code.
i find a good script that do what i want exactly but it is in javascript. it is called jquery-linkify.
you can find the script here
http://github.com/maranomynet/linkify/
the trick in the script that it convert text links without breaking the html code. i tried to convert the script into php but failed.
i cant use the script on my website because there is other scripts that has conflict with jquery.
anyone could rewrite this script for php? or at least guide me how?
thanks.

First, parse the text with an HTML parser, with something like DOMDocument::loadHTML. Note that poor HTML can be hard to parse, and depending on the parser, you might get slightly different output in the browser after running such a function.
PHP's DOMDocument isn't very flexible in that regard. You may have better luck by parsing with other tools. But if you are working with valid HTML (and you should try to, if it's within your control), none of that is a concern.
After parsing the text, you need to look at the text nodes for links and replace them. Using a regular expression is the simplest way.
Here's a sample script that does just that:
<?php
function linkify($text)
{
$re = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
preg_match_all($re, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$i = count($matches);
while ($i--)
{
$url = $matches[$i][0];
if (!preg_match('#^https?://#', $url))
$url = 'http://'.$url;
$text = substr_replace($text, ''.$matches[$i][0].'', $matches[$i][1], strlen($matches[$i][0]));
}
return $text;
}
$dom = new DOMDocument();
$dom->loadHTML('<b>stackoverflow.com</b> test');
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//text()') as $text)
{
$frag = $dom->createDocumentFragment();
$frag->appendXML(linkify($text->nodeValue));
$text->parentNode->replaceChild($frag, $text);
}
echo $dom->saveHTML();
?>
I did not come up with that regular expression, and I cannot vouch for its accuracy. I also did not test the script, except for this above case. However, this should be more than enough to get you going.
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>stackoverflow.com</b>
test
</body>
</html>
Note that saveHTML() adds the surrounding tags. If that's a problem, you can strip them out with substr().

Use a HTML parser and only search for URLs within text nodes.

I think the trick is in tracking the single ' and double quotes '' in your PHP code and merging between them in a correct way so you put '' inside "" or vice versa.
For Example,
<?PHP
//old html tags
echo "<h1>Header1</h1>";
echo "<div>some text</div>";
//your added links
echo "<p><a href='link1.php'>Link1</a><br>";
echo "<a href='link1.php'>Link1</a></p>";
//old html tags
echo "<h1>Another Header</h1>";
echo "<div>some text</div>";
?>
I hope this helps you ..

$text = 'Any text ... link http://example123.com and image <img src="http://exaple.com/image.jpg" />';
$text = preg_replace('!([^\"])(http:\/\/(?:[\w\.]+))([^\"])!', '\\1\\2\\3', $text);

How to use PHP's DOM extension loadHTML

It was suggested to me that in order to close some "dangling" HTML tags, I should use PHP's DOM extension and loadHTML.
I've been trying for a while, searching for tutorials, reading this page, trying various things, but can't seem to figure out how to use it to accomplish what I want.
I have this string: <div><p>The quick brown <a href="">fox jumps...
I need to write a function which closes the opened HTML tags.
Just looking for a starting point here. I can usually figure things out pretty quick.

Can be done with DOMDocument class within PHP using the DOMDocument::loadHTML() & DOMDocument::normalizeDocument() methods.
<?php
$html = '<div><p>The quick brown <a href="">fox jumps';
$DDoc = new DOMDocument();
$DDoc->loadHTML($html);
$DDoc->normalizeDocument();
echo $DDoc->saveHTML();
?>
OutPuts:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div><p>The quick brown fox jumps</p></div></body></html>
From there, just substr & strpos away the html that you don't want, like so:
<?php
$html = '<div><p>The quick brown <a href="">fox jumps';
$DDoc = new DOMDocument();
$DDoc->loadHTML($html);
$DDoc->normalizeDocument();
$html = $DDoc->saveHTML();
# Remove Everything Before & Including The Opening HTML & Body Tags.
$html = substr($html, strpos($html, '<html><body>') + 12);
# Remove Everything After & Including The Closing HTML & Body Tags.
$html = substr($html, 0, -14);
echo $html;
?>

While I'm sure you could get DOM to do what you want I'm pretty sure you'd be better off with Tidy.

OK, what about http://htmlpurifier.org/ ?
Also http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php
Can you use Tidy? http://php.net/manual/en/book.tidy.php

I think you're following the wrong approach: You have to use the DOM stuff to truncate the string, not after truncating it.
This is how I would do it:
Find the place where you want to truncate the string
Delete all child nodes after that point
Truncate the string

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex syntax question - trying to understand - php

Related

DOMDocument moves <p> tags outside of <h1> tags - what now?

Replace and return partial HTML with DOMDocument without adding body, doctype, etc

How to convert HTML <TAGS> to <tags> in PHP?

automatic link creation using php without breaking the html tags

How to use PHP's DOM extension loadHTML

Categories

Resources