I'm trying to find a regular expression that would allow me replace the SRC attribute in an image. Here is what I have:
function getURL($matches) {
global $rootURL;
return $rootURL . "?type=image&URL=" . base64_encode($matches['1']);
}
$contents = preg_replace_callback("/<img[^>]*src *= *[\"']?([^\"']*)/i", getURL, $contents);
For the most part, this works well, except that anything before the src=" attribute is eliminated when $contents is echoed to the screen. In the end, SRC is updated properly and all of the attributes after the updated image URL are returned to the screen.
I am not interested in using a DOM or XML parsing library, since this is such a small application.
How can I fix the regex so that only the value for SRC is updated?
Thank you for your time!
Use a lazy star instead of a greedy one.
This may be your problem:
/<img[^>]*src *= *[\"']?([^\"']*)/
^
Change it to:
/<img[^>]*?src *= *[\"']?([^\"']*)/
This way, the [^>]* matches the smallest possible number of your bracket expression, rather than the largest possible.
Do another grouping and prepend it to the return value?
function getURL($matches) {
global $rootURL;
return $matches[1] . $rootURL . "?type=image&URL=" . base64_encode($matches['2']);
}
$contents = preg_replace_callback("/(<img[^>]*src *= *[\"']?)([^\"']*)/i", getURL, $contents);
I am not interested in using a DOM or XML parsing library, since this is such a small application.
Nevertheless, that is the correct approach regardless of your application size.
Remember, when you modify elements with DOMDocument, you should iterate in reverse to avoid unexpected oddities - in particular if you remove anything.
Here's a working example using DOMDocument. It's more complicated than a regex, but not terribly difficult and a lot more flexible and robust for any other tweaking the may be required.
function inner_html($node) {
$innerHTML = "";
foreach ($node->childNodes as $child) {
$innerHTML .= $node->ownerDocument->saveHTML($child);
}
return $innerHTML;
}
function replace_src($html) {
$rootURL = 'https://example.com';
$dom = new DOMDocument();
if (mb_detect_encoding($html, 'UTF-8', true) == 'UTF-8') {
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
}
$dom->loadHTML('<body>' . $html . '</body>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('img'), $i = $els->length - 1; $i >= 0; $i--) {
$src = $els->item($i)->getAttribute('src');
$els->item($i)->setAttribute('src', $rootURL . '?type=image&URL=' . $src);
}
return inner_html($dom->documentElement);
}
$html = '
<div>
<img src="test123">
<img src="test456">
</div>
';
echo replace_src($html);
OUTPUT:
<div>
<img src="https://example.com?type=image&URL=test123">
<img src="https://example.com?type=image&URL=test456">
</div>
You can check for spaces too
Use this:
/<\s*img[^>]*?src\s*=\s*(["'])([^"']+)\1[^>]*?>/giu
https://regex101.com/r/jmMoio/1
Related
My html content looks like this:
<div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/>
It is one unbroken long line with no newlines separating each img element with no indentation whatsoever.
The php code I use is as follows:
/**
*
* Take in html content as string and find all the <script src="yada.js" ... >
* and add $prepend to the src values except when there is http: or https:
*
* #param $html String The html content
* #param $prepend String The prepend we expect in front of all the href in css tags
* #return String The new $html content after find and replace.
*
*/
protected static function _prependAttrForTags($html, $prepend, $tag) {
if ($tag == 'css') {
$element = 'link';
$attr = 'href';
}
else if ($tag == 'js') {
$element = 'script';
$attr = 'src';
}
else if ($tag == 'img') {
$element = 'img';
$attr = 'src';
}
else {
// wrong tag so return unchanged
return $html;
}
// this checks for all the "yada.*"
$html = preg_replace('/(<'.$element.'\b.+'.$attr.'=")(?!http)([^"]*)(".*>)/', '$1'.$prepend.'$2$3$4', $html);
// this checks for all the 'yada.*'
$html = preg_replace('/(<'.$element.'\b.+'.$attr.'='."'".')(?!http)([^"]*)('."'".'.*>)/', '$1'.$prepend.'$2$3$4', $html);
return $html;
}
}
I want my function to work regardless how badly formed the img element is.
It must work regardless the position of the src attribute.
The only thing it is supposed to do is to prepend the src value with something.
Also note that this preg_replace will not happen if the src value starts with http.
Right now, my code works only if my content is:
<div class="preload">
<img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"></img>
<img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u15_line.png" width="1" height="1"/>
As you probably can guess, it successfully does it but only for the first img element because it goes to the next line and there is no / at the end of the opening img tag.
Please advise how to improve my function.
UPDATE:
I used DOMDocument and it worked a treat!
After prepending the src values, I need to replace it with a php code snippet
So original:
<img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/>
After using DOMDocument and adding my prepend string:
<img src="prepended/PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1" />
Now I need to replace the whole thing with:
<?php echo $this->Html->img('prepended/PRODUCTPAGE_files/read_icon_u12_normal.png', array('width'=>'1', height='1')); ?>
Can I still use DOMDocument? Or I need to use preg_replace?
DomDocument was built to parse HTML no matter how messed up it is, rather then building your own HTML parser, why not use it ?
With a combination of DomDocument and XPath you can do it like this:
<?php
$html = <<<HTML
<script src="test"/><link href="test"/><div class="preload"><img src="PRODUCTPAGE_files/like_icon_u10_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/read_icon_u12_normal.png" width="1" height="1"/><img src="PRODUCTPAGE_files/line_u14_line.png" width="1" height="1"/><img width="1" height="1" src="httpPRODUCTPAGE_files/line_u14_line.png"/>
HTML;
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$searchTags = $xpath->query('//img | //link | //script');
$length = $searchTags->length;
for ($i = 0; $i < $length; $i++) {
$element = $searchTags->item($i);
if ($element->tagName == 'link')
$attr = 'href';
else
$attr = 'src';
$src = $element->getAttribute($attr);
if (!startsWith($src, 'http'))
{
$element->setAttribute($attr, "whatever" . $src);
}
}
// this small function will check the start of a string
// with a given term, in your case http or http://
function startsWith($haystack, $needle)
{
return !strncmp($haystack, $needle, strlen($needle));
}
$result = $doc->saveHTML();
echo $result;
Here is a Live DEMO of it working.
If your HTML if messed up like missing ending tags, etc, you can use before #$doc->loadHTML($html);:
$doc->recover = true;
$doc->strictErrorChecking = false;
If you want the output formatted you can use before #$doc->loadHTML($html);:
$doc->formatOutput = true;
With XPath, we are only capturing the data you need to edit so we don't worry about other elements.
Keep in mind that if your HTML had missing tags for example body, html, doctype, head this will automatically add it however if you already had em it shouldn't do anything else.
However if u want to remove them you can use the below instead of just $doc->saveHTML();:
$result = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $doc->saveHTML());
If you want to replace the element with a new created element on it's place, you can use this:
$newElement = $doc->createElement($element->tagName, '');
$newElement->setAttribute($attr, "prepended/" . $src);
$myArrayWithAttributes = array ('width' => '1', 'height' => '1');
foreach ($myArrayWithAttributes as $attribute=>$value)
$newElement->setAttribute($attribute, $value);
$element->parentNode->replaceChild($newElement, $element);
By creating a fragment:
$frag = $doc->createDocumentFragment();
$frag->appendXML('<?php echo $this->Html->img("prepended/PRODUCTPAGE_files/read_icon_u12_normal.png", array("width"=>"1", "height"=>"1")); ?>');
$element->parentNode->replaceChild($frag, $element);
Live DEMO.
You can format the HTML with tidy:
$tidy = tidy_parse_string($result, array(
'indent' => TRUE,
'output-xhtml' => TRUE,
'indent-spaces' => 4
));
$tidy->cleanRepair();
echo $tidy;
I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}
I hate to have to write down a lot of CSS rules and then enter my styles in it, so I'd like to develop a tiny php script that would parse the HTML I'd pass to it and then return empty CSS rules.
I decided to use PHP's DomDocument.
The question is: How could I loop through the whole structure? (I saw that for example DomDocument only has getElementByTag or getElementById and no getFirstElement for example)
I only want to get the ids and the classes in a given block of HTML code, I'd pass things like:
<div id="testId">
<div class="testClass">
<span class="message error">hello world</span>
</div>
</div>
I only want to know how could I loop through every node?
Thanks!
You can pass an asterisk (*) to getElementsByTagName to get all tags and then loop through them...
<?php
$nodes = $xml->getElementsByTagName("*");
$css = "";
for ($i = 0; $i < $nodes->length; $i ++)
{
$node = $nodes->item($i);
if ($node->hasAttribute("class")) {
$css = $css . "." . $node->getAttribute("class") . " { }\n";
} elseif ($node->hasAttribute("id")) {
$css = $css . "#" . $node->getAttribute("id") . " { }\n";
}
}
echo $css;
?>
The SimpleXML extension for PHP may help you. It work perfectly to navigate through HTML tree.
http://www.php.net/manual/en/simplexml.examples-basic.php
I have a string (not xml )
<headername>X-Mailer-Recptid</headername>
<headervalue>15772348</headervalue>
</header>
from this, i need to get the value 15772348, that is the value of headervalue. How is possible?
Use PHP DOM and traverse the headervalue tag using getElementsByTagName():
<?php
$doc = new DOMDocument;
#$doc->loadHTML('<headername>X-Mailer-Recptid</headername><headervalue>15772348</headervalue></header>');
$items = $doc->getElementsByTagName('headervalue');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "\n";
}
?>
This gives the following output:
15772348
[EDIT]: Code updated to suppress non-HTML warning about invalid headername and headervalue tags as they are not really HTML tags. Also, if you try to load it as XML, it totally fails to load.
This looks XML-like to me. Anyway, if you don't want to parse the string as XML (which might be a good idea), you could try something like this:
<?
$str = "<headervalue>15772348</headervalue>";
preg_match("/<headervalue\>([0-9]+)<\/headervalue>/", $str, $matches);
print_r($matches);
?>
// find string short way
function my_url_search($se_action_data)
{
// $regex = '/https?\:\/\/[^\" ]+/i';
$regex="/<headervalue\>([0-9]+)<\/headervalue>/"
preg_match_all($regex, $se_action_data, $matches);
$get_url=array_reverse($matches[0]);
return array_unique($get_url);
}
echo my_url_search($se_action_data)
<?php
$html = new simple_html_dom();
$html = str_get_html("<headername>X-Mailer-Recptid</headername>headervalue>15772348</headervalue></header>"); // Use Html dom here
$get_value=$html->find("headervalue", 0)->plaintext;
echo $get_value;
?>
http://simplehtmldom.sourceforge.net/manual.htm#section_find
I am looking for suitable replacement code that allows me replace the content inside of any HTML tag that has a certain class e.g.
$class = "blah";
$content = "new content";
$html = '<div class="blah">hello world</div>';
// code to replace, $html now looks like:
// <div class="blah">new content</div>
Bare in mind that:
It wont necessarily be a div, it could be <h2 class="blah">
The class can have more than one class and still needs to be replaced e.g. <div class="foo blah green">hello world</div>
I am thinking regular expressions should be able to do this, if not I am open to other suggestions such as using the DOM class (although I would rather avoid this if possible because it has to be PHP4 compatible).
Do not use regular expressions to parse HTML. You can use the built in DOMDocument, or something like simple_html_dom:
require_once("simple_html_dom.php");
$class = "blah";
$content = "new content";
$html = '<div class="blah">hello world</div>';
$doc = new simple_html_dom();
$doc->load($html);
foreach ( $doc->find("." . $class) as $node ) {
$node->innertext = $content;
}
Sorry, I didn't see the PHP4 requirement. Here's a solution using the standard DOMDocument as mentioned above.
function DOM_getElementByClassName($referenceNode, $className, $index=false) {
$className = strtolower($className);
$response = array();
foreach ( $referenceNode->getElementsByTagName("*") as $node ) {
$nodeClass = strtolower($node->getAttribute("class"));
if (
$nodeClass == $className ||
preg_match("/\b" . $className . "\b/", $nodeClass)
) {
$response[] = $node;
}
}
if ( $index !== false ) {
return isset($response[$index]) ? $response[$index] : false;
}
return $response;
}
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach ( DOM_getElementByClassName($doc, $class) as $node ) {
$node->nodeValue = $content;
}
echo $doc->saveHTML();
If you are sure that $html is valid HTML code, you could use a HTML parser or even XML parser if it's valid XML code.
But the quick and dirty way in Regex would be something like:
$html = preg_replace('/(<[^>]+ class="[^>]*' . $class . '[^"]*"[^>]*>)[^<]+(<\/[^>]+>)/siU', '$1' . $content . '$2', $html);
Didn't test it too much, but it should work. Tell me if you find cases where it doesn't. ;)
Edit: Added "and dirty"... ;)
Edit 2: New version of the RegEx:
<?php
$class = "blah";
$content = "new content";
$html = '<div class="blah test"><h1><span>hello</span> world</h1></div><div class="other">other content</div><h2 class="blah">remove this</h2>';
$html = preg_replace('/<([\w]+)(\s[^>]*class="[^"]*' . $class . '[^"]*"[^>]*>).+(<\/\\1>)/siU', '<$1$2' . $content . '$3', $html);
echo $html;
?>
The last problem left is if theres a class that only has "blah" in its name, like "tooMuchBlahNow". Let's see how we can address that. Btw: Is it obvious already that I love playing with RegEx? ;)
There is no need to use the DOM class, this would probably be done quickest using jQuery, as Khnle said, or you could use the preg_replace() function. Give me some time, I may write a quick regex for you.
But I would recommend using something like jQuery, this way you can serve the page up to the user quickly and allow their computer to do the processing instead of your server.