PHP regex replace content in img src attribute? - php

I'm new to PHP and have following issue. I have to sanitize images and need to replace all img src if it starts with http:// with /files/images/.
Is that possible to do with regex?
I found a similar solution here but it does't fulfil everything I need.

A modification of the thread you linked to do what you are requesting:
$dom = new DOMDocument();
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('img') as $img) {
$current_src = $img->getAttribute('src');
$new_src = preg_replace('http://', '/files/images', $current_src);
$img->setAttribute( 'src', $new_src );
}
$content = $dom->saveHTML();
I don't have a machine with PHP readily available to test this for you (not at home), so let me know if you have any issues with it and I would be happy to troubleshoot for you.

Related

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

Regex to find target="_blank" links and add text before closing </a> tag

I need to be able to parse some text and find all the instances where an tag has target="_blank".... and for each match, add (for example): This link opens in a new window before the closeing tag.
For example:
Before:
Go here now
After:
Go here now<span>(This link opens in a new window)</span>
This is for a PHP site, so i assume preg_replace() will be the method... i just dont have the skills to write the regex properly.
Thanks in advance for any help anyone can offer.
You should never use a regex to parse HTML, except maybe in extremely well-defined and controlled circumstances.
Instead, try a built-in parser:
$dom = new DOMDocument();
$dom->loadHTML($your_html_source);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a[#target='_blank']");
foreach($links as $link) {
$link->appendChild($dom->createTextNode(" (This link opens in a new window)"));
}
$output = $dom->saveHTML();
Aternatively, if this is being output to the browser, you can just use CSS:
a[target='_blank']:after {
content: ' (This link opens in a new window)';
}
This will work for anchor tag replacement....
$string = str_replace('<a ','<a target="_blank" ',$string);
Well #Kolink is right, but there's my RegExp version.
$string = '<p>mess</p>Google<p>mess</p>';
echo preg_replace("/(\<a.*?target=\"_blank\".*?>)(.*?)(\<\/a\>)/miU","$1$2(This link opens in a new window)$3",$string);
This does the job:
$newText = '<span>(This link opens in a new window)</span>';
$pattern = '~<a\s[^>]*?\btarget\s*=(?:\s*([\'"])_blank\1|_blank\b)[^>]*>[^<]*(?:<(?!/a>)[^<]*)*\K~i';
echo preg_replace($pattern, $newText, $html);
However this direct string approach may replace also commented html parts, strings or comments in css or javascript code and eventually inside javascript literal regexes, that is at best unneeded and at worst unwanted at all. That's why you should use a DOM approach if you want to avoid these pitfalls. All you have to do is to append a new node to each link with the desired attribute:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[#target="_blank"]');
foreach($nodeList as $node) {
$newNode = dom->createElement('span', '(This link opens in a new window)');
$node->appendChild($newNode);
}
$html = $dom->saveHTML();
To finish, a last alternative consists to not change the html at all and to play with css:
a[target="_blank"]::after {
content: " (This link opens in a new window)";
font-style: italic;
color: red;
}
You won't be able to write a regex that will evaluate an infinitely long string. I suggest:
$h = explode('>', $html);
This will give you the chance to traverse it like any other array and then do:
foreach($h as $k){
if(!preg_match('/^<a href=/', $k){
continue;
}elseif(!preg_match(/target="_blank")/, $k){
continue;
}else{
$h[$k + 1] .= '(open in new window);
}
}
$html = implode('>', $h);
This is how I would approach such a problem. of course, I just threw this out off the top of my head and is note guaranteed to work as is, but with a few possible tweaks to your exact logic, and you will have what you need.

Geting links from a context using preg_match_all

I'm trying to grab all the links and their content from a text, but my problem is that the links might also have other attributes like class or id. What would be the pattern for this?
What i tried so far is:
/<a href="(.*)">(.*)<\/a\>/
Thank You,
Radu
As the comment to your question states, avoid using regex for HTML. The correct way to do it is using DOMDocument
$dom = new DOMDocument;
$dom->load($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//*/a');
foreach ($links as $link) {
/* do something with this */
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Edit:
An even better answer on the subject
This should do it:
/<a .*?href="(.*?)"[^>]*>([^<]*)<\/a>/i
Read this and see if you still want to use it.

Identify image links in PHP

I have a PHP page which looks through some HTML for links and replaces them with links to a local PHP page; the problem is finding image links. I currently use this code:
$data = preg_replace('|(<a\s*[^>]*href=[\'"]?)|','\1newjs.php?url=', $data);
Which matches things like
Google
and will replace them with
Google
I am looking to do a similar things with image files (jpg, gif, png) and replace something like:
Image
With this:
Image
Note the '&image=1' in the new URL. Is it possible for me to do this using PHP, preferably with regular expressions?
As per usual with anything involving regexes and HTML: https://stackoverflow.com/a/1732454/118068
The proper solution is to use DOM operations:
$dom = new DOMDocument();
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$anchors = $xp->query('//a');
foreach($anchors as $a) {
$href = $a->getAttribute('href');
if (is_image_link($href)) { //
$a->setAttribute('href', ... new link here ...);
}
}

Regex to extract images from HTML - how to get only JPGs?

I am using this PHP function to grab all <img> tags within any given HTML.
function extract_images($content)
{
$img = strip_tags(html_entity_decode($content),'<img>');
$regex = '~src="[^"]*"~';
preg_match_all($regex, $img, $all_images);
return $all_images;
}
This works and returns all images (gif, png, jpg, etc).
Anyone know how to change the regex...
~src="[^"]*"~
in order to only get files with JPG or JPEG extension?
Thanks a bunch.
Sooner or later the Regex Enforcement Agency will show up. It might as well be me :)
The proper way to do this is with a proper HTML DOM parser. Here's a DOMDocument solution. The usefulness of this is in that it's more robust than parsing the HTML by regex, and also gives you the ability to access or modify other HTML attributes on your <img> nodes at the same time.
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all images
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img) {
// Check the src attr of each img
$src = "";
$src = $img->getAttribute("src");
if (preg_match("/\.jp[e]?g$/i", $src) {
// Add it onto your $links array.
$links[] = $src;
}
See other answers for the simple regex solution, or adapt from the regex inside my foreach loop.
/src="[^"]*\.(jpg|jpeg)"/i
i -> case insensitive match

Categories