php regular expression to remove unwanted code

php regular expression to remove unwanted code - php

The editor I am using is adding extraneous coding that I would like to remove via php before writing to the database.
The code looks like this:
<img style="width: 250px;" src="files/school-big.jpg" data-cke-saved-src="files/school-big.jpg" alt="">
<img style="width: 250px;" src="files/firemen.jpg" data-cke-saved-src="files/firemen.jpg" alt="">
What I need to get rid of is the data-cke-saved-src="files/image-name". My understanding of regex is somewhere below weak so how would I build a regex to grab the image name without grabbing the end of the line or the rest of the content?
Thank you kindly,

Try this:
$data = preg_replace('#\s(data-cke-saved-src)="[^"]+"#', '', $data);
Or do it in jQuery before going into PHP with this:
$('img').removeAttr('data-cke-saved-src')

Try adding and using this function:
/*
*I am assuming you get all the data in a single variable.
*/
function remove_data_cke($text) {
// Get all data-cke-saved-src="..." tags from the html.
$result = array();
preg_match_all('|data-cke-saved-src="[^"]*"|U', $text, $result);
// Replace all occurances with an empty string.
foreach($result[0] as $data_cke) {
$text = str_replace($data_cke, '', $text);
}
return $text;
}

You can use DOM to easily remove the attribute:
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
foreach ($doc->getElementsByTagName('img') as $img) {
$img->removeAttribute('data-cke-saved-src');
}

Related

How to use htmlspecialchars only on <code></code> tags.

I'm using WordPress and would like to create a function that applies the PHP function htmlspecialchars only to code contained between <code></code> tags. I appreciate this may be fairly simple but I'm new to PHP and can't find any references on how to do this.
So far I have the following:
function FilterCodeOnSave( $content, $post_id ) {
return htmlspecialchars($content, ENT_NOQUOTES);
}
Obviously the above is very simple and performs htmlspecialchars on the entire content of my page. I need to limit the function to only apply to the HTML between code tags (there may be multiple code tags on each page).
Any help would be appreciated.
Thanks,
James

EDIT: updated to avoid multiple CODE tags
Try this:
<?php
// test data
$textToScan = "Hi <code>test12</code><br>
Line 2 <code><br>
Test <b>Bold</b><br></code><br>
Test
";
// debug:
echo $textToScan . "<hr>";
// the regex pattern (case insensitive & multiline
$search = "~<code>(.*?)</code>~is";
// first look for all CODE tags and their content
preg_match_all($search, $textToScan, $matches);
//print_r($matches);
// now replace all the CODE tags and their content with a htmlspecialchars() content
foreach($matches[1] as $match){
$replace = htmlspecialchars($match);
// now replace the previously found CODE block
$textToScan = str_replace($match, $replace, $textToScan);
}
// output result
echo $textToScan;
?>

Use DOMDocument to get all <code> tags;
// in this example $dom is an instance of DOMDocument
$code_tags = $dom->getElementsByTagName('code');
if ($code_tags) {
foreach ($code_tags as $node) {
// [...]
}
// [...]
}

i know this is a little bit late, but you can call the htmlspecialchars function first and then when outputting call the htmlspecialchars_decode function

PHP find string of substring using Regex

I have a web page source code that I want to use in my project. I want to use an image link in this code. So, I want to reach this link using regex in PHP.
That's it:
img src="http://imagelinkhere.com" class="image"
There is only one line like this.
My logic is to get the string between
="
and
" class="image"
characters.
How can I do that with REGEX? Thank you very much.

Don't use Regex for HTML .. try DomDocument
$html = '<html><img src="http://imagelinkhere.com" class="image" /></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$img = $dom->getElementsByTagName("img");
foreach ( $img as $v ) {
if ($v->getAttribute("class") == "image")
print($v->getAttribute("src"));
}
Output
http://imagelinkhere.com

preg_match("/(http://+.*?")/",$text,$matches);
var_dump($matches);
The link would be in $matches.

Using
.*="(.*)?" .*
with preg replace gives you only the url in the first regex group (\1).
So complete it would look like
$str='img src="http://imagelinkhere.com" class="image"';
$str=preg_replace('.*="(.*)?" .*','$1',$str);
echo $str;
-->
http://imagelinkhere.com
Edit:
Or just follow Baba's advice and use DOM Parser. I'll remember that regex will give you headaches when parsing html with it.

There is several ways to do so :
1.you can use
SimpleHTML Dom Parser which I prefer with simple HTML
2.you can also use preg_match
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" class="image" />';
$array = array();
preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
see this thread

I can hear the sound of hooves, so I have gone with DOM parsing instead of regex.
$dom = new DOMDocument();
$dom->loadHTMLFile('path/to/your/file.html');
foreach ($dom->getElementsByTagName('img') as $img)
{
if ($img->hasAttribute('class') && $img->getAttribute('class') == 'image')
{
echo $img->getAttribute('src');
}
}
This will echo only the src attribute of an img tag with a class="image"

Try using preg_match_all, like this:
preg_match_all('/img src="([^"]*)"/', $source, $images);
That should put all the URL's of the images in the $images variable. What the regex does is find all img src bits in the code and matches the bit between the quotes.

Strip directory structure in HTML

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark

As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php

This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>

If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Scrape unique image URLs from HTML

Using PHP to curl a web page (some URL entered by user, let's assume it's valid).
Example: http://www.youtube.com/watch?v=Hovbx6rvBaA
I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).
I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.

What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.
<?php
$resultFromCurl = '
<html>
<body>
<img src="hello.jpg" />
Yep
<table background="yep.jpg">
</table>
<p>
Perhaps you should check out foo.jpg! I promise it
is safe for work.
</p>
</body>
</html>
';
// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
'//table/#background',
'//img/#src',
'//input/#src',
'//a/#href',
'//area/#href',
'//img/#longdesc',
);
$dom = #DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);
$urls = array();
foreach ($queries as $query) {
foreach ($xpath->query($query) as $link) {
if (preg_match('#\.(gif|jpe?g|png)$#', $link->textContent))
$urls[$link->textContent] = true;
}
}
if (preg_match_all('#\b[^\s]+\.(?:gif|jpe?g|png)\b#', $dom->textContent, $matches)) {
foreach ($matches as $m) {
$urls[$m[0]] = true;
}
}
$urls = array_keys($urls);
var_dump($urls);

Collect all image urls into an array, then use array_unique() to remove duplicates.
$my_image_links = array_unique( $my_image_links );
// No more duplicates
If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:
$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);
The above will capture the image link in stuff like:
<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
Live example

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7

Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.

Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg

If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);

Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.

The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?

This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php regular expression to remove unwanted code - php

Try this: $data = preg_replace('#\s(data-cke-saved-src)="[^"]+"#', '', $data); Or do it in jQuery before going into PHP with this: $('img').removeAttr('data-cke-saved-src')

You can use DOM to easily remove the attribute: $doc = new DOMDocument; #$doc->loadHTML($html); // load the HTML data foreach ($doc->getElementsByTagName('img') as $img) { $img->removeAttribute('data-cke-saved-src'); }

Related

How to use htmlspecialchars only on <code></code> tags.

PHP find string of substring using Regex

Strip directory structure in HTML

Scrape unique image URLs from HTML

Matching SRC attribute of IMG tag using preg_match

Categories

Resources