Help with Regex expression?

Help with Regex expression? - php

I'm trying to use preg_replace to filter member comments. To filter script and img tags. If src is from my site, allow it with tags, if from another site, just show the src
Regex Expression:
<(\w+).+src=[\x22|'](?![^\x22']+mysite\.com[^\x22']+)([^\x22']+)[\x22|'].*>(?:</\1>)?
Using:
preg_replace($pattern, $2, $comment);
Comment :
Hi look at this!
<img src="http://www.mysite.com/blah/blah/image.jpg"></img>
<img src="http://mysite.com/blah/blah/image.jpg"></img>
<img src="http://subdomain.mysite.com/blah/blah/image.jpg"/>
<img src="http://www.mysite.fakesite.com/blah/blah/image.jpg"></img>
<img src="http://www.fakesite.com/blah/blah/image.jpg"></img>
<img src="http://fakesite.com/blah/blah/image.jpg"></img>
Which one is your favorite?
Wanted Outcome:
Hi look at this!
<img src="http://www.mysite.com/blah/blah/image.jpg"></img>
<img src="http://mysite.com/blah/blah/image.jpg"></img>
<img src="http://subdomain.mysite.com/blah/blah/image.jpg"/>
http://www.mysite.fakesite.com/blah/blah/image.jpg (notice that it's just url, because it's not from my site)
http://www.fakesite.com/blah/blah/image.jpg
http://fakesite.com/blah/blah/image.jpg
Which one is your favorite?
Anyone see anything wrong?

I'm trying to use preg_replace to filter member comments. To filter script and img tags.
HTML Purifier is going to be the best tool for this purpose, though you want a whitelist of acceptable tags and attributes, not a blacklist of specific harmful tags.

The biggest thing wrong I can see is trying to use regex to modify HTML.
You should use DOMDOcument.
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($content);
foreach($dom->getElementsByTag('img') as $element) {
if ( ! $element->hasAttribute('src')) {
continue;
}
$src = $element->getAttribute('src');
$elementHost = parse_url($src, PHP_URL_HOST);
$thisHost = $_SERVER['SERVER_NAME'];
if ($elementHost != $thisHost) {
$element->parentNode->insertBefore($dom->createTextNode($src), $element);
$element->parentNode->removeChild($element);
}
}

you shoud use im mode;
#<(\w+).+src=[\x22|'](?![^\x22']+mysite\.com[^\x22']+)([^\x22']+)[\x22|'].*>(?:</\1>)?#im

Related

Remove everything except image tag from string using regular expression

I have string that contains all the html elements , i have to remove everything except images .
Currently i am using this code
$e->outertext = "<p class='images'>".str_replace(' ', ' ', str_replace('Â','',preg_replace('/#.*?(<img.+?>).*?#is', '',$e)))."</p>";
Its serving my purpose but very slow in execution . Any other way to do the same would be appreciable .

The code you provided seems to not work as it should and even the regex is malformed. You should remove the initial slash / like this: #.*?(<img.+?>).*?#is.
Your mindset is to remove everything and leave just the image tags, this is not a good way to do it. A better way is to think in just capturing all image tags and then using the matches to construct the output. First let's capture the image tags. That can be done using this regex:
/<img.*>/Ug
The U flag makes the regex engine become lazy instead of eager, so it will match the encounter of the first > it finds.
DEMO1
Now in order to construct the output let's use the method preg_match_all and put the results in a string. That can be done using the following code:
<?php
// defining the input
$e =
'<div class="topbar-links"><div class="gravatar-wrapper-24">
<img src="https://www.gravatar.com/avatar" alt="" width="24" height="24" class="avatar-me js-avatar-me">
</div>
</div> <img test2> <img test3> <img test4>';
// defining the regex
$re = "/<img.*>/U";
// put all matches into $matches
preg_match_all($re, $e, $matches);
// start creating the result
$result = "<p class='images'>";
// loop to get all the images
for($i=0; $i<count($matches[0]); $i++) {
$result .= $matches[0][$i];
}
// print the final result
echo $result."</p>";
DEMO2
A further way to improve that code is to use functional programming (array_reduce for example). But I'll leave that as a homework.
Note: There is another way to accomplish this which is parsing the html document and using XPath to find the elements. Check out this answer for more information.

preg_replace for images in PHP

I have a question about preg_replace. I have the following HTML in WordPress:
<img width="256" height="256" src="http://localhost/wp-content/uploads/2015/08/spiderman-avatar.png" class="attachment-post-thumbnail wp-post-image" alt="spiderman-avatar">
I change it to the following:
<img src="" data-breakpoint="http://localhost/wp-content/uploads/2015/08/" data-img="theme-{folder}.jpg" class="srcbox" alt="spiderman-avatar">
with the following preg_replace:
$html = preg_replace(
'/src="(https?:\/\/.+\/)(.+\-)([0-9]+)(.jpg|.jpeg|.png|.gif)"/',
'src="" data-breakpoint="$1" data-img="$2{folder}$4"', // Replace and split src attribute into two new attributes
preg_replace(
'/(width|height)="[0-9]*"/',
'', // Remove width and height attributes
preg_replace(
'/<img ?([^>]*)class="([^"]*)"?/',
'<img $1 class="$2 srcbox"', // Add class srcbox to class attribute
$html
)
)
);
I have the feeling I have written some serious slow code, and it can be done in a single preg_replace.
Chris85 mentioned the HTML parser, so I found this and got this so far:
http://nimishprabhu.com/top-10-best-usage-examples-php-simple-html-dom-parser.html
include('simple_html_dom.php');
$html = file_get_html($html);
From here I COULD loop through all images and change the th attribute. But how do I put the new element were it came from?

you should better use DOM
http://php.net/manual/de/domdocument.loadhtml.php
and extract the attributes with it.

How to scrape img src value of each li tag

<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
and my preg match syntax is as below:
preg_match_all('/<ul class="vehicle__gallery cf">.*?<li>.*?<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>.*?<\/li>.*?<\/ul>/s', $html_image,$posts, PREG_SET_ORDER);

Please don't use regular expressions to parse HTML. PHP has a fine DOM implementation you can use to loadHTML() and query() it with XPath expressions such as //ul/li/a/img/#src to retrieve what you're after, or maybe import it as a SimpleXML object if you prefer that toolset.
Example:
$html = <<<HTML
<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$imgs = $xpath->query("//ul/li/a/img/#src");
foreach ($imgs as $img) {
echo $img->nodeValue . "\n";
}
Output:
AETV19098412_2a.jpg
AETV19098412_3a.jpg
AETV19098412_4a.jpg

You dont use regex to parse HTML.It wont work.
<li> tags dont always have ending tag nor do <img> tag.
There can be n number of attributes to a tag
attribute values don't always go in double quotes
Use an html parser like simpledomparser
I wont even attempt to come up with a regex for this because at some point it would fail.

If you give your img tags a class or something, for example:
<img class="gallery_item" src="AETV19098412_2a.jpg">
<img class="gallery_item" src="AETV19098412_3a.jpg">
you can do more easy:
preg_match('/<img class="gallery_item" src="(.*)">/');
However this is still very hacky, if you ever add a css class, html attributes or modify your code you have the problem that your code might not work anymore.
This solution is anything else then clean and you should considerung using JQuery or a form as stated in my comment before would make your life alot easier and the code will not break because of future, minor html changes that might come up any day.

Another approach is use javascript (jquery).
var imgArr = []
$("ul.vehicle__gallery li img").each(function(){
imgArr.push($(this).attr('src'));
})

Getting the first image in string with php

I'm trying to get the first image from each of my posts. This code below works great if I only have one image. But if I have more then one it gives me an image but not always the first.
I really only want the first image. A lot of times the second image is a next button
$texthtml = 'Who is Sara Bareilles on Sing Off<br>
<img alt="Sara" title="Sara" src="475993565.jpg"/><br>
<img alt="Sara" title="Sara two" src="475993434343434.jpg"/><br>';
preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $texthtml, $matches);
$first_img = $matches [1] [0];
now I can take this "$first_img" and stick it in front of the short description
<img alt="Sara" title="Sara" src="<?php echo $first_img;?>"/>

If you only need the first source tag, preg_match should do instead of preg_match_all, does this work for you?
<?php
$texthtml = 'Who is Sara Bareilles on Sing Off<br>
<img alt="Sara" title="Sara" src="475993565.jpg"/><br>
<img alt="Sara" title="Sara two" src="475993434343434.jpg"/><br>';
preg_match('/<img.+src=[\'"](?P<src>.+?)[\'"].*>/i', $texthtml, $image);
echo $image['src'];
?>

Don't use regex to parse html.
Use an html-parsing lib/class, as phpquery:
require 'phpQuery-onefile.php';
$texthtml = 'Who is Sara Bareilles on Sing Off<br>
<img alt="Sarahehe" title="Saraxd" src="475993565.jpg"/><br>
<img alt="Sara" title="Sara two" src="475993434343434.jpg"/><br>';
$pq = phpQuery::newDocumentHTML($texthtml);
$img = $pq->find('img:first');
$src = $img->attr('src');
echo "<img alt='foo' title='baa' src='{$src}'>";
Download: http://code.google.com/p/phpquery/

After testing an answer from here Using regular expressions to extract the first image source from html codes? I got better results with less broken link images than the answer provided here.
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
For more consistent results use this object http://simplehtmldom.sourceforge.net/ which allows you to manipulate html.
An example is provided in the response in the first link I posted.
function get_first_image($html){
require_once('SimpleHTML.class.php')
$post_html = str_get_html($html);
$first_img = $post_html->find('img', 0);
if($first_img !== null) {
return $first_img->src';
}
return null;
}
Enjoy

Using regular expressions to extract the first image source from html codes?

I would like to know how this can be achieved.
Assume: That there's a lot of html code containing tables, divs, images, etc.
Problem: How can I get matches of all occurances. More over, to be specific, how can I get the img tag source (src = ?).
example:
<img src="http://example.com/g.jpg" alt="" />
How can I print out http://example.com/g.jpg in this case. I want to assume that there are also other tags in the html code as i mentioned, and possibly more than one image. Would it be possible to have an array of all images sources in html code?
I know this can be achieved way or another with regular expressions, but I can't get the hang of it.
Any help is greatly appreciated.

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:
function get_first_image($html) {
require_once('SimpleHTML.class.php')
$post_html = str_get_html($html);
$first_img = $post_html->find('img', 0);
if($first_img !== null) {
return $first_img->src;
}
return null;
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
EDIT: If you want all the images:
function get_images($html){
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($html);
$img_tags = $post_dom->find('img');
$images = array();
foreach($img_tags as $image) {
$images[] = $image->src;
}
return $images;
}

Use this, is more effective:
preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
echo $value."<br>";
}
Example:
$html = '
<ul>
<li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>
<li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>
<li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
<li>Electronaut Records</li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
<li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
<li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>
<li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>
<li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
<li>Electronaut Records</li>
<img src="value5.jpg" />
<li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
echo $value."<br>";
}
Output:
value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

This works for me:
preg_match('#<img.+src="(.*)".*>#Uims', $html, $matches);
$src = $matches[1];

i assume all your src= have " around the url
<img[^>]+src=\"([^\"]+)\"
the other answers posted here make other assumsions about your code

I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.
Let's say in your header you have,
<script type="text/javascript">
function getFirstImageSource()
{
var img = document.images[0].src;
return img;
}
</script>
and then in your body you have,
<script type="text/javascript">
alert(getFirstImageSource());
</script>
This will return the 1st image source. You can also loop through them along the lines of, (in head section)
function getAllImageSources()
{
var returnString = "";
for (var i = 0; i < document.images.length; i++)
{
returnString += document.images[i].src + "\n"
}
return returnString;
}
(in body)
<script type="text/javascript">
alert(getAllImageSources());
</script>
If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,
<script type="text/javascript">
function getFirstImageSource()
{
var img = document.images[0].src;
return img;
}
window.onload = getFirstImageSource; //bad function
</script>
because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.
Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

I don't know if you MUST use regex to get your results. If not, you could try out simpleXML and XPath, which would be much more reliable for your goal:
First, import the HTML into a DOM Document Object. If you get errors, turn errors off for this part and be sure to turn them back on afterward:
$dom = new DOMDocument();
$dom -> loadHTMLFile("filename.html");
Next, import the DOM into a simpleXML object, like so:
$xml = simplexml_import_dom($dom);
Now you can use a few methods to get all of your image elements (and their attributes) into an array. XPath is the one I prefer, because I've had better luck with traversing the DOM with it:
$images = $xml -> xpath('//img/#src');
This variable now can treated like an array of your image URLs:
foreach($images as $image) {
echo '<img src="$image" /><br />
';
}
Presto, all of your images, none of the fat.
Here's the non-annotated version of the above:
$dom = new DOMDocument();
$dom -> loadHTMLFile("filename.html");
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/#src');
foreach($images as $image) {
echo '<img src="$image" /><br />
';
}

I really think you can not predict all the cases with on regular expression.
The best way is to use the DOM with the PHP5 class DOMDocument and xpath. It's the cleanest way to do what you want.
$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/#src');

You can try this:
preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
echo $key . ", " . $value . "<br>";
}

since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft.
Then you can search for an expression like
"/\<img .+ \/\>/i"
The backslashes escape special characters like <,>,/.
.+ insists that there be 1 or more of any character inside the img tag
You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.
When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

<?php
/* PHP Simple HTML DOM Parser # http://simplehtmldom.sourceforge.net */
require_once('simple_html_dom.php');
$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;
echo "<img src='{$image}'/>"; // BOOM!
PHP Simple HTML DOM Parser will do the job in few lines of code.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Help with Regex expression? - php

I'm trying to use preg_replace to filter member comments. To filter script and img tags. HTML Purifier is going to be the best tool for this purpose, though you want a whitelist of acceptable tags and attributes, not a blacklist of specific harmful tags.

you shoud use im mode; #<(\w+).+src=[\x22|'](?![^\x22']+mysite\.com[^\x22']+)([^\x22']+)[\x22|'].*>(?:</\1>)?#im

Related

Remove everything except image tag from string using regular expression

preg_replace for images in PHP

How to scrape img src value of each li tag

Getting the first image in string with php

Using regular expressions to extract the first image source from html codes?

Categories

Resources