php match pattern to get images from text file - php

i have seen many answers when people ask how to grab and extract the images actual URLs, from a web page content / text ect, however, in my database, sadly, i have this syntax:
<img class="photo" src="http://domain.com/image.jpg" alt="alt goes here" />
So, the typical way $pattern = '/src=["|\']([^"|\']+)/is'; does not work in my case due to those "...
Have been trying for hours, i must be doing something very very wrong...
Any help is much appreciated!

First of all, the 'usual way' is to use an HTML/XML parser, not regular expressions.
Secondly, what you have is HTML code encoded as HTML text, which smells badly for two reasons:
it's not HTML any more (why encode it as HTML text when it is in fact HTML code)?
you shouldn't encode HTML before putting it into DB, but rather when writing it to the user.
With these two issues aside, what you need to do is to htmlspecialchars_decode() that stuff and pass it through an HTML parser:
$stuff = '<img class="photo" src="http://domain.com/image.jpg" alt="alt goes here" />';
$code = htmlspecialchars_decode($stuff, ENT_QUOTES);
$xml = simplexml_load_string($code);
That said, to me this sounds like a hack to fix badly written code. But there may be a valid reason why it's there in the first place.

Dont use Regular expression!
Use XML/DOM libraries like Simple HTML DOM.
BTW, the regular expression you are looking for is,
$pattern = '/src=(["\'])(.+)(?=\1)/i';
Test Case (Optional):
Here is a simple program to test it. Obviously you need to use htmlspecialchars_decode() first to decode it from entity format.
$str = array(
"<script type=\"text/javascript\" src=\"script.js\"></script>",
"<script type=\"text/javascript\" src='script.js'></script>",
'<script type="text/javascript" src="script.js"></script>',
'<script type="text/javascript" src=\'script.js\'></script>',
);
$pattern = '/src=(["\'])(.+)(?=\1)/i';
foreach($str as $s){
preg_match($pattern, $s, $m);
echo $m[2], PHP_EOL;
}
Output
script.js
script.js
script.js
script.js

You can test Regex here:
http://gskinner.com/RegExr/
What's not working?

Related

Change the src part of an img tag

I've got a string containing html code, and I want to change <img src="anything.jpg"> to <img src="'.DOC_ROOT .'anything.jpg"> everytime it occurs in the string. I really don't want to use an html parser, since this will be the only thing I'll be using it for. Does anyone know how to do this in php, using a regex for example?
You really should use a parser but since you made clear that you really don't want to do that, you can use the following regex replace:
$string = preg_replace('/<img([^>]*)src=["\']([^"\'\\/][^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $string);
Demo. This regular expression will not modify any urls that are already a relative path. Change it to the following if you do want to match those:
$string = preg_replace('/<img([^>]*)src=["\']["\'\\/]?([^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $string);
Demo.
If you absolutely have to use regular expressions instead of a DOM parser, you could use this.
Not sure where DOC_ROOT is coming from though, since it's not a valid PHP variable (maybe a constant?). Also be aware that you won't be able to use an embedded variable inside the string if you have single quotes.
You probably want something more like:
img.*?src=['"](.*?)['"]
Replacing with:
img src="$_SERVER['DOCUMENT_ROOT']$1"
Which converts:
echo "<img src='anything.jpg'>"; //into:
echo "<img src='$_SERVER[\'DOCUMENT_ROOT\']/anything.jpg'>";
http://regex101.com/r/vN7lN9
In php, the code would look like this:
$string = "<img src='anything.jpg'>";
echo preg_replace('/img.*?src=[\'\"](.*?)[\'\"]/', "img src='$_SERVER[DOCUMENT_ROOT]/$1'", $string);
Be warned that if your DOM contains irregular HTML (a tag misplaced here and there, spaces between the = sign) you're liable to end up causing a lot of problems. That's where a DOM parser like domdocument comes in handy.
A lot of people state the importance of using a DOM parser, but too few answers actually demonstrate how to execute the task.
Regex, even when tempting to write a one-liner or to change a single character, is unsuitable for parsing html because it is DOM-ignorant -- it treats your input as a string and nothing more. I've crafted a demonstration of how regex (from the accepted answer) will make unintended replacements.
Code: (Demo)
$html = <<<HTML
<p>Some random text <img src="anything.jpg"> text <iframe data-whoops="<img" src="anything.jpg"></iframe></p>
HTML;
define('DOC_ROOT', 'www.example.com/');
echo "With regex:\n";
echo preg_replace('/<img([^>]*)src=["\']([^"\'\\/][^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $html);
echo "\n\n---\n\nWith a parser:\n";
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('img') as $img) {
$img->setAttribute('src', DOC_ROOT . $img->getAttribute('src'));
}
echo $dom->saveHTML();
Output:
With regex:
<p>Some random text <img src="www.example.com/anything.jpg"> text <iframe data-whoops="<img" src="www.example.com/anything.jpg"></iframe></p>
---
With a parser:
<p>Some random text <img src="www.example.com/anything.jpg"> text <iframe data-whoops="<img" src="anything.jpg"></iframe></p>
If you need to make conditional replacements on an img tag's url, there are additional tools like a url parser or Xpath that can be implemented to serve your requirements.
https://stackoverflow.com/a/60263813/2943403
Ultimately, my advice is to forget about how many lines of code you write; just write robust/reliable code.
That's what you are looking for, i think:
$pictureName = 'anything.jpg';
$html = str_replace($pictureName, DOC_ROOT.$pictureName, $html);

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

automatic link creation using php without breaking the html tags

i want to convert text links in my content page into active links using php. i tried every possible script out there, they all fine but the problem that they convert links in img src tag. they convert links everywhere and break the html code.
i find a good script that do what i want exactly but it is in javascript. it is called jquery-linkify.
you can find the script here
http://github.com/maranomynet/linkify/
the trick in the script that it convert text links without breaking the html code. i tried to convert the script into php but failed.
i cant use the script on my website because there is other scripts that has conflict with jquery.
anyone could rewrite this script for php? or at least guide me how?
thanks.
First, parse the text with an HTML parser, with something like DOMDocument::loadHTML. Note that poor HTML can be hard to parse, and depending on the parser, you might get slightly different output in the browser after running such a function.
PHP's DOMDocument isn't very flexible in that regard. You may have better luck by parsing with other tools. But if you are working with valid HTML (and you should try to, if it's within your control), none of that is a concern.
After parsing the text, you need to look at the text nodes for links and replace them. Using a regular expression is the simplest way.
Here's a sample script that does just that:
<?php
function linkify($text)
{
$re = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
preg_match_all($re, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$i = count($matches);
while ($i--)
{
$url = $matches[$i][0];
if (!preg_match('#^https?://#', $url))
$url = 'http://'.$url;
$text = substr_replace($text, ''.$matches[$i][0].'', $matches[$i][1], strlen($matches[$i][0]));
}
return $text;
}
$dom = new DOMDocument();
$dom->loadHTML('<b>stackoverflow.com</b> test');
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//text()') as $text)
{
$frag = $dom->createDocumentFragment();
$frag->appendXML(linkify($text->nodeValue));
$text->parentNode->replaceChild($frag, $text);
}
echo $dom->saveHTML();
?>
I did not come up with that regular expression, and I cannot vouch for its accuracy. I also did not test the script, except for this above case. However, this should be more than enough to get you going.
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>stackoverflow.com</b>
test
</body>
</html>
Note that saveHTML() adds the surrounding tags. If that's a problem, you can strip them out with substr().
Use a HTML parser and only search for URLs within text nodes.
I think the trick is in tracking the single ' and double quotes '' in your PHP code and merging between them in a correct way so you put '' inside "" or vice versa.
For Example,
<?PHP
//old html tags
echo "<h1>Header1</h1>";
echo "<div>some text</div>";
//your added links
echo "<p><a href='link1.php'>Link1</a><br>";
echo "<a href='link1.php'>Link1</a></p>";
//old html tags
echo "<h1>Another Header</h1>";
echo "<div>some text</div>";
?>
I hope this helps you ..
$text = 'Any text ... link http://example123.com and image <img src="http://exaple.com/image.jpg" />';
$text = preg_replace('!([^\"])(http:\/\/(?:[\w\.]+))([^\"])!', '\\1\\2\\3', $text);

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

How do I extract HTML content using Regex in PHP

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website.
In any case, I don't know much about regexes, so bear with me.
I've got an HTML page in a format similar to this
<html>
<head>...</head>
<body>
<div class=nav>...</div><p id="someshit" />
<div class=body>....</div>
<div class=footer>...</div>
</body>
I need to extract the contents of the body class container.
I tried this.
$pattern = "/<div class=\"body\">\(.*?\)<\/div>/sui"
$text = $htmlPageAsIs;
if (preg_match($pattern, $text, $matches))
echo "MATCHED!";
else
echo "Sorry gambooka, but your text is in another castle.";
What am I doing wrong? My text ends up in another castle.
*EDIT: ooohh... never mind, I found readability's code
You are matching for class="body" your document has class=body: you're missing the quotes. Use "/<div class=\"?body\"?>(.*?)<\/div>/sui".

Categories