extract two parts of a string using regex in php - php

I have this string:
<img src=images/imagename.gif alt='descriptive text here'>
and I am trying to split it up into the following two strings (array of two strings, what ever, just broken up).
imagename.gif
descriptive text here
Note that yes, it's actually the < and not <. Same with the end of the string.
I know regex is the answer, but I'm not good enough at regex to know how to pull it off in PHP.

Try this:
<?php
$s="<img src=images/imagename.gif alt='descriptive text here'>";
preg_match("/^[^\/]+\/([^ ]+)[^']+'([^']+)/", $s, $a);
print_r($a);
Output:
Array
(
[0] => <img src=images/imagename.gif alt='descriptive text here
[1] => imagename.gif
[2] => descriptive text here
)

Better use DOM xpath rather than regex
<?php
$your_string = html_entity_decode("<img src=images/imagename.gif alt='descriptive text here'>");
$dom = new DOMDocument;
$dom->loadHTML($your_string);
$x = new DOMXPath($dom);
foreach($x->query("//img") as $node)
{
echo $node->getAttribute("src");
echo $node->getAttribute("alt");
}
?>

Related

Regex to match placeholders that contain HTML within them

I have placeholders that users can insert into a WYSIWYG editor (which contains HTML code). Sometimes when they paste from apps like Word etc it injects HTML within them.
Eg: It pastes %<span>firstname</span>% instead of %firstname%.
Here is an example of my regex code:
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>other random <strong>HTML</strong> that needs to be preserved.</div>
';
preg_match_all(
'/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U',
$html,
$matches
);
echo '<pre>';
print_r($matches);
echo '</pre>';
Which outputs the following:
Array
(
[0] => Array
(
[0] => %firstname%
[1] => %firstname%
[2] => %firstname%
)
[1] => Array
(
[0] => firstname
[1] => firstname
[2] => firstname
)
)
As soon as there is more than one span inside the placeholder it doesn't work. I'm not quite sure what to adjust in my regex.
/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U
How would I achieve this?
Try this Regex. It should help you out!
/\%(?![0-9])(?:<[^<]+?>)*([a-zA-z0-9_-]+)(?:[\s]?<\/[^<]+?>)*\%/U
You could use a parser and the textContent property if it is a WYSIWYG editor anyway:
<?php
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>A cool div with %firstname%</div>
<span>And a very neat span with %firstname%</span>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# query only root elements here
$containers = $xpath->query("/*");
foreach ($containers as $container) {
echo $container->textContent . "\n";
}
?>
This outputs %firstname% a couple of times, see a demo on ideone.com.
Do you really need a regex for this? You could have simply used strip_tags() here.
Try this:
echo strip_tags($html);

Extracting links from a piece of text in PHP except ignoring image links

I have this piece of text, and I want to extract links from this. Some links with have tags and some will be out there just like that, in plain format. But I also have images, and I don't want their links.
How would I extract links from this piece of text but ignoring image links. So basically and google.com should both be extract.
string(441) "<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>"
I have tried the following but its incomplete:
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
Using just that one string to test, the following works for me:
$str = '<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
preg_match('~a href="(.*?)"~', $str, $strArr);
Using a href ="..." in the preg_match() statement returns an array, $strArr containing two values, the two links to google.
Array
(
[0] => a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg"
[1] => https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg
)
I would try something like this.
Find and remove images tags:
$content = preg_replace("/<img[^>]+\>/i", "(image) ", $content);
Find and collect URLs.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $content, $match);
Output Urls:
print_r($match);
Good luck!
I played around with this a lot more and have an answer that may better suit what you are trying to do with a bit of "future proofing"
$str = '<p class="fr-tag">Please visit www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
$str = str_replace(' ',' ',$str);
$strArr = explode(' ',$str);
$len = count($strArr);
for($i = 0; $i < $len; $i++){
if(stristr($strArr[$i],'http') || stristr($strArr[$i],"www")){
$matches[] = $strArr[$i];
}
}
echo "<pre>";
print_r($matches);
echo "</pre>";
I went back and analyzed your string and noticed that if you translate the to spaces you can then explode the string into an array, step through that and if any elements contain http or www then add them to the $matches array to be processed later. The output is pretty clean and easy to work with and you also get rid of most of the html markup this way.
Something to note is that this probably isn't the best way to do this. I haven't tested with any other strings but the one you offered so there's optimization that can be done.

What's wrong with my PHP regex?

I'm trying to pull a specific link from a feed where all of the content is on one line and there are multiple links present. The one I want has the content of "[link]" in the the A tag. Here's my example:
test1 test2 [link] test3test4
... could be more links before and/or after
How do I isolate just the href with the content "[link]"?
This regex goes to the correct end of the block I want, but starts at the first link:
(?<=href\=\").*?(?=\[link\])
Any help would be greatly appreciated! Thanks.
Try this updated regex:
(?<=href\=\")[^<]*?(?=\">\[link\])
See demo.
The problem is that the dot matches too many characters and in order to get the right 'href' you need to just restrict the regex to [^<]*?.
Alternatively :)
This code :
$string = 'test1 test2 [link] test3test4';
$regex = '/href="([^"]*)">\[link\]/i';
$result = preg_match($regex, $string, $matches);
var_dump($matches);
Will return :
array(2) {
[0] =>
string(41) "href="http://www.amazingpage.com/">[link]"
[1] =>
string(27) "http://www.amazingpage.com/"
}
You can avoid using regular expression and use DOM to do this.
$doc = DOMDocument::loadHTML('
test1
test2
[link]
test3
test4
');
foreach ($doc->getElementsByTagName('a') as $link) {
if ($link->nodeValue == '[link]') {
echo $link->getAttribute('href');
}
}
With DOMDocument and XPath:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
foreach ($xpath->query('//a[. = "[link]"]/#href') as $node) {
echo $node->nodeValue;
}
or if you are looking for only one result:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
$nodeList = $xp->query('//a[. = "[link]"][1]/#href');
if ($nodeList->length)
echo $nodeList->item(0)->nodeValue;
xpath query details:
//a # 'a' tag everywhere in the DOM tree
[. = "[link]"] # (condition) which has "[link]" as value
/#href # "href" attribute
The reason your regex pattern doesn't work:
The regex engine walks from left to right and for each position in the string it tries to succeed. So, even if you use a non-greedy quantifier, you obtain always the leftmost result.

PHP preg_match_all regex to extract only number in string

I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:
<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />
So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.
The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.
Any thoughts on how to proceed with it would be great...
Thus far, I've tried:
$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);
I think this is the best approach:
Use an HTML parser to extract the image tags
Use a regular expression (or perhaps string manipulation) to extract the ID
Query for the data
Use the HTML parser to insert the returned data
Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);
foreach( $doc->getElementsByTagName('img') as $img)
{
$src = $img->getAttribute('src');
preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
$id = $matches[1];
echo 'Fetching info for image ID ' . $id . "\n";
// Query stuff here
$result = 'Got this from the DB';
$img->setAttribute( 'title', $result);
$img->setAttribute( 'alt', $result);
}
$newHTML = $doc->saveHtml();
Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.
preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);
This would contain all of the stuff that it found.
Consider using preg_replace_callback.
Use this regex: (images/([0-9]+)[^"]+")
Then, as the callback argument, use an anonymous function. Result:
$output = preg_replace_callback(
"(images/([0-9]+)[^\"]+\")",
function($m) {
// $m[1] is the number.
$t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
return $m[0]." title=\"".$t."\"";
},
$input
);
use preg_match_all:
preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => <img src="http://domain.com/images/59.
[1] => <img src="http://domain.com/images/549.
[2] => <img src="http://domain.com/images/1249.
[3] => <img src="http://domain.com/images/6.
)
[1] => Array
(
[0] => 59
[1] => 549
[2] => 1249
[3] => 6
)
)
This regex should match the number parts:
\/images\/(?P<digits>[0-9]+)\.[a-z]+
Your $matches['digits'] should have all of the digits you want as an array.
Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:
$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/#src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$ids[] = $number;
}
}
Because that only gives you an array, why not encapsulate it?
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$imageNumbers = new ImageNumbers($html);
var_dump((array) $imageNumbers);
Which gives you:
array(4) {
[0]=>
int(59)
[1]=>
int(549)
[2]=>
int(1249)
[3]=>
int(6)
}
By that function above nicely wrapped into an ArrayObject:
class ImageNumbers extends ArrayObject
{
public function __construct($html) {
parent::__construct($this->extractFromHTML($html));
}
private function extractFromHTML($html) {
$numbers = array();
$doc = new DOMDocument();
$preserve = libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/#src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$numbers[] = $number;
}
}
libxml_use_internal_errors($preserve);
return $numbers;
}
}
If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class.
$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);
Then loop through the matches array to both reconstruct the HTML and to do you look up in the database.

Using preg_replace_callback() to extract all images from a string of HTML

Tricky preg_replace_callback function here - I am admittedly not great at PRCE expressions.
I am trying to extract all img src values from a string of HTML, save the img src values to an array, and additionally replace the img src path to a local path (not a remote path). Ie I might have, surrounded by a lot of other HTML:
img src='http://www.mysite.com/folder/subfolder/images/myimage.png'
And I would want to extract myimage.png to an array, and additionally change the src to:
src='images/myimage.png'
Can that be done?
Thanks
Does it need to use regular expressions? Handling HTML is normally easier with DOM functions:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("http://stackoverflow.com"));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName("img");
$data = array();
foreach($items as $item) {
$data[] = array(
"src" => $item->getAttribute("src"),
"alt" => $item->getAttribute("alt"),
"title" => $item->getAttribute("title"),
);
}
print_r($data);
Do you need regex for this? Not necessary. Are regex the most readable solution? Probably not - at least unless you are fluent in regex. Are regex more efficient when scanning large amounts of data? Absolutely, the regex are compiled and cached upon first appearance. Do regex win the "least lines of code" trophy?
$string = <<<EOS
<html>
<body>
blahblah<br>
<img src='http://www.mysite.com/folder/subfolder/images/myimage.png'>blah<br>
blah<img src='http://www.mysite.com/folder/subfolder/images/another.png' />blah<br>
</body>
</html>
EOS;
preg_match_all("%<img .*?src=['\"](.*?)['\"]%s", $string, $matches);
$images = array_map(function ($element) { return preg_replace("%^.*/(.*)$%", 'images/$1', $element); }, $matches[1]);
print_r($images);
Two lines of code, that's hard to undercut in PHP. It results in the following $images array:
Array
(
[0] => images/myimage.png
[1] => images/another.png
)
Please note that this won't work with PHP versions prior to 5.3 unless you replace the anonymous function with a proper one.

Categories