Extract img src from string with preg_match_all - php

I've been trying to use preg_match_all for 30 minutes but it looks like I can't do it.
Basically I have a $var which contains a string of HTML code. For example:
<br>iihfuhuf
<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"
src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg">
<img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>
I want to get the src attribute values of img tags that contain /temp/temp[a-z0-9]{13}\.jpeg in their src value.
This is what I have so far:
preg_match_all('!(<img.*src=".*/temp/temp[a-z0-9]{13}\.jpeg"(.*alt=".*")?>)!', $content, $matches);

<img[^>]*src="([^"]*/temp/temp[a-z0-9]{13}\.jpeg)"
<img[^>]* Select IMG tags
src="([^"]*)" gets src value and save it as a match
/temp/temp[a-z0-9]{13}\.jpeg is the filter for src values
For quick RegEx tests use some online tool like http://regexpal.com/

All you need to do is add another group to your regular expression. You have du surround everything you want to extract from the match with braces:
preg_match_all('!(<img.*src="(.*/temp/temp[a-z0-9]{13}\.jpeg)"(.*alt=".*")?>)!', $content, $matches);
You can see that working here. You can find the URLs in $matches[2].
But just for having said it: Regular expressions are no reasonable approach to extract anything from HTML. You would be better off using DOMDocument, XPath or something along that line.

Try this:
preg_match_all('/src="([^"]+temp[a-z0-9]{13}\.jpeg)"/',$url,$matches);
var_dump($matches);

<?php
$text = '<br>iihfuhuf<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg" src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"><img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>';
$pattern = '#src="([^"]+/temp/temp[a-z0-9]{13}\.jpeg)"#';
preg_match_all($pattern, $text, $out);
echo '<pre>';
print_r($out);
?>
Array
(
[0] => Array
(
[0] => src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"
[1] => src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"
)
[1] => Array
(
[0] => http://www.jlnv2.local/temp/temp513caca536fcd.jpeg
[1] => http://www.jlnv2.local/temp/temp513caca73b8da.jpeg
)
)

Here is a DOMDocument/DOMXPath based example of how to do it. This is arguably the only right way to do it, because unless you are really good at regular expressions there will most likely always be edge cases that will break your logic.
$doc = new DOMDocument;
$xpath = new DOMXPath($doc);
$doc->loadHTML($content);
$candidates = $xpath->query("//img[contains(#src, '/temp/temp')]");
$result = array();
foreach ($candidates as $image) {
$src = $image->getAttribute('src');
if (preg_match('/temp[0-9a-z]{13}\.jpeg$/', $src, $matches)) {
$result[] = $src;
}
}
print_r($result);

$text = '<br>iihfuhuf<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg" src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"><img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>';
$pattern = '#src="([^"]+/temp/temp[a-z0-9]{13}\.jpeg)"#';
preg_match( '#src="([^"]+)"#' , $text, $match );
$src = array_pop($match);
echo $src;

Related

(PHP) Replace string of array elements using regex

I have an array
Array
(
[0] => "http://example1.com"
[1] => "http://example2.com"
[2] => "http://example3.com"
...
)
And I want to replace the http with https of each elements using RegEx. I tried:
$Regex = "/http/";
$str_rpl = '${1}s';
...
foreach ($url_array as $key => $value) {
$value = preg_replace($Regex, $str_rpl, $value);
}
print_r($url_array);
But the result array is still the same. Any thought?
You actually print an array without changing it. Why do you need regex for this?
Edited with Casimir et Hippolyte's hint:
This is a solution using regex:
$url_array = array
(
0 => "http://example1.com",
1 => "http://example2.com",
2 => "http://example3.com",
);
$url_array = preg_replace("/^http:/i", "https:", $url_array);
print_r($url_array);
PHP Demo
Without regex:
$url_array = array
(
0 => "http://example1.com",
1 => "http://example2.com",
2 => "http://example3.com",
);
$url_array = str_replace("http://", "https://", $url_array);
print_r($url_array);
PHP Demo
First of all, you are not modifying the array values at all. In your example, you are operating on the copies of array values. To actually modify array elements:
use reference mark
foreach($foo as $key => &$value) {
$value = 'new value';
}
or use for instead of foreach loop
for($i = 0; $i < count($foo); $i++) {
$foo[$i] = 'new value';
}
Going back to your question, you can also solve your problem without using regex (whenever you can, it is always better to not use regex [less problems, simpler debugging, testing etc.])
$tmp = array_map(static function(string $value) {
return str_replace('http://', 'https://', $value);
}, $url_array);
print_r($tmp);
EDIT:
As Casimir pointed out, since str_replace can take array as third argument, you can just do:
$tmp = str_replace('http://', 'https://', $url_array);
This expression might also work:
^http\K(?=:)
which we can add more boundaries, and for instance validate the URLs, if necessary, such as:
^http\K(?=:\/\/[a-z0-9_-]+\.[a-z0-9_-]+)
DEMO
Test
$re = '/^http\K(?=:\/\/[a-z0-9_-]+\.[a-z0-9_-]+)/si';
$str = ' http://example1.com ';
$subst = 's';
echo preg_replace($re, $subst, trim($str));
Output
https://example1.com
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:

Extract image URLs with a specific pattern with PHP preg_match_all

I am trying to list all image URLs of a string with PHP preg_match_all.
The image patterns are as such:
1. //c1.site.com/5/2421/38306891313_ce7844189b_n.jpg
2. \/\/c2.site.com\/2421\/24574382628_9a2af39e82_h.jpg
I am able to extract up to the following result:
m/5/2421/38306891313_ce7844189b_n.jpg // 'm' seems to be the end of .com?
However I am not sure how to include the subdomain, domain and .com:
preg_match_all('/([-a-z0-9_\/:.])\/([-a-z0-9_\/:.])\/([-a-z0-9_\/:.]+\.(jpg))/i', $data, $matches);
foreach($matches[0] as $image){
echo $image.'<br/>';
}
preg_match_all('/([-a-z0-9_\/:.])\\/([-a-z0-9_\/:.])\\/([-a-z0-9_\/:.]+\.(jpg))/i', $data, $matches);
foreach($matches[0] as $image){
echo $image.'<br/>';
}
You could use parse_url instead:
<?php
$string = "//c1.site.com/5/2421/38306891313_ce7844189b_n.jpg";
print_r(parse_url($string));
?>
Which yields
Array
(
[host] => c1.site.com
[path] => /5/2421/38306891313_ce7844189b_n.jpg
)

How to find more than one variable in a string using preg_match?

How to find more than one variable in a string using preg_match?
I have below string in a php variable, where the variable values to be found are highlighted.
$var = '<div class="CK mag10">OKT: **VARVALUE1**<span class="OK1 OK2">|</span>MOK: **VARVALUE2**<span class="OK1 OK2">|</span>ISIN: **VARVALUE3**<span class="OK1 OK2">|</span>SOCCER: **VARVALUE4**</div>';
I have written this code:
$found_matches = preg_match('/\<div class=\"CK mag10\">OKT: ([0-9A-Za-z]+)\<span class=\"OK1 OK2\"\>|\<\/span\>MOK: ([0-9A-Za-z]+)\<span class=\"OK1 OK2\"\>|\<\/span>ISIN: ([0-9A-Za-z]+)\<span class=\"OK1 OK2\"\>|\<\/span>SOCCER: ([0-9A-Za-z]+)\<\/div\>/i', $var, $matches);
but it is giving me only one value not all variables value.
Is there any way to get all variable values stacked in that single array $matches ?
Maybe this function is useful for you:
preg_match_all("/(?<=startTag)[\w]+(?=endTag)/", $input_lines, $output_array);
startTag = change its string to the default string before of the word you want to extract.
endTag = change its string to the default string after ends the word you want to extract.
Sample - preg_match or preg_match_all
Exit
array(
0 => array(
0 => VARVALUE1
1 => VARVALUE2
2 => VARVALUE3
3 => VARVALUE4
)
)
Here is a way to
Parse HTML with DOM
Obtain results in a safe way
Here is a sample code:
<?php
$html = <<<HTML
<div class="CK mag10">OKT: VARVALUE1<span class="OK1 OK2">|</span>MOK: VARVALUE2<span class="OK1 OK2">|</span>ISIN: VARVALUE3<span class="OK1 OK2">|</span>SOCCER: VARVALUE4</div>
HTML;
$arr = array();
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$links = $xp->query('//div[#class="CK mag10"]');
foreach ($links as $link) {
$chks = explode('|', $link->nodeValue);
foreach ($chks as $chk) {
if (preg_match('/\s*[A-Z]+:\s+(.*)/', $chk, $matches)) {
array_push($arr, $matches[1]);
}
}
}
print_r($arr);
See IDEONE demo

How to parse html tags multiple times? PHP

String I'm trying to parse.
<b>Genre:</b> Action, Adventure, Casual, Early Access, Indie, RPG<br>
What I'm trying to achieve (without all the other tags etc):
Action
Adventure
Casual
Early Access
Indie
RPG
Here's what I've tried
function getTagInfo($content,$start,$end){
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '0';
}
getTagInfo($html, '/?snr=1_5_9__408">', '</a>');
and that only gives me one genre, I can't think of an algorithm to be able to parse the rest also, so how would I be able to parse the other lines?
You can use regexp's here:
<a.*?>(.*?)</a>
This RegExp will return all <a></a> contetns.
Try this php code:
preg_match(/<a.*?>(.*?)<\/a>/, $htmlString, $matches);
foreach($matches as $match) {
echo $match . " <br /> ";
}
This will output:
Action
Adventure
Casual
Early
Access
Indie
RPG
You can use this code from another stackoverflow thread.
PHP/regex: How to get the string value of HTML tag?
<?php
function getTextBetweenTags($string, $tagname) {
$pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
return $matches[1];
}
$str = '<textformat leading="2"><p align="left"><font size="10">get me</font></p></textformat>';
$txt = getTextBetweenTags($str, "font");
echo $txt;
?>
You can use preg_match_all:
$regex = '/<a.*?>(.*?)<\/a>/is';
preg_match_all($regex, $html, $matches);
$matches[1] will then be an array of the contents between the anchor tags and you could iterate over it like this:
foreach ($matches[1] as $match)
{
echo $match .'<br>';
}
It would probably be better to use an actual HTML parser, as HTML is not regualr syntax.
You may try something like this (DEMO):
function getTagInfo($html)
{
if( preg_match_all('/<a href=\"(.*?)\">/i', $html, $matches)) {
$result = array();
foreach($matches[1] as $href) {
$array = explode('/', $href);
$arr = $array[count($array) - 2];
$result[] = urldecode($arr);
}
return $result;
}
return false;
}
// Get an array
print_r(getTagInfo($html));
Output:
Array (
[0] => Action
[1] => Adventure
[2] => Casual
[3] => Early Access
[4] => Indie
[5] => RPG
)
I would probably do this with REGEX also, but since there are already 4 posts with REGEX answers, I'll throw another idea out there. This may be overly simple, but you can use strip_tags to remove any HTML tags.
$string = '<b>Genre:</b> Action, Adventure, Casual, Early Access, Indie, RPG<br>';
print strip_tags($string);
This will return the following:
Genre: Action, Adventure, Casual, Early Access, Indie, RPG
Anyway, it's probably not how I'd go about doing it, but it's a one-liner that is really easy to implement.
I reckon, you can also turn it into the array you're looking for by combining the preceeding with some REGEX like this:
$string_array = preg_split('/,\s*/', preg_replace('/Genre:\s+/i', '', strip_tags($string)));
print_r($string_array);
That will give you the following:
Array
(
[0] => Action
[1] => Adventure
[2] => Casual
[3] => Early Access
[4] => Indie
[5] => RPG
)
Ha, sorry ... ended up throwing REGEX into the answer anyway. But it's still a one-liner. :)

File_Get_Contents an Entire Array

I want the file_get_contents every single link in my array. Therefore, i Can apply a preg_match code which will then match all the first 20 characters in the first p tags detected.
my code is below:
$links = array(0 => "http://en.wikipedia.org/wiki/The_Big_Bang_Theory", 1=> "http://en.wikipedia.org/wiki/Fantastic_Four");
print_r($links);
$link = implode(", " , $links);
$html = file_get_contents($link);
preg_match('%(<p[^>]*>.*?</p>)%i', $html, $re);
$res = get_custom_excerpt($re[1]);
echo $res;
you can use one url at a time in file_get_contents. Combining the links will not work instead you need to loop the each link and get the content.
$links = array(0 => "http://en.wikipedia.org/wiki/The_Big_Bang_Theory", 1=> "http://en.wikipedia.org/wiki/Fantastic_Four");
print_r($links);
foreach($links as $link){
$html = file_get_contents($link);
preg_match('%(<p[^>]*>.*?</p>)%i', $html, $re);
$res = get_custom_excerpt($re[1]);
echo $res;
}
Why not use their API?
You can still use file_get_contents to retrieve the, well, contents, but you can decide on a format more suitable to your needs.
Their API is documented quite well, see: http://www.mediawiki.org/wiki/API:Main_page
URLs will transform into something

Categories