I'm trying to pre replace to get dynamic url,
Code is:
<span>
<img src="/List/Detail/c0954c57-57ca-4f32-841d-de2b61a5087c/5358455" />
</span>
I need to extract: only /List/Detail/c0954c57-57ca-4f32-841d-de2b61a5087c/5358455
Ive tried: \/Listing\/AdDetail\/
Try with this:
preg_match("/<img src=\"(.*)\"/", $input_line, $output_array);
Use this regex to get the link between "yourlink"
(?<=")(.*)(?=")
Note: It works in PHP but not in Javascript. Lookbehind is not supported in Javascript!
Note 2: This answer is only specified for your SHORT information. I would suggest you to update your Question with more Details to get an better Code Sample!
public static Set<String> getImgStr(String htmlStr) {
Set<String> pics = new HashSet<>();
String img = "";
Pattern p_image;
Matcher m_image;
String regEx_img = "<img.*src\\s*=\\s*(.*?)[^>]*?>";
p_image = Pattern.compile
(regEx_img, Pattern.CASE_INSENSITIVE);
m_image = p_image.matcher(htmlStr);
while (m_image.find()) {
img = m_image.group();
Matcher m = Pattern.compile("src\\s*=\\s*\"?(.*?)(\"|>|\\s+)").matcher(img);
while (m.find()) {
pics.add(m.group(1));
}
}
return pics;
}
Related
I am trying to display a website to a user, having downloaded it using php.
This is the script I am using:
<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>
So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: ). It is a png and as I see it it can be placed on the page using 6 different urls:
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
4 is not applicable in this case but added anyway!
5. src="/cat.png"
6. src="cat.png"
Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?
Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.
$target_domain = 'http://stackoverflow.com/';
$url = $target_domain . 'pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument === false) {
// something went wrong in loading HTML to DOM Document
// provide error messaging and exit
}
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.
I don't know if I get your question completely right, if you want to deal with all text-sequences enclosed in src=" and ", the following pattern could make it:
~(\ssrc=")([^"]+)(")~
It has three capturing groups of which the second one contains the data you're interested in. The first and last are useful to change the whole match.
Now you can replace all instances with a callback function that is changing the places. I've created a simple string with all the 6 cases you've got:
$site = <<<BUFFER
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
5. src="/cat.png"
6. src="cat.png"
BUFFER;
Let's ignore for a moment that there are no surrounding HTML tags, you're not parsing HTML anyway I'm sure as you haven't asked for a HTML parser but for a regular expression. In the following example, the match in the middle (the URL) will be enclosed so that it's clear it matched:
So now to replace each of the links let's start lightly by just highlighting them in the string.
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, function ($matches) {
return $matches[1] . ">>>" . $matches[2] . "<<<" . $matches[3];
}, $site);
The output for the example given then is:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
As the way of replacing the string is to be changed, it can be extracted, so it is easier to change:
$callback = function($method) {
return function ($matches) use ($method) {
return $matches[1] . $method($matches[2]) . $matches[3];
};
};
This function creates the replace callback based on a method of replacing you pass as parameter.
Such a replacement function could be:
$highlight = function($string) {
return ">>>$string<<<";
};
And it's called like the following:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($highlight), $site);
The output remains the same, this was just to illustrate how the extraction worked:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
The benefit of this is that for the replacement function, you only need to deal with the URL match as single string, not with regular expression matches array for the different groups.
Now to your second half of your question: How to replace this with the specific URL handling like removing the filename. This can be done by parsing the URL itself and remove the filename (basename) from the path component. Thanks to the extraction, you can put this into a simple function:
$removeFilename = function ($url) {
$url = new Net_URL2($url);
$base = basename($path = $url->getPath());
$url->setPath(substr($path, 0, -strlen($base)));
return $url;
};
This code makes use of Pear's Net_URL2 URL component (also available via Packagist and Github, your OS packages might have it, too). It can parse and modify URLs easily, so is nice to have for the job.
So now the replacement done with the new URL filename replacement function:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($removeFilename), $site);
And the result then is:
1. src="//www.stackoverflow.com/"
2. src="http://www.stackoverflow.com/"
3. src="https://www.stackoverflow.com/"
4. src="somedirectory/"
5. src="/"
6. src=""
Please note that this is exemplary. It shows how you can to it with regular expressions. You can however to it as well with a HTML parser. Let's make this an actual HTML fragment:
1. <img src="//www.stackoverflow.com/cat.png"/>
2. <img src="http://www.stackoverflow.com/cat.png"/>
3. <img src="https://www.stackoverflow.com/cat.png"/>
4. <img src="somedirectory/cat.png"/>
5. <img src="/cat.png"/>
6. <img src="cat.png"/>
And then process all <img> "src" attributes with the created replacement filter function:
$doc = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($site, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors($saved);
$srcs = (new DOMXPath($doc))->query('//img/#hsrc') ?: [];
foreach ($srcs as $src) {
$src->nodeValue = $removeFilename($src->nodeValue);
}
echo $doc->saveHTML();
The result then again is:
1. <img src="//www.stackoverflow.com/cat.png">
2. <img src="http://www.stackoverflow.com/cat.png">
3. <img src="https://www.stackoverflow.com/cat.png">
4. <img src="somedirectory/cat.png">
5. <img src="/cat.png">
6. <img src="cat.png">
Just a different way of parsing has been used - the replacement still is the same. Just to offer two different ways that are also the same in part.
I suggest doing it in more steps.
In order to not complicate the solution, let's assume that any src value is always an image (it could as well be something else, e.g. a script).
Also, let's assume that there are no spaces, between equals sign and quotes (this can be fixed easily if there are). Finally, let's assume that the file name does not contain any escaped quotes (if it did, regexp would be more complicated).
So you'd use the following regexp to find all image references:
src="([^"]*)". (Also, this does not cover the case, where src is enclosed into single quotes. But it is easy to create a similar regexp for that.)
However, the processing logic could be done with preg_replace_callback function, instead of str_replace. You can provide a callback to this function, where each url can be processed, based on its contents.
So you could do something like this (not tested!):
$site = preg_replace_callback(
'src="([^"]*)"',
function ($src) {
$url = $src[1];
$ret = "";
if (preg_match("^//", $url)) {
// case 1.
$ret = "src='" . $url . '"';
}
else if (preg_match("^https?://", $url)) {
// case 2. and 3.
$ret = "src='" . $url . '"';
}
else {
// case 4., 5., 6.
$ret = "src='http://your.site.com.com/" . $url . '"';
}
return $ret;
},
$site
);
I'm trying to replace a title tag from |title|Page title| to <title>Page Title</title>, using this regular expression. But being a complete amateur, it's not gone to well..
'^|title|^[a-zA-Z0-9_]{1,}|$' => '<title>$1</title>'
I would love to know how to fix it, and more importantly, what I did wrong and why it was wrong.
You almost got it:
You should escape the | characters as they have special meaning in a
regex and you are using it as a plain character.
You should add the space character to your search group
$string = '|title|Page title|';
$pattern = '/\|title\|([a-zA-Z0-9_ ]{1,})\|/';
$replacement = '<title>$1</title>';
echo preg_replace($pattern, $replacement, $string); //echoes <title>Page title</title>
See working demo
OP posted some code in comments which is wrong, try this version:
$regular_expressions = array( array( '/\|title\|([a-zA-Z0-9_ ]{1,})\|/' , '<title>$1</title>' ));
foreach($regular_expressions as $regexp){
$data = preg_replace($regexp[0], $regexp[1], $data);
}
Heres a little function I came up with a while back to essentially scrape the titles of a page when users submitted links through my service. What this function does is will get the contents of a provided URL. Seek a title tag, if found, get whats between the title tag and dump it's result. With a little tweaking I am sure you can use a replace method for whatever your doing, and make it work for your needs. So this is more of a starting point rather than an answer but overall I hope it helps to some extent.
$url = 'http://www.chrishacia.com';
function get_page_title($url){
if( !($data = file_get_contents($url)) ) return false;
if( preg_match("#<title>(.+)<\/title>#iU", $data, $t)) {
return trim($t[1]);
} else {
return false;
}
}
var_dump(get_page_title($url));
<?php
$s = "|title|Page title|";
$s = preg_replace('/^\|title\|([^\|]+)\|/', "<title>$1</title>", $s);
echo $s;
?>
I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);
as I cant find exact answer to my question I decided to ask for help posting my question here. So, I have a page content which I get with file_get_contents and want to preg_match this url:
http://sample.com/Last/LastSearch.ashx?q=cnt=1&day=5&keyword=sample&location=15&view=v
from
Last
Please help me.
Why not use the DOM? That's what it's for...
If you insist on a regex, try (in PHP)
if (preg_match('/<a href="javascript:LastURL\(\'([^\'])*\'/', $subject, $regs)) {
$result = $regs[1];
} else {
$result = "";
}
or (in JavaScript)
var myregexp = /<a href="javascript:LastURL\('([^'])*'/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
} else {
result = "";
}
As long as the url is always going to start with http:// then you could use the following expression within your preg_match:
(((f|ht){1}tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+)
I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.