searching link with php regular expression - php

I was using c and c# for programming and I am using some third-party regular expression library to identify link pattern. But yesterday, for some reason, someone asked me to use php instead. I am not familiar with the php regular expression but I try, didn't get the result as expected. I have to extract and replace the link of an image src of the form :
<img src="/a/b/c/d/binary/capture.php?id=main:slave:demo.jpg"/>
I only want the path in the src but the quotation could be double or single, also the id could be vary form case to case (here it is main:slave:demo.jpg)
I try the following code
$searchfor = '/src="(.*?)binary\/capture.php?id=(.+?)"/';
$matches = array();
while ( preg_match($searchfor, $stringtoreplace, $matches) == 1 ) {
// here if mataches found, replace the source text and search again
$stringtoreplace= str_replace($matches, 'whatever', $stringtoreplace);
}
But it doesn't work, anything I miss or any mistake from above code?
More specifically, let say I have a image tag which give the src as
<img src="ANY_THING/binary/capture.php?id=main:slave:demo.jpg"/>
here ANY_THING could be anything and "/binary/capture.php?id=" will be fixed for all cases, the string after "id=" is of pattern "main:slave:demo.jpg", the string before colon will be changed from case to case, the name of the jpeg will be varied too. I would expect to have it replaced as
<img src="/main/slave/demo.jpg"/>
Since I only have right to modify the php script at specific and limit time, I want to debug my code before any modification made. Thanks.

First of all, as you may know, regex shouldn't be used to manipulate HTML.
However, try:
$stringtoreplace = '<img src="/a/b/c/d/binary/capture.php?id=main:slave:demo.jpg"/>';
$new_str = preg_replace_callback(
// The regex to match
'/<img(.*?)src="([^"]+)"(.*?)>/i',
function($matches) { // callback
parse_str(parse_url($matches[2], PHP_URL_QUERY), $queries); // convert query strings to array
$matches[2] = '/'.str_replace(':', '/', $queries['id']); // replace the url
return '<img'.$matches[1].'src="'.$matches[2].'"'.$matches[3].'>'; // return the replacement
},
$stringtoreplace // str to replace
);
var_dump($new_str);

Related

extracting only image src, not other 'src' tags in html with php

I've been able to use preg_match on getting the src of any image tags, but I only really need the src of images with class 'wp-post-image' in this case. However, this code is returning nothing for me
$pattern = '<img(?:[^>]+src="(.+?)"[^>]+(?:id|class)="image"|[^>]+(?:id|class)="wp-post-image"[^>]+src="(.+?)")
';
preg_match($pattern,$results[$k]['description'], $matches);
$results[$k]['image'] = $matches[0];
print_r($results[$k]['image']);
The old version returns all image matches which includes 4 that have the class I'm looking for so maybe my syntax is just wrong?
old version that returned all images:
$pattern = '%<img.*?src=["\'](.*?)["\'].*?/>%i';
preg_match($pattern,$results[$k]['description'], $matches);
$src = $matches[0];
//print_r($src);
Asking to parse HTML with regex on SO will get you flamed. Not without reason, but flamed nonetheless.
If you insist on using regex (which, if for nothing else, is good practice), I suggest using a regex sandbox to test out patterns on sample text. One I use is https://regex101.com/ .
The old version (which you say worked) is looking for either single or double quotes around the src attribute. The new version is only looking for double quotes, which is possibly why it's failing.
Rather than trying to write a more complicated regex, it may be easier to use your old regex -- which grabs all the image links -- along with an expanded capture, and then look through the captured links to sort out the ones you need:
$pattern = '%(<img.*?src=["\'].*?["\'].*?/>)%i';

php: preg_match and preg_replace

I'm unsure of how to do the following...
I need to search through a string and match all instances of a forward slash and certain letters. This is for word modifications that users have the ability to input and I want them to be able to modify individual words.
Here is an example string
Hello, isn't the weather just absolutely beautiful today!?
What I'd like the user to be able to do is something like this
Hello, isn't the /bo weather just /it beautiful today!?
take note of the /bo and /it
what I'd like to do is have a preg_match and or preg_replace statement that finds and replaces the instances of /bo and /it and converts them instead into html tags such as bolded html tag and italics html tag (i cant type them here or they get converted into actual html. but wrap themselves around the word immediately following the /bo so in this example it would wind up being
Hello, isn't the <b>weather</b> just <i>beautiful</i> today!?
Any ideas how I could do this with a regex?
Once the conversions are done i'll do the standard sanitizing before inserting the data into the database along with prepared statements.
$string = "Hello, isn't the /bo weather just /it beautiful /bo today!?";
var_dump(preg_replace (array('/\/bo\s(\w+)/', '/\/it\s(\w+)/'), array('<b>$1</b>', '<i>$1</i>'), $string));
"Hello, isn't the weather just beautiful today!?"
You can use preg_replace_callback for this purpose.
This basically calls a callback method for every match that occurs.
Inside the callback, you can perform the replacement according to your conditions(bold for bo, italics for it, heading for he et al.)
Something like this -
$str = "Hello, isn't the /bo weather just /it beautiful today!?";
$regex = "/\/(.*?)\s+(.+?)\b/";
function callback($matches){
$type = $matches[1];
$text = $matches[2];
switch($type){
case "bo":
return "<b>".$text."</b>";
break;
case "it":
return "<i>".$text."</i>";
break;
default:
return $text;
}
}
$resp = preg_replace_callback(
$regex,
"callback",
$str
);
var_dump($resp);
/*
OUTPUT-
Hello, isn't the <b>weather</b> just <i>beautiful</i> today!?
*/
This example could be extended further by checking for various types and invalid types
The regexp
/\/(bo|it)\s+([\S]+)(?=\b)/g
and replacement string
<$1>$2</$1>
would almost do it:
Hello, isn't the <bo>weather</bo> just <it>beautiful</it> today!?
But the tags are not quite right yet ... They need to be single letters. :-(
Try here: https://regex101.com/r/oB9gT0/1
2. Edit - a bit late, but now it works:
$str=preg_replace('/\/([bi])((?<=b)o|(?<=i)t)\s+([\w]+)/','<$1>$3</$1>',$str);
will now deliver the correct result:
Hello, isn't the <b>weather</b> just <i>beautiful</i> today!?
see here: https://regex101.com/r/oB9gT0/3

Regular Expression for urls for images and links

EDIT: I'm not parsing html like the 5 billion other questions that have been posted. This is raw unformatted text that I want to convert into some HTML.
I'm working on a post processing. I need to convert Urls with image endings (jpe?g|png|gif) into image tags, and all other Urls into href links. I have my image replacement correct, however I'm stuck keeping the link replacement from trying to overwrite one another.
I need help with the expression within how to get it to looked for urls without the tags in place from the image replace, or look for urls that do not end in dot jpe?g|png|gif.
public function smartConvertPost($post) {
/**
* Match image based urls
*/
$pattern = '!http://([a-z0-9\-\.\/\_]+\.(?:jpe?g|png|gif))!Ui';
$replace='<p><img src="http://$1"></p>';
$postImages = preg_replace($pattern,$replace,$post);
/**
* Match url based
*/
$pattern='/http://([a-z0-9\-\.\/\_]+(?:\S|$))/i';
$replace='$1';
$postUrl = preg_replace($pattern,$replace, $postImages);
return $postUrl;
}
Please note I am not talking about matching tags or html. matching a string like so and converting it to html.
If this was an example post with a Url to a page like http://www.some-website.com/some-page/anything.html and I also put a url to an image http://www.some-website.com/someimage.jpg you would need to regex the two to be a hyperlink and an image.
Thanks,
Brad Christie's preg_replace_callback() recommendation is a good one. Here is one possible implementation:
function smartConvertPost($post)
{ // Disclaimer: This "URL plucking" regex is far from ideal.
$pattern = '!http://[a-z0-9\-._~\!$&\'()*+,;=:/?#[\]#%]+!i';
$replace='_handle_URL_callback';
return preg_replace_callback($pattern,$replace, $post);
}
function _handle_URL_callback($matches)
{ // preg_replace_callback() is passed one parameter: $matches.
if (preg_match('/\.(?:jpe?g|png|gif)(?:$|[?#])/', $matches[0]))
{ // This is an image if path ends in .GIF, .PNG, .JPG or .JPEG.
return '<p><img src="'. $matches[0] .'"></p>';
} // Otherwise handle as NOT an image.
return ''. $matches[0] .'';
}
Note that the regex used to pluck out a URL is not ideal. To do it right is tricky. See the following resources:
The Problem With URLs by Jeff Atwood.
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber.
URL Linkification (HTTP/FTP). by yours truly.
Edit: Added ability to recognize image URLs having a query or fragment.
Since it's the 215247th post on that kind of topic, let's say it again : HTML is too complicated to use regex. Use a parser.
See this. Regular expression for parsing links from a webpage?
PS: no offense =).
Edit:
I personnaly often user symfony, and there's a really great parser for what you need : http://fabien.potencier.org/article/42/parsing-xml-documents-with-css-selectors
You can get all images using simple css expression on your html. Give it a try.
What about using a marker ?
public function smartConvertPost($post) {
$MY_MARKER="<MYMARKER>"; // Define the marker here
/**
* Match image based urls
*/
$pattern = '!http://([a-z0-9\-\.\/\_]+\.(?:jpe?g|png|gif))!Ui';
$replace='<p><img src="$MY_MARKERhttp://$1$MY_MARKER"></p>'; // Use it here...
$postImages = preg_replace($pattern,$replace,$post);
/**
* Match url based
*/
$pattern='/(?<!$MY_MARKER)http://([a-z0-9\-\.\/\_]+(?:\S|$))(?!$MY_MARKER)/i';//...here
$replace='$1';
$postUrl = preg_replace($pattern,$replace, $postImages);
/**
* Remove all markers
*/
$postUrl = str_replace( $MY_MARKER, '', $postUrl);
return $postUrl;
}
Try to choose a marker that will have no chance to aapear in the post.
HTH

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.
Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.
www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';
From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}
I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;

regex to get current page or directory name?

I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!
Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";
You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.
You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension
Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.
Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));
As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";
This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.
Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.

Categories