PHP regex modification - php

I'm using an old Joomla! plugin (I know, first mistake). It does some URL replacement through regex. Here is the code:
$row->text = preg_replace_callback('#href=("|\')(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)("|\')#', 'replace_links', $row->text);
The problem is that it breaks with URLs that have a hyphen in them. Any help on how I can modify it to accept hyphens would be great.
It could also be the replace_links function that breaks:
function replace_links($matches) {
$match = $matches[0];
$array = array('href=',"'", '"');
$match = str_replace($array, '',$match);
if (strpos($match, JURI::root())) {
return $matches[0];
} else {
$plugin =& JPluginHelper::getPlugin('content', 'linkdisclaimer');
$pluginParams = new JParameter( $plugin->params );
$id = $pluginParams->get('disclaimerPage');
$match = "href=\"javascript:linkDisclaimer('".rawurlencode($match)."', '".$id."');\"";
return $match;
}
}

I tried this in a regex tester and it doesn't match urls with a - in them, so I'm guessing it's the regex. Try adding a - character into the regex like so href=("|\')(https?://([-\w\.]+)+(:\d+)?(/([\w-/_\.]*(\?\S+)?)?)?)("|\'). This should allow - in the path segment after the domain. The full replacement would be like
$row->text = preg_replace_callback('#href=("|\')(https?://([-\w\.]+)+(:\d+)?(/([\w-/_\.]*(\?\S+)?)?)?)("|\')#', 'replace_links', $row->text);

Related

PHP - find and replace a string between two variables when the length can't be determined

I'm trying to create a simple PHP find and replace system by looking at all of the images in the HTML and add a simple bit of code at the start and end of the image source. The image source has something like this:
<img src="img/image-file.jpg">
and it should become into this:
<img src="{{media url="wysiwyg/image-file.jpg"}}"
The Find
="img/image-file1.jpg"
="img/file-2.png"
="img/image3.jpg"
Replace With
="{{media url="wysiwyg/image-file.jpg"}}"
="{{media url="wysiwyg/file-2.png"}}"
="{{media url="wysiwyg/image3.jpg"}}"
The solution is most likely simple yet from all of the research that I have done. It only works with one string not a variety of unpredictable strings.
Current Progress
$oldMessage = "img/";
$deletedFormat = '{{media url="wysiwyg/';
$str = file_get_contents('Content Slots/Compilied Code.html');
$str = str_replace("$oldMessage", "$deletedFormat",$str);
The bit I'm stuck at is find the " at the end of the source to add the end of the required code "}}"
I don't like to build regular expressions to parse HTML, but it seems that in this case, a regular expression will help you:
$reg = '/=["\']img\/([^"\']*)["\']/';
$src = ['="img/image-file1.jpg"', '="img/file-2.png"', '="img/image3.jpg"'];
foreach ($src as $s) {
$str = preg_replace($reg, '={{media url="wysiwyg/$1"}}', $s);
echo "$str\n";
}
Here you have an example on Ideone.
To make it works with your content:
$content = file_get_contents('Content Slots/Compilied Code.html');
$reg = '/=["\']img\/([^"\']*)["\']/';
$final = preg_replace($reg, '={{media url="wysiwyg/$1"}}', $content);
Here you have an example on Ideone.
In my opinion what you are doing is not the best way this can be done. I would use abstract template for this.
<?php
$content = file_get_contents('Content Slots/Compilied Code.html');
preg_match_all('/=\"img\/(.*?)\"/', $content, $matches);
$finds = $matches[1];
$abstract = '="{{media url="wysiwyg/{filename}"}}"';
$concretes = [];
foreach ($finds as $find) {
$concretes[] = str_replace("{filename}", $find, $abstract);
}
// $conretes[] will now have all matches formed properly...
Edit:
To return full html use this:
<?php
$content = file_get_contents('Content Slots/Compilied Code.html');
preg_match_all('/=\"img\/(.*)\"/', $content, $matches);
$finds = $matches[1];
$abstract = '="{{media url="wysiwyg/{filename}"}}"';
foreach ($finds as $find) {
$content = preg_replace('/=\"img\/(.*)\"/', str_replace("{filename}", $find, $abstract), $content, 1);
}
echo $content;

greek url conversion and trim unwated numbers and symbols

This problem is little complicated since i'm newbee to php encoding.
My site uses utf-8 encoding.
After a lot of tests, i found some solution. I use this kind of code:
function chr_conv($str)
{
$a=array with pattern('%CE%B2','%CE%B3','%CE%B4','%CE%B5' etc..);
$b=array with replacement characters(a,b,c,d, etc...);
return str_replace($a, $b2, $str);
}
function replace_old($str)
{
$a1 = array ('index.php','/http://' etc...);
$a2 = array with replacement characters('','' etc...);
return str_replace($a1, $a2, $str);
}
function sanitize($url)
{
$url= replace_old(replace_old($url));
$url = strtolower($url);
$url = preg_replace('/[0-9]/', '', $url);
$url = preg_replace('/[?]/', '', $url);
$url = substr($url,1);
return $url;
}
function wbz404_process404()
{
$options = wbz404_getOptions();
$urlRequest = $_SERVER['REQUEST_URI'];
$url = chr_conv($urlRequest);
$requestedURL = replace_old(replace_old($url));
$requestedURL .= wbz404_SortQuery($urlParts);
//Get URL data if it's already in our database
$redirect = wbz404_loadRedirectData($requestedURL);
echo sanitize($requestedURL);
echo "</br>";
echo $requestedURL;
echo "</br>";
}
When incoming url is:
/content.php?147-%CE%A8%CE%AC%CF%81%CE%B9-%CE%BC%CE%B5-%CF%80%CF%81%CE%AC%CF%83%CE%B1%28%CE%A7%CE%BF%CF%8D%CE%BC%CF%80%CE%BB%CE%B9%CE%BA%29";
I get:
/content.php?147-psari-me-prasa-choumplik
I want only:
/psari-me-prasa-choumplik
without the content.php?147- before URL.
BUT the most important problem is that I get ENDLESS LOOP instead of correct URL.
What am i doing wrong?
Have in mind that .htaccess solution won't work since i have a lighttpd server, not Apache.
If you need
I am assuming it's not always ?147- that you need to skip. But always after the first hyphen. In which case, before the echo add the following:
$requestedURL = substr($requestedURL, strrpos( $requestedURL , '-') +1 );
This will search for the position of the first hyphen and return that, add one so you skip the hyphen itself, and use that to cut the $requestedURL string up after the hyphen to the end of the string.
If it's always /content.php?127- then replace strrpos( $requestedURL , '-') +1 with the number 17.

how can I get a list of all files and urls on a webpage

I'm trying to get a list of all files and urls on a webpage. It's something like the list given on http://tools.pingdom.com when you type in some url. Now I'm trying to do this in php by using cURL or wget. Does anyone has a suggestion about how I can get this kind of file/path lists?
$url="http://wwww.xyz.com";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$string ){
if( strpos($string, "<a href=") !== FALSE ){
$string = preg_replace("/.*<a\s+href=\"/sm","",$u);
$stringu = preg_replace("/\".*/","",$string);
$url = $string
}
}
edit:
or you can use this function:
function getAllUrls($string)
{
$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
return ($matches[0]);
}
$url_array = getAllUrls($string);
print_r($url_array);
Once you have the document in a string use regex to find all the URLs.
Match URLs with regex
Use regex with PHP

Preg-replace - replace all URLs except a domain and its subdomains

I've a Glype proxy and I want not parse external URLs. All URLs on the page are automatically converted to: http://proxy.com/browse.php?u=[URL HERE]. Example: If I visit The Pirate Bay on my proxy, then I want not to parse the following URLs:
ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0)
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0)
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0)
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0)
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0)
etc.
Of course I want to keep the internal URLs, so:
thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0)
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0)
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0)
etc.
Is there a preg_replace to replace all URL's except thepiratebay.se and there subdomains (as in the example)? An other function is also welcome. (Such as domdocument, querypath, substr or strpos. Not str_replace because then I should define all URLs)
I've found something, but I'm not familiar with preg_replace:
$exclude = '.thepiratebay.se';
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)';
$message= preg_replace("~(($exclude)?($pattern))~i", '$2$5$6', $message);
I'll guess you would need to provide a whitelist to tell which domains should be proxied:
$whitelist = array();
$whitelist[] = "internal1.se";
$whitelist[] = "internal2.no";
$whitelist[] = "internal3.com";
// and so on...
$string = 'External link 1<br>';
$string .= 'Internal link 1<br>';
$string .= 'Internal link 2<br>';
$string .= 'External link 2<br>';
//Assuming the URL always is inside '' or "" you can use this pattern:
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i';
$string = preg_replace_callback($pattern, "my_callback", $string);
//I had only PHP 5.2 on my server, so I decided to use a callback function.
function my_callback($match) {
global $whitelist;
// set return bypass proxy URL
$returnstring = urldecode($match[2]);
foreach ($whitelist as $white) {
// check if URL matches whitelist
if (stripos($match[2], $white) > 0) {
$returnstring = $match[0];
break; } }
return $returnstring;
}
echo "NEW STRING[:\n" . $string . "\n]\n";
you can use preg_replace_callback() to execute a callback function for every match. In that function you can determine if the matched string should be converted or not.
<?php
$string = 'http://foobar.com/baz and http://example.org/bumm';
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i';
$string = preg_replace_callback($pattern, function($match) {
if (stripos($match[0], 'example.org/') !== false) {
// exclude all URLs containing example.org
return $match[0];
} else {
return 'http://proxy.com/?u=' . urlencode($match[0]);
}
}, $string);
echo $string, "\n";
(Example is using PHP 5.3 closure notation)

Codeigniter and preg_replace

I use Codeigniter to create a multilingual website and everything works fine, but when I try to use the "alternative languages helper" by Luis I've got a problem. This helper uses a regular expression to replace the current language with the new one:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri);
The problem is that I have a URL like this: http://www.example.com/en/language/english/ and I want to replace only the first "en" without changing the word "english". I tried to use the limit for preg_replace:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri, 1);
but this doesn't work for me. Any ideas?
You could do something like this:
$regex = '#^'.preg_quote($actual_lang, '#').'(?=/|$)#';
$new_uri = preg_replace($regex, $lang, $uri);
The last capture pattern basically means "only match if the next character is a forward slash or the end of the string"...
Edit:
If the code you always want to replace is at the beginning of the path, you could always do:
if (stripos($url, $actual_lang) !== false) {
if (strpos($url, '://') !== false) {
$path = parse_url($url, PHP_URL_PATH);
} else {
$path = $url;
}
list($language, $rest) = explode('/', $path, 2);
if ($language == $actual_lang) {
$url = str_replace($path, $lang . '/' . $rest, $url);
}
}
It's a bit more code, but it should be fairly robust. You could always build a class to do this for you (by parsing, replacing and then rebuilding the URL)...
If you know what the beginning of the URL will always, be, use it in the regex!
$domain = "http://www.example.com/"
$regex = '#(?<=^' . preg_quote($domain, '#') . ')' . preg_quote($actual_lang, '#') . '\b#';
$new_uri = preg_replace($regex, $lang, $uri);
In the case of your example, the regular expression would become #(?<=^http://www.example.com/)en\b which would match en only if it followed the specified beginning of a domain ((?<=...) in a regular expression specifies a positive lookbehind) and is followed by a word boundary (so english wouldn't match).

Categories