Preg_replace for url and links - php

Right now
I'm using
$content = preg_replace('#(https?://([-\w\.]+)+(:\d+)?((/[\w/_\.%\-+~]*)?(\?\S+)?)?)#', '$1', $content);
for replace url with links but it doesn't works with some symbols like # and so many other
and also i want that if the content appears like this
http://www.abc.com/
then the preg_replace skip this otherwise it will duplicate the same and produces wrong result.

The text helper class from Kohana has a function for this that would probably be a good starting point: https://github.com/kohana/core/blob/3.2/master/classes/kohana/text.php#L362

Why not just look for anything starting with http:// or https:// up until any whitespace character?
https?://[^\s]+
That is obviously pretty forgiving, the only problem is that you might get some false positives.

Related

Regex to update URLs after a content migration

I recently moved some old content to a new site and updated some URL structures. I need to do a find-replace on the entire database to update some old links. This would be easy if I knew regex, but I don't so hoping this is easy for the SO guru's.
Note: This is PHP regex.
Find:
https://api.floodmagazine.com/{number}/{string}/
Result:
https://api.floodmagazine.com/789/foo-bar/
https://api.floodmagazine.com/12345/foo-bar-1/
Replace with:
https://floodmagazine.com/$1/$2/
Result:
https://floodmagazine.com/789/foo-bar/
https://floodmagazine.com/12345/foo-bar-1/
It's not as easy as just doing a search for the sub-domain (api.floodmagazine.com) because there are URL's in the DB that need that sub-domain to remain (images for example). So the /{number/{string}/ part is an important way to find only the URL's that need to be changed.
I just need the regex part, I'm using WP Migrate for the database updating part.
Thanks for the help!
https:\/\/api.floodmagazine.com\/([0-9]+)\/([A-z0-9._+-]+)\/?
that should work. On regex101 you have to escape / so I kept that here. That may not be true in your tooling.
You can omit the last ? if you don’t want the trailing slash to be optional.
This should grab all the URLs you describe :
(https://floodmagazine.com)(\/)[0-9]*(\/)[A-z-0-9]*(\/)
To avoid URL error du to WordPress inconsistency you can use this PHP code generated with regex101
$re = '/https?:\/\/([^\/]+)\/([^\/]+)\/([^\/]+)\/?/m';
$str = 'https://api.floodmagazine.com/789/foo-bar/';
$subst = 'https://floodmagazine.com/$2/$3/';
$result = preg_replace($re, $subst, $str);
this regex catch domain, id and post name. Can catch special case like non HTTPS, special char ...
and return the result like expected in your exemple

Replace spaces in all URLs with %20 using Regex

I have a large block of HTML that contains multiples URLs with spaces in them. How do I used Regex to replace any space that occurs in a URL, with a '%20'. The good thing is that all of the URLs end with '.pdf'.
Looking for something I could run in BBedit/Text Wrangler, or even PHP.
Example: http://www.site-name.com/dir/file name here.pdf
Need to return: http://www.site-name.com/dir/file%20name%20here.pdf
Instead of Regex you could use could use urlencode in PHP to achieve this which escapes the url for you. Similar to encodeURI in JavaScript.
I was faced with exactly the same problem. I solved it with this:
$text = preg_replace("/http(.*) (.*)\.pdf/U", "http$1%20$2.pdf", $text);
This looks for a space between http and pdf and then replaces the space with %20.
If your URLs have multiple spaces, then simply run the code over and over until all the spaces are gone:
while(preg_match("/http(.*) (.*)\.pdf/U", $text))
{
$text = preg_replace("/http(.*) (.*)\.pdf/U", "http$1%20$2.pdf", $text);
echo('testing testing');
}
However, I've found this will overwrite text if there are two or more URLs on the same line. I haven't found a solution for this yet.

regular expression for replacing all links but css and js

i want to download a site an replace all links on that site to an internal link.
that's easy:
$page=file_get_contents($url);
$local=$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'];
$page=preg_replace('/href="(.+?)"/','href="http://'.$local.'?href=\\1"',$page);
but i want to exclude all css files and js files from replacing, so i tried:
$regex='/href="(.+?(?!(\.js|\.css)))"/';
$page=preg_replace($regex,'href="http://'.$local.'?href=\\1"',$page);
but that didnt work,
what am i doing wrong?
i thought
?!
is a negative lookahead
To answer your regex question, you need a lookbehind there and better limit the match with a character class:
$regex = '/href="([^"]+(?<!\.js|\.css))"/';
The charclass first matches the whole link content, then asserts that this didn't end in .js or .css.
You might want to augment the whole match with <a\s[^>]*? even, so it really just finds anything that looks like a link.
Another option would be using domdocument or querypath for such tasks, which is usually tedious and more code, but simpler to add programmatic conditions to:
htmlqp->find("a") FOREACH $a->attr("href", "http:/...".$a->attr("href"))
// would need a real foreach and an if and stuff..

Replace anchor text with PHP (and regular expression)

I have a string that contains a lot of links and I would like to adjust them before they are printed to screen:
I have something like the following:
replace_this
and would like to end up with something like this
replace this
Normally I would just use something like:
echo str_replace("_"," ",$url);
In in this case I can't do that as the URL contains underscores so it breaks my links, the thought was that I could use regular expression to get around this.
Any ideas?
Here's the regex: <a(.+?)>.+?<\/a>.
What I'm doing is preserving the important dynamic stuff within the anchor tag, and and replacing it with the following function:
preg_replace('/<a(.+?)>.+?<\/a>/i',"<a$1>REPLACE</a>",$url);
This will cover most cases, but I suggest you review to make sure that nothing unexpected was missed or changed.
pattern = "/_(?=[^>]*<)/";
preg_replace($pattern,"",$url);
You can use this regular expression
(>(.*)<\s*/)
along with preg_replace_callback .
EDIT :
$replaced_text = preg_replace_callback('~(>(.*)<\s*/)~g','uscore_replace', $text);
function uscore_replace($matches){
return str_replace('_','',$matches[1]); //try this with 1 as index if it fails try 0, I am not entirely sure
}

regex to get current page or directory name?

I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!
Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";
You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.
You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension
Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.
Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));
As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";
This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.
Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.

Categories