Regex to update URLs after a content migration

Regex to update URLs after a content migration - php

I recently moved some old content to a new site and updated some URL structures. I need to do a find-replace on the entire database to update some old links. This would be easy if I knew regex, but I don't so hoping this is easy for the SO guru's.
Note: This is PHP regex.
Find:
https://api.floodmagazine.com/{number}/{string}/
Result:
https://api.floodmagazine.com/789/foo-bar/
https://api.floodmagazine.com/12345/foo-bar-1/
Replace with:
https://floodmagazine.com/$1/$2/
Result:
https://floodmagazine.com/789/foo-bar/
https://floodmagazine.com/12345/foo-bar-1/
It's not as easy as just doing a search for the sub-domain (api.floodmagazine.com) because there are URL's in the DB that need that sub-domain to remain (images for example). So the /{number/{string}/ part is an important way to find only the URL's that need to be changed.
I just need the regex part, I'm using WP Migrate for the database updating part.
Thanks for the help!

https:\/\/api.floodmagazine.com\/([0-9]+)\/([A-z0-9._+-]+)\/?
that should work. On regex101 you have to escape / so I kept that here. That may not be true in your tooling.
You can omit the last ? if you don’t want the trailing slash to be optional.

This should grab all the URLs you describe :
(https://floodmagazine.com)(\/)[0-9]*(\/)[A-z-0-9]*(\/)

To avoid URL error du to WordPress inconsistency you can use this PHP code generated with regex101
$re = '/https?:\/\/([^\/]+)\/([^\/]+)\/([^\/]+)\/?/m';
$str = 'https://api.floodmagazine.com/789/foo-bar/';
$subst = 'https://floodmagazine.com/$2/$3/';
$result = preg_replace($re, $subst, $str);
this regex catch domain, id and post name. Can catch special case like non HTTPS, special char ...
and return the result like expected in your exemple

Related

Regex not preceded by href="

So I am adding [embed][/embed] around youtube links in a WordPress environment, since if you use different fields for content input in the backend than the normale content editor, it won't do this automatically (even if you apply_filter the_content).
So, I found this regex which works perfect for my application:
$firstalinea = preg_replace('/\s*[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i', '[embed]https://www.youtube.com/watch?v=$2[/embed]', $firstalinea);
Except for one thing. If someone places a link to a YouTube-video instead of wanting to embed it, it also replaces and then the link does not work anymore.
Link
So, how to make the regex NOT work, if preceded by href=" ?
Thanks!

Solved it:
$re = '/(?<!href=\")(http:\/\/|https:\/\/)(?:www\.)?youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i';
$firstalinea = preg_replace($re, '[embed]https://www.youtube.com/watch?v=$3[/embed]', $firstalinea);

stripping altered URLs from strings with preg_replace()

As the title says, but the regex i am using has some glitches. im not too good with regex, as you can see
I am trying to remove any web URLs that a user adds to a string.
However, as the user is "crafty", they try to alter the URL slightly so that it does not trigger my removal code, hence my below regex will match on slightly modified urls too (hence me not using a conventional ULR regex). I know it will always be possible to trick my removal code, but i would like to make it as hard as possible
The problem i am having is if a user adds a sentence and then a full stop, but does not space out things right, the below regex matches this. i would like to limit this as best possible.
e.g all the below match:
this.matches (i dont want this to match).
mysite.co.xx (i want this to match).
http:// www.mysite.co.xx (i want this to match)
i am trying to limit the characters after the last "." to between 2 and 4 but am struggling to work out how to do this.
The code below is what i am using.
define('REG_URL', '#((https?://|https?://\s)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#');
public function stripURLs($string){
try {
$replacement = "[** website removed **]";
$string = preg_replace(REG_URL, $replacement, $string);
return $string;
}
catch (Exception $e){
error_log('checksubmitted.class.php MLE_Check.stripURls - Exception caught: '.$e->getMessage());
return false;
}
}
if anyone could point me in the right direction for how i do what i want, i would be very grateful.
If anyone know of any similar questions on here (i cant find any) or any other site that offers advice on removing "crafty" URLs i would again be grateful if this could be pointed out to me.

This is my personal preference for validating urls:
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?

\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness

Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)

You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

Slugs for SEO using PHP - Appending name to end of URL

Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.

I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator

Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u

Extract URL from string

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.

Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?

This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to update URLs after a content migration - php

https:\/\/api.floodmagazine.com\/([0-9]+)\/([A-z0-9._+-]+)\/? that should work. On regex101 you have to escape / so I kept that here. That may not be true in your tooling. You can omit the last ? if you don’t want the trailing slash to be optional.

This should grab all the URLs you describe : (https://floodmagazine.com)(\/)[0-9](\/)[A-z-0-9](\/)

Related

Regex not preceded by href="

stripping altered URLs from strings with preg_replace()

Trying to stop regex at a tag

Slugs for SEO using PHP - Appending name to end of URL

Extract URL from string

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to update URLs after a content migration - php

https:\/\/api.floodmagazine.com\/([0-9]+)\/([A-z0-9._+-]+)\/? that should work. On regex101 you have to escape / so I kept that here. That may not be true in your tooling. You can omit the last ? if you don’t want the trailing slash to be optional.

This should grab all the URLs you describe : (https://floodmagazine.com)(\/)[0-9]*(\/)[A-z-0-9]*(\/)

Related

Regex not preceded by href="

stripping altered URLs from strings with preg_replace()

Trying to stop regex at a tag

Slugs for SEO using PHP - Appending name to end of URL

Extract URL from string

Categories

Resources

This should grab all the URLs you describe : (https://floodmagazine.com)(\/)[0-9](\/)[A-z-0-9](\/)