Replace text in a url using PHP

Replace text in a url using PHP - php

So basically I got links like these
https://dog.example.com/randomgenerated45443444444444
https://turtle.example.com/randomgenerated45443
https://mice.example.com/randomgenerated452
https://monkey.example.com/randomgenerated43232323
https://leopard.example.com/randomgenerated22222222222222222
I was wondering if it was possible to detect the words between https:// and .example.com/ which would be the random animal name. And replace it with "thumbnail". The amount of letters in the animal names and randomgenerated ones always vary in amount of letters in them

You can use a positive lookahead to get to the data you want:
$string = 'https://leopard.example.com/randomgenerated22222222222222222';
$pattern = '/(?=.*\/\/)(.*?)(?=\.)/';
$replacement = 'thumbnail';
$foo = preg_replace($pattern, $replacement, $string);
$protocol = 'https://';
echo $protocol . $foo;
returns
https://thumbnail.example.com/randomgenerated22222222222222222
Explanation of the regex:
Positive Lookahead (?=.*\/\/)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
1st Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\.)
Assert that the Regex below matches
\. matches the character . literally (case sensitive)

Assuming that https:// and example.com never change, then this is the simplest regex you can use for the purpose:
https://(.+)\.example\.com
Anything in the (.+) will be the words you are attempting to extract.
Edit on 2016.10.27:
While the / character has no special meaning in Regular Expressions, it will likely need to be escaped (\/) if you are also using it as your expression delimiter. So the above will look like:
https:\/\/(.+)\.example\.com

Related

Laravel validate url name and protocol

I need validate url. I need allow only main url sites, example:
http://example.com
https://example.com
I need prevent these urls on my site:
http://example.com/page/blahblahblah
https://example.com/other/bloa
I use regex:
'url' => ['required', 'url', 'regex:/((http:|https:)\/\/)[^\/]+/']
When user insert url, he can insert http://example.com/page/blahblahblah why? My regex is not working.. Validation is passing

You can use the following pattern to ensure a URL does not contain subdirectories:
^(?:\S+:\/\/)?[^\/]+\/?$
Explanation:
^ asserts position at start of the string
Non-capturing group (?:\S+://)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
/ matches the character / literally (case sensitive)
/ matches the character / literally (case sensitive)
Match a single character not present in the list below [^/]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
/ matches the character / literally (case sensitive)
/? matches the character / literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

You could write a custom validator and use a combination of filter_var and parse_url?
Something as follows will do the job...
<?php
$url = "http://example.com/page/blahblahblah";
if (!filter_var($url, FILTER_VALIDATE_URL)) {
return false;
}
$parts = parse_url($url);
echo "{$parts['scheme']}://{$parts['host']}";

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.

You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.

You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);

You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

Regular expression to fix my long string that only partially repeats format

I have this string that I want to clean up using PHP and regex:
Name/__text,Password/__text,Profile/__text,Locale/__text,UserType/__text,Passwor
dUpdateDate/__text,Columns/0/Name/__text,Columns/0/Label/__text,Columns/0/Order/
__text,Columns/1/Name/__text,Columns/1/Label/__text,Columns/1/Order/__text,Colum
ns/2/Name/__text,Columns/2/Label/__text,Columns/2/Order/__text,Columns/3/Name/__
text,Columns/3/Label/__text,Columns/3/Order/__text,Columns/4/Name/__text,Columns
/4/Label/__text,Columns/4/Order/__text,Columns/5/Name/__text,Columns/5/Label/__t
ext,Columns/5/Order/__text,Columns/6/Name/__text,Columns/6/Label/__text,Columns/
6/Order/__text,Columns/7/Name/__text,Columns/7/Label/__text,Columns/7/Order/__te
xt,Columns/8/Name/__text,Columns/8/Label/__text,Columns/8/Order/__text,Columns/9
/Name/__text,Columns/9/Label/__text,Columns/9/Order/__text,Columns/10/Name/__tex
t,Columns/10/Label/__text,Columns/10/Order/__text,Columns/11/Name/__text,Columns
/11/Label/__text,Columns/11/Order/__text,Columns/12/Name/__text,Columns/12/Label
/__text,Columns/12/Order/__text,Columns/13/Name/__text,Columns/13/Label/__text,C
olumns/13/Order/__text,MailAddress/__text,Description/__text,Columns/14/Name/__t
ext,Columns/14/Label/__text,Columns/14/Order/__text,Columns/15/Name/__text,Colum
ns/15/Label/__text,Columns/15/Order/__text
I want it to be Password,Profile,Locale,UserType,PasswordUpdateDate,Name,Label,Order...
I'm removing the /text or /__text after the word, but there are only sometimes things like Columns/0/ before the word to remove.
I tried this (below) regular expression in the regex tester, but it misses the first few items that don't have the Columns/2/ type of thing before it. I can't use a regex that will grab what's before /__text, because the / before the word is optional, like for the first Name. Any ideas how to do this? It's tough to search for this pattern or info on how to create it. Any help would be great!
[A-Za-z\/0-9]+\/([A-Za-z]+)\/[__text]

Probably easier to just match what you want and then join them on commas. Match a word (\w+) followed by \__text:
preg_match_all('#(\w+)/__text#', $string, $matches);
$result = implode(',', $matches[1]);
You could also use ([A-Za-z0-9]+) and add anything else instead of (\w+) in case it could be First_Name, First-Name, Firstname0 etc...

Regex:
(\w+)\/__text(?:(,)(?:Columns\/\d+\/)*)*
Demo
Explanation:
/(\w+)\/__text(?:(,)(?:Columns\/\d+\/)*)*/g
1st Capturing Group (\w+)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\/ matches the character / literally (case sensitive)
__text matches the characters __text literally (case sensitive)
Non-capturing group (?:(,)(?:Columns\/\d+\/)*)*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (,)
, matches the character , literally (case sensitive)
Non-capturing group (?:Columns\/\d+\/)*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Columns matches the characters Columns literally (case sensitive)
\/ matches the character / literally (case sensitive)
\d+ matches a digit (equal to [0-9])
\/ matches the character / literally (case sensitive)

Regular Expression (preg_match)

This is the not working code:
<?php
$matchWith = " http://videosite.com/ID123 ";
preg_match_all('/\S\/videosite\.com\/(\w+)\S/i', $matchWith, $matches);
foreach($matches[1] as $value)
{
print 'Hyperlink';
}
?>
What I want is that it should not display the link if it has a whitespace before or after.
So now it should display nothing. But it still displays the link.

This can also match ID12, because 3 is not an space, and the / of http:/ is not a space. You can try:
preg_match_all('/^\S*\/videosite\.com\/(\w+)\S*$/i', $matchWith, $matches);

So, you don't want it to display if there's whitespaces. Something like this should work, didn't test.
preg_match_all('/^\S+?videosite\.com\/(\w+)\S+?$/i', $matchWith, $matches);

You can try this. It works:
if (preg_match('%^\S*?/videosite\.com/(\w+)(?!\S+)$%i', $subject, $regs)) {
#$result = $regs[0];
}
But i am positive that after I post this, you will update your question :)
Explanation:
"
^ # Assert position at the beginning of the string
\S # Match a single character that is a “non-whitespace character”
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\/ # Match the character “/” literally
videosite # Match the characters “videosite” literally
\. # Match the character “.” literally
com # Match the characters “com” literally
\/ # Match the character “/” literally
( # Match the regular expression below and capture its match into backreference number 1
\w # Match a single character that is a “word character” (letters, digits, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
\S # Match a single character that is a “non-whitespace character”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
"

It would probably be simpler to use this regex:
'/^http:\/\/videosite\.com\/(\w+)$/i'
I believe you are referring to the white space before http, and the white space after the directory. So, you should use the ^ character to indicate that the string must start with http, and use the $ character at the end to indicate that the string must end with a word character.

What does this Regex pattern mean: '/&\w;/'

Can someone explain what this function
preg_replace('/&\w;/', '', $buf)
does? I have looked at various tutorials and found that it replaces the pattern /&\w;/ with string ''. But I can't understand the pattern /&\w;/. What does it represent?
Similarly in
preg_match_all("/(\b[\w+]+\b)/", $buf, $words)
I can't understand what does the string "/(\b[\w+]+\b)/" represents.
Please help. Thanks in advance :)

The explanation of your first expression is simple, it is:
& # Match the character “&” literally
\w # Match a single character that is a “word character” (letters, digits, and underscores)
; # Match the character “;” literally
The second one is:
( # Match the regular expression below and capture its match into backreference number 1
\b # Assert position at a word boundary
[\w+] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “+”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
)
The preg_replace function makes use of regular expressions. Regular expressions allow you to find patterns in text in a really powerful way.
To be able to use functions like preg_replace or preg_match I recommend you to take a look first at how regular expressions work.
You can gather a lot of info on this site http://www.regular-expressions.info/
And you can use software tools to help you understand the regex (like RegexBuddy)

In regular expressions, \w stands for any "word" character. That is: a-z, A-Z, 0-9 and underscore. \b stands for "word boundary", that is the beginning and end of a word (a series of word characters).
So, /&\w;/ is a regular expression to match the & sign, followed by a series of word characters, followed by a ;. For example, &foobar; would match, and preg_replace will replace it with an empty string.
In that same manner, /(\b[\w+]+\b)/ matches a word boundary, followed by multiple word characters, followed by another word boundary. The words are captured separately using the parenthesis. So, this regular expression will simply return the words in a string as an array.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Replace text in a url using PHP - php

Related

Laravel validate url name and protocol

How to use preg_replace to remove excessive single spaces

Regular expression to fix my long string that only partially repeats format

Regular Expression (preg_match)

What does this Regex pattern mean: '/&\w;/'

Categories

Resources