Extract specific part of URL from string

Extract specific part of URL from string - php

I need to extract only parts of a URL with PHP but I am struggling to the set point where the extraction should stop. I used a regex to extract the entire URL from a longer string like this:
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i';
preg_match_all($regex, $href, $matches);
The result is the following string:
http://www.cambridgeenglish.org/test-your-english/&sa=U&ei=a4rbU8agB-zY0QWS_IGYDw&ved=0CFEQFjAL&usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg
Now I want to extract only this bit http://www.cambridgeenglish.org/test-your-english/. I basically need to get rid off everything starting at &amp onwards.
Anyone an idea how to achieve this? Do I need to run another regex or can I add it to the initial one?

I would suggest you abandon regex and let PHP's own parse_url function do this for you:
http://php.net/manual/en/function.parse-url.php
$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . $parsed['path'];
to get the substring of the path up to the &amp, try:
$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . substr($parsed['path'], 0, strpos($parsed['path'],'&amp'));

The below regex would get ridoff everything after the string &amp. Your php code would be,
<?php
echo preg_replace('~&amp.*$~', '', 'http://www.cambridgeenglish.org/test-your-english/&sa=U&ei=a4rbU8agB-zY0QWS_IGYDw&ved=0CFEQFjAL&usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg');
?> //=> http://www.cambridgeenglish.org/test-your-english/
Explanation:
&amp Matches the string &amp.
.* Matches any character zero or more times.
$ End of the line.

Related

Remove characters from beginning and end string

I want to ouput only MYID from URL. What I did so far:
$url = "https://whatever.expamle.com/display/MYID?out=1234567890?Browser=0?OS=1";
echo substr($url, 0, strpos($url, "?out="));
output: https://whatever.expamle.com/display/MYID
$url = preg_replace('#^https?://whatever.expamle.com/display/#', '', $url);
echo $url;
ouput: MYID?out=1234567890?Browser=0?OS=1
How can I combine this? Thanks.

For a more general solution, we can use regex with preg_match_all:
$url = "https://whatever.expamle.com/display/MYID?out=1234567890?Browser=0?OS=1";
preg_match_all("/\/([^\/]+?)\?/", $url, $matches);
print_r($matches[1][0]); // MYID

When the string is always a Uniform Resource Locator (URL), like you present it in your question,
given the following string:
$url = "https://whatever.expamle.com/display/MYID?out=1234567890?Browser=0?OS=1";
you can benefit from parsing it first:
$parts = parse_url($url);
and then making use of the fact that MYID is the last path component:
$str = preg_replace(
'~^.*/(?=[^/]*$)~' /* everything but the last path component */,
'',
$parts['path']
);
echo $str, "\n"; # MYID
and then depending on your needs, you can combine with any of the other parts, for example just the last path component with the query string:
echo "$str?$parts[query]", "\n"; # MYID?out=1234567890?Browser=0?OS=1
Point in case is: If the string already represents structured data, use a dedicated parser to divide it (cut it in smaller pieces). It is then easier to come to the results you're looking for.
If you're on Linux/Unix, it is even more easy and works without a regular expression as the basename() function returns the paths' last component then (does not work on Windows):
echo basename(parse_url($url, PHP_URL_PATH)),
'?',
parse_url($url, PHP_URL_QUERY),
"\n"
;
https://php.net/parse_url
https://php.net/preg_replace
https://www.php.net/manual/en/regexp.reference.assertions.php

Using preg_replace on url variables

I have some very long URL variables. Here is one example.
http://localhost/index.php?image=XYZ_1555025022.jpg&mppdf=yes&pdfname=Printer&deskew=yes&autocrop=yes&print=no&mode=color&printscalewidth100=&printscaleheight100=&rand=56039
Ultimately it would be nice if I could find a way to use preg_replace to simply change one variable even if in the middle of the string for instance in the string above change print=no to 'print=yes for example.
I will however settle for a preg_replace pattern match that allows me to delete ?image=XYZ_1555025022.jpg. as this is a variable the name could be anything. It will always have "?image" " at the start and end with "&"
I think one of the problems I have run into is that preg_match seems to have issues on strings with "=" contained in them .
I am completely lost here in this and all those characters make may head spin. Maybe someone can give some guidance please?

Here's a demo of how you can do some of things you want using explode, parse_str and http_build_query:
$url = 'http://localhost/index.php?image=XYZ_1555025022.jpg&mppdf=yes&pdfname=Printer&deskew=yes&autocrop=yes&print=no&mode=color&printscalewidth100=&printscaleheight100=&rand=56039';
// split on first ?
list($path, $query_string) = explode('?', $url, 2);
// parse the query string
parse_str($query_string, $params);
// delete image param
unset($params['image']);
// change the print param
$params['print'] = 'yes';
// rebuild the query
$query_string = http_build_query($params);
// reassemble the URL
$url = $path . '?' . $query_string;
echo $url;
Output:
http://localhost/index.php?mppdf=yes&pdfname=Printer&deskew=yes&autocrop=yes&print=yes&mode=color&printscalewidth100=&printscaleheight100=&rand=56039
Demo on 3v4l.org

You can use str_replace() or preg_replace() to get your job done, but parse_url() with parse_str() will give you more controls to modify any parameters easily by array index. Finally use http_build_query() to make your final url after modification.
<?php
$url = 'http://localhost/index.php?image=XYZ_1555025022.jpg&mppdf=yes&pdfname=Printer&deskew=yes&autocrop=yes&print=no&mode=color&printscalewidth100=&printscaleheight100=&rand=56039';
$parts = parse_url($url);
parse_str($parts['query'], $query);
echo "BEFORE".PHP_EOL;
print_r($query);
$query['print'] = 'yes';
echo "AFTER".PHP_EOL;
print_r($query);
?>
DEMO: https://3v4l.org/npGij

Parsing the last substring of the url

I want to parse the string after the last "/" .
For example:
http://127.0.0.1/~dtm/index.php/en/parts/engine
Parse the "engine" .
I tried do it with Regexp but as im new to regexp im stuck near the solution.
Also this pattern seems quite easy breakable (/engine/ will break it ) . Need somehow make it a bit more stable.
$pattern = ' \/(.+^[^\/]?) ' ;
/ Match the / char
.+ Match any char one or more times
^[^/\ Exclude \ char
Demo of the current state

You don't need a regex, don't make it complicated just use this:
<?php
$url = "http://127.0.0.1/~dtm/index.php/en/parts/engine";
echo basename($url);
?>
Output:
engine

I recommend you to use performatner functions instead of preg_match to do this
eg basename()
$url = "http://127.0.0.1/~dtm/index.php/en/parts/engine";
echo basename($url);
or explode()
$parts = explode('/',$url);
echo array_pop($parts);

You can also use parse_url(), explode() and array_pop() together to achieve your goal.
<?php
$url = 'http://127.0.0.1/~dtm/index.php/en/parts/engine';
$parsed = parse_url($url);
$path = $parsed['path'];
echo array_pop(explode('/', $path));
?>
PhpFiddle Demo

Is this something?
$url = "http://127.0.0.1/~dtm/index.php/en/parts/engine";
$ending = end(explode('/', $url));
Output:
engine

PHP preg_match between text and the first occurrence of -

I'm trying to grab the 12345 out of the following URL using preg_match.
$url = "http://www.somesite.com/directory/12345-this-is-the-rest-of-the-url.html";
$beg = "http://www.somesite.com/directory/";
$close = "\-";
preg_match("($beg(.*)$close)", $url, $matches);
I have tried multiple combinations of . * ? \b
Does anyone know how to extract 12345 out of the URL with preg_match?

Two things, first off, you need preg_quote and you also need delimiters. Using your construction method:
$url = "http://www.somesite.com/directory/12345-this-is-the-rest-of-the-url.html";
$beg = preg_quote("http://www.somesite.com/directory/", '/');
$close = preg_quote("-", '/');
preg_match("/($beg(.*?)$close)/", $url, $matches);
But, I would write the query slightly differently:
preg_match('/directory\/(\d+)-/i', $url, $match);
It only matches the directory part, is far more readable, and ensures that you only get digits back (no strings)

This doesn't use preg_match but would achieve the same thing and would execute faster:
$url = "http://www.somesite.com/directory/12345-this-is-the-rest-of-the-url.html";
$url_segments = explode("/", $url);
$last_segment = array_pop($url_segments);
list($id) = explode("-", $last_segment);
echo $id; // Prints 12345

Too slow, I am ^^.
Well, if you are not stuck on preg_match, here is a fast and readable alternative:
$num = (int)substr($url, strlen($beg));
(looking at your code I guessed, that the number you are looking for is a numeric id is it is typical for urls looking like that and will not be "12abc" or anything else.)

PHP - strip URL to get tag name

I need to strip a URL using PHP to add a class to a link if it matches.
The URL would look like this:
http://domain.com/tag/tagname/
How can I strip the URL so I'm only left with "tagname"?
So basically it takes out the final "/" and the start "http://domain.com/tag/"

For your URL
http://domain.com/tag/tagname/
The PHP function to get "tagname" is called basename():
echo basename('http://domain.com/tag/tagname/'); # tagname

combine some substring and some position finding after you take the last character off the string. use substr and pass in the index of the last '/' in your URL, assuming you remove the trailing '/' first.

As an alternative to the substring based answers, you could also use a regular expression, using preg_split to split the string:
<?php
$ptn = "/\//";
$str = "http://domain.com/tag/tagname/";
$result = preg_split($ptn, $str);
$tagname = $result[count($result)-2];
echo($tagname);
?>
(The reason for the -2 is because due to the ending /, the final element of the array will be a blank entry.)
And as an alternate to that, you could also use preg_match_all:
<?php
$ptn = "/[a-z]+/";
$str = "http://domain.com/tag/tagname/";
preg_match_all($ptn, $str, $matches);
$tagname = $matches[count($matches)-1];
echo($tagname);
?>

Many thanks to all, this code works for me:
$ptn = "/\//";
$str = "http://domain.com/tag/tagname/";
$result = preg_split($ptn, $str);
$tagname = $result[count($result)-2];
echo($tagname);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract specific part of URL from string - php

Related

Remove characters from beginning and end string

Using preg_replace on url variables

Parsing the last substring of the url

PHP preg_match between text and the first occurrence of -

PHP - strip URL to get tag name

Categories

Resources