Slugs for SEO using PHP - Appending name to end of URL - php

Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.

I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator

Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u

Related

laravel. Replace %20 in url

I have simple problem, I have to replace %20 and other crap from URL. At the moment it looks like this http://exmaple/profile/about/Eddies%20Plumbing. As you can see it's profile link.
Yes I could add str_replace values before every hyperlink, but I have like 10 of them and I think it's bad practice. Maybe there is better solution? What solution would you use? Thanks.
That is not crap, that is a valid unicode representation of a space character. And it's encoded because it's one of the characters that are deemed unsafe by RFC1738:
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.
So in order to have pretty URLs, you should avoid using reserved and unsafe characters which need encoding to be valid as part of a URL:
Reserved characters: $ & + , / : ; = ? #
Unsafe characters: Blank/empty space and < > # % { } | \ ^ ~ [ ] `
Instead replace spaces with dashes, which serve the same purpose visually while being a safe character, for example look at the Stack Overflow URL for this question. The URL below looks just fine and readable without spaces in it:
http://exmaple/profile/about/eddies-plumbing
You can use Laravel's str_slug helper function to do the hard work for your:
str_slug('Eddies Plumbing', '-'); // returns eddies-plumbing
The str_slug does more that replace spaces with dashes, it replaces multiple spaces with a single dash and also strips all non-alphanumeric characters, so there's no reliable way to decode it.
That being said, I wouldn't use that approach in the first place. There are two main ways I generally use to identify a database entry:
1. Via an ID
The route path definition would look like this in your case:
/profiles/about/{id}/{slug?} // real path "/profiles/about/1/eddies-plumbing"
The code used to identify the user would look like this User::find($id) (the slug parameter is not needed, it's just there to make the URL more readable, that's why I used the ? to make it optional).
2. Via a slug
The route path definition would look like this in your case:
/profiles/about/{slug} // real path "/profiles/about/eddies-plumbing"
In this case I always store the slug as a column in the users table because it's a property relevant to that user. So the retrieval process is very easy User::where('slug', $slug). Of course using str_slug to generate a valid slug when saving the user to the database. I usually like this approach better because it has the added benefit of allowing the slug to be whatever you want (not really needing to be generated from the user name). This can also allow users to choose their custom URL, and can also help with search engine optimisation.
The links are urlencoded. Use urldecode($profileLink); to decode them.
I am parsing the url tha i got in this way ->
$replacingTitle = str_replace('-',' ',$title);
<a href="example.com/category/{{ str_slug($article->title) }}/" />
In your view ...
{{$comm->title}
and in controller using parsing your url as
public function showBySlug($slug) {
$title = str_replace('-',' ',$slug);
$post = Community::where('title','=',$title)->first();
return view('show')->with(array(
'post' => $post,
));
}

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

converting url sperators with slash

I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks
In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories