This question already has answers here:
PHP validation/regex for URL
(21 answers)
Closed 8 years ago.
I know there are already questions for validating links. But I'm very bad with regex, and I don't know how to validate a user input (in html) is equivalent to these URL:
http://www.domain.com/?p=123456abcde
or
http://www.domain.com/doc/123456abcde
I guess it's like this
/^(http://)(www)((\.[A-Z0-9][A-Z0-9_-]*).com/?p=((\.[A-Z0-9][A-Z0-9_-]*)
I need the regex or the two URL. Thanks
This might not be a job for regexes, but for existing tools in your language of choice. Regexes are not a magic wand you wave at every problem that happens to involve strings. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class
This will match both your strings.
(http:\/\/)?(www\.)?([A-Z0-9a-z][A-Z0-9a-z_-]*).com\/(\?p=)?([A-Z0-9a-z][\/A-Za-z0-9_-]*)
I highly recommend using a regex checker, you can find some for (almost) every OS and there are even some online ones such as: http://regexpal.com/ or http://www.quanetic.com/Regex.
This will match any valid domain with the format you specified.
http(s)?:\/\/(www\.)?[a-zA-Z0-9-\.]+\.[a-z]{2,6}\/(\?p=|doc\/)[a-z0-9]+
Replace [a-z]{2,6} with com if you only want .com domains. See it in action here.
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I'm doing a regular expression for a linking system, and the syntax looks like this:
Login
This tells the system that this link should be converted to JS or an HTML destination depending on the user's browser capabilities.
Right, so I have all the back-end stuff working fine, but I noticed a strange problem with the regular expresion that I'm using to catch these types of links. When a dynamic link (href=":) stands by itself (i.e. not next to another object) then it works fine; however, if a dynamic link like
<a href=":myLink">
comes after a standard link like
<a href="myLink">
then the dynamic link doesn't get altered, like it should.
Here is a codepad link to some sample code that demonstrates the bug.
http://codepad.org/ZKdm2NkS
Notice the <a href=":first"> link does not get modified but the <a href=":second"> link does.
I'm not very good with regexps so I'm sure there's a better way of handling things rather than just using a (.*) everywhere you turn, but like I said, I'm open to better ideas and opinions.
since the only thing you are replacing is the ":myLink" portion you don't really need to match the rest... try this:
$html = preg_replace('/href=":([\w]+)"/', 'href="processedLink-$1"', $html);
this is matching only word (\w) characters (letters, digits, underscores)
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to add http:// if it’s not exists in the URL?
Say I want to match a URL that may have either http://, https:// or neither in it. When I replace it, I want to have https:// at the front if it was there, but if it was http:// or nothing I want to have http:// at the beginning.
I can't figure out how to figure this out with a preg_match expression, or for the non-PHP inclined, a search and replace PHP function.
You can use preg_replace_callback, and write a function to do that.
Something like this should work if you want a regex solution.
preg_replace('|^(?:http(s)?://)?(.+)$|', 'http\\1://\\2', $url);
Though i would probably use parse_url and put it back together.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for URL validation (in JavaScript)
So I've seen many similar questions and answers but can't find a solution that fits my specific needs.
I'm terrible at Regex's and am struggling to get a simple Regex for the following url validation.
domain.com
domain.com/folder
subdomain.domain.com
subdomain.domain.com/folder
also to validate for optional http:// and http://www. would be super helpful. Thanks!
As near as I can get would be:
/[a-z]+:\/\/(([a-z0-9][a-z0-9-]+\.)*[a-z][a-z]+|(0x[0-9A-F]+)|[0-9.]+)\/.*/
Note that your question hasn't limited URLs to a set of protocols, TLDs or character sets.
Something like skype://18005551212 or gopher://localhost is a valid URL. Heck, depending on what you're using to browse, the following might all be valid ways to get to the same server (though not quite the same virtualhost):
https://stackoverflow.com/
http://64.34.119.12/
http://1076000524/
http://0x4022770C/
They all work for me in Firefox.
If you want further restrictions, determine WHAT they are. Are you willing to sacrifice valid protocols? Are you really only interested in one or two protocols?
A more specific question will get you a more specific answer.
This question already exists:
Closed 11 years ago.
Possible Duplicate:
How to parse HTML with PHP?
i want to write a php-program that count all hyperlinks of a website, the user can enter.
how to do this? is there a libary or something which i can parse and analyze the html about the hyperlinks?
thanks for your help
Like this
<?php
$site = file_get_contents("someurl");
$links = substr_count($site, "<a href=");
print"There is {$links} in that page.";
?>
Well, we won't be able to give you a finite answer but only pointers. I've done a search engine once out of php so the principle will be the same:
First of all you need to code your script as a console script, a web script is not really appropriate but it's all a question of tastes
You need to understand how to work with sockets in PHP and make requests, look at the php socket library at: http://www.php.net/manual/ref.network.php
You will need to get versed in the world of HTTP requests, learn how to make your own GET/POST requests and split the headers from the returned content.
Last part will be easy with regexp, just preg_match the content for "#()*#i" (the last expression might be wrong, i didn't test it at all ok?)
Loop the list of found hrefs, compare to already visited hrefs (remember to take into account wildcard GET params in your stuff) and then repeat the process to load all the pages of a site.
It IS HARD WORK... good luck
You may have to use CURL to fetech the contents of the webpage. Store that in a variable then parse it for hyperlinks. You might need regular expression for that.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What's the shebang (#!) in Facebook and new Twitter URLs for?
It usually comes straight after the domain name.
I see it all the time, like in Twitter and Facebook urls.
Is it some special sort of routing?
# is the fragment separator. Everything before it is handled by the server, and everything after it is handled by the client, usually in JavaScript (although it will advance the page to an anchor with the same name as the fragment).
after # is the hash of the location; the ! the follows is used by search engines to help index ajax content. After that can be anything, but is usually rendered to look as a path (hence the /). If you want to know more, read this.