Replacing all links with regular expression [duplicate] - php

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
PHP Linkify Links In Content
I've got a little stuck with finding text links and wrapping them in A tags.
I'm using this so far / [\w]*\.[a-z]{2,}/i to find the link which works fine for links like this, stackoverflow.com but it misses www. or anything before hand.
To recap, I'm trying to find all links and wrap in A tags. Non of the text contains the protocol part (http(s)://) or port part which makes it a tad harder.

Can't find a good duplicate now, so try something simple like repeating the prefix:
/\b(\w[\w-]+\.)+[a-z]{2,}\b/i
I wouldn't use this; too many false positives. But you haven't really limited the scope. Alternatives include e.g. a fixed list of TLDs to make it a bit more specific.

$text = preg_replace('#((?:http(?:s)?://)?(?:www)?([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#', '$1', $text);

Related

Validate a link for a specific link php [duplicate]

This question already has answers here:
PHP validation/regex for URL
(21 answers)
Closed 8 years ago.
I know there are already questions for validating links. But I'm very bad with regex, and I don't know how to validate a user input (in html) is equivalent to these URL:
http://www.domain.com/?p=123456abcde
or
http://www.domain.com/doc/123456abcde
I guess it's like this
/^(http://)(www)((\.[A-Z0-9][A-Z0-9_-]*).com/?p=((\.[A-Z0-9][A-Z0-9_-]*)
I need the regex or the two URL. Thanks
This might not be a job for regexes, but for existing tools in your language of choice. Regexes are not a magic wand you wave at every problem that happens to involve strings. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class
This will match both your strings.
(http:\/\/)?(www\.)?([A-Z0-9a-z][A-Z0-9a-z_-]*).com\/(\?p=)?([A-Z0-9a-z][\/A-Za-z0-9_-]*)
I highly recommend using a regex checker, you can find some for (almost) every OS and there are even some online ones such as: http://regexpal.com/ or http://www.quanetic.com/Regex.
This will match any valid domain with the format you specified.
http(s)?:\/\/(www\.)?[a-zA-Z0-9-\.]+\.[a-z]{2,6}\/(\?p=|doc\/)[a-z0-9]+
Replace [a-z]{2,6} with com if you only want .com domains. See it in action here.

Weird behavior in regular expression replacement [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I'm doing a regular expression for a linking system, and the syntax looks like this:
Login
This tells the system that this link should be converted to JS or an HTML destination depending on the user's browser capabilities.
Right, so I have all the back-end stuff working fine, but I noticed a strange problem with the regular expresion that I'm using to catch these types of links. When a dynamic link (href=":) stands by itself (i.e. not next to another object) then it works fine; however, if a dynamic link like
<a href=":myLink">
comes after a standard link like
<a href="myLink">
then the dynamic link doesn't get altered, like it should.
Here is a codepad link to some sample code that demonstrates the bug.
http://codepad.org/ZKdm2NkS
Notice the <a href=":first"> link does not get modified but the <a href=":second"> link does.
I'm not very good with regexps so I'm sure there's a better way of handling things rather than just using a (.*) everywhere you turn, but like I said, I'm open to better ideas and opinions.
since the only thing you are replacing is the ":myLink" portion you don't really need to match the rest... try this:
$html = preg_replace('/href=":([\w]+)"/', 'href="processedLink-$1"', $html);
this is matching only word (\w) characters (letters, digits, underscores)

Javascript and PHP Regular Expression for url validation [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for URL validation (in JavaScript)
So I've seen many similar questions and answers but can't find a solution that fits my specific needs.
I'm terrible at Regex's and am struggling to get a simple Regex for the following url validation.
domain.com
domain.com/folder
subdomain.domain.com
subdomain.domain.com/folder
also to validate for optional http:// and http://www. would be super helpful. Thanks!
As near as I can get would be:
/[a-z]+:\/\/(([a-z0-9][a-z0-9-]+\.)*[a-z][a-z]+|(0x[0-9A-F]+)|[0-9.]+)\/.*/
Note that your question hasn't limited URLs to a set of protocols, TLDs or character sets.
Something like skype://18005551212 or gopher://localhost is a valid URL. Heck, depending on what you're using to browse, the following might all be valid ways to get to the same server (though not quite the same virtualhost):
https://stackoverflow.com/
http://64.34.119.12/
http://1076000524/
http://0x4022770C/
They all work for me in Firefox.
If you want further restrictions, determine WHAT they are. Are you willing to sacrifice valid protocols? Are you really only interested in one or two protocols?
A more specific question will get you a more specific answer.

Count Hyperlinks of a Website [duplicate]

This question already exists:
Closed 11 years ago.
Possible Duplicate:
How to parse HTML with PHP?
i want to write a php-program that count all hyperlinks of a website, the user can enter.
how to do this? is there a libary or something which i can parse and analyze the html about the hyperlinks?
thanks for your help
Like this
<?php
$site = file_get_contents("someurl");
$links = substr_count($site, "<a href=");
print"There is {$links} in that page.";
?>
Well, we won't be able to give you a finite answer but only pointers. I've done a search engine once out of php so the principle will be the same:
First of all you need to code your script as a console script, a web script is not really appropriate but it's all a question of tastes
You need to understand how to work with sockets in PHP and make requests, look at the php socket library at: http://www.php.net/manual/ref.network.php
You will need to get versed in the world of HTTP requests, learn how to make your own GET/POST requests and split the headers from the returned content.
Last part will be easy with regexp, just preg_match the content for "#()*#i" (the last expression might be wrong, i didn't test it at all ok?)
Loop the list of found hrefs, compare to already visited hrefs (remember to take into account wildcard GET params in your stuff) and then repeat the process to load all the pages of a site.
It IS HARD WORK... good luck
You may have to use CURL to fetech the contents of the webpage. Store that in a variable then parse it for hyperlinks. You might need regular expression for that.

Reliably parsing HTML elements using RegEx [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I'm trying to parse a webpage using RegEx, and I'm having some trouble making it work in a reliable manner.
Say I wanted to parse the code that creates a div element, and I want to extract everything between <div> and </div>. Now, this code could just be <div></div>, but it could also very well be something like:
<div class="thisIsMyDivClass"><p>This text is inside the div</p></div>
How can I make sure that no matter how many characters that are in between the greater-than/less-than signs of the initial div tag and the corresponding last div tag, I'll always only get the content in between them? If I specify that the number of characters following < can be anything from one to ten thousand, I will always be extracting the > after ten thousand characters, and thus (most likely, unless there is a lot of code or text in between) retrieve a bunch of code in between that I don't need.
This is my code so far (not reliable for the aforementioned reason):
/<.{1,10000}>/
Regular expressions describe so called regular languages - or Type 3 in the Chomsky hierarchy. On the other hand HTML is a context free language which is Type 2 in the Chomsky hierarchy. So: There is no way to reliably parse HTML with regular expressions in general. Use a HTML parser instead. For PHP you can find some suggestions in this question: How do you parse and process HTML/XML in PHP?
You will need a Lexical analyser and grammar checker to parse html correctly. RegEx main focus was for searching strings for patterns.
I would suggest using something like DOM. I am doing a large scale site with and using DOM like crazy on it. It works, works good, and with a little work can be extremely powerful.
http://php.net/manual/en/book.dom.php

Categories