PHP: Regexp to change urls

PHP: Regexp to change urls - php

I'm looking for nice regexp which could change me string from:
text text website.tld text text anotherwebsite.tld/longeraddress text http://maybeanotheradress.tld/file.ext
into bbcodes
text text [url=website.tld]LINK[/url] text text [url=anotherwebsite.tld/longeradress]LINK[/url] text text [url=http://maybeanotheradress.tld/file/ext]LINK[/url]
Could you please advice?

Even I vote for duplicate, a general suggestion: Divide and Conquer.
In your input string, all "URLs" do not contain any spaces. So you can divide the string into the parts that do not contain spaces:
$chunks = explode(' ', $str);
As we know that each part is now potentially a link you can create your own function that is able to tell so:
/**
* #return bool
*/
function is_text_link($str)
{
# do whatever you need to do here to tell whether something is
# a link in your domain or not.
# for example, taken the links you have in your question:
$links = array(
'website.tld',
'anotherwebsite.tld/longeraddress',
'http://maybeanotheradress.tld/file.ext'
);
return in_array($str, $links);
}
The in_array is just an example, you might be looking for regular expression based pattern matching instead. You can edit it later to fit your needs, I leave this as an exercise.
As you can now say what a link is and what not, the only problem left is how to create a BBCode out of a link, that's a fairly simple string operation:
if (is_link($chunk))
{
$chunk = sprintf('[url=%s]LINK[/url]', $chunk);
}
So technically, all problems have been solved and this needs to be put together:
function bbcode_links($str)
{
$chunks = explode(' ', $str);
foreach ($chunks as &$chunk)
{
if (is_text_link($chunk))
{
$chunk = sprintf('[url=%s]LINK[/url]', $chunk);
}
}
return implode(' ', $chunks);
}
This already runs with your example string in question (Demo):
$str = 'text text website.tld text text anotherwebsite.tld/longeraddress text http://maybeanotheradress.tld/file.ext';
echo bbcode_links($str);
Output:
text text [url=website.tld]LINK[/url] text text [url=anotherwebsite.tld/longeraddress]LINK[/url] text [url=http://maybeanotheradress.tld/file.ext]LINK[/url]
You then only need to tweak your is_link function to fullfill your needs. Have fun!

Related

Get the first sentence from string using php

I have some content stored in a variable and it looks like"
$content = "This is a test content and the content of the url is http://www.test.com. The is a second sentence.";
Now my code is
$pos = strpos($content, '.');
$firstsentence = substr($content, 0, $pos);
The above code doesn't work as the string already contains a url having dots.
How can I get the first sentence considering the fact that a string contains a hyperlink?

Please share other scenarios of text. This works fine for your example:
$sentences = 'This is a test content and the content of the url is http://www.test.com. The is a second sentence.';
preg_match('/(http|https):(.*?)com/', $sentences, $match);
$sentences = preg_replace('/(http|https):(.*?)com/', '', $sentences);
$pos = strpos($sentences, '.');
$pos .= -1;
$firstsentence = substr($sentences, 0, $pos) .$match[0].'.';
//This is a test content and the content of the url is http://www.test.com.

In general, I think you're going to also have to look for <sentence-end-punct>"<whitespace>, "<sentence-end-punct><whitespace>, and <sentence-end-punct><whitespace> (where <whitespace> includes the end of a line). Is this very general English text, not especially under your control, or is the grammar very limited? For non-English text, there can be additional rules, such as putting spaces between punctuation and quotes.
Add: What are you trying to accomplish here? Do you really need to pull apart text into individual sentences, or are you just trying to create a "teaser". In the latter case, just cut off the text at a complete word before some number of characters, and add an ellipsis (...).

How to improve my algorithm?/seaching and replacing words in a formated text/

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.
For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.
Here is the code I've got so far:
$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';
function search_and_replace(($key,$text)
{
$words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
for($words as $word)
{
if(strpos($word,$key) !== false)
{
if($word.startswith($key))
{
str_replace($word,''.$word.',$_text);
}
}
}
return text;
}
for($_keys as $_key)
{
$text = search_and_replace($key,$text);
}
My questions:
Would this algorithm work?
How would I modify this to work with UTF-8?
How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
Is this algorithm safe?

is the algorithm "true"? ( I'm reading "accurate")
No, it is not. Since str_replace functions as follows
a string or an array with all occurrences of search in subject
replaced with the given replace value.
The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).
work with UTF-8 Alphabets?
Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.
I want to igonre all words in each a tag for search operetion
That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.
Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.
a better alternative method?
Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.

In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match

How about regex?
preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
print($each[0]);
}
(Sorry, my PHP is a bit rusty)

For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.
$_keys = array( 'ABC', 'DEF', 'ABČ' );
$text =
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
PHP is <em>the</em> most ABCwidely used
langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';
// array for holding html items replaced with tokens
$tokens = array();
$id = 0;
// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s',
function( $matches ) use ( &$tokens, &$id )
{
// store matches into the tokens array
$tokens[ '#'.++$id.'#' ] = $matches[0];
// replace matches with the unique id
return '#'.$id.'#';
},
$text
);
echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
- note the #1# #2# #3# tokens
*/
// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '$0', $text );
// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );
echo $text;
Info about UTF-8 strings in PHP regex:

Find links in string with PHP. Differ from normal and youtube links

I have a string that contain links. I want my php to do different things with my links, depending on the url.
Answer:
function fixLinks($text)
{
$links = array();
$text = strip_tags($text);
$pattern = '!(https?://[^\s]+)!';
if (preg_match_all($pattern, $text, $matches)) {
list(, $links) = ($matches);
}
$i = 0;
$links2 = array();
foreach($links AS $link) {
if(strpos($link,'youtube.com') !== false) {
$search = "!(http://.*youtube\.com.*v=)?([a-zA-Z0-9_-]{11})(&.*)?!";
$youtube = 'http://www.youtube.com/watch?v=\\2';
$link2 = preg_replace($search, $youtube, $link);
} else {
$link2 = preg_replace('#(https?://([-\w\.]+)+(:\d+)?(/([\-\w/_\.]*(\?\S+)?)?)?)#', '<u>$1</u>', $link);
}
$links2[$i] = $link2;
$i++;
}
$text = str_replace($links, $links2, $text);
$text = nl2br($text);
return $text;
}

First of all, ditch eregi. It's deprecated and will disappear soon.
Then, doing this in just one pass is maybe a stretch too far. I think you'll be better off splitting this into three phases.
Phase 1 runs a regex search over your input, finding everything that looks like a link, and storing it in a list.
Phase 2 iterates over the list, checking whether a link goes to youtube (parse_url is tremendously useful for this), and putting a suitable replacement into a second list.
Phase 3: you now have two lists, one containing the original matches, one containing the desired replacements. Run str_replace over your original text, providing the match list for the search parameter and the replacement list for the replacements.
There are several advantages to this approach:
The regular expression for extracting links can be kept relatively simple, since it doesn't have to take special hostnames into account
It is easier to debug; you can dump the search and replace arrays prior to phase 3, and see if they contain what you expect
Because you perform all replacements in one go, you avoid problems with overlapping matches or replacing a piece of already-replaced text (after all, the replaced text still contains a URL, and you don't want to replace that again)

tdammers' answer is good, but another option is to use preg_replace_callback. If you go with that, then the process changes a little:
Create a regular expression to match all links, same as his Phase 1
In the callback, search for the YouTube video id. This will require running a second preg_match, which is (in my opinion) the biggest problem with this technique.
Return the replacement string, based on whether or not it's YouTube.
The code would look something like this:
function replaceem($matches) {
$url = $matches[0];
preg_match('~youtube\.com.*v=([\w\-]{11})~', $url, $matches);
return isset($matches[0]) ?
'<a href="youtube.php?id='.$matches[1].'" class="fancy">'.
'http://www.youtube.com/watch?v='.$matches[1].'</a>' :
'<a href="'.$url.'" title="Åben link" alt="Åben link" '.
'target="_blank">'.$url.'</a>';
}
$text = preg_replace_callback('~(?:f|ht)tps?://[^\s]+~', 'replaceem', $text);

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?

Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.

Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

Preg Replace - replace second occurance of a match

I am relatively new to php, and hope someone can help me with a replace regex, or maybe a match replace I am not exactly sure.
I want to automatically bold the (second occurance of a match) and then make the 4th appearance of a match italic and then the 7th appearance of a match underlined.
This is basically for SEO purposes in content.
I have done some replacements with: and were thinking this should do the trick?
preg_replace( pattern, replacement, subject [, limit ])
I already know the word I want to use in
'pattern' is also a word that is already defined like [word].
`replacement` 'This is a variable I am getting from a mysql db.
'subject' - The subject is text from a db.
Lets say I have this content: This explains more or less what I want to do.
This is an example of the text that I want to replace. In this text I want to make the second occurance of the word example < bold. Then I want to skip the next time example occurs in the text, and make the 4th time the word example appears in italic. Then I want to skip the 5th time the word example appears in the text, as well as the 6th time and lastly wants to make the 7th time example appears in the text underline it. In this example I have used a hyperlink as the underline example as I do not see an underline function in the text editor. The word example may appear more times in the text, but my only requerement is to underline once, make bold once and make italic once. I may later descide to do some quotes on the word "example" as well but it is not yet priority.
It is also important for the code not to through an error if there is not atleast 7 occurances of the word.
How would I do this, any ideas would be appreciated.

You could use preg_split to split the text at the matches, apply the modifications, and then put everything back together:
$parts = preg_split('/(example)/', $str, 7, PREG_SPLIT_DELIM_CAPTURE);
if (isset($parts[3])) $parts[3] = '<b>'.$parts[3].'</b>';
if (isset($parts[7])) $parts[7] = '<i>'.$parts[7].'</i>';
if (isset($parts[13])) $parts[13] = '<u>'.$parts[13].'</u>';
$str = implode('', $parts);
The index formula for the i-th match is index = i · 2 - 1.

The regular expression itself cannot count, and the preg_ functions provide little help. You need a workaround. If you were to actually search for just a word, you might want to use string functions. Otherwise try:
// just counting
if (7 >= preg_match_all($pattern, $subject, $matches)) {
$cb_num = 0;
$subject = preg_replace_callback($pattern, "cb_ibu", $subject);
}
function cb_ibu($match) {
global $cb_num;
$match = $match[0];
switch (++$cb_num) {
case 2: return "<b>$match</b>";
case 4: return "<i>$match</i>";
case 7: return "<u>$match</u>";
default: return $match;
}
}
The trick is to have a callback which does the accounting. And there it's quite easy to add any rules.

That's an interesting question. My implementation would be:
function replace_exact($word, $tag, $string, $limit) {
$tag1 = '<'.$tag.'>';
$tag2 = '</'.$tag.'>';
$string = str_replace($word, $tag1.$word.$tag2, $string, 1);
if ($limit==1) return $string;
return str_replace($tag1.$word.$tag2,$word,$string,$limit-1);
}
Use it like this:
echo replace_exact('Example', 'b', $source_text, 2);
echo replace_exact('Example', 'i', $source_text, 4);
I don't know about how fast this will work, but it will be faster than preg_replace.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.