Regular expression to replace markup with iframe

Regular expression to replace markup with iframe - php

I'm building a simple function to embed videos in Wordpress. I want to read the post content and replace [youku: xxAAAJFSK] with an iframe: <iframe src="http://player.youku.com/embed/xxAAAJFSK"></iframe>
I'm guessing I should use a regular expression to do the replacement but can't seem to find the correct one... I tried:
$pattern = '/youku\.com\/([^\/]*)/i';
if (preg_match($pattern, $content, $matches)){
$id_video = $matches[1];
return "<iframe src='http://player.youku.com/embed/" . $id_video . "></iframe>";
}
This just breaks my site though..
Extra points if you manage to let me set the width and height using something like [youku: xxAAAJFSK width:400 height:400]

Are you fixed to that syntax? If not, you'd be best looking at the Wordpress Shortcode API and following their style. That would take a lot of the hard work out of it for you as the system would handle the argument parsing. For example:
// [youku vid="xxAAAJFSK" width="400" height="400"]
function youku_func( $atts ) {
return "<iframe src='http://player.youku.com/embed/" . $atts['vid'] . " width='" . $atts['width'] . "' height='" $atts['height'] . "'></iframe>";
}
add_shortcode( 'youku', 'youku_func' );
You would probably want to expand this to include default values for width and height or remove them if they're not given as arguments.

This is actually very easy to do ...
\[: Match [
\s* : Match a whitespace 0 or more times
youku : Match youku
\s* : Match a whitespace 0 or more times
: : Match :
\s* : Match a whitespace 0 or more times
([^]]*) : Match anything except ] 0 or more times and group it
\] : Match ]
You may even use the i modifier for case insenstive matching.
Regex: \[\s*youku\s*:\s*([^]]*)\]
Replace: <iframe src="http://player.youku.com/embed/$1"></iframe>
PHP code: $output = preg_replace('#\[\s*youku\s*:\s*([^]]*)\]#i', '<iframe src="http://player.youku.com/embed/$1"></iframe>', $input);

Unless you're doing this for educational purposes, don't reinvent the wheel.
There are a lot of youku-enabled Wordpress plugins already.
Edit: If you want to roll your own, I'd suggest looking at one of the existing working plugins and tailoring their implementation to suit your needs.

Related

preg match text between tags excluding same tag in between

Well I know there several questions similar but could not find any with this specific case.
I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.
Code:
$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match);
static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case string(56) "<namespaces>
// <namespace key="-2">Media</namespace>"
$tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
$tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';
preg_match_all($tag_regex,
$xml,
$matches,
PREG_OFFSET_CAPTURE);
return $matches;
}
As you can see, there is a bug if the tag is nested:
<namespaces> <namespace key="-2">Media</namespace>
When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.
I tried to add "<{$tag}[^\>|^\r\n ]*?>", ^\s+, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.
Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.
I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type.
Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.
Can anyone get the right syntax for this?
You can check an extract of the text here: http://pastebin.com/f2naN2S3
After the proposed change: $tag_ini = "<{$tag}\\b[^>]*>"; $tag_end = "<\\/{$tag}>"; it does work for the the example case, but not for this one:
<namespace key="0" />
<namespace key="1">Talk</namespace>
As it results in:
<namespace key="1">Talk"
It's because numbers and " and letters are considered inside word boundary. How could I address that?

The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.
The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\/) before the > (see demo), or use a (?![^>]*\/>) negative lookahead after \b (see demo).
So, you can use
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";

This is probably not the idea answer, but I was messing with a regex generator:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>

This line is what I needed
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you #Alison and #Wictor for your help and directions

Struggling to use a regex function to find a link in a string

I am trying to extract a string from another string using php.
At the moment im using:
<?php
$testVal = $node->field_link[0]['view'];
$testVal = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$3$4'", $testVal);
print "testVal = ";
print $testVal;
?>
This seems to be printing my entire string at the moment.
Now what i want to do is: extract a web address if there is one and save it as a variable called testVal.
I am a novice so please explain what i am doing wrong. Also i have looked at other questions and have used the regex from one.
For #bos
Input:
<iframe width="560" height="315" src="http://www.youtube.com/embed/CLXt3yh2g0s" frameborder="0" allowfullscreen></iframe>
Desired Output
http://www.youtube.com/embed/CLXt3yh2g0s

Well, you say you want to populate $testVal with the extracted web address, but you're using preg_replace instead of preg_match. You use preg_replace when you wish to replace occurrences, and you use preg_match (or preg_match_all) when you want to find occurrences.
If you want to replace URLs with links (<a> tags) like in your example, use something like this:
<?php
$testVal = preg_replace(
'/((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.||\]|!|\?|,|,|")*(?:[\s<>()"]|$))/',
'<a target="_blank" rel="nofollow" href="$1">$1</a>$2',
$testVal
);
If you want to instead simply locate a URL from a string, try (using your regex now instead of mine above):
<?php
$testVal = $node->field_link[0]['view'];
if(!preg_match("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", $testVal, $matches)) {
echo "Not found!";
else {
echo "URL: " . $matches[1];
}
When you use preg_match, the (optional) third parameter is filled with the results of the search. $matches[0] would contain the string that matched the entire pattern, $matches[1] would contain the first capture group, $matches[2] the second, and so on.

Regex to deterime text 'http://...' but not in iframes, embeds...etc

This regex is used to replace text links with a clickable anchor tag.
#(?<!href="|">)((?:https?|ftp|nntp)://[^\s<>()]+)#i
My problem is, I don't want it to change links that are in things like <iframe src="http//... or <embed src="http://...
I tried checking for a whitespace character before it by adding \s, but that didn't work.
Or - it appears they're first checking that an href=" doesn't already exist (?) - maybe I can check for the other things too?
Any thoughts / explanations how I would do this is greatly appreciated. Main, I just need the regex - I can implement in CakePHP myself.
The actual code comes from CakePHP's Text->autoLink():
function autoLinkUrls($text, $htmlOptions = array()) {
$options = var_export($htmlOptions, true);
$text = preg_replace_callback('#(?<!href="|">)((?:https?|ftp|nntp)://[^\s<>()]+)#i', create_function('$matches',
'$Html = new HtmlHelper(); $Html->tags = $Html->loadConfig(); return $Html->link($matches[0], $matches[0],' . $options . ');'), $text);
return preg_replace_callback('#(?<!href="|">)(?<!http://|https://|ftp://|nntp://)(www\.[^\n\%\ <]+[^<\n\%\,\.\ <])(?<!\))#i',
create_function('$matches', '$Html = new HtmlHelper(); $Html->tags = $Html->loadConfig(); return $Html->link($matches[0], "http://" . $matches[0],' . $options . ');'), $text);
}

You can expand the lookbehind at the beginning of those regexes to check for src=" as well as href=", like this:
(?<!href="|src="|">)

Preg_replace_all links to markdown format

We're converting to markdown, before we used an 'in-house' system, where both the image links and all data with it (e.g. alt) in another bracket.
For example {IMAGE LINK}[OPTIONAL ALT WITH OTHER DATA]
Now we are moving to markdown, (our data is stored as markdown in the database), I need to convert everything into markdown:
So How can I turn all instances of {LINK}[OPTIONAL DATA] (square brackets not required, so some are just {}) into markdown equivalent:
Basically,
{http://www.youtube.com/image.gif}[this
is optional alt] INTO
![alt](http://www.youtube.com/Image.gif)
I have the following so far, but do I deal with the optional [ALT DATA] tag?
if (preg_match_all('/\[(.*?)\]/i', $string, $matches, PREG_SET_ORDER))
{
}

To deal with the optional alt attribute you should use preg_replace_callback. This allows you to test for the existence of the alt attr and add it if necessary.
$str = '
This is an image {http://www.youtube.com/image.gif}[this is optional alt]
This is an image with an alt attribute {http://www.youtube.com/image.gif}
';
echo preg_replace_callback(
'~{(http://[^s]+)}(?:\[(.*?)\])?~',
function($m){
if ( isset( $m[2] ) ) {
return $img = sprintf( '![%s](%s)', $m[2], $m[1] );
}
return $img = sprintf( '(%s)', $m[1] );
},
$str
);

The simple case would be
{(.*?)}\[(.*?)\] <-- search pattern
![\1](\2) <-- replace pattern
but you'll be messed up with links that contain the escaped characters (\{, \}, \[, \]). It would involve a lookahead that you'll have to hope someone else writes up for you. However, if this is just image URLs, you shouldn't have too many (if any) instances of that happening.

I would use preg_replace_callback for that purpose. There it's easier to probe for the optional alt tag and/or construct a replacement.
$source = preg_replace_callback('#
\{ (http://[^}\s]+) \}
(?:
\[ ([^\]{}\n]+) \]
)?
#x',
"cb_img_markdown",
$source);
function cb_img_markdown($m) {
list($asis, $link, $alt) = $m;
if (!strlen($alt)) {
$alt = "image " . basename($link);
}
return "![$alt]($link)";
}
You could also make the link match stricter to avoid false positives. Here I just made it depend on http:// being present, but you could append e.g. (?:png|jpe?g|gif) to ensure it only matches image urls.

This is so hectic in to parse tags in PHP,
I would suggest you should use this PHP Simple HTML DOM Parser
it is very easy to parsing any kind tags, and you can easily filter by attributes also.

php preg_match_all html dates with slashes error

I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance

It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);

I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes

My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.

I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression to replace markup with iframe - php

Unless you're doing this for educational purposes, don't reinvent the wheel. There are a lot of youku-enabled Wordpress plugins already. Edit: If you want to roll your own, I'd suggest looking at one of the existing working plugins and tailoring their implementation to suit your needs.

Related

preg match text between tags excluding same tag in between

Struggling to use a regex function to find a link in a string

Regex to deterime text 'http://...' but not in iframes, embeds...etc

Preg_replace_all links to markdown format

php preg_match_all html dates with slashes error

Categories

Resources