We're converting to markdown, before we used an 'in-house' system, where both the image links and all data with it (e.g. alt) in another bracket.
For example {IMAGE LINK}[OPTIONAL ALT WITH OTHER DATA]
Now we are moving to markdown, (our data is stored as markdown in the database), I need to convert everything into markdown:
So How can I turn all instances of {LINK}[OPTIONAL DATA] (square brackets not required, so some are just {}) into markdown equivalent:
Basically,
{http://www.youtube.com/image.gif}[this
is optional alt] INTO
![alt](http://www.youtube.com/Image.gif)
I have the following so far, but do I deal with the optional [ALT DATA] tag?
if (preg_match_all('/\[(.*?)\]/i', $string, $matches, PREG_SET_ORDER))
{
}
To deal with the optional alt attribute you should use preg_replace_callback. This allows you to test for the existence of the alt attr and add it if necessary.
$str = '
This is an image {http://www.youtube.com/image.gif}[this is optional alt]
This is an image with an alt attribute {http://www.youtube.com/image.gif}
';
echo preg_replace_callback(
'~{(http://[^s]+)}(?:\[(.*?)\])?~',
function($m){
if ( isset( $m[2] ) ) {
return $img = sprintf( '![%s](%s)', $m[2], $m[1] );
}
return $img = sprintf( '(%s)', $m[1] );
},
$str
);
The simple case would be
{(.*?)}\[(.*?)\] <-- search pattern
![\1](\2) <-- replace pattern
but you'll be messed up with links that contain the escaped characters (\{, \}, \[, \]). It would involve a lookahead that you'll have to hope someone else writes up for you. However, if this is just image URLs, you shouldn't have too many (if any) instances of that happening.
I would use preg_replace_callback for that purpose. There it's easier to probe for the optional alt tag and/or construct a replacement.
$source = preg_replace_callback('#
\{ (http://[^}\s]+) \}
(?:
\[ ([^\]{}\n]+) \]
)?
#x',
"cb_img_markdown",
$source);
function cb_img_markdown($m) {
list($asis, $link, $alt) = $m;
if (!strlen($alt)) {
$alt = "image " . basename($link);
}
return "![$alt]($link)";
}
You could also make the link match stricter to avoid false positives. Here I just made it depend on http:// being present, but you could append e.g. (?:png|jpe?g|gif) to ensure it only matches image urls.
This is so hectic in to parse tags in PHP,
I would suggest you should use this PHP Simple HTML DOM Parser
it is very easy to parsing any kind tags, and you can easily filter by attributes also.
Related
I would like to process my user input to allow only certain html tags, and replace the other ones by their html entities, as well as replace non-tag-characters. For example, if I only wanted to allow the <b> and the <a> tag, then
allow_only("This is <b>bold</b> and this is <i>italic</i>.
Moreover 2<3 and <a href='google.com'>this is a link</a>.","<b><a>");
should produce
This is <b>bold</b> and this is <i>italic</i>.
Moreover 2<3 and <a href='google.com'>this is a link</a>.
How can I do this in PHP? I am aware of strip_tags() that can remove the unwanted tags completely, and I'm aware of htmlspecialchars() which can replace all tags by their html entities, but none where only specific tags get replaced. How can this be done in PHP?
And if there is no 'common' way to do this, how should I in general go on processing user input that can have valid regular html, but can also have < signs and potentially dangerous html code?
Apply htmlspecialchars and then replace encoded entities with regular entities for a given array of tags
function allow_only($str, $allowed){
$str = htmlspecialchars($str);
foreach( $allowed as $a ){
$str = str_replace("<".$a.">", "<".$a.">", $str);
$str = str_replace("</".$a.">", "</".$a.">", $str);
}
return $str;
}
echo allow_only("This is <b>bold</b> and this is <i>italic</i>.", array("b"));
That works for simple tags, returning "This is bold and this is <i>italic</i>."
As it was pointed out, that doesn't work for tags with attributes, but this does:
function fix_attributes($match){
// TODO: study $match[2] in depth and avoid banned attributes
// eg: those that begin with on, or href that begins with javascript:
// to avoid some potential hacks
return "<".$match[1].str_replace('"','"',$match[2]).">";
}
function allow_only($str, $allowed){
$str = htmlspecialchars($str);
foreach( $allowed as $a ){
$str = preg_replace_callback("/<(".$a."){1}([\s\/\.\w=&;:#]*?)>/", fix_attributes, $str);
$str = str_replace("</".$a.">", "</".$a.">", $str);
}
return $str;
}
echo allow_only('This is <b>bold</b> and this is <i>italic</i>.', array("b","a"));
that handles more complex tags with certain attributes, only the characters listed between [] are allowed to appear in attributes by this. Unfortunately " must be allowed within attributes or it won't work, and with it all other entities are allowed too - however only " in attributes will be decoded.
As it was suggested a much better (safer, cleaner) way to solve problems like this to use a library like http://htmlpurifier.org/demo.php
Having following code to turn an URL in a message into HTML links:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
"\\1\\2", $message);
It works very good with almost all links, except in following cases:
1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide
Problem here is the # and the : within the link, because not the complete link is transformed.
2) If someone just writes "www" in a message
Example: www
So the question is about if there is any way to fix these two cases in the code above?
Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:
$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"\\0", $message);
Does this help?
Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:
$
_
. (dot)
+
!
*
'
(
)
If you would include these chars, you should then delimiter your regex with %.
Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
"\\0", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
"\\1\\2", $message);
In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:
"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."
Example of code (not tested):
$message = preg_replace_callback(
'~(?(DEFINE)
(?<prot> (?>ht|f) tps?+ :// ) # you can add protocols here
)
(?>
<a\b (?> [^<]++ | < (?!/a>) )++ </a> # avoid links inside "a" tags
|
<[^>]++> # and tags attributes.
) (*SKIP)(?!) # makes fail the subpattern.
| # OR
\b(?>(\g<prot>)|www\.)(\S++) # something that begins with
# "http://" or "www."
~xi',
function ($match) {
if (filter_var($match[2], FILTER_VALIDATE_URL)) {
$url = (empty($match[1])) ? 'http://' : '';
$url .= $match[0];
return '<a href="away?to=' . $url . '"target="_blank">'
. $url . '</a>';
} else { return $match[0] }
},
$message);
I'm building a simple function to embed videos in Wordpress. I want to read the post content and replace [youku: xxAAAJFSK] with an iframe: <iframe src="http://player.youku.com/embed/xxAAAJFSK"></iframe>
I'm guessing I should use a regular expression to do the replacement but can't seem to find the correct one... I tried:
$pattern = '/youku\.com\/([^\/]*)/i';
if (preg_match($pattern, $content, $matches)){
$id_video = $matches[1];
return "<iframe src='http://player.youku.com/embed/" . $id_video . "></iframe>";
}
This just breaks my site though..
Extra points if you manage to let me set the width and height using something like [youku: xxAAAJFSK width:400 height:400]
Are you fixed to that syntax? If not, you'd be best looking at the Wordpress Shortcode API and following their style. That would take a lot of the hard work out of it for you as the system would handle the argument parsing. For example:
// [youku vid="xxAAAJFSK" width="400" height="400"]
function youku_func( $atts ) {
return "<iframe src='http://player.youku.com/embed/" . $atts['vid'] . " width='" . $atts['width'] . "' height='" $atts['height'] . "'></iframe>";
}
add_shortcode( 'youku', 'youku_func' );
You would probably want to expand this to include default values for width and height or remove them if they're not given as arguments.
This is actually very easy to do ...
\[: Match [
\s* : Match a whitespace 0 or more times
youku : Match youku
\s* : Match a whitespace 0 or more times
: : Match :
\s* : Match a whitespace 0 or more times
([^]]*) : Match anything except ] 0 or more times and group it
\] : Match ]
You may even use the i modifier for case insenstive matching.
Regex: \[\s*youku\s*:\s*([^]]*)\]
Replace: <iframe src="http://player.youku.com/embed/$1"></iframe>
PHP code: $output = preg_replace('#\[\s*youku\s*:\s*([^]]*)\]#i', '<iframe src="http://player.youku.com/embed/$1"></iframe>', $input);
Unless you're doing this for educational purposes, don't reinvent the wheel.
There are a lot of youku-enabled Wordpress plugins already.
Edit: If you want to roll your own, I'd suggest looking at one of the existing working plugins and tailoring their implementation to suit your needs.
What is the easiest way of applying highlighting of some text excluding text within OCCASIONAL tags "<...>"?
CLARIFICATION: I want the existing tags PRESERVED!
$t =
preg_replace(
"/(markdown)/",
"<strong>$1</strong>",
"This is essentially plain text apart from a few html tags generated with some
simplified markdown rules: <a href=markdown.html>[see here]</a>");
Which should display as:
"This is essentially plain text apart from a few html tags generated with some simplified markdown rules: see here"
... BUT NOT MESS UP the text inside the anchor tag (i.e. <a href=markdown.html> ).
I've heard the arguments of not parsing html with regular expressions, but here we're talking essentially about plain text except for minimal parsing of some markdown code.
Actually, this seems to work ok:
<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";
//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"
//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);
// this preserves the text before ($1), after ($4)
// and inside <strong>..</strong> ($2), but without the tags ($3)
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"
?>
A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)
You could split the string into tag/no-tag parts using preg_split:
$parts = preg_split('/(<(?:[^"\'>]|"[^"<]*"|\'[^\'<]*\')*>)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
Then you can iterate the parts while skipping every even part (i.e. the tag parts) and apply your replacement on it:
for ($i=0, $n=count($parts); $i<$n; $i+=2) {
$parts[$i] = preg_replace("/(markdown)/", "<strong>$1</strong>", $parts[$i]);
}
At the end put everything back together with implode:
$str = implode('', $parts);
But note that this is really not the best solution. You should better use a proper HTML parser like PHP’s DOM library. See for example these related questions:
Highlight keywords in a paragraph
Regex / DOMDocument - match and replace text not in a link
First replace any string after a tag, but force your string is after a tag:
$t=preg_replace("|(>[^<]*)(markdown)|i",'$1<strong>$2</strong>',"<null>$t");
Then delete your forced tag:
$show=preg_replace("|<null>|",'',$show);
You could split your string into an array at every '<' or '>' using preg_split(), then loop through that array and replace only in entries not beginning with an '>'. Afterwards you combine your array to an string using implode().
This regex should strip all HTML opening and closing tags: /(<[.*?]>)+/
You can use it with preg_replace like this:
$test = "Hello <strong>World!</strong>";
$regex = "/(<.*?>)+/";
$result = preg_replace($regex,"",$test);
actually this is not very efficient, but it worked for me
$your_string = '...';
$search = 'markdown';
$left = '<strong>';
$right = '</strong>';
$left_Q = preg_quote($left, '#');
$right_Q = preg_quote($right, '#');
$search_Q = preg_quote($search, '#');
while(preg_match('#(>|^)[^<]*(?<!'.$left_Q.')'.$search_Q.'(?!'.$right_Q.')[^>]*(<|$)#isU', $your_string))
$your_string = preg_replace('#(^[^<]*|>[^<]*)(?<!'.$left_Q.')('.$search_Q.')(?!'.$right_Q.')([^>]*<|[^>]*$)#isU', '${1}'.$left.'${2}'.$right.'${3}', $your_string);
echo $your_string;
I have an html (sample.html) like this:
<html>
<head>
</head>
<body>
<div id="content">
<!--content-->
<p>some content</p>
<!--content-->
</div>
</body>
</html>
How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?
esafwan - you could use a regex expression to extract the content between the div (of a certain id).
I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.
[update] try this:
<?php
function get_tag( $attr, $value, $xml ) {
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';
preg_match($tag_regex,
$xml,
$matches);
return $matches[1];
}
$yourentirehtml = file_get_contents("test.html");
$extract = get_tag('id', 'content', $yourentirehtml);
echo $extract;
?>
or more simply:
preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];
jim
If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.
The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.
If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.
<?php
$content=file_get_contents("sample.html");
$comment=explode("<!--content-->",$content);
$comment=explode("<!--content-->",$comment[1]);
var_dump(strip_tags($comment[0]));
?>
check this ,it will work for you
Problem is with nested divs
I found solution here
<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*> # match the "main" class DIV opening tag
( # capture "main" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "main" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(? 1)</div>)*)</div>}si';
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No matches');
}
echo("\n</pre>");
?>
Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html
You can then treat it as a normal SimpleXML object.
EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)