Cakephp: How can make excerpt safer - php

I use $this->Text->excerpt() in TextHelper for excerpt my post description. But, I realize that isn't safe. It maybe break up my web layout,
ex
<p>Advanced SystemCare 7 PRO provides automated and all-in-one PC care service with Malware Removal</p>
<p>It also creates...
It maybe excerpt without </p> tag and break up my layout
So, if you have any solution, please help me. Thanks

The way I got around it was to strip the tags beforehand.
$searchDisplayText = strip_tags($modelItemDetails[$model->alias][$fieldName]);
$searchDisplayTextTruncated = String::excerpt($searchDisplayText, $options['keyword'], $settings['excerptLength'] , '...');
You would need to do this anyway, as you probably don't want images or other unsafe items in your results, you only want text.

Related

Changing/deleting html from file_get_contents

I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display

Parsing Wiki API content

I have this wiki from the API http://fr.wikipedia.org/w/api.php?action=query&titles=%C9rythropo%EF%E9tine&prop=revisions&rvprop=content&format=xmlfm
which I would like to retrieve the main content starting from:
L''''érythropoïétine''' ('''EPO''') est une [[hormone]] ......etc
I tried for a start to preg_replace everything from the top starting from the word "{{Chimiebox..." to the bottom "}}" using this
preg_replace( '/^{{(.*)}}$/sim', '', $value[0]['*'] );
But kind of doesn't work..does anyone know of a good way to determine the start of the content?? Thanks for any advice.
Well, afaik the most projects use the Wikipedia Parser directly, e.g. the Wikipedia Offline Client Project at my university. Since you seem to be using php, this may the be the easiest way for you.

Regex to alter img attributes in Wordpress Filter

I have a custom theme I've developed for a photographer client and need to implement lazy-loading of the images so that the blog loads faster as it is horribly slow due to the amount of images he currently has, even when only showing five posts. To do this I'm using the JAIL jquery plugin but I need to be able to modify the image tags for it to work properly.. basically I have to replace the src attribute with a placeholder and set a data-href attribute to the source url. I cannot seem to find a resolution that works properly inside of a wordpress filter, I'm basically filtering the_content() hook in the posts.. does anyone know how I could accomplish this?
The standard Stackoverflow cliche for these questions is that you should use a DOM parser. Which is actually correct, but not quite feasible (performance) for output manipulation.
To accomplish what you want you could try:
$html = preg_replace_callback(
'#(<img\s[^>]*src)="([^"]+)"#',
"callback_img", $html);
Then define a callback like this:
function callback_img($match) {
list(, $img, $src) = $match;
return "$img=\"placeholder\" data-href=\"$src\" ";
}
Note that this regex is only workable if all your image links follow this scheme consistently (they all should be using double quotes for example).

WMD markdown editor - HTML to Markdown conversion

I am using wmd markdown editor on a project and had a question:
When I post the form containing the markdown text area, it (as expected) posts html to the server. However, say upon server-side validation something fails and I need to send the user back to edit their entry, is there anyway to refill the textarea with just the markdown and not the html? Since as I have it set up, the server only has access to the post data (which is in the form of html) so I can't seem to think of a way to do this. Any ideas? Preferably a non-javascript based solution.
Update: I found an html to markdown converter called markdownify. I guess this might be the best solution for displaying the markdown back to the user...any better alternatives are welcome!
Update 2: I found this post on SO and I guess there is an option to send the data to the server as markdown instead of html. Are there any downsides to simply storing the data as markdown in the database? What about displaying it back to the user (outside of an editor)? Maybe it would be best to post both versions (html AND markdown) to the server...
SOLVED: I can simply use php markdown to convert the markdown to html serverside.
I would suggest that you simply send and store the text as Markdown. This seems to be what you have settled on already. IMO, storing the text as Markdown will be better because you can safely strip all HTML tags out without worrying about loss of formatting - this makes your code safer, because it will be harder to use a XSS attack (although it may still be possible though - I am only saying that this part will be safer).
One thing to consider is that WMD appears to have certain different edge cases from certain server-side Markdown implementations. I've definitely seen some quirks in the previews here that have shown up differently after submission (I believe one such case was attempting to escape a backtick surrounded by backticks). By sending the converted preview over the wire, you can ensure that the preview is accurate.
I'm not saying that should make your decision, but it's something to consider.
Try out Pandoc. It's a little more comprehensive and reliable than Markdownify.
The HTML you are seeing is just a preview, so it's not a good idea to store that in the database as you will run into issues when you try to edit. It's also not a good idea to store both versions (markdown and HTML) as the HTML is just an interpretation and you will have the same problems of editing and keeping both versions in synch.
So the best idea is to store the markdown in the db and then convert it server side before displaying.
You can use PHP Markdown for this purpose. However this is not 100% perfect conversion of what you are seeing on the javascript side and may need some tweaking.
The version that the Stack Exchange network is using is a C# implementation and there should be a python implementation you downloaded with the version of wmd you have.
The one thing I tweaked was the way new lines were rendered so I changed this in markdown.php to convert some new lines into <br> starting from line 626 in the version I have:
var $span_gamut = array(
#
# These are all the transformations that occur *within* block-level
# tags like paragraphs, headers, and list items.
#
# Process character escapes, code spans, and inline HTML
# in one shot.
"parseSpan" => -30,
# Process anchor and image tags. Images must come first,
# because ![foo][f] looks like an anchor.
"doImages" => 10,
"doAnchors" => 20,
# Make links out of things like `<http://example.com/>`
# Must come after doAnchors, because you can use < and >
# delimiters in inline links like [this](<url>).
"doAutoLinks" => 30,
"encodeAmpsAndAngles" => 40,
"doItalicsAndBold" => 50,
"doHardBreaks" => 60,
"doNewLines" => 70,
);
function runSpanGamut($text) {
#
# Run span gamut tranformations.
#
foreach ($this->span_gamut as $method => $priority) {
$text = $this->$method($text);
}
return $text;
}
function doNewLines($text) {
return nl2br($text);
}

"Safe" markdown processor for PHP?

Is there a PHP implementation of markdown suitable for using in public comments?
Basically it should only allow a subset of the markdown syntax (bold, italic, links, block-quotes, code-blocks and lists), and strip out all inline HTML (or possibly escape it?)
I guess one option is to use the normal markdown parser, and run the output through an HTML sanitiser, but is there a better way of doing this..?
We're using PHP markdown Extra for the rest of the site, so we'd already have to use a secondary parser (the non-"Extra" version, since things like footnote support is unnecessary).. It also seems nicer parsing only the *bold* text and having everything escaped to <a href="etc">, than generating <b>bold</b> text and trying to strip the bits we don't want..
Also, on a related note, we're using the WMD control for the "main" site, but for comments, what other options are there? WMD's javascript preview is nice, but it would need the same "neutering" as the PHP markdown processor (it can't display images and so on, otherwise someone will submit and their working markdown will "break")
Currently my plan is to use the PHP-markdown -> HTML santiser method, and edit WMD to remove the image/heading syntax from showdown.js - but it seems like this has been done countless times before..
Basically:
Is there a "safe" markdown implementation in PHP?
Is there a HTML/javascript markdown editor which could have the same options easily disabled?
Update: I ended up simply running the markdown() output through HTML Purifier.
This way the Markdown rendering was separate from output sanitisation, which is much simpler (two mostly-unmodified code bases) more secure (you're not trying to do both rendering and sanitisation at once), and more flexible (you can have multiple sanitisation levels, say a more lax configuration for trusted content, and a much more stringent version for public comments)
PHP Markdown has a sanitizer option, but it doesn't appear to be advertised anywhere. Take a look at the top of the Markdown_Parser class in markdown.php (starts on line 191 in version 1.0.1m). We're interested in lines 209-211:
# Change to `true` to disallow markup or entities.
var $no_markup = false;
var $no_entities = false;
If you change those to true, markup and entities, respectively, should be escaped rather than inserted verbatim. There doesn't appear to be any built-in way to change those (e.g., via the constructor), but you can always add one:
function do_markdown($text, $safe=false) {
$parser = new Markdown_Parser;
if ($safe) {
$parser->no_markup = true;
$parser->no_entities = true;
}
return $parser->transform($text);
}
Note that the above function creates a new parser on every run rather than caching it like the provided Markdown function (lines 43-56) does, so it might be a bit on the slow side.
JavaScript Markdown Editor Hypothesis:
Use a JavaScript-driven Markdown Editor, e.g., based on showdown
Remove all icons and visual clues from the Toolbar for unwanted items
Set up a JavaScript filter to clean-up unwanted markup on submission
Test and harden all JavaScript changes and filters locally on your computer
Mirror those filters in the PHP submission script, to catch same on the server-side.
Remove all references to unwanted items from Help/Tutorials
I've created a Markdown editor in JavaScript, but it has enhanced features. That took a big chunk of time and SVN revisions. But I don't think it would be that tough to alter a Markdown editor to limit the HTML allowed.
How about running htmlspecialchars on the user entered input, before processing it through markdown? It should escape anything dangerous, but leave everything that markdown understands.
I'm trying to think of a case where this wouldn't work but can't think of anything off hand.

Categories