How might I truncate HTML with JS (prefered) or PHP? - php

I am trying to use JS (prefered) or PHP to access APIs like StackOverflow, Tumblr & Forrst to get my latest posts to display in my blog. So I will need a way to truncate the HTML returned, so that it fits into a "widget" sized space.
How might I do it with JS or PHP? It should
not truncate creating invalid HTML
not truncate words (leaving half a word for example)
I am also considering stripping out code blocks or images that otherwise may not fit well. But this is secondary

Well, as I guess, when you truncate a piece of code, you should be careful not to break its workings [in case of HTML, make sure all opening and closing tags remain intact], of course, if you are considering to keep those code blocks. This will require good piece of code heavily loaded with Reg-ex, and I doubt it would be a good idea to achieve this goal with Jscript - PHP would be much faster and safer way...
On the other hand, if you are considering getting rid of all code blocks, first use striptags() function of PHP [you can add <img> as a second parameter to it to keep IMG tags] like:
$clean = striptags( $incoming, "<img>" );
And then truncate your code making sure you are not damaging closing ">" characters of tags. Again, Reg-ex will do the job: just use Reg-ex conditionals and look-forwards, -behinds to achieve that goal.
Once you're done with tags, it's time to make sure you are not damaging your Multi-byte characters: using truncate without control, might corrupt multi-byte characters by splitting their bytes apart. To achieve this try using PHP's mb_substr() function. As you are doing this truncation, you might wish to make your code not count the remaining HTML tags in it as characters - using Reg-ex, you can temporarily replace them with placeholders, once truncation is done, place the original values back in.
So, "simply" put: It requires good command of PHP and some coding, which is hard to post here, I am afraid.

Depending on your needs, you may not actually need to do any truncating at all. Instead, you might be able to style the container that you put the HTML in and set overflow: hidden; to prevent it taking up more space than you want.
This way, you know that you won't be cutting a word in half (as the browsers will "wrap" it nicely) and you know that you won't be accidentally breaking the HTML code, as it will all still be there.
As I said, depending on your specific needs, and the specific HTML that you are getting back, this may or may not be an option. But I think it's worth at least considering.

Related

How to force line-breaks on ?

Sometimes text on my pages looks very strange, real example:
trained professionals and paraprofessionals coming together
...While the parent div is quite narrow so the text is just sticking out of it.
And it looks quite strange, because actually represents a space.
So, I wonder if it's possible to make the browser account these characters as actual spaces and break the line where necessary without actually replacing them?
EDIT
Why a blind replacing is a problem?
Because may be needed sometimes.
Consider the following example:
Ranks:<br>
Marshall<br>
Leutenant<br>
Sergeant
If I just use a preg_replace on them it would look differently in the end.
(I would also consider some suggestions if you have any ideas on replacing them smartly (for php platform) If you could think of some algorithm that wouldn't affect formatting.)
By definition, is a non-breakable space. It's very meaning is not to be broken across line endings. If this is not what you intend then I suggest fixing the HTML instead of trying to force the browser into non-standard behaviour.

php character limits (trim an html paragraph)

We have our own blog system and the post data is stores a raw html, so when it's called from the db we can just echo it and it's formatted completely, no need for BB codes in our situation. Our issue now is that our blog posts sometimes are too long and need to be trimmed.
The problem is that our data contains html, mostly <font>, <span>, <p>, <b>, and other styling tags. I made a php function that trims the characters, but it doesn't take into account the html tags. If the trim function trims the blog it should not trim tags because it messes the whole page. The function needs to be able to close the html tags if they're trimmed. Is there a function out there that can do this? or a function where I could start and build from it?
There's a good example here of truncating text while preserving HTML tags.
There is strip_tags which gets rid of all HTML tags but other than that there isn't much.
This is not an easy thing by the way, you have to actually parse the HTML to find out which tags are left open - that's the most robust approach anyway. Also, don't use a regular expression.
The right solution is to not store display information in your database layer.
Failing that, you could use CSS overflow properties: print the whole post, and then have the display layer handle sizing it to fit. This mitigates the problem of having formatting information in your database by putting the resizing (a display issue, not a content issue) into the display layer as well.
Failing that, you could parse the HTML and "round up" or "round down" to the nearest tag boundary, then insert the tag-close characters necessary to finish the block you were in.
Another option is to iframe the content.
I know this isn't the best way to do it programatically, but have you considered manually specifying where the cut should be? Adding something like and cutting it there manually would allow you to control where the cut happened, regardless of the number of characters before it. For example, you could always put that below the first paragraph.
Admittedly, you lose the ability to just have it happen automatically, but I bring it up in case that doesn't matter as much to you.

Is it possible to write a regex which checks if a string (javascript & php code) is minified?

Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

how to check if a php file is obfuscated?

is there any way we can check if a php file has been obfuscated, using php? I was thinking regex maybe (for instance ioncube's encoded file contains a very long alphabet string, etc.
One idea is to check for whitespace. The first thing that an obfuscator will do is to remove extra whitespace. Another thing you can look for is the number of characters per line, as obfuscators will put all the code into few (one?) lines.
Often, obsfuscators initialize very large arrays to translate variables into less meaningful names (eg. see obsfucator article
One technique may be to search for these super-large arrays, close to the top of the class/file etc. You may be able to hook xdebug up to examine/look for these. The whole thing of course depends on the obsfuscation technique used. Check the source code, there may be patterns they've used that you can search on.
I think you can use token_get_all() to parse the file - then compute some statistics. For example check for number of function calls(in calse obfuscator uses some eval() string and nothing else) and calculate average function length - for obfuscators it will usually be about 3-5 chars, for normal PHP code it should be much bigger. You can also use dictionary lookup for function/variable names, check for comments etc. I think if you know all obfuscator formats that you want to detect - it will be easy.

Cleaning an HTML string saving some tags and attributes

After I implemented my sanitize functions (according to requested specifics), my boss decided to change the accepted input. Now he wants to keep some specific tag and its attributes. I suggested to implement a BBCode-like language which is safer imho but he doesn't want to because it would be to much work.
This time I would like to keep it simple so I will not kill him the next time he asks me to change again this thing. And I know he will.
Is it enough to use first the strip_tags with the tag parameter to preserve and then htmlentities?
strip_tags does not necessarily result in safe content. strip_tags followed by htmlentities would be safe, in that anything HTML-encoded is safe, but it doesn't make any sense.
Either the user is inputting plain text, in which case it should be output using htmlspecialchars (in preference to htmlentities), or they're inputting HTML markup, in which case you need to parse it properly, fixing broken markup and removing elements/attributes that aren't in a safe whitelist.
If that's what you want, use an existing library to do it (eg. htmlpurifier). Because it's not a trivial task and if you get it wrong you've given yourself XSS security holes.
You can keep specific tags using strip_tags with this syntax: strip_tags($text, '<p><a>');
That snippet would strip all tags except p and a. Attributes are kept for tags you have allowed (p and a in the above example).
However, this doesn't mean that the attributes are safe. Does he want specific attributes or does he want to keep all of them on allowed tags? For the first case, you would need to parse each tag and remove the ones desired, sanitizing the values. To keep all attributes on allowed tags, you still need to sanitize them. I would recommend running htmlentities on the attribute values to sanitize them (for display, I would assume).

Categories