php character limits (trim an html paragraph) - php

We have our own blog system and the post data is stores a raw html, so when it's called from the db we can just echo it and it's formatted completely, no need for BB codes in our situation. Our issue now is that our blog posts sometimes are too long and need to be trimmed.
The problem is that our data contains html, mostly <font>, <span>, <p>, <b>, and other styling tags. I made a php function that trims the characters, but it doesn't take into account the html tags. If the trim function trims the blog it should not trim tags because it messes the whole page. The function needs to be able to close the html tags if they're trimmed. Is there a function out there that can do this? or a function where I could start and build from it?

There's a good example here of truncating text while preserving HTML tags.

There is strip_tags which gets rid of all HTML tags but other than that there isn't much.
This is not an easy thing by the way, you have to actually parse the HTML to find out which tags are left open - that's the most robust approach anyway. Also, don't use a regular expression.

The right solution is to not store display information in your database layer.
Failing that, you could use CSS overflow properties: print the whole post, and then have the display layer handle sizing it to fit. This mitigates the problem of having formatting information in your database by putting the resizing (a display issue, not a content issue) into the display layer as well.
Failing that, you could parse the HTML and "round up" or "round down" to the nearest tag boundary, then insert the tag-close characters necessary to finish the block you were in.
Another option is to iframe the content.

I know this isn't the best way to do it programatically, but have you considered manually specifying where the cut should be? Adding something like and cutting it there manually would allow you to control where the cut happened, regardless of the number of characters before it. For example, you could always put that below the first paragraph.
Admittedly, you lose the ability to just have it happen automatically, but I bring it up in case that doesn't matter as much to you.

Related

PHP: Regex replace while ignoring content between html tags

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.
Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>
I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)
:)
I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.
Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.
And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D
It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.
Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.
I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

How might I truncate HTML with JS (prefered) or PHP?

I am trying to use JS (prefered) or PHP to access APIs like StackOverflow, Tumblr & Forrst to get my latest posts to display in my blog. So I will need a way to truncate the HTML returned, so that it fits into a "widget" sized space.
How might I do it with JS or PHP? It should
not truncate creating invalid HTML
not truncate words (leaving half a word for example)
I am also considering stripping out code blocks or images that otherwise may not fit well. But this is secondary
Well, as I guess, when you truncate a piece of code, you should be careful not to break its workings [in case of HTML, make sure all opening and closing tags remain intact], of course, if you are considering to keep those code blocks. This will require good piece of code heavily loaded with Reg-ex, and I doubt it would be a good idea to achieve this goal with Jscript - PHP would be much faster and safer way...
On the other hand, if you are considering getting rid of all code blocks, first use striptags() function of PHP [you can add <img> as a second parameter to it to keep IMG tags] like:
$clean = striptags( $incoming, "<img>" );
And then truncate your code making sure you are not damaging closing ">" characters of tags. Again, Reg-ex will do the job: just use Reg-ex conditionals and look-forwards, -behinds to achieve that goal.
Once you're done with tags, it's time to make sure you are not damaging your Multi-byte characters: using truncate without control, might corrupt multi-byte characters by splitting their bytes apart. To achieve this try using PHP's mb_substr() function. As you are doing this truncation, you might wish to make your code not count the remaining HTML tags in it as characters - using Reg-ex, you can temporarily replace them with placeholders, once truncation is done, place the original values back in.
So, "simply" put: It requires good command of PHP and some coding, which is hard to post here, I am afraid.
Depending on your needs, you may not actually need to do any truncating at all. Instead, you might be able to style the container that you put the HTML in and set overflow: hidden; to prevent it taking up more space than you want.
This way, you know that you won't be cutting a word in half (as the browsers will "wrap" it nicely) and you know that you won't be accidentally breaking the HTML code, as it will all still be there.
As I said, depending on your specific needs, and the specific HTML that you are getting back, this may or may not be an option. But I think it's worth at least considering.

How to get some elements from html source and convert them to readable text?

I have a page which displays "HeLLo 54292" in ASCII art, using + characters inside <table> tags to produce block letters. I'm generating this with PHP. You can check out page's html source code, and see how the ASCII art is constructed.
I want to convert the ASCII-art letters to actual text, so I could parse that HTML source and would end up with the string "HeLLo 54292". How would I accomplish this?
Step 1: Write an HTML rendering engine in PHP. It will parse the HTML, lay out the page and render it to an image.
Step 2: Write an optical character recognition library in PHP. It will take an image as input, and identify letters in that image by their shapes.
Step 3: Combine those programs and you can convert your tables back to text.
Estimated time for full solution: 1-2 years.
I believe you could package this as a task on Mechanical Turk. This exactly fits the profile of solving problems which are presented via browser rendering.
https://www.mturk.com/mturk/welcome
The latency would be pretty good, probably just a little bit faster than Stack Overflow.
Actually, ok, if you hook it up to SO.. No seriously, those of you reading this, would you rather get three pennies, or 10 rep points? Mmmmm?
Wow I'm gonna go with impossible. Why would you need to convert it to text? Do you have a program generating text in such a format? If so whats stopping you from getting the original variable??
Deconstruct the HTML by using the same patterns you used to produce it.
You used PHP to create that HTML from a string. Reverse the process to convert the HTML back into a string. You have the source code, it should be easy.
Do a reverse replace of each string representing a pixel and recreate the pattern. Then compare that pattern to the one you generated from each character to find the sequence.
I voted to close this as not a real question. But, on the off chance that this is somehow a real question, I'll try to provide a real answer.
What I would suggest, assuming that the characters are not always the same and your goal here is to convert any ASCII art text to a string representation, would be to render the page to an image and try to use some sort of [OCR program]9http://en.wikipedia.org/wiki/Optical_character_recognition) to attempt to recognize the characters and determine what the original text was.
Of course if the ASCII art always uses the same characters, you could parse this using RegExes or other string manipulation.

Cleaning an HTML string saving some tags and attributes

After I implemented my sanitize functions (according to requested specifics), my boss decided to change the accepted input. Now he wants to keep some specific tag and its attributes. I suggested to implement a BBCode-like language which is safer imho but he doesn't want to because it would be to much work.
This time I would like to keep it simple so I will not kill him the next time he asks me to change again this thing. And I know he will.
Is it enough to use first the strip_tags with the tag parameter to preserve and then htmlentities?
strip_tags does not necessarily result in safe content. strip_tags followed by htmlentities would be safe, in that anything HTML-encoded is safe, but it doesn't make any sense.
Either the user is inputting plain text, in which case it should be output using htmlspecialchars (in preference to htmlentities), or they're inputting HTML markup, in which case you need to parse it properly, fixing broken markup and removing elements/attributes that aren't in a safe whitelist.
If that's what you want, use an existing library to do it (eg. htmlpurifier). Because it's not a trivial task and if you get it wrong you've given yourself XSS security holes.
You can keep specific tags using strip_tags with this syntax: strip_tags($text, '<p><a>');
That snippet would strip all tags except p and a. Attributes are kept for tags you have allowed (p and a in the above example).
However, this doesn't mean that the attributes are safe. Does he want specific attributes or does he want to keep all of them on allowed tags? For the first case, you would need to parse each tag and remove the ones desired, sanitizing the values. To keep all attributes on allowed tags, you still need to sanitize them. I would recommend running htmlentities on the attribute values to sanitize them (for display, I would assume).

I'm acquiring a VARCHAR variable through php, and want to show it using HTML/CSS. How do I auto format it so that it isn't one long sentence

I basically want to automatically add line breaks to a VARCHAR variable I acquire from a mysql database trough PHP, so that it shows right. At the moment when I show it trough HTML, all it does is give me a long sentence.
I imagine this can be done with CSS, but overflow: scroll just makes it scroll to the right and left. Use of javascript or JQuery is also accepted.
If it really is a long string with no spaces or line breaks then the word-wrap CSS property can be set to break-word. There are some examples on the MDC page. But, like others have said, if this is a normal sentence with spaces in it then a browser will wrap it automatically providing it's not in a pre block. If its plain text with line breaks that you want preserved then you can either convert the \n to <br> like elusive suggested, or just place it in a pre block to preserve its text formatting.
by any chance are the spaces in this long string eg. non breaking spaces? If so then just replace them with normal spaces.
if not then please show us some of this html that is generated and the code that generates it.
I think PHP's nl2br-function is what you are looking for. It converts any linebreaks (\n) into html-breaks (<br />). Alternatively, you could take a look at the white-space CSS command.

Categories