Finding a point in a string that is not inside BBCodes - php

I have a string which contains the text of an article. This is sprinkled with BBCodes (between square brackets). I need to be able to grab the first say, 200 characters of an article without cutting it off in the middle of a bbcode. So I need an index where it is safe to cut it off. This will give me the article summary.
The summary must be minimum 200 characters but can be longer to 'escape' out of a bbcode. (this length value will actually be a parameter to a function).
It must not give me a point inside a stand alone bbcode (see the pipe) like so: [lis|t].
It must not give me a point between a start and end bbcode like so: [url="http://www.google.com"]Go To Goo|gle[/url].
It must not give me a point inside either the start or end bbcode or in-between them, in the above example.
It should give me the "safe" index which is after 200 and is not cutting off any BBCodes.
Hope this makes sense. I have been struggling with this for a while. My regex skills are only moderate. Thanks for any help!

First off, I would suggest considering what you will do with a post that is entirely wrapped in BBcodes, as is often true in the case of a font tag. In other words, a solution to the problem as stated will easily lead to 'summaries' containing the entire article. It may be more valuable to identify which tags are still open and append the necessary BBcodes to close them. Of course in cases of a link, it will require additional work to ensure you don't break it.

Well, the obvious easy answer is to present your "summary" without any bbcode-driven markup at all (regex below taken from here)
$summary = substr( preg_replace( '|[[\/\!]*?[^\[\]]*?]|si', '', $article ), 0, 200 );
However, do do the job you explicitly describe is going to require more than just a regex. A lexer/parser would do the trick, but that's a moderately complicated topic. I'll see if I can come up w/something.
EDIT
Here's a pretty ghetto version of a lexer, but for this example it works. This converts an input string into bbcode tokens.
<?php
class SimpleBBCodeLexer
{
protected
$tokens = array()
, $patterns = array(
self::TOKEN_OPEN_TAG => "/\\[[a-z].*?\\]/"
, self::TOKEN_CLOSE_TAG => "/\\[\\/[a-z].*?\\]/"
);
const TOKEN_TEXT = 'TEXT';
const TOKEN_OPEN_TAG = 'OPEN_TAG';
const TOKEN_CLOSE_TAG = 'CLOSE_TAG';
public function __construct( $input )
{
for ( $i = 0, $l = strlen( $input ); $i < $l; $i++ )
{
$this->processChar( $input{$i} );
}
$this->processChar();
}
protected function processChar( $char=null )
{
static $tokenFragment = '';
$tokenFragment = $this->processTokenFragment( $tokenFragment );
if ( is_null( $char ) )
{
$this->addToken( $tokenFragment );
} else {
$tokenFragment .= $char;
}
}
protected function processTokenFragment( $tokenFragment )
{
foreach ( $this->patterns as $type => $pattern )
{
if ( preg_match( $pattern, $tokenFragment, $matches ) )
{
if ( $matches[0] != $tokenFragment )
{
$this->addToken( substr( $tokenFragment, 0, -( strlen( $matches[0] ) ) ) );
}
$this->addToken( $matches[0], $type );
return '';
}
}
return $tokenFragment;
}
protected function addToken( $token, $type=self::TOKEN_TEXT )
{
$this->tokens[] = array( $type => $token );
}
public function getTokens()
{
return $this->tokens;
}
}
$l = new SimpleBBCodeLexer( 'some [b]sample[/b] bbcode that [i] should [url="http://www.google.com"]support[/url] what [/i] you need.' );
echo '<pre>';
print_r( $l->getTokens() );
echo '</pre>';
The next step would be to create a parser that loops over these tokens and takes action as it encounters each type. Maybe I'll have time to make it later...

This does not sound like a job for (only) regex.
"Plain programming" logic is a better option:
grab a character other than a '[', increase a counter;
if you encounter an opening tag, keep advancing until you reach the closing tag (don't increase the counter!);
stop grabbing text when your counter has reached 200.

Here is a start. I don't have access to PHP at the moment, so you might need some tweaking to get it to run. Also, this will not ensure that tags are closed (i.e. the string could have [url] without [/url]). Also, if a string is invalid (i.e. not all square brackets are matched) it might not return what you want.
function getIndex($str, $minLen = 200)
{
//on short input, return the whole string
if(strlen($str) <= $minLen)
return strlen($str);
//get first minLen characters
$substr = substr($str, 0, $minLen);
//does it have a '[' that is not closed?
if(preg_match('/\[[^\]]*$/', $substr))
{
//find the next ']', if there is one
$pos = strpos($str, ']', $minLen);
//now, make the substr go all the way to that ']'
if($pos !== false)
$substr = substr($str, 0, $pos+1);
}
//now, it may be better to return $subStr, but you specifically
//asked for the index, which is the length of this substring.
return strlen($substr);
}

I wrote this function which should do just what you want. It counts n numbers of characters (except those in tags) and then closes tags which needs to be closed. Example use included in code. The code is in python, but should be really easy to port to other languages, such as php.
def limit(input, length):
"""Splits a text after (length) characters, preserving bbcode"""
stack = []
counter = 0
output = ""
tag = ""
insideTag = 0 # 0 = Outside tag, 1 = Opening tag, 2 = Closing tag, 3 = Opening tag, parameters section
for i in input:
if counter >= length: # If we have reached the max length (add " and i == ' '") to not make it split in a word
break
elif i == '[': # If we have reached a tag
insideTag = 1
elif i == '/': # If we reach a slash...
if insideTag == 1: # And we are in an opening tag
insideTag = 2
elif i == '=': # If we have reached the parameters
if insideTag >= 1: # If we actually are in a tag
insideTag = 3
elif i == ']': # If we have reached the closing of a tag
if insideTag == 2: # If we are in a closing tag
stack.pop() # Pop the last tag, we closed it
elif insideTag >= 1:# If we are in a tag, parameters or not
stack.append(tag) # Add current tag to the tag-stack
if insideTag >= 0: # If are in some type of tag
insideTag = 0
tag = ""
elif insideTag == 0: # If we are not in a tag
counter += 1
elif insideTag <= 2: # If we are in a tag and not among the parameters
tag += i
output += i
while len(stack) > 0:
output += '[/'+stack.pop()+']' # Add the remaining tags
return output
cutText = limit('[font]This should be easy:[img]yippee.png[/img][i][u][url="http://www.stackoverflow.com"]Check out this site[/url][/u]Should be cut here somewhere [/i][/font]', 60)
print cutText

Related

How to get the string position of the middle-most element within HTML content?

I am working with news articles in HTML format, that come from a wysiwyg editor, and I need to find the middle of it, but in a visual/HTML context, meaning an empty place inbetween two root elements. Kind of if you wanted to split the article into two pages let's say, with the equal number of paragraphs on each when possible.
All root elements seem to come out as paragraphs, which was easy enough to count, a simple
$p_count = substr_count($article_text, '<p');
Returns the total number of opening paragraph tags, and then i can look for the strpos of a ($p_count/2)-th occurrence of a paragraph.
But the problem is embedded tweets, that contain paragraphs, which appear sometimes under blockquote > p, other times as center > blockquote > p.
So i turn to DOMDocument. This little snippet gives me the nth element that is the middle one (even if the elements are divs and not paragraphs, which is cool):
$dom = new DOMDocument();
$dom->loadHTML($article_text);
$body = $dom->getElementsByTagName('body');
$rootNodes = $body->item(0)->childNodes;
$empty_nodes = 0;
foreach($rootNodes as $node) {
if($node->nodeType === XML_TEXT_NODE && strlen(trim($node->nodeValue)) === 0) {
$empty_nodes++;
}
}
$total_elements = $rootNodes->length - $empty_nodes;
$middle_element = floor($total_elements / 2);
But how do i now find the string offset of this middle element within my original HTML source, so that i can point to this middle place within the article text string? Especially considering that DOMDocument converts the HTML of what i gave it, into a full HTML page (with a doctype, and head and all that), so its output HTML is bigger than my original HTML article source.
Ok i solved it.
What i did was match all HTML tags from the article, using the PREG_OFFSET_CAPTURE flag of preg_match_all, which remembers at which character offset the pattern was matched. Then i looped through all of them sequentially, and counted which depth i'm in; if it's an opening tag, i count the depth +1, and for a closing -1 (minding the self-closing tags). Every time the depth gets to zero after a closing tag, i count that as one more root element closed. If at the end i ended up at depth 0, i assumed i counted correctly.
Now, i can take the number of root elements that i counted, divide by 2 to get the middle-ish one (+-1 for odd numbers), and look at the offset of the element at that index as reported by preg_match_all previously.
Complete code for that if anyone needs to do the same thing is below.
It might be sped up if the is_self_closing() function was written using a regex and then checking in_array($self_closing_tags), instead of a foreach loop, but in my case it didn't make enough of a difference for me to bother.
function calculate_middle_of_article(string $text, bool $debug=false): ?int {
function is_self_closing(string $input, array $self_closing_tags): bool {
foreach($self_closing_tags as $tag) {
if(substr($input, 1, strlen($tag)) === $tag) {
return true;
}
}
return false;
}
$self_closing_tags = [
'!--',
'area',
'base',
'br',
'col',
'embed',
'hr',
'img',
'input',
'link',
'meta',
'param',
'source',
'track',
'wbr',
'command',
'keygen',
'menuitem',
];
$regex = '/<("[^"]*"|\'[^\']*\'|[^\'">])*>/';
preg_match_all($regex, $text, $matches, PREG_OFFSET_CAPTURE);
$debug && print count($matches[0]) . " tags found \n";
$root_elements = [];
$depth = 0;
foreach($matches[0] as $match) {
if(!is_self_closing($match[0], $self_closing_tags)) {
$depth+= (substr($match[0], 1, 1) === '/') ? -1 : 1;
}
$debug && print "level {$depth} after tag: " . htmlentities($match[0]) . "\n";
if($depth === 0) {
$root_elements[]= $match;
}
}
$ok = ($depth === 0);
$debug && print ($ok ? 'ok' : 'not ok') . "\n";
// has to end at depth zero to confirm counting is correct
if(!$ok) {
return null;
}
$debug && print count($root_elements) . " root elements\n";
$element_index_at_middle = floor(count($root_elements)/2);
$half_char = $root_elements[$element_index_at_middle][1];
$debug && print "which makes the half the {$half_char}th character at the {$element_index_at_middle}th element\n";
return $half_char;
}

PHP: Get specific content of a website

I want to get specific content of a website into an array.
I have approx 20 sites to fetch the content and output in other ways i like.Only the port is always changing (not 27015, its than 27016 or so...)
This is just one: SOURCE-URL of Content
For now, i use this code in PHP to fetch the Gameicon "cs.png", but the icon varies in length - so it isn't the best way, or? :-/
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array(
"icon" => substr($source, strpos($source, 'game_icons64')+13, 6),
);
echo $content[icon];
}
Thanks for helping, some days are passed from my last PHP work :P
You just need to look for the first " that comes after the game_icons64 and read up to there.
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
// find the position right after game_icons64/
$first_occurance = strpos($source, 'game_icons64')+13;
// find the first occurance of " after game_icons64, where the src ends for the img
$second_occurance = strpos($source, '"', $first_occurance);
$content = array(
// take a substring starting at the end of game_icons64/ and ending just before the src attribute ends
"icon" => substr($source, $first_occurance, $second_occurance-$first_occurance),
);
echo $content['icon'];
}
Also, you had an error because you used [icon] and not ['icon']
Edit to match the second request involving multiple strings
$srvip = '148.251.78.214';
$srvlist = array('27015');
$content_strings = array( );
// the first 2 items are the string you are looking for in your first occurrence and how many chars to skip from that position
// the third is what should be the first char after the string you are looking for, so the first char that will not be copied
// the last item is how you want your array / program to register the string you are reading
$content_strings[] = array('game_icons64', 13, '"', 'icon');
// to add more items to your search, just copy paste the line above and change whatever you need from it
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array();
foreach($content_strings as $k=>$v)
{
$first_occurance = strpos($source, $v[0])+$v[1];
$second_occurance = strpos($source, $v[2], $first_occurance);
$content[$v[3]] = substr($source, $first_occurance, $second_occurance-$first_occurance);
}
print_r($content);
}

Split a long string not using space

If I have sentences like this:
$msg = "hello how are you?are you fine?thanks.."
and I wish to seperate it into 3 (or whatever number).
So I'm doing this:
$msglen = strlen($msg);
$seperate = ($msglen /3);
$a = 0;
for($i=0;$i<3;$i++)
{
$seperate = substr($msg,$a,$seperate)
$a = $a + $seperate;
}
So the output should be..
hello how are
[a space here->] you?are you [<-a space here]
fine?thanks..
So is it possible to separate at middle of any word instead of having a space in front or end of the separated message?
Such as "thank you" -> "than" and "k you" instead of "thank" " you ".
Because I'm doing a convert function and with a space in front or end it will effect the convertion , and the space is needed for the conversion,so I can't ignore or delete it.
Thanks.
I take it you can't use trim because the message formed by the joined up strings must be unchanged?
That could get complicated. You could make something that tests for a space after the split, and if a space is detected, makes the split one character earlier. Fairly easy, but what if you have two spaces together? Or a single lettered word? You can of course recursively test this way, but then you may end up with split strings of lengths that are very different from each other.
You need to properly define the constraints you want this to function within.
Please state exactly what you want to do - do you want each section to be equal? Is the splitting in between words of a higher priority than this, so that the lengths do not matter much?
EDIT:
Then, if you aren't worried about the length, you could do something like this [starting with Eriks code and proceeding to change the lengths by moving around the spaces:
$msg = "hello how are you?are you fine?thanks..";
$parts = split_without_spaces ($msg, 3);
function split_without_spaces ($msg, $parts) {
$parts = str_split(trim($msg), ceil(strlen($msg)/$parts));
/* Used trim above to make sure that there are no spaces at the start
and end of the message, we can't do anything about those spaces */
// Looping to (count($parts) - 1) becaause the last part will not need manipulation
for ($i = 0; $i < (count($parts) - 1) ; $i++ ) {
$k = $i + 1;
// Checking the last character of the split part and the first of the next part for a space
if (substr($parts[$i], -1) == ' ' || $parts[$k][0] == ' ') {
// If we move characters from the first part to the next:
$num1 = 1;
$len1 = strlen($parts[$i]);
// Searching for the last two consecutive non-space characters
while ($parts[$i][$len1 - $num1] == ' ' || $parts[$i][$len1 - $num1 - 1] == ' ') {
$num1++;
if ($len1 - $num1 - 2 < 0) return false;
}
// If we move characters from the next part to the first:
$num2 = 1;
$len2 = strlen($parts[$k]);
// Searching for the first two consecutive non-space characters
while ($parts[$k][$num2 - 1] == ' ' || $parts[$k][$num2] == ' ') {
$num2++;
if ($num2 >= $len2 - 1) return false;
}
// Compare to see what we can do to move the lowest no of characters
if ($num1 > $num2) {
$parts[$i] .= substr($parts[$k], 0, $num2);
$parts[$k] = substr($parts[$k], -1 * ($len2 - $num2));
}
else {
$parts[$k] = substr($parts[$i], -1 * ($num1)) . $parts[$k];
$parts[$i] = substr($parts[$i], 0, $len1 - $num1);
}
}
}
return ($parts);
}
This takes care of multiple spaces and single lettered characters - however if they exist, the lengths of the parts may be very uneven. It could get messed up in extreme cases - if you have a string made up on mainly spaces, it could return one part as being empty, or return false if it can't manage the split at all. Please test it out thoroughly.
EDIT2:
By the way, it'd be far better for you to change your approach in some way :) I seriously doubt you'd actually have to use a function like this in practice. Well.. I hope you do actually have a solid reason to, it was somewhat fun coming up with it.
If you simply want to eliminate leading and trailing spaces, consider trim to be used on each result of your split.
If you want to split the string into exact thirds it is not known where the cut will be, maybe in a word, maybe between words.
Your code can be simplified to:
$msg = "hello how are you?are you fine?thanks..";
$parts = str_split($msg, ceil(strlen($msg)/3));
Note that ceil() is needed, otherwise you might get 4 elements out because of rounding.
You're probably looking for str_split, chunk_split or wordwrap.

PHP Find Previous String Position

Is there a way that I can search a variable starting from a given position and find the start position of a string that is in the variable backwards from the given start position.
So for example if I initially do $getstart = strpos($contents, 'position', 0);
I then want to do $getprevpos = prevstrpos($contents, 'previous token', $getstart);
Obviously there is no such function as prevstrpos but I hope you get what I mean.
Example text area (terrible example I now):
Here is an example where I want to find the previous token once I have found the start position of a text string.
you can strrpos( substr($contents, 0, $getstart), 'previous token')
Is there something wrong with strrpos()? If 'offset' is negative: "Negative values will stop searching at the specified point prior to the end of the string."
you can try this. I think it should would for all cases but you should probly test it a bit. Might be a bug here and there but you get the idea. Reverse everything and do a strpos on the reversed string
prevstrpos( $contents, $token, $start )
{
$revToken = strrev($token);
$revStart = strlen($token) - $start;
$revContent = strrev($content);
$revFoundPos = strpos( $revContent, $revToken, $revStart );
if( $revFoundPos != -1 )
{
$foundPos = strlen($token) - $revFoundPos;
}
else
{
$foundPos = -1;
}
return $foundPos;
}

Split a large string into an array, but the split point cannot break a tag

I wrote a script that sends chunks of text of to Google to translate, but sometimes the text, which is html source code) will end up splitting in the middle of an html tag and Google will return the code incorrectly.
I already know how to split the string into an array, but is there a better way to do this while ensuring the output string does not exceed 5000 characters and does not split on a tag?
UPDATE: Thanks to answer, this is the code I ended up using in my project and it works great
function handleTextHtmlSplit($text, $maxSize) {
//our collection array
$niceHtml[] = '';
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
//the current position of the index
$currentPiece = 0;
//start assembling a group until it gets to max size
foreach ($pieces as $piece) {
//make sure string length of this piece will not exceed max size when inserted
if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) {
//advance current piece
//will put overflow into next group
$currentPiece += 1;
//create empty string as value for next piece in the index
$niceHtml[$currentPiece] = '';
}
//insert piece into our master array
$niceHtml[$currentPiece] .= $piece;
}
//return array of nicely handled html
return $niceHtml;
}
Note: haven't had a chance to test this (so there may be a minor bug or two), but it should give you an idea:
function get_groups_of_5000_or_less($input_string) {
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $input_string,
-1, PREG_SPLIT_DELIM_CAPTURE);
$groups[] = '';
$current_group = 0;
while ($cur_piece = array_shift($pieces)) {
$piecelen = strlen($cur_piece);
if(strlen($groups[$current_group]) + $piecelen > 5000) {
// Adding the next piece whole would go over the limit,
// figure out what to do.
if($cur_piece[0] == '<') {
// Tag goes over the limit, just put it into a new group
$groups[++$current_group] = $cur_piece;
} else {
// Non-tag goes over the limit, split it and put the
// remainder back on the list of un-grabbed pieces
$grab_amount = 5000 - $strlen($groups[$current_group];
$groups[$current_group] .= substr($cur_piece, 0, $grab_amount);
$groups[++$current_group] = '';
array_unshift($pieces, substr($cur_piece, $grab_amount));
}
} else {
// Adding this piece doesn't go over the limit, so just add it
$groups[$current_group] .= $cur_piece;
}
}
return $groups;
}
Also note that this can split in the middle of regular words - if you don't want that, then modify the part that begins with // Non-tag goes over the limit to choose a better value for $grab_amount. I didn't bother coding that in since this is just supposed to be an example of how to get around splitting tags, not a drop-in solution.
Why not strip the html tags from the string before sending it to google. PHP has a strip_tags() function that can do this for you.
preg_split with a good regex would do it for you.

Categories