Is there a way that I can search a variable starting from a given position and find the start position of a string that is in the variable backwards from the given start position.
So for example if I initially do $getstart = strpos($contents, 'position', 0);
I then want to do $getprevpos = prevstrpos($contents, 'previous token', $getstart);
Obviously there is no such function as prevstrpos but I hope you get what I mean.
Example text area (terrible example I now):
Here is an example where I want to find the previous token once I have found the start position of a text string.
you can strrpos( substr($contents, 0, $getstart), 'previous token')
Is there something wrong with strrpos()? If 'offset' is negative: "Negative values will stop searching at the specified point prior to the end of the string."
you can try this. I think it should would for all cases but you should probly test it a bit. Might be a bug here and there but you get the idea. Reverse everything and do a strpos on the reversed string
prevstrpos( $contents, $token, $start )
{
$revToken = strrev($token);
$revStart = strlen($token) - $start;
$revContent = strrev($content);
$revFoundPos = strpos( $revContent, $revToken, $revStart );
if( $revFoundPos != -1 )
{
$foundPos = strlen($token) - $revFoundPos;
}
else
{
$foundPos = -1;
}
return $foundPos;
}
Related
I am working with news articles in HTML format, that come from a wysiwyg editor, and I need to find the middle of it, but in a visual/HTML context, meaning an empty place inbetween two root elements. Kind of if you wanted to split the article into two pages let's say, with the equal number of paragraphs on each when possible.
All root elements seem to come out as paragraphs, which was easy enough to count, a simple
$p_count = substr_count($article_text, '<p');
Returns the total number of opening paragraph tags, and then i can look for the strpos of a ($p_count/2)-th occurrence of a paragraph.
But the problem is embedded tweets, that contain paragraphs, which appear sometimes under blockquote > p, other times as center > blockquote > p.
So i turn to DOMDocument. This little snippet gives me the nth element that is the middle one (even if the elements are divs and not paragraphs, which is cool):
$dom = new DOMDocument();
$dom->loadHTML($article_text);
$body = $dom->getElementsByTagName('body');
$rootNodes = $body->item(0)->childNodes;
$empty_nodes = 0;
foreach($rootNodes as $node) {
if($node->nodeType === XML_TEXT_NODE && strlen(trim($node->nodeValue)) === 0) {
$empty_nodes++;
}
}
$total_elements = $rootNodes->length - $empty_nodes;
$middle_element = floor($total_elements / 2);
But how do i now find the string offset of this middle element within my original HTML source, so that i can point to this middle place within the article text string? Especially considering that DOMDocument converts the HTML of what i gave it, into a full HTML page (with a doctype, and head and all that), so its output HTML is bigger than my original HTML article source.
Ok i solved it.
What i did was match all HTML tags from the article, using the PREG_OFFSET_CAPTURE flag of preg_match_all, which remembers at which character offset the pattern was matched. Then i looped through all of them sequentially, and counted which depth i'm in; if it's an opening tag, i count the depth +1, and for a closing -1 (minding the self-closing tags). Every time the depth gets to zero after a closing tag, i count that as one more root element closed. If at the end i ended up at depth 0, i assumed i counted correctly.
Now, i can take the number of root elements that i counted, divide by 2 to get the middle-ish one (+-1 for odd numbers), and look at the offset of the element at that index as reported by preg_match_all previously.
Complete code for that if anyone needs to do the same thing is below.
It might be sped up if the is_self_closing() function was written using a regex and then checking in_array($self_closing_tags), instead of a foreach loop, but in my case it didn't make enough of a difference for me to bother.
function calculate_middle_of_article(string $text, bool $debug=false): ?int {
function is_self_closing(string $input, array $self_closing_tags): bool {
foreach($self_closing_tags as $tag) {
if(substr($input, 1, strlen($tag)) === $tag) {
return true;
}
}
return false;
}
$self_closing_tags = [
'!--',
'area',
'base',
'br',
'col',
'embed',
'hr',
'img',
'input',
'link',
'meta',
'param',
'source',
'track',
'wbr',
'command',
'keygen',
'menuitem',
];
$regex = '/<("[^"]*"|\'[^\']*\'|[^\'">])*>/';
preg_match_all($regex, $text, $matches, PREG_OFFSET_CAPTURE);
$debug && print count($matches[0]) . " tags found \n";
$root_elements = [];
$depth = 0;
foreach($matches[0] as $match) {
if(!is_self_closing($match[0], $self_closing_tags)) {
$depth+= (substr($match[0], 1, 1) === '/') ? -1 : 1;
}
$debug && print "level {$depth} after tag: " . htmlentities($match[0]) . "\n";
if($depth === 0) {
$root_elements[]= $match;
}
}
$ok = ($depth === 0);
$debug && print ($ok ? 'ok' : 'not ok') . "\n";
// has to end at depth zero to confirm counting is correct
if(!$ok) {
return null;
}
$debug && print count($root_elements) . " root elements\n";
$element_index_at_middle = floor(count($root_elements)/2);
$half_char = $root_elements[$element_index_at_middle][1];
$debug && print "which makes the half the {$half_char}th character at the {$element_index_at_middle}th element\n";
return $half_char;
}
I'm working with indexing some news sites. A kind of news clipping.
I'm an amateur and curious. I'm not a programmer so the question may seem silly to anyone in the business. But if anyone can help, thank you.
The paging of the sites I was doing parsing was practically the same and I used this scheme:
$url = $ url. '/page/'. $s;
$next_url = $s + 1;
$prev_url = $s - 1;
if ($prev_url <= 0) {
$prev_url = 1;
}
The format was basically this:
http://example.com/politics/page/2
But yesterday I came across something different and I do not know how to page. I get this link format through preg_match_all:
http://www.example.com/browse-Politics-National-texts-1-date.html
This is the paging part:
-1-
This part is variable:
Political-National-texts
Any guidance?
If what you are asking for is parsing the url for the pagination and variable parts, you can use preg_match with the following regexp:
if (preg_match('/^http:\/\/www.example.com\/browse-([-a-zA-Z]+)-(\d+)-date\.html$/', $url, $matches)) {
var_export($matches);
}
Then you will get the result:
array (
0 => 'http://www.example.com/browse-Politics-National-texts-1-date.html',
1 => 'Politics-National-texts',
2 => '1',
)
The keys in $matches will be:
0: The entire match
1: The first matched group (the variable)
2: The second matched group (the pagination)
<?php
$url = 'http://www.example.com/browse-Politics-National-texts-1-date.html'
$url_basename = basename($url); // extract `browse-Politics-National-texts-1-date.html`
$url_exploded = explode('-',$url_basename); // make an array delimited by `-`
array_pop($url_exploded);
$url_page_number = array_pop($url_exploded); // get the 2nd element from back
?>
Result:
$url_page_number = 1
PS. Could make it shorter, but it's for educational purposes :-)
I want to get specific content of a website into an array.
I have approx 20 sites to fetch the content and output in other ways i like.Only the port is always changing (not 27015, its than 27016 or so...)
This is just one: SOURCE-URL of Content
For now, i use this code in PHP to fetch the Gameicon "cs.png", but the icon varies in length - so it isn't the best way, or? :-/
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array(
"icon" => substr($source, strpos($source, 'game_icons64')+13, 6),
);
echo $content[icon];
}
Thanks for helping, some days are passed from my last PHP work :P
You just need to look for the first " that comes after the game_icons64 and read up to there.
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
// find the position right after game_icons64/
$first_occurance = strpos($source, 'game_icons64')+13;
// find the first occurance of " after game_icons64, where the src ends for the img
$second_occurance = strpos($source, '"', $first_occurance);
$content = array(
// take a substring starting at the end of game_icons64/ and ending just before the src attribute ends
"icon" => substr($source, $first_occurance, $second_occurance-$first_occurance),
);
echo $content['icon'];
}
Also, you had an error because you used [icon] and not ['icon']
Edit to match the second request involving multiple strings
$srvip = '148.251.78.214';
$srvlist = array('27015');
$content_strings = array( );
// the first 2 items are the string you are looking for in your first occurrence and how many chars to skip from that position
// the third is what should be the first char after the string you are looking for, so the first char that will not be copied
// the last item is how you want your array / program to register the string you are reading
$content_strings[] = array('game_icons64', 13, '"', 'icon');
// to add more items to your search, just copy paste the line above and change whatever you need from it
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array();
foreach($content_strings as $k=>$v)
{
$first_occurance = strpos($source, $v[0])+$v[1];
$second_occurance = strpos($source, $v[2], $first_occurance);
$content[$v[3]] = substr($source, $first_occurance, $second_occurance-$first_occurance);
}
print_r($content);
}
What would be an elegant way of doing this?
I have this -> "MC0001" This is the input. It always begins with "MC"
The output I'd be aiming with this input is "MC0002".
So I've created a function that's supposed to return "1" after removing "MC000". I'm going to convert this into an integer later on so I could generate "MC0002" which could go up to "MC9999". To do that, I figured I'd need to loop through the string and count the zeros and so on but I think I'd be making a mess that way.
Anybody has a better idea?
This should do the trick:
<?php
$string = 'MC0001';
// extract the part succeeding 'MC':
$number_part = substr($string, 2);
// count the digits for later:
$number_digits = strlen($number_part);
// turn it into a number:
$number = (int) $number_part;
// make the next sequence:
$next = 'MC' . str_pad($number + 1, $number_digits, '0', STR_PAD_LEFT);
using filter_var might be the best solution.
echo filter_var("MC0001", FILTER_SANITIZE_NUMBER_INT)."\n";
echo filter_var("MC9999", FILTER_SANITIZE_NUMBER_INT);
will give you
0001
9999
These can be cast to int or just used as they are, as PHP will auto-convert anyway if you use them as numbers.
just use ltrim to remove any leading chars: http://php.net/manual/en/function.trim.php
$str = ltrim($str, 'MC0');
$num = intval($str);
<php
// original number to integer
sscanf( $your_string, 'MC%d', $your_number );
// pad increment to string later on
sprintf( 'MC%04u', $your_number + 1 );
Not sure if there is a better way of parsing a string as an integer when there are leading zero's.
I'd suggest doing the following:
1. Loop through the string ( beginning at location 2 since you don't need the MC part )
2. If you find a number thats bigger than 0, stop, get the substring using your current location and the length of the string minus your current location. Cast to integer, return value.
You can remove the "MC" par by doing a substring operating on the string.
$a = "MC0001";
$a = substr($a, 2); //Lengths of "MC"
$number = intval($a); //1
return intval(str_replace($input, 'MC', ''), 10);
I have a string which contains the text of an article. This is sprinkled with BBCodes (between square brackets). I need to be able to grab the first say, 200 characters of an article without cutting it off in the middle of a bbcode. So I need an index where it is safe to cut it off. This will give me the article summary.
The summary must be minimum 200 characters but can be longer to 'escape' out of a bbcode. (this length value will actually be a parameter to a function).
It must not give me a point inside a stand alone bbcode (see the pipe) like so: [lis|t].
It must not give me a point between a start and end bbcode like so: [url="http://www.google.com"]Go To Goo|gle[/url].
It must not give me a point inside either the start or end bbcode or in-between them, in the above example.
It should give me the "safe" index which is after 200 and is not cutting off any BBCodes.
Hope this makes sense. I have been struggling with this for a while. My regex skills are only moderate. Thanks for any help!
First off, I would suggest considering what you will do with a post that is entirely wrapped in BBcodes, as is often true in the case of a font tag. In other words, a solution to the problem as stated will easily lead to 'summaries' containing the entire article. It may be more valuable to identify which tags are still open and append the necessary BBcodes to close them. Of course in cases of a link, it will require additional work to ensure you don't break it.
Well, the obvious easy answer is to present your "summary" without any bbcode-driven markup at all (regex below taken from here)
$summary = substr( preg_replace( '|[[\/\!]*?[^\[\]]*?]|si', '', $article ), 0, 200 );
However, do do the job you explicitly describe is going to require more than just a regex. A lexer/parser would do the trick, but that's a moderately complicated topic. I'll see if I can come up w/something.
EDIT
Here's a pretty ghetto version of a lexer, but for this example it works. This converts an input string into bbcode tokens.
<?php
class SimpleBBCodeLexer
{
protected
$tokens = array()
, $patterns = array(
self::TOKEN_OPEN_TAG => "/\\[[a-z].*?\\]/"
, self::TOKEN_CLOSE_TAG => "/\\[\\/[a-z].*?\\]/"
);
const TOKEN_TEXT = 'TEXT';
const TOKEN_OPEN_TAG = 'OPEN_TAG';
const TOKEN_CLOSE_TAG = 'CLOSE_TAG';
public function __construct( $input )
{
for ( $i = 0, $l = strlen( $input ); $i < $l; $i++ )
{
$this->processChar( $input{$i} );
}
$this->processChar();
}
protected function processChar( $char=null )
{
static $tokenFragment = '';
$tokenFragment = $this->processTokenFragment( $tokenFragment );
if ( is_null( $char ) )
{
$this->addToken( $tokenFragment );
} else {
$tokenFragment .= $char;
}
}
protected function processTokenFragment( $tokenFragment )
{
foreach ( $this->patterns as $type => $pattern )
{
if ( preg_match( $pattern, $tokenFragment, $matches ) )
{
if ( $matches[0] != $tokenFragment )
{
$this->addToken( substr( $tokenFragment, 0, -( strlen( $matches[0] ) ) ) );
}
$this->addToken( $matches[0], $type );
return '';
}
}
return $tokenFragment;
}
protected function addToken( $token, $type=self::TOKEN_TEXT )
{
$this->tokens[] = array( $type => $token );
}
public function getTokens()
{
return $this->tokens;
}
}
$l = new SimpleBBCodeLexer( 'some [b]sample[/b] bbcode that [i] should [url="http://www.google.com"]support[/url] what [/i] you need.' );
echo '<pre>';
print_r( $l->getTokens() );
echo '</pre>';
The next step would be to create a parser that loops over these tokens and takes action as it encounters each type. Maybe I'll have time to make it later...
This does not sound like a job for (only) regex.
"Plain programming" logic is a better option:
grab a character other than a '[', increase a counter;
if you encounter an opening tag, keep advancing until you reach the closing tag (don't increase the counter!);
stop grabbing text when your counter has reached 200.
Here is a start. I don't have access to PHP at the moment, so you might need some tweaking to get it to run. Also, this will not ensure that tags are closed (i.e. the string could have [url] without [/url]). Also, if a string is invalid (i.e. not all square brackets are matched) it might not return what you want.
function getIndex($str, $minLen = 200)
{
//on short input, return the whole string
if(strlen($str) <= $minLen)
return strlen($str);
//get first minLen characters
$substr = substr($str, 0, $minLen);
//does it have a '[' that is not closed?
if(preg_match('/\[[^\]]*$/', $substr))
{
//find the next ']', if there is one
$pos = strpos($str, ']', $minLen);
//now, make the substr go all the way to that ']'
if($pos !== false)
$substr = substr($str, 0, $pos+1);
}
//now, it may be better to return $subStr, but you specifically
//asked for the index, which is the length of this substring.
return strlen($substr);
}
I wrote this function which should do just what you want. It counts n numbers of characters (except those in tags) and then closes tags which needs to be closed. Example use included in code. The code is in python, but should be really easy to port to other languages, such as php.
def limit(input, length):
"""Splits a text after (length) characters, preserving bbcode"""
stack = []
counter = 0
output = ""
tag = ""
insideTag = 0 # 0 = Outside tag, 1 = Opening tag, 2 = Closing tag, 3 = Opening tag, parameters section
for i in input:
if counter >= length: # If we have reached the max length (add " and i == ' '") to not make it split in a word
break
elif i == '[': # If we have reached a tag
insideTag = 1
elif i == '/': # If we reach a slash...
if insideTag == 1: # And we are in an opening tag
insideTag = 2
elif i == '=': # If we have reached the parameters
if insideTag >= 1: # If we actually are in a tag
insideTag = 3
elif i == ']': # If we have reached the closing of a tag
if insideTag == 2: # If we are in a closing tag
stack.pop() # Pop the last tag, we closed it
elif insideTag >= 1:# If we are in a tag, parameters or not
stack.append(tag) # Add current tag to the tag-stack
if insideTag >= 0: # If are in some type of tag
insideTag = 0
tag = ""
elif insideTag == 0: # If we are not in a tag
counter += 1
elif insideTag <= 2: # If we are in a tag and not among the parameters
tag += i
output += i
while len(stack) > 0:
output += '[/'+stack.pop()+']' # Add the remaining tags
return output
cutText = limit('[font]This should be easy:[img]yippee.png[/img][i][u][url="http://www.stackoverflow.com"]Check out this site[/url][/u]Should be cut here somewhere [/i][/font]', 60)
print cutText