Regex replace characters in data

Regex replace characters in data - php

I am trying to clean some junked up data of special characters (allowing a few) but some still get through. I found a regex snippet earlier but does not remove some characters, like asterisks.
$clean_body = $raw_text;
$clean_title = preg_replace("/[^!&\/A-Za-z0-9_ ]/","", $clean_body);
$clean_title = substr($clean_title, 0, 64);
$clean_body = nl2br($clean_body);
if ($nid) {
$node = node_load($nid);
unset($node->field_category);
} else {
$node = new stdClass();
$node->type = 'article';
node_object_prepare($node);
}
$split_title = str_split($clean_title);
foreach ($split_title as $key => $character) {
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
}
The first part attempts to clean out anything in the raw text that isn't normal punctuation or alpha numeric. Then, I split the title into an array and look for a space. What I want to do is create a title that is at least 15 characters long, and truncates on a space (leaving whole words intact) without stopping on a punctuation character. This is the part I am having trouble with.
Some titles still come out as ***************** or ** HOW TO MAKE $$$$$$ BLOGGING **, when the first title should not even have *'s, and the section should be HOW TO MAKE..., for example.

What about "/[^!&\/\w\s]/ui" ?
Works fine on my machine

Your problem (or, one of them anyhow) is this logic:
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
You're only setting $node->title if these conditions match when iterating the characters in the $split_title array.
What happens when they don't match? $node->title doesn't get set (or overwritten? You didn't give much context, so I can't tell).
Using this as a test:
$clean_body = '** HOW TO MAKE $$$$$$ BLOGGING **';
You can see that these conditions do not match, so $node->title does not get set (or overwritten).

Related

What is a workaround for colour coding in wordwrap()

So Minecraft uses section signs (§) for colour coding so for example, light green is §a (a is the color code id for green). An important note to remember is that these are VISUALLY ignored in-game. I'm using wordwrap() to make text look centred however these section signs get in the way of that because they're visually not there yet still considered as characters by the function itself.
Here's my attempt: if you take a look, I tried to count the number of occurrences the section sign was found and multiplied it by two for the colour code character. I later then realized that this is inefficient because this affects the entire line of code and not just a specific bit. This basically means that this would make the length of other colour coded lines look odd since they have more or less colour coding in them. I also tried a rather dumb alternative where I'd use constants but I quickly realized that wasn't going to work. Let me know if anything is unclear. Thanks in advance.
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying."
public static function itemLineOptimizer(string $line, int $width = 40)
{
$width += substr_count($line, '§') * 2;
return wordwrap($line, $width, "\n");
}
Console Output:
string(281) "§r§7This is the §eAuction House§7! In the §eAuction
House§7, you can sell and purchase items from other
Luriders who have auctioned their items. The §eAuction
House §7is a great way to make some cash by simply
selling items that other players might be interested in
buying."
In-Game Output:
In-Game Output

No where near as efficient as IMSoP's approach, but it is an alternative method I wanted to share. So what I did was I replaced section signs, removed them, wordwrapped, then added them back to their correct places. A bit complicated at first look but it's quite simple. Every line has its details commented.
function itemLineOptimizer(string $line, int $width = 40)
{
$line = str_replace("§", "&", $line); // Since section signs aren't just one-byte, we're going to make our lives easier and replace them with another one-byte symbol, I went with "&"
$colourCoding = array(); // Straightforward
$split = str_split($line); // Splitting the line into an array per character
foreach ($split as $key => $char){ // for every character has a $key (position) and the character itself: $char
if($char === "&") { // Check if it's a section sign / symbol chosen
array_push($colourCoding, [$key, $split[$key + 1]]); // add to $colourCoding an element which includes an array consisting of the position of the sign and the colour which the character at the position after
unset($split[$key]); // remove sign
unset($split[$key + 1]); // remove colour
}
}
// Now we've removed all colour coding from the line and saved it in $colourCoding
$bland = wordwrap(implode("", $split), $width, "\n"); // $bland is the now colourless wordwrapped line
foreach ($colourCoding as $array){ // Lastly we add the section signs back in their positions
$key = $array[0]; // position
$colour = $array[1]; // colour
$lineBreak = substr_count($bland, "§"); // Check for section signs already inside this line: they interfere with future loops since the correct position is different
$bland = substr_replace($bland, "§".$colour, $key + $lineBreak, 0); // Adding the colour coding back back to its correct position
}
return $bland; // Straightforward
}
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying.";
var_dump(wordwrap($line, 40), itemLineOptimizer($line, 40));

One way to approach this which I though might be interesting is to take the internal implementation of wordwrap, and adapt it to our needs.
So I found the definition in the source, and in particular the special-case algorithm for handling a single-character line-break character which is all we need here, and saves us understanding all the other modes.
It works by copying the string, and then walking through it character by character, tracking when it last saw a space, and when it last saw or inserted a newline character. It then over-writes spaces with newline characters in place, without having to touch the rest of the string.
I first translated that literally into PHP (mostly a case of adding $ in front of each variable, and removing some special type handling macros), giving this:
function my_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
}
$lastspace = $current;
}
elseif ($current - $laststart >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
}
}
return $newtext;
}
Two of those if statements include this condition which tracks how many characters we've seen since the last line break: $current - $laststart >= $linelength. What we could do is subtract from that the number of invisible characters we've seen, so they don't contribute to the "width" of lines: $current - $laststart - $invisibles >= $linelength.
Next, we need to detect section signs. My immediate guess was to use $text[$current] == '§', but that doesn't work because we're working in byte offsets, and § is not a single byte. Assuming UTF-8, it's specifically the pair of bytes which in hexadecimal are C2 A7, so we need to test the current and next character for that pair: $text[$current] == "\xC2" && $text[$current+1] == "\xA7".
Now we can detect the invisible characters, we can increment our $invisibles counter. Since § is two bytes, and the following character is also invisible, we want to increment the counter by three, and also move the $current pointer an extra two steps:
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
Finally, we need to reset the $invisibles counter whenever we insert a newline, or see an existing one - in other words, everywhere we reset $laststart.
So, the final result looks like this:
function special_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = $invisibles = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
$invisibles = 0;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart - $invisibles >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
$invisibles = 0;
}
$lastspace = $current;
}
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
elseif ($current - $laststart - $invisibles >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
$invisibles = 0;
}
}
return $newtext;
}
Here's a live demo of it in action with your sample input.
Not the most elegant, and probably not the most efficient way to do it, but I enjoyed the exercise, even if it's not what you were hoping for. :)

reorder / rewrap bbcodes

I'm trying to reorder the BBCodes but I failed
so
[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶b̶]̶[̶/̶u̶]̶[̶/̶i̶]̶ ̶-̶ ̶w̶r̶o̶n̶g̶ ̶o̶r̶d̶e̶r̶ ̶ ̶
I̶ ̶w̶a̶n̶t̶ ̶i̶t̶ ̶t̶o̶ ̶b̶e̶:̶ ̶
̶[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶u̶]̶[̶/̶i̶]̶[̶/̶b̶]̶ ̶-̶ ̶r̶i̶g̶h̶t̶ ̶o̶r̶d̶e̶r̶
PIC:
I tried with
<?php
$string = '[b][i][u]foo[/b][/u][/i]';
$search = array('/\[b](.+?)\[\/b]/is', '/\[i](.+?)\[\/i]/is', '/\[u](.+?)\[\/u]/is');
$replace = array('[b]$1[/b]', '[i]$1[/i]', '[u]$1[/u]');
echo preg_replace($search, $replace, $string);
?>
OUTPUT: [b][i][u]foo[/b][/u][/i]
any suggestions ? thanks!

phew, spent awhile thinking of the logic to do this. (feel free to put it in a function)
this only works for the scenario given. Like other users have commented it's impossible. You shouldn't be doing this. Or even on server side. I'd use a client side parser just to throw a syntax error.
supports [b]a[i]b[u]foo[/b]baa[/u]too[/i]
and bbcode with custom values [url=test][i][u]foo[/url][/u][/i]
Will break with
[b] bold [/b][u] underline[/u]
And [b] bold [u][/b] underline[/u]
//input string to be reorganized
$string = '[url=test][i][u]foo[/url][/u][/i]';
echo $string . "<br />";
//search for all opentags (including ones with values
$tagsearch = "/\[([A-Za-z]+)[A-Za-z=._%?&:\/-]*\]/";
preg_match_all($tagsearch, $string, $tags);
//search for all close tags to store them for later
$closetagsearch = "/(\[\/([A-Za-z]+)\])/is";
preg_match_all($closetagsearch, $string, $closetags);
//flip the open tags for reverse parsing (index one is just letters)
$tags[1] = array_reverse($tags[1]);
//create temp var to store new ordered string
$temp = "";
//this is the last known position in the original string after a match
$last = 0;
//iterate through each char of the input string
for ($i = 0, $len = strlen($string); $i < $len; $i++) {
//if we run out of tags to replace/find stop looping
if (empty($tags[1]) || empty($closetags[1]))
continue;
//this is the part of the string that has no matches
$good = substr($string, $last, $i - $last);
//next closing tag to search for
$next = $closetags[1][0];
//how many chars ahead to compare against
$scope = substr($string, $i, strlen($next));
//if we have a match
if ($scope === "$next") {
//add to the temp variable with a modified
//version of an open tag letter to become a close tag
$temp .= $good . substr_replace("[" . $tags[1][0] . "]", "/", 1, 0);
//remove the first key/value in both arrays
array_shift($tags[1]);
array_shift($closetags[1]);
//update the last known unmatched char
$last += strlen($good . $scope);
}
}
echo $temp;
Please also note: it might be the users intention to nest the tags out of order :X

MySQL / PHP, but more of a MATH Question (Shortening Script)

For my latest project I need to shorten the URLs which I then put in a mysql database.
I now ran against a problem, because I don't know how to solve this. Basically, the shortened strings should look like this (I want to include lowercase letters, uppercase letters and numbers)
a
b
...
z
0
...
9
A
...
Z
aa
ab
ac
...
ba
So, 1. URl --> a. Stored in mysql.
Next time, a new url gets stored to --> b because a is already in the mysql database.
And that is it. But I don't have any idea. Could someone of you please help me out?
Edit: Formattted & Further explanation.
It is kinda like the imgur.com URL shortening service. It should continue like this until infinity (which is not needed, I think...)

You can use the following function (code adapted from my personal framework):
function Base($input, $output, $number = 1, $charset = 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ')
{
if (strlen($charset) >= 2)
{
$input = max(2, min(intval($input), strlen($charset)));
$output = max(2, min(intval($output), strlen($charset)));
$number = ltrim(preg_replace('~[^' . preg_quote(substr($charset, 0, max($input, $output)), '~') . ']+~', '', $number), $charset[0]);
if (strlen($number) > 0)
{
if ($input != 10)
{
$result = 0;
foreach (str_split(strrev($number)) as $key => $value)
{
$result += pow($input, $key) * intval(strpos($charset, $value));
}
$number = $result;
}
if ($output != 10)
{
$result = $charset[$number % $output];
while (($number = intval($number / $output)) > 0)
{
$result = $charset[$number % $output] . $result;
}
$number = $result;
}
return $number;
}
return $charset[0];
}
return false;
}
Basically you just need to grab the newly generated auto-incremented ID (this also makes sure you don't generate any collisions) from your table and pass it to this function like this:
$short_id = Base(10, 62, $auto_increment_id);
Note that the first and second arguments define the input and output bases, respectively.
Also, I've reordered the charset from the "default" 0-9a-zA-Z to comply with your examples.
You can also just use base_convert() if you can live without the mixed alphabet case (base 36).

How to remove text between "matching" parentheses?

When I read the alt(technically title)-text of this XKCD comic, I became curious whether every articles in Wikipedia eventually points to Philosophy article. So I began to make a web application that displays what articles it's "pointing" using PHP.
(PS: don't worry about traffic - because I'll use it privately and will not send too much requests to Wikipedia server)
To do this, I have to remove texts between parentheses and italics, and get the first link. Other things can be achieved using PHP Simple HTML DOM Parser, but remove texts between parentheses is the problem..
If there's no parentheses in parentheses, then I could use this RegEx:\([^\)]+\), however, like the article about German language, there's some articles have overlapped parentheses(for example: German (Deutsch [ˈdɔʏtʃ] ( listen)) is..), and above RegEx can't handle these cases, since [^\)]*\) finds first closing parentheses, not matching closing parentheses. (Actually above case doesn't become a problem since there's no text between two closing parentheses, but it becomes a big problem when there's a link between two closing parentheses.)
One dirty solution I can think is this:
$s="content of a wikipedia article";$depth=0;$s2="";
for($i=0;$i<strlen($s);$i++){
$c=substr($s,$i,1);
if($c=='(')$depth++;
if($c==')'){if($depth>0)$depth--;continue;}
if($depth==0) $s2.=$c;
}
$s=$s2;
However, I don't like this solution since it cuts down a string into single characters and that looks like unnecessary...
Is there other ways to remove text in a pair of(matching) parentheses?
For example, I want to make this text:
blah(asdf(foo)bar(lol)asdf)blah
into this:
blahblah
but not like this:
blahbarasdf)blah
Edit : from a comment of Emil Vikström's answer, I realized that above approach(remove texts between parentheses) may remove a link containing parentheses. However, I still want the answer of above problem, since I met similar problem before and I want to know the answer...
So my question is still: how to remove texts between matching parentheses?

You can check out recursive patterns, which should be able to solve the problem.
When I read the comic I didn't have the willpower to get my head around recursive patterns, so I simplified it to find a link and only then check if it's in parenthesis. Here's my solution:
//Fetch links
$matches = array();
preg_match_all('!<a [^>]*href="/wiki/([^:"#]+)["#].*>!Umsi', $text, $matches);
$links = $matches[1];
//Find first link not within parenthesis
$found = false;
foreach($links as $l) {
if(preg_match('!\([^)]+/wiki/'.preg_quote($l).'.+\)!Umsi', $text)) {
continue;
}else{
$found = true;
break;
}
}
Here's my entire script: http://lajm.eu/emil/dump/filosofi.phps

Great! I am seeing someone with a problem which I experienced while cleaning up Wikipedia plain text content. Here is how you use it.
cleanBraces("blah(asdf(foo)bar(lol)asdf)blah", "(", ")")
will return
blahblah
You can pass any type of braces. Like [ and ] or { and }
Here goes my source code.
function cleanBraces($source, $oB, $eB) {
$finalText = "";
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
while (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$brace = getBracesPos($source, $oB, $eB);
$finalText .= substr($source, 0, $brace[0]);
$source = substr($source, $brace[1] + 1, strlen($source) - $brace[1]);
}
$finalText .= $source;
} else {
$finalText = $source;
}
return $finalText;
}
function getBracesPos($source, $oB, $eB) {
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$open = 0;
$length = strlen($source);
for ($i = 0; $i < $length; $i++) {
$currentChar = substr($source, $i, 1);
if ($currentChar == $oB) {
$open++;
if ($open == 1) { // First open brace
$firstOpenBrace = $i;
}
} else if ($currentChar == $eB) {
$open--;
if ($open == 0) { //time to wrap the roots
$lastCloseBrace = $i;
return array($firstOpenBrace, $lastCloseBrace);
}
}
} //for
} //if
}

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content

Use a lexer generator.

The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse

Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.

It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex replace characters in data - php

What about "/[^!&\/\w\s]/ui" ? Works fine on my machine

Related

What is a workaround for colour coding in wordwrap()

reorder / rewrap bbcodes

MySQL / PHP, but more of a MATH Question (Shortening Script)

How to remove text between "matching" parentheses?

Performance of tokenizing CSS in PHP

Categories

Resources