Regular Expression for a specific ID in a bracket

Regular Expression for a specific ID in a bracket - php

I have little confidence when it comes to regular expressions. Writing this in PHP code.
I need to be able to filter out strings that follow this format, where the numbers can be 4~6 digits (numeric only):
$input = "This is my string with a weird ID added cause I'm a weirdo! (id:11223)";
I could simply remove the last word by finding the last position of a space via strrpos(); (it appears none of them have a trailing space from JSON feed), then use substr(); to cut it. But I think the more elegant way would be a substring. The intended output would be:
$output = trim(preg_replace('[regex]', $input));
// $output = "This is my string with a weird ID added cause I'm a weirdo!"
So this regex should match with the brackets, and the id: portion, and any contiguous numbers, such as:
(id:33585)
(id:1282)
(id:9845672)
Intending to use the preg_replace() function to remove these from a data feed. Don't ask me why they decided to include an ID in the description string... It blows my mind too why it's not a separate column in the JSON feed altogether.

Try using the pattern \(id:\d+\):
$input = "Text goes here (id:11223) and also here (id:33585) blah blah";
echo $input . "\n";
$output = preg_replace("/\(id:\d+\)/", "", $input);
echo $output;
This prints:
Text goes here (id:11223) and also here (id:33585) blah blah
Text goes here and also here blah blah
There is an edge case here, which you can see in the possible (unwanted) extract whitespace left behind after the replacement. We could try to get sophisticated and remove that too, but you should state what you expected output is.

Related

Regex Get the previous sentence based on period or new line, before the occurance of a word

I'm trying to make a regex statement that can get the previous sentence before the occurrence of "[bbcode]" but is flexible enough to work in different scenarios.
For example, the previous sentence may be defined as following a period. However, it may simply be on a new line. I cannot use ^$ to define start or end of line as this may not always be the case.
Whole test string:
Example 1:
Blah blah blah. THIS SENTENCE SHOULD BE SELECTED [bbcode]
Example 2:
THIS SENTENCE SHOULD BE SELECTED [bbcode]
Example 3:
A trick sentence. And another. THIS SENTENCE SHOULD BE SELECTED
[bbcode]
Expected matches:
All three instances of THIS SENTENCE SHOULD BE SELECTED should be matched.
This is the regex I tried:
'/(?:\.)(.+)(\[bbcode\])/gUs'
This fails when sentence is on a new line as in Example 2.
Link to
Regex Interrupter using my Regex
I have tried negative lookbehinds to no avail. The strings "THIS SENTENCE SHOULD BE SELECTED" should get picked up in all three examples.
Picking up surrounding spaces is ok because I can trim it later.
Challenges:
The entire supplied code must be tested as one string. This is how the data will be supplied and will likely contain many random spaces, new lines etc which the regex must consider.
It is likely impossible to prepare / sanitize the string first, as the string will likely be very poorly formatted without proper punctuation. Contracting the string could cause unintended run-on sentences.

This can be achieved with basic PHP functions. Something like this:
function extractSentence($string)
{
$before = substr($string, 0, strpos($string, '[bbcode]'));
return trim(substr($before, strrpos($before, '.')), "\n .");
}
The advantage is that it is easy to understand, doesn't take much time to develop and can more easily be changed if that need arises.
See: PHP Fiddle

Match and release an optional space ( *\K) then
Lazily match one or more non-dot characters ([^.]+?) then
Lookahead for zero or more whitespace characters followed by the bbcode tag ((?=\s+\[bbcode]))
Make the pattern case-insensitive if the bbcode might be uppercase (i)
Code: (Demo)
$tests = [
'Blah blah blah. THIS SENTENCE SHOULD BE SELECTED [bbcode] text',
'THIS SENTENCE SHOULD BE SELECTED [bbcode] text',
'A trick sentence. And another. THIS SENTENCE SHOULD BE SELECTED
[bbcode]] text'
];
foreach ($tests as $test) {
var_export(preg_match('/ *\K[^.]+?(?=\s+\[bbcode])/i', $test, $m) ? $m[0] : 'no match');
echo "\n---\n";
}

Insert a string into a string at a specific point using PHP

I'm wondering if there is a way to insert a string into another string at a specific point (in this case, near the end)? For example:
$string1 = "item one, item two, item three, item four";
$string2 = "AND";
//do something fancy here
echo $string1;
OUTPUT:
item one, item two, item three, AND item four
I need some help with the fancy string work part. Basically inserting the word after the last ", " if possible.

You can use preg_replace for this, and I find it to be terser than other string manipulation methods and also more easily adaptable if your use case should change:
$string1 = "item one, item two, item three, item four";
$string2 = "AND";
$pattern = "/,(?!.*,)/";
$string1 = preg_replace($pattern, ", $string2", $string1);
echo $string1;
Where you pass preg_replace a regex pattern, the replacement string, and the original string. Instead of modifying the original string, preg_replace returns a new string, so you will set $string1 equal to the output of preg_replace.
The pattern: You can use any delimiter to signal the beginning and end of the expression. Typically I see / used*, so the expression will be "/pattern/", where the pattern consists of the comma and a negative lookahead (?!) to find the comma that isn't followed by another comma. It isn't necessary to explicitly declare $pattern. You can just use the expression directly in the preg_replace arguments, but sometimes it can be just a little easier (especially for complex patterns) to separate the pattern declaration from its use.
The replacement: preg_replace is going to replace the entire match, so you need to prepend your replacement text with the comma (that's getting replaced) and a space. Since variables wrapped in double quotes are evaluated in strings, you put $string2 inside the quotes**.
The target: you just put your original string here.
* I prefer to use ~ as my delimiter, since / starts to get cumbersome when you deal with urls, but you can use anything.
Here is a cheat sheet for regex patterns, but there are plenty of these floating around. Just google regex cheat sheet if you need one.
https://www.rexegg.com/regex-quickstart.html
Also, you can find plenty of online regex testers. Here is one that includes a quick reference and also lets you switch regex engines (there are a few, and some can be just a little bit different than others):
https://regex101.com/
** I prefer to also wrap the variable in curly braces to make it more explicit that I am inserting the value, but it's optional. That would look like ", {$string2}"

Lots of ways to do this - but since you specifically stated "after the last ," then strrpos seemed appropriate:
// insert this line where you indicate 'do something fancy here'
$string1 = substr_replace($string1, " ".$string2, strrpos($string1,",")+1, 0);
Find the right-most comma and insert the $string2 (with a space prepended) one position after it. The last parameter indicates the length of the substring to replace so a 0 means "do not replace anything but only insert."
Note the extra space added to $string2. Obviously you could modify how $string2 is initialized to include the space and eliminate this part.

You can use a pattern to match the last comma in the string using .*, and then use \K to forget what is matched so far.
In the replacement use AND
$string1 = "item one, item two, item three, item four";
$string2 = " AND";
$string1 = preg_replace("/.*,\K/", $string2, $string1);
echo $string1;
Output
item one, item two, item three, AND item four
See a PHP demo.

Getting titles out of string

I'm really stuck with this one program...
I'm learning how to program and I'm starting with PHP right now.
I need to get titles out of articles.
I already asked this question, and I mannaged to get the first title of the text in many ways. For example if text was :
Hello
I'm learning how
to write this code.
:like this, so I got the "Hello" part for example like this:
<?php
$string = "Hello
I'm learning how
to write this code.";
$str=strstr($string,"\n",true);
echo $str . "<br />";
?>
However, there can be a lot of titles in the article and each one of them is seperated with blank lines from above and bellow and I cannot mannage to get all of these titles.
Here's what I tried:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how."
$get = substr($string, strpos($string, $finda), -1);
$finda="\n";
$getFinal=strstr($get, $finda, true);
echo $getFinal;
?>
But this doesn't work because there are "\n" after every line. How to identify only those blank lines? I tried to find them:
$getRow = explode("\n", $string);
foreach($getRow as $row){
if(strlen($row) <= 1){
but I don't know what to do next.
Do you have any ideas? Can you help?
Thank you in advance:)

You can use a regular expression like this:
<?php
$string="
Good text
Good text is good but I have no idea
how to code this.
Another title
I need to get you,
but don't know how.";
preg_match_all('/^\n(.+?)\n\n/m', $string, $matches);
var_dump($matches[1]);
?>
Outputs:
array(2) {
[0] =>
string(9) "Good text"
[1] =>
string(13) "Another title"
}
Explanation of the regular expression
Regular expressions are a compact way to describe constraints for a string. Either to check that it verifies a given pattern or to capture some of its parts. In this case, we want to capture some parts of the string (titles).
'/^\n(.+?)\n\n/m' is the regular expression used to solve your problem. The actual expression is between the slashes while the leading m is an option. It indicates that we want to analyse multiple lines.
We are left with ^\n(.+?)\n\n which can be read from left to right.
^ indicates the beginning of a line and \n represents the "new line" character. Coupled (^\n), they represent an empty line.
Parenthesis indicates what we want to capture. In this case, the title, which can be any number of any characters. The . represents any characters and the + indicates that we want any number of occurrences of that character (but at least one, the * can be used to include zero occurrence). The ? indicates that we don't want to go too far and capture the whole string. It will thus stop at the first occasion it has to match the remaining part of the regular expression.
Then, the two \n represent the end of the title line and the end of the empty line following it.
As we used preg_match_all instead of preg_match, every occurrence of the pattern will be matched instead of the first one only.
Regular expressions are really powerful and I invite you to learn them further.

While iterating over the lines, you could have a variable that stores what you are currently doing. What I mean is that you could have 3 states: processing_text, expecting_title, got_title.
Each time you find that $row == "" (meaning there was an empty line, only containing a \n), you set your variable to expecting_title. If the var==expecting_title, you store/echo the next row you encounter and set the variable to got_title. This way, when you encounter the next empty line, you won't set the variable to expecting_title, but to processing_text.
Some pseudocode to get you started:
foreach ($getRow as $row)
if (state == expecting_title)
processTitle($row)
state=got_title
if ($row == "")
if (state == processing_text)
state=expecting_title
else
state=processing_text
Or, you can always use regex, as the other answer mentioned, but that's another story.

Automatically convert keywords to links in php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>

I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.

Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;

You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

regex to find all text after delimited string

I have some content that contains a token string in the form
$string_text = '[widget_abc]This is some text. This is some text, etc...';
And I want to pull all the text after the first ']' character
So the returned value I'm looking for in this example is:
This is some text. This is some text, etc...

preg_match("/^.+?\](.+)$/is" , $string_text, $match);
echo trim($match[1]);
Edit
As per author's request - added explanation:
preg_match(param1, param2, param3) is a function that allows you to match a single case scenario of a regular expression that you're looking for
param1 = "/^.+?](.+?)$/is"
"//" is what you put on the outside of your regular expression in param1
the i at the end represents case insensitive (it doesn't care if your letters are 'a' or 'A')
s - allows your script to go over multiple lines
^ - start the check from the beginning of the string
$ - go all the way to end of the string
. - represents any character
.+ - at least one or more characters of anything
.+? - at least one more more characters of anything until you reach
.+?] - at least one or more characters of anything until you reach ] (there is a backslash before ] because it represents something in regular expressions - look it up)
(.+)$ - capture everything after ] and store it as a seperate element in the array defined in param3
param2 = the string that you created.
I tried to simplify the explanations, I might be off, but I think I'm right for the most part.

The regex (?<=]).* will solve this problem if you can guarantee that there are no other square brackets on the line. In PHP the code will be:
if (preg_match('/(?<=\]).*/', $input, $group)) {
$match = $group[0];
}
This will transform [widget_abc]This is some text. This is some text, etc... into This is some text. This is some text, etc.... It matches everything that follows the ].

$output = preg_replace('/^[^\]]*\]/', '', $string_text);

Is there any particular reason why a regex is wanted here?
echo substr(strstr($string_text, ']'), 1);

A regex is definitely overkill for this instance.
Here is a nice one-liner :
list(, $result) = explode(']', $inputText, 2);
It does the job and is way less expensive than using regular expressions.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.