Different results between preg_replace & preg_match_all - php

I have a forum that supports hashtags. I'm using the following line to convert all hashtags into links. I'm using the (^|\(|\s|>) pattern to avoid picking up named anchors in URLs.
$str=preg_replace("/(^|\(|\s|>)(#(\w+))/","$1$2",$str);
I'm using this line to pick up hashtags to store them in a separate field when the user posts their message, this picks up all hashtags EXCEPT those at the start of a new line.
preg_match_all("/(^|\(|\s|>)(#(\w+))/",$Content,$Matches);
Using the m & s modifiers doesn't make any difference. What am I doing wrong in the second instance?
Edit: the input text could be plain text or HTML. Example of problem input:
#startoftextreplacesandmatches #afterwhitespacereplacesandmatches <b>#insidehtmltagreplacesandmatches</b> :)
#startofnewlinereplacesbutdoesnotmatch :(

Your replace operation has a problem which you have evidently not yet come across - it will allow unescaped HTML special characters through. The reason I know this is because your regex allows hashtags to be prefixed with >, which is a special character.
For that reason, I recommend you use this code to do the replacement, which will double up as the code for extracting the tags to be inserted into the database:
$hashtags = array();
$expr = '/(?:(?:(^|[(>\s])#(\w+))|(?P<notag>.+?))/';
$str = preg_replace_callback($expr, function($matches) use (&$hashtags) {
if (!empty($matches['notag'])) {
// This takes care of HTML special characters outside hashtags
return htmlspecialchars($matches['notag']);
} else {
// Handle hashtags
$hashtags[] = $matches[2];
return htmlspecialchars($matches[1]).'#'.htmlspecialchars($matches[2]).'';
}
}, $str);
After the above code has been run, $str will contain the modified string, properly escaped for direct output, and $hashtags will be populated with all the tags matched.
See it working

Related

Regex expression to exclude words surrounded by special characters

I’ve been having issues with finding a solution to a regex conundrum I’m having.
Recently, I worked on a project where we needed to replace a list of words in a given text with a list of anchor tags.
For example, given a string
This is a test string
I may want to replace the word “test” with
<a target="_blank"  href="https://website.com/string-random“>test</a>.
The resulting string should look like this
This is a <a target="_blank" href="https://website.com/string-random“>test</a> string
The replacement of the words is done in a loop
foreach ($documents as $document)
foreach ($links as $link)
replace keywords
What ends up happening in some scenarios is some of the urls in the anchor tags contain words that could potentially be replaced
For example, given this list of words to replace
[
{
'keyword': 'test',
'link': 'https://website.com/string-random'
},
{
'keyword': 'string',
'link': 'https://random.com/string'
}
]
After all the replacements are done, the sample string I gave above would look like this
This is a <a target="_blank" href="https://website.com/<a target="_blank"  href="https://random.com/string“>string</a>-random“>test</a> <a target="_blank" href="https://random.com/string“>string</a>
Instead of
This is a <a target="_blank" href="https://website.com/string-random“>test</a> <a target="_blank" href="https://random.com/string“>string</a>
Currently, I am looking for a regular expression that would not match on any words surrounded by special characters as I think this would solve my problem.
Also very open to any other ideas on how to tackle this problem
This is not just about the previous replacements: any word that occurs within tag attributes / names / values is an issue.
In other words, you want to replace strings that are followed some chars where next < occurs before next > (strings between tags and not within tags)
Hence try this one :
(string-to-match)(?=[^>]*?<)
(replace string-to-match, obviously)
The other block is a lookahead : it ensures that you can read any char but >, as many times as needed, then a <
Try :
foreach ($wordlist as $word){
$document = preg_replace("~(?! )($word[keyword])(?! )~i","<a href='$word[link]'>$1<")
}
I found a pattern that works pretty well for me hear
$pattern = '/(?<!(>|\/|-))\b' . preg_quote($stringToReplace, '/') . '\b(?!(<|\/|-))/i';

Replacing all of the characters in between two html tags with whitespace in PHP

I'm trying to write some code to replace every character between two html tags (<del> tag to be specific) with a whitespace, but I'm struggling with the PHP string methods.
An example of how the code should operate: if the initial string is "blahblah<del>blahblah</del>blahblah", ideally the result would be "blahblah blahblah" with 8 whitespaces (for the 8 character length of the piece between the <del> tags) in between the two blahblah's
I'm writing this code for a pretty basic server, so I don't have access to any extra libraries/services like Node,etc.
Any help would be appreciated!
Edit: The text between the <del> tags is variable in length. It's not always 8 characters long.
You can use preg_replace_callback to pull the content between the <del> and </del>. You then can use that return to count the number of characters inside and replace them in the original string. Something like this should do it:
$string = 'blahblah<del>blahblah</del>blahblah';
echo preg_replace_callback('~<del>(.*?)</del>~', function($match) {
$replace = str_repeat(' ', mb_strlen($match[1]));
return $replace;
}, $string);
https://3v4l.org/ht22A

Make sure string is a valid CSS ID name

I have a bunch of database records (without auto_increment IDs or anything else like that) rendered as a list, and it came to pass that I need to differentiate each of them with a unique id.
I could just add a running counter into the loop and be done with it, but unfortunately this ID needs to be cross-referenceable throughout the site, however the list is ordered or filtered.
Therefore I got an idea to include the record title as a part of the id (with a prefix so it doesn't clash with layout elements).
How could I transform a string into an id name in a foolproof way so that it never contains characters that would break the HTML or not work as valid CSS selectors?
For example;
Title ==> prefix_title
TPS Report 2010 ==> prefix_tps_report_2010
Mike's "Proposal" ==> prefix_mikes_proposal
#53: Míguèl ==> prefix_53_miguel
The titles are always diverse enough to avoid conflicts, and will never contain any non-western characters.
Thanks!
I needed a similar solution for Deeplink Anchors.
Found this useful:
function seoUrl($string) {
//Lower case everything
$string = strtolower($string);
//Make alphanumeric (removes all other characters)
$string = preg_replace("/[^a-z0-9_\s-]/", "", $string);
//Clean up multiple dashes or whitespaces
$string = preg_replace("/[\s-]+/", " ", $string);
//Convert whitespaces and underscore to dash
$string = preg_replace("/[\s_]/", "-", $string);
return $string;
}
Credit: https://stackoverflow.com/a/11330527/3010027
PS: Wordpress-Users have a look here: https://codex.wordpress.org/Function_Reference/sanitize_title
You don't need to use the HTML id attribute at all. You can use HTML5 data-* attribute to store user defined values. Here is an example:
<ul class="my-awesome-list">
<!-- PHP code, begin a for/while loop on rows -->
<li data-rowtitle="<?php echo json_encode($row["title"]); ?>">
<?php echo $row["title"]; ?>
</li>
<!-- PHP loop end -->
</ul>
Then, in you jQuery code, you can access the data-* values with the $.fn.data method
$(".my-awesome-list li").each(function(){
var realTitle = $(this).data('rowtitle');
});
Looking at the W3 specs, ids and classes can contain:
only the characters [a-zA-Z0-9] (...) plus the hyphen (-) and the
underscore (_); they cannot start with a digit, or a hyphen followed by a digit
Note that some other characters are accepted (that I omitted for simplicity). So I use this:
// We must be careful not to replace into an invalid string,
// thus adding 'a' in some cases and doing a second replace.
string.replace(/(^-\d-|^\d|^-\d|^--)/,'a$1').replace(/[\W]/g, '-');
Note that you may end up with identical strings that were originally different: you would have to rely on a more advance replace function if you have issues with this.

Automatically convert keywords to links in php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>
I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.
Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;
You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

Locating specific string and capturing data following it

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.
My site has meta tags like this (each page different):
<meta name="clan_name" content="Dark Mage" />
So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.
I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)
Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.
Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.
<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
$name = $meta->getAttribute('name');
if ($name == 'clan_name') {
$content = $meta->getAttribute('content');
// TODO handle content for clan_name
}
}
?>
EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:
<?
$html = strip_tags($html, '<meta>');
?>
Use a regular expression like the following, with PHP's preg_match():
/<meta name="clan_name" content="([^"]+)"/
If you're not familiar with regular expressions, read on.
The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.
The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:
[^"]
means "match any character that is not a double-quote".
The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:
[^"]+
means "match one or more characters that are not double-quotes".
Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:
([^"]+)
means "match one or more characters that are not double-quotes and store them as a matched subpattern.
In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.
So, assuming your HTML page is in the variable "$page", the following code:
$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);
if ($found) {
$clan_name = $matches[1];
}
Should get you what you want.
Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/

Categories