How to properly replace strings when you have repeated substrings? - php

I want to add hyperlinks to urls in a text, but the problem is that I can have different formats and the urls could have some substrings repeated in other strings. Let me explain it better with an example:
Here I have one insidelinkhttp://google.com But I can have more formats like the followings: https://google.com google.com
And right now I have the following links extracted from the above example: ["http://google.com", "https://google.com", "google.com"] and I want to replace those matches with the following array: ['http://google.com', 'https://google.com', 'google.com']
If I iterate over the array replacing each element there will be an error as in the above example once that I have properly added the hyperlink for "http://google.com" each substring will be replaced with another hyperlink from "google.com"
Anyone has any idea about how solve that problem?
Thanks

On the basis of your sample string, I have defined 3 different patterns for URL matching and replace it as per your requirement, you can define more patterns in the "$regEX" variable.
// string
$str = "Here I have one insidelinkhttp://google.com But I can have more formats like the followings: https://google.com google.com";
/**
* Replace with the match pattern
*/
function urls_matches($url1)
{
if (isset($url1[0])) {
return '' . $url1[0] . '';
}
}
// regular expression for multiple patterns
$regEX = "/(http:\/\/[a-zA-Z0-9]+\.+[A-Za-z]{2,6}+)|(https:\/\/[a-zA-Z0-9]+\.+[A-Za-z]{2,6}+)|([a-zA-Z0-9]+\.+[A-Za-z]{2,6}+)/";
// replacing string based on defined patterns
$replacedString = preg_replace_callback(
$regEX,
"urls_matches",
$str
);
// print the replaced string
echo $replacedString;

You could do a search and replace them with templatestrings.
e.g.: STRINGA, STRINGB, STRINGC
Then loop over the array where item 0 replaces STRINGA.
Just make sure the template names don't have overlapping names, like STRING1 and STRING10

Related

Create a function to find a specific word in the title

I have the following title formation on my website:
It's no use going back to yesterday, because at that time I was... Lewis Carroll
Always is: The phrase… (author).
I want to delete everything after the ellipsis (…), leaving only the sentence as the title. I thought of creating a function in php that would take the parts of the titles, throw them in an array and then I would work each part, identifying the only pattern I have in the title, which is the ellipsis… and then delete everything. But when I do that, in the X space of my array, it returns the following:
was...
In position 8 of the array comes the word and the ellipsis and I don't know how to find a pattern to delete the author of the title, my pattern was the ellipsis. Any idea?
<?php
$a = get_the_title(155571);
$search = '... ';
if(preg_match("/{$search}/i", $a)) {
echo 'true';
}
?>
I tried with the code above and found the ellipsis, but I needed to bring it into an array to delete the part I need. I tried something like this:
<?php
define('WP_USE_THEMES', false);
require('./wp-blog-header.php');
global $wpdb;
$title_array = explode(' ', get_the_title(155571));
$search = '... ';
if (array_key_exists("/{$search}/i",$title_array)) {
echo "true";
}
?>
I started doing it this way, but it doesn't work, any ideas?
Thanks,
If you use regex you need to escape the string as preg_quote() would do, because a dot belongs to the pattern.
But in your simple case, I would not use a regex and just search for the three dots from the end of the string.
Note: When the elipsis come from the browser, there's no way to detect in PHP.
$title = 'The phrase... (author).';
echo getPlainTitle($title);
function getPlainTitle(string $title) {
$rpos = strrpos($title, '...');
return ($rpos === false) ? $title : substr($title, 0, $rpos);
}
will output
The phrase
First of all, since you're working with regular expressions, you need to remember that . has a special meaning there: it means "any character". So /... / just means "any three characters followed by a space", which isn't what you want. To match a literal . you need to escape it as \.
Secondly, rather than searching or splitting, you could achieve what you want by replacing part of the string. For instance, you could find everything after the ellipsis, and replace it with an empty string. To do that you want a pattern of "dot dot dot followed by anything", where "anything" is spelled .*, so \.\.\..*
$title = preg_replace('/\.\.\..*/', '', $title);

php preg_replace pattern - replace text between commas

I have a string of words in an array, and I am using preg_replace to make each word into a link. Currently my code works, and each word is transformed into a link.
Here is my code:
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z]+)\b/is", sprintf($template, "\\1"), $keywords);
Now, the only problem is that when I want 2 or 3 words to be a single link. For example, I have a keyword "blue curtains". The script would create a link for the word "blue" and "curtains" separately. I have the keywords separated by commas, and I would like the preg_replace to only replace the text between the commas.
I've tried playing around with the pattern, but I just can't figure out what the pattern would be.
Just to clarify, currently the output looks as follows:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
While I want to achieve the following output:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
A little bit change in preg_replace code and your job will done :-
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z ' ']+)\b/is", sprintf($template, "\\1"), $keywords);
OR
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z' ']+)\b/is", sprintf($template, "\\1"), $keywords);
echo $newkeys;
Output:- http://prntscr.com/77tkyb
Note:- I just added an white-space in your preg_replace. And you can easily get where it is. I hope i am clear.
Matching white-space along with words is missing there in preg_replace and i added that only.

How to improve my algorithm?/seaching and replacing words in a formated text/

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.
For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.
Here is the code I've got so far:
$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';
function search_and_replace(($key,$text)
{
$words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
for($words as $word)
{
if(strpos($word,$key) !== false)
{
if($word.startswith($key))
{
str_replace($word,''.$word.',$_text);
}
}
}
return text;
}
for($_keys as $_key)
{
$text = search_and_replace($key,$text);
}
My questions:
Would this algorithm work?
How would I modify this to work with UTF-8?
How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
Is this algorithm safe?
is the algorithm "true"? ( I'm reading "accurate")
No, it is not. Since str_replace functions as follows
a string or an array with all occurrences of search in subject
replaced with the given replace value.
The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).
work with UTF-8 Alphabets?
Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.
I want to igonre all words in each a tag for search operetion
That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.
Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.
a better alternative method?
Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.
In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match
How about regex?
preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
print($each[0]);
}
(Sorry, my PHP is a bit rusty)
For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.
$_keys = array( 'ABC', 'DEF', 'ABČ' );
$text =
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
PHP is <em>the</em> most ABCwidely used
langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';
// array for holding html items replaced with tokens
$tokens = array();
$id = 0;
// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s',
function( $matches ) use ( &$tokens, &$id )
{
// store matches into the tokens array
$tokens[ '#'.++$id.'#' ] = $matches[0];
// replace matches with the unique id
return '#'.$id.'#';
},
$text
);
echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
- note the #1# #2# #3# tokens
*/
// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '$0', $text );
// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );
echo $text;
Info about UTF-8 strings in PHP regex:

RegEx or Similar - Grab string preceding matched value

Here's the deal, I am handling a OCR text document and grabbing UPC information from it with RegEx. That part I've figured out. Then I query a database and if I don't have record of that UPC I need to go back to the text document and get the description of the product.
The format on the receipt is:
NAME OF ITEM 123456789012
OTHER NAME 987654321098
NAME 567890123456
So, when I go back the second time to find the name of the item I am at a complete loss. I know how to get to the line where the UPC is, but how can I use something like regex to get the name that precedes the UPC? Or some other method. I was thinking of somehow storing the entire line and then parsing it with PHP, but not sure how to get the line either.
Using PHP.
Get all of the names of the items indexed by their UPCs with a regex and preg_match_all():
$str = 'NAME OF ITEM 123456789012
OTHER NAME 987654321098
NAME 567890123456';
preg_match_all( '/^(.*?)\s+(\d+)/m', $str, $matches);
$items = array();
foreach( $matches[2] as $k => $upc) {
if( !isset( $items[$upc])) {
$items[$upc] = array( 'name' => $matches[1][$k], 'count' => 0);
}
$items[$upc]['count']++;
}
This forms $items so it looks like:
Array (
[123456789012] => NAME OF ITEM
[987654321098] => OTHER NAME
[567890123456] => NAME
)
Now, you can lookup any item name you want in O(1) time, as seen in this demo:
echo $items['987654321098']; // OTHER NAME
You can find the string preceding a value you know with the following regex:
$receipt = "NAME OF ITEM 123456789012\n" .
"OTHER NAME 987654321098\n" .
"NAME 567890123456";
$upc = '987654321098';
if (preg_match("/^(.*?) *{$upc}/m", $receipt, $matches)) {
$name = $matches[1];
var_dump($name);
}
The /m flag on the regex makes the ^ work properly with multi-line input.
The ? in (.*?) makes that part non-greedy, so it doesn't grab all the spaces
It would be simpler if you grabbed both the name and the number at the same time during the initial pass. Then, when you check the database to see if the number is present, you already have the name if you need to use it. Consider:
preg_match_all('^([A-Za-z ]+) (\d+)$', $document, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$name = $match[1];
$number = $match[2];
if (!order_number_in_database($number)) {
save_new_order($number, $name);
}
}
You can use lookahead assertions to match string preceding the UPC.
http://php.net/manual/en/regexp.reference.assertions.php
By something like this: ^\S*(?=\s*123456789012) substituting the UPC with the UPC of the item you want to find.
I'm lazy, so I would just use one regex that gets both parts in one shot using matching groups. Then, I would call it every time and put each capture group into name and upc variables. For cases in which you need the name, just reference it.
Use this type of regex:
/([a-zA-Z ]+)\s*(\d*)/
Then you will have the name in the $1 matching group and the UPC the $2 matching group. Sorry, it's been a while since I've used php, so I can't give you an exact code snippet.
Note: the suggested regex assumes you'll only have letters or spaces in your "names" if that's not the case, you'll have to expand the character class.

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?
Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.
Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

Categories