PHP - Replacing emoticon with meaning - php

I am analysing informal chat style message for sentiment and other information. I need all of the emoticons to be replaced with their actual meaning, to make it easier for the system to parse the message.
At the moment I have the following code:
$str = "Am I :) or :( today?";
$emoticons = array(
':)' => 'happy',
':]' => 'happy',
':(' => 'sad',
':[' => 'sad',
);
$str = str_replace(array_keys($emoticons), array_values($emoticons), $str);
This does a direct string replacement, and therefore does not take into account if the emoticon is surrounded by other characters.
How can I use regex and preg_replace to determine if it is actually an emoticon and not part of a string?
Also how can I extend my array so that happy element for example can contain both entries; :) and :]?

For maintainability and readability, I would change your emoticons array to:
$emoticons = array(
'happy' => array( ':)', ':]'),
'sad' => array( ':(', ':[')
);
Then, you can form a look-up table just like you originally had, like this:
$emoticon_lookup = array();
foreach( $emoticons as $name => $values) {
foreach( $values as $emoticon) {
$emoticon_lookup[ $emoticon ] = $name;
}
}
Now, you can dynamically form a regex from the emoticon lookup array. Note that this regex requires a non-word-boundary surrounding the emoticon, change it to what you need.
$escaped_emoticons = array_map( 'preg_quote', array_keys( $emoticon_lookup), array_fill( 0, count( $emoticon_lookup), '/'));
$regex = '/\B(' . implode( '|', $escaped_emoticons) . ')\B/';
And then use preg_replace_callback() with a custom callback to implement the replacement:
$str = preg_replace_callback( $regex, function( $match) use( $emoticon_lookup) {
return $emoticon_lookup[ $match[1] ];
}, $str);
You can see from this demo that this outputs:
Am I happy or sad today?

Related

explode string on multiple words

There is a string like this:
$string = 'connector:rtp-monthly direction:outbound message:error writing data: xxxx yyyy zzzz date:2015-11-02 10:20:30';
This string is from user Input. So it will never have the same order. It's an input field which I need to split to build a DB query.
Now I would like to split the string based on words given in a array() which is like a mapper containing the words I need to find in the string. Looking like so:
$mapper = array(
'connector' => array('type' => 'string'),
'direction' => array('type' => 'string'),
'message' => array('type' => 'string'),
'date' => array('type' => 'date'),
);
Only the keys of the $mapper will be relevant. I've tried with foreach and explode like:
$parts = explode(':', $string);
But the problem is: There can be colons somewhere in the string so I don't need to explode there. I only need to explode if a colon is followed right after the mapper key. The mapper keys in this case are:
connector // in this case split if "connector:" is found
direction // untill "direction:" is found
message // untill "message:" is found
date // untill "date:" is found
But remember also, the user input can varey. So the string will always change ant the order of the string and the mapper array() will never be in the same order. So I'm not sure if explode is the right way to go, or if I should use a regex. And if so how to do it.
The desired result should be an array looking something like this:
$desired_result = array(
'connector' => 'rtp-monthly',
'direction' => 'outbound',
'message' => 'error writing data: xxxx yyyy zzzz',
'date' => '2015-11-02 10:20:30',
);
Help is much appreciated.
The trickier part of this is matching the original string. You can do it with Regex with the help of lookahead positive assertions:
$pattern = "/(connector|direction|message|date):(.+?)(?= connector:| direction:| message:| date:|$)/";
$subject = 'connector:rtp-monthly direction:outbound message:error writing data: xxxx yyyy zzzz date:2015-11-02 10:20:30';
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER );
$returnArray = array();
foreach($matches as $item)
{
$returnArray[$item[1]] = $item[2];
}
In this Regex /(connector|direction|message|date):(.+?)(?= connector:| direction:| message:| date:|$)/, you're matching:
(connector|direction|message|date) - find a keyword and capture it;
: - followed by a colon;
(.+?) - followed by any character many times non greedy, and capture it;
(?= connector:| direction:| message:| date:|$) - up until the next keyword or the end of the string, using a non-capturing look-ahead positive assertion.
The result is:
Array
(
[connector] => rtp-monthly
[direction] => outbound
[message] => error writing data: xxxx yyyy zzzz
[date] => 2015-11-02 10:20:30
)
I didn't use the mapper array just to make the example clear, but you could use implode to put the keywords together.
Our aim isto make one array that contains the values of two arrays that we would extract from the string. It is neccesary to have two arrays since there are two string delimeters we wish to consider.
Try this:
$parts = array();
$large_parts = explode(" ", $string);
for($i=0; $i<count($large_parts); $i++){
$small_parts = explode(":", $large_parts[$i]);
$parts[$small_parts[0]] = $small_parts[1];
}
$parts should now contain the desired array
Hope you get sorted out.
Here you are. The regex is there to "catch" the key (any sequence of characters, excluding blank space and ":"). Starting from there, I use "explode" to "recursively" split the string. Tested ad works good
$string = 'connector:rtp-monthly direction:outbound message:error writing data date:2015-11-02';
$element = "(.*?):";
preg_match_all( "/([^\s:]*?):/", $string, $matches);
$result = array();
$keys = array();
$values = array();
$counter = 0;
foreach( $matches[0] as $id => $match ) {
$exploded = explode( $matches[ 0 ][ $id ], $string );
$keys[ $counter ] = $matches[ 1 ][ $id ];
if( $counter > 0 ) {
$values[ $counter - 1 ] = $exploded[ 0 ];
}
$string = $exploded[ 1 ];
$counter++;
}
$values[] = $string;
$result = array();
foreach( $keys as $id => $key ) {
$result[ $key ] = $values[ $id ];
}
print_r( $result );
You could use a combination of a regular expression and explode(). Consider the following code:
$str = "connector:rtp-monthly direction:outbound message:error writing data date:2015-11-02";
$regex = "/([^:\s]+):(\S+)/i";
// first group: match any character except ':' and whitespaces
// delimiter: ':'
// second group: match any character which is not a whitespace
// will not match writing and data
preg_match_all($regex, $str, $matches);
$mapper = array();
foreach ($matches[0] as $match) {
list($key, $value) = explode(':', $match);
$mapper[$key][] = $value;
}
Additionally, you might want to think of a better way to store the strings in the first place (JSON? XML?).
Using preg_split() to explode() by multiple delimiters in PHP
Just a quick note here. To explode() a string using multiple delimiters in PHP you will have to make use of the regular expressions. Use pipe character to separate your delimiters.
$string = 'connector:rtp-monthly direction:outbound message:error writing data: xxxx yyyy zzzz date:2015-11-02 10:20:30';
$chunks = preg_split('/(connector|direction|message)/',$string,-1, PREG_SPLIT_NO_EMPTY);
// Print_r to check response output.
echo '<pre>';
print_r($chunks);
echo '</pre>';
PREG_SPLIT_NO_EMPTY – To return only non-empty pieces.

How to explode different section from a textfile into an array using php (and no regex)?

This question is almost duplicate to How to transform structured textfiles into PHP multidimensional array but I have posted it again since I was unable to understand the regular expression based solutions that were given. It seems better to try and solve this using just PHP so that I may actually learn from it (regex is too hard to understand at this point).
Assume the following text file:
HD Alcoa Earnings Soar; Outlook Stays Upbeat
BY By James R. Hagerty and Matthew Day
PD 12 July 2011
LP
Alcoa Inc.'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts' forecasts.
However, profits wereless than expected
TD
Licence this article via our website:
http://example.com
I read this textfile with PHP, an need a robust way to put the file contents into an array, like this:
array(
[HD] => Alcoa Earnings Soar; Outlook Stays Upbeat,
[BY] => By James R. Hagerty and Matthew Day,
[PD] => 12 July 2011,
[LP] => Alcoa Inc.'s profit...than expected,
[TD] => Licence this article via our website: http://example.com
)
The words HD BY PD LP TD are keys to identify a new section in the file. In the array, all newlines may be stripped from the values. Ideally I would be able to do this without regular expressions. I believe exploding on all keys could be one way of doing it, but it would be very dirty:
$fields = array('HD', 'BY', 'PD', 'LP', 'TD');
$parts = explode($text, "\nHD ");
$HD = $parts[0];
Does anybody have a more clean idea on how to loop through the text, perhaps even once, and dividing it up into the array as given above?
This is another, even shorter approach without using regular expressions.
/**
* #param array array of stopwords eq: array('HD', 'BY', ...)
* #param string Text to search in
* #param string End Of Line symbol
* #return array [ stopword => string, ... ]
*/
function extract_parts(array $parts, $str, $eol=PHP_EOL) {
$ret=array_fill_keys($parts, '');
$current=null;
foreach(explode($eol, $str) AS $line) {
$substr = substr($line, 0, 2);
if (isset($ret[$substr])) {
$current = $substr;
$line = trim(substr($line, 2));
}
if ($current) $ret[$current] .= $line;
}
return $ret;
}
$ret = extract_parts(array('HD', 'BY', 'PD', 'LP', 'TD'), $str);
var_dump($ret);
Why not using regular expressions?
Since the php documentation, particular in preg_* functions, recommend to not use regular expressions if not strongly required. I was wondering which of the examples in the answers to this question has the best berformance.
The result surprised myself:
Answer 1 by: hek2mgl 2.698 seconds (regexp)
Answer 2 by: Emo Mosley 2.38 seconds
Answer 3 by: anubhava 3.131 seconds (regexp)
Answer 4 by: jgb 1.448 seconds
I would have expected that the regexp variants would be the fastest.
Well, it isn't a bad thing to not use regular expressions in any case. In other words: using regular expressions is not the best solution in general. You have to decide for the best solution case-by-case.
You may repeat the measurement with this script.
Edit
Here is a short, more optimized example using a regexp pattern. Still not as fast as my example above but faster than the other regexp based examples.
The Output format may be optimized (whitespaces / line breaks).
function extract_parts_regexp($str) {
$a=array();
preg_match_all('/(?<k>[A-Z]{2})(?<v>.*?)(?=\n[A-Z]{2}|$)/Ds', $str, $a);
return array_combine($a['k'], $a['v']);
}
A plea on behalf of SIMPLIFIED, FAST & READABLE regex code!
(From Pr0no in comments) Do you think you could simplify the regex or have a tip on how to start with a php solution? Yes, Pr0n0, I believe I can simplify the regex.
I'd like to make the case that regex is by far the best tool for the job and that it doesn't have to be frightening & unreadable expressions as we've seen earlier. I have broken this function down into understandable parts.
I've avoided complex regex features like capture groups and wildcard expressions and focused on trying to produce something simple that you'll feel comfortable coming back to in 3 months time.
My proposed function (commented)
function headerSplit($input) {
// First, let's put our headers (any two consecutive uppercase characters at the start of a line) in an array
preg_match_all(
"/^[A-Z]{2}/m", /* Find 2 uppercase letters at start of a line */
$input, /* In the '$input' string */
$matches /* And store them in a $matches array */
);
// Next, let's split our string into an array, breaking on those headers
$split = preg_split(
"/^[A-Z]{2}/m", /* Find 2 uppercase letters at start of a line */
$input, /* In the '$input' string */
null, /* No maximum limit of matches */
PREG_SPLIT_NO_EMPTY /* Don't give us an empty first element */
);
// Finally, put our values into a new associative array
$result = array();
foreach($matches[0] as $key => $value) {
$result[$value] = str_replace(
"\r\n", /* Search for a new line character */
" ", /* And replace with a space */
trim($split[$key]) /* After trimming the string */
);
}
return $result;
}
And the output (note: you may need to replace \r\n with \n in str_replace function depending on your operating system):
array(5) {
["HD"]=> string(41) "Alcoa Earnings Soar; Outlook Stays Upbeat"
["BY"]=> string(35) "By James R. Hagerty and Matthew Day"
["PD"]=> string(12) "12 July 2011"
["LP"]=> string(172) "Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected"
["TD"]=> string(59) "Licence this article via our website: http://example.com"
}
Removing the Comments for a Cleaner Function
Condensed version of this function. It's exactly the same as above but with the comments removed:
function headerSplit($input) {
preg_match_all("/^[A-Z]{2}/m",$input,$matches);
$split = preg_split("/^[A-Z]{2}/m",$input,null,PREG_SPLIT_NO_EMPTY);
$result = array();
foreach($matches[0] as $key => $value) $result[$value] = str_replace("\r\n"," ",trim($split[$key]));
return $result;
}
Theoretically it shouldn't matter which one you use in your live code as parsing comments has little performance impact, so use the one you're more comfortable with.
Breakdown of the Regular Expression Used Here
There is only one expression in the function (albeit, used twice), let's break it down for simplicity:
"/^[A-Z]{2}/m"
/ - This is a delimiter, representing the start of the pattern.
^ - This means 'Match at the beginning of the text'.
[A-Z] - This means match any uppercase character.
{2} - This means match exactly two of the previous character (so exactly two uppercase characters).
/ - This is the second delimiter, meaning the pattern is over.
m - This is 'multi-line mode', telling regex to treat each line as a new string.
This tiny expression is powerful enough to match HD but not HDM at the start of a line, and not HD (for example in Full HD) in the middle of a line. You will not easily achieve this with non-regex options.
If you want two or more (instead of exactly 2) consecutive uppercase characters to signify a new section, use /^[A-Z]{2,}/m.
Using a list of pre-defined headers
Having read your last question, and your comment under #jgb's post, it looks like you want to use a pre-defined list of headers. You can do that by replacing our regex with "/^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m -- the | is treated as an 'or' in regular expressions.
Benchmarking - Readable Doesn't Mean Slow
Somehow benchmarking has become part of the conversation, and even though I think it's missing the point which is to provide you with a readable & maintainable solution, I rewrote JGB's benchmark to show you a few things.
Here are my results, showing that this regex-based code is the fastest option here (these results based on 5,000 iterations):
SWEETIE BELLE'S SOLUTION (2 UPPERCASE IS A HEADER): 0.054 seconds
SWEETIE BELLE'S SOLUTION (2+ UPPERCASE IS A HEADER): 0.057 seconds
MATEWKA'S SOLUTION (MODIFIED, 2 UPPERCASE IS A HEADER): 0.069 seconds
BABA'S SOLUTION (2 UPPERCASE IS A HEADER): 0.075 seconds
SWEETIE BELLE'S SOLUTION (USES DEFINED LIST OF HEADERS): 0.086 seconds
JGB'S SOLUTION (USES DEFINED LIST OF HEADERS, MODIFIED): 0.107 seconds
And the benchmarks for solutions with incorrectly formatted output:
MATEWKA'S SOLUTION: 0.056 seconds
JGB'S SOLUTION: 0.061 seconds
HEK2MGL'S SOLUTION: 0.106 seconds
ANUBHAVA'S SOLUTION: 0.167 seconds
The reason I offered a modified version of JGB's function is because his original function doesn't remove newlines before adding paragraphs to the output array. Small string operations make a huge difference in performance and must be benchmarked equally to get a fair estimation of performance.
Also, with jgb's function, if you pass in the full list of headers, you will get a bunch of null values in your arrays as it doesn't appear to check if the key is present before assigning it. This would cause another performance hit if you wanted to loop over these values later as you'd have to check empty first.
Here is a simple solution without regex
$data = explode("\n", $str);
$output = array();
$key = null;
foreach($data as $text) {
$newKey = substr($text, 0, 2);
if (ctype_upper($newKey)) {
$key = $newKey;
$text = substr($text, 2);
}
$text = trim($text);
isset($output[$key]) ? $output[$key] .= $text : $output[$key] = $text;
}
print_r($output);
Output
Array
(
[HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
[BY] => By James R. Hagerty and Matthew Day
[PD] => 12 July 2011
[LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
[TD] => Licence this article via our website:http://example.com
)
See Live Demo
Note
You might also want to do the following :
Check for duplicates Data
Make sure only HD|BY|PD|LP|TD are used
Remove $text = trim($text) so that the new lines would be preserved in the text
If it's just one record per file, here you go:
$record = array();
foreach(file('input.txt') as $line) {
if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
$currentKey = $matches[1];
$record[$currentKey] = $matches[2];
} else {
$record[$currentKey] .= str_replace("\n", ' ', $line);
}
}
The code iterates over each line of input and checks whether the line starts with an identifier. If so, currentKey is set to this identifier. All following content unless a new identifier was found will be added to this key in the array after new lines have been removed.
var_dump($record);
Output:
array(5) {
'HD' =>
string(42) "Alcoa Earnings Soar; Outlook Stays Upbeat "
'BY' =>
string(36) "By James R. Hagerty and Matthew Day "
'PD' =>
string(12) "12 July 2011"
'LP' =>
string(169) " Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected "
'TD' =>
string(58) "Licence this article via our website: http://example.com "
}
Note: If there are multiple records per file, you can refine the parser to return an multidimensional array:
$records = array();
foreach(file('input.txt') as $line) {
if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
$currentKey = $matches[1];
// start a new record if `HD` was found.
if($currentKey === 'HD') {
if(is_array($record)) {
$records []= $record;
}
$record = array();
}
$record[$currentKey] = $matches[2];
} else {
$record[$currentKey] .= str_replace("\n", ' ', $line);
}
}
However the data format itself looks fragile to me. What if LP looks like this:
LP dfks ldsfjksdjlf
lkdsjflk dsfjksld..
HD defsdf sdf sd....
You see, there is a HD in the data of LP in my example. In order to keep data parseable you'll have to avoid such situations.
UPDATE :
Given the posted example input file and code, I've altered my answer. I've added the OP's provided "parts" that define the section codes and make the function able to handle 2-or-more-digit codes. Below is a non-regex procedural function that should produce the desired results:
# Parses the given text file and populates an array with coded sections.
# INPUT:
# filename = (string) path and filename to text file to parse
# RETURNS: (assoc array)
# null is returned if there was a file error or no data was found
# otherwise an associated array of the field sections is returned
function getSections($parts, $lines) {
$sections = array();
$code = "";
$str = "";
# examine each line to build section array
for($i=0; $i<sizeof($lines); $i++) {
$line = trim($lines[$i]);
# check for special field codes
$words = explode(' ', $line, 2);
$left = $words[0];
#echo "DEBUG: left[$left]\n";
if(in_array($left, $parts)) {
# field code detected; first, finish previous section, if exists
if($code) {
# store the previous section
$sections[$code] = trim($str);
}
# begin to process new section
$code = $left;
$str = trim(substr($line, strlen($code)));
} else if($code && $line) {
# keep a running string of section content
$str .= " ".$line;
}
} # for i
# check for no data
if(!$code)
return(null);
# store the last section and return results
$sections[$code] = trim($str);
return($sections);
} # getSections()
$parts = array('HD', 'BY', 'WC', 'PD', 'SN', 'SC', 'PG', 'LA', 'CY', 'LP', 'TD', 'CO', 'IN', 'NS', 'RE', 'IPC', 'PUB', 'AN');
$datafile = $argv[1]; # NOTE: I happen to be testing this from command-line
# load file as array of lines
$lines = file($datafile);
if($lines === false)
die("ERROR: unable to open file ".$datafile."\n");
$data = getSections($parts, $lines);
echo "Results from ".$datafile.":\n";
if($data)
print_r($data);
else
echo "ERROR: no data detected in ".$datafile."\n";
Results:
Array
(
[HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
[BY] => By James R. Hagerty and Matthew Day
[PD] => 12 July 2011
[LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected
[TD] => Licence this article via our website: http://example.com
)
This is one problem where I think using regex shouldn't be a problem considering rules to parse inout data. Consider code like this:
$s = file_get_contents('input'); // read input file into a string
$match = array(); // will hold final output
if (preg_match_all('~(^|[A-Z]{2})\s(.*?)(?=[A-Z]{2}\s|$)~s', $s, $arr)) {
for ( $i = 0; $i < count($arr[1]); $i++ )
$match[ trim($arr[1][$i]) ] = str_replace( "\n", "", $arr[2][$i] );
}
print_r($match);
As you can see how compact code becomes because of the way preg_match_all has been used to match data from input file.
OUTPUT:
Array
(
[HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
[BY] => By James R. Hagerty and Matthew Day
[PD] => 12 July 2011
[LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
[TD] => Licence this article via our website:http://example.com
)
Don't loop at all. How about this (assuming one record per file)?
$inrec = file_get_contents('input');
$inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", $inrec ) ) )."'";
eval( '$record = array('.$inrec.');' );
var_export($record);
results:
array (
'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat ',
'BY' => 'By James R. Hagerty and Matthew Day ',
'PD' => '12 July 2011',
'LP' => '
Alcoa Inc.\'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts\' forecasts.
However, profits wereless than expected
',
'TD' => '
Licence this article via our website:
http://example.com',
)
If there can be more than on record per file, try something like:
$inrecs = explode( 'HD ', file_get_contents('input') );
$records = array();
foreach ( $inrecs as $inrec ) {
$inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", 'HD ' . $inrec ) ) )."'";
eval( '$records[] = array('.$inrec.');' );
}
var_export($records);
Edit
Here's a version with the $inrec functions split out so it can be more easily understood - and with a couple of tweaks: strips new-lines, trims leading and trailing spaces, and addresses backslash concern in EVAL in case the data is from an untrusted source.
$inrec = file_get_contents('input');
$inrec = str_replace( '\\', '\\\\', $inrec ); // Preceed all backslashes with backslashes
$inrec = str_replace( "'", "\\'", $inrec ); // Precede all single quotes with backslashes
$inrec = str_replace( PHP_EOL, " ", $inrec ); // Replace all new lines with spaces
$inrec = str_replace( array( 'HD ', 'BY ', 'PD ', 'LP ', 'TD ' ), array( "'HD' => trim('", "'),'BY' => trim('", "'),'PD' => trim('", "'),'LP' => trim('", "'),'TD' => trim('" ), $inrec )."')";
eval( '$record = array('.$inrec.');' );
var_export($record);
Results:
array (
'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat',
'BY' => 'By James R. Hagerty and Matthew Day',
'PD' => '12 July 2011',
'LP' => 'Alcoa Inc.\'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts\' forecasts. However, profits wereless than expected',
'TD' => 'Licence this article via our website: http://example.com',
)
Update
It dawned on me that in a multi-record scenario, building $repl outside of the record loop would perform even better. Here's the 2 byte keyword version:
$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, $repl, 'HD '.$inrec );
function parseRecord( $keys, $repl, $rec ) {
$split = chr(255);
$lines = explode( $split, str_replace( $keys, $repl, $rec ) );
array_shift( $lines );
$out = array();
foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
return $out;
}
Benchmark (thanks #jgb):
Answer 1 by: hek2mgl 6.783 seconds (regexp)
Answer 2 by: Emo Mosley 4.738 seconds
Answer 3 by: anubhava 6.299 seconds (regexp)
Answer 4 by: jgb 2.47 seconds
Answer 5 by: gwc 3.589 seconds (eval)
Answer 6 by: gwc 1.871 seconds
Here's another answer for multiple input records (assuming each records begins with 'HD ') and supporting 2 byte, 2 or 3 byte, or variable length keywords.
$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, 'HD '.$inrec );
Parse record with 2 byte keywords:
function parseRecord( $keys, $rec ) {
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$lines = explode( $split, str_replace( $keys, $repl, $rec ) );
array_shift( $lines );
$out = array();
foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
return $out;
}
Parse record with 2 or 3 byte keywords (assumes space or PHP_EOL between key and content):
function parseRecord( $keys, $rec ) {
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$lines = explode( $split, str_replace( $keys, $repl, $rec ) );
array_shift( $lines );
$out = array();
foreach ( $lines as $line ) $out[ trim( substr( $line, 0, 3 ) ) ] = trim( substr( $line, 3 ) );
return $out;
}
Parse record with variable length keywords (assumes space or PHP_EOL between key and content):
function parseRecord( $keys, $rec ) {
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$lines = explode( $split, str_replace( $keys, $repl, $rec ) );
array_shift( $lines );
$out = array();
foreach ( $lines as $line ) {
$keylen = strpos( $line.' ', ' ' );
$out[ trim( substr( $line, 0, $keylen ) ) ] = trim( substr( $line, $keylen+1 ) );
}
return $out;
}
Expectation is that each parseRecord function above would perform a little worse than its predecessor.
Results:
Array
(
[0] => Array
(
[HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
[BY] => By James R. Hagerty and Matthew Day
[PD] => 12 July 2011
[LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected
[TD] => Licence this article via our website: http://example.com
)
)
I prepared my own solution which came out slightly faster than jgb's answer. Here's the code:
function answer_5(array $parts, $str) {
$result = array_fill_keys($parts, '');
$poss = $result;
foreach($poss as $key => &$val) {
$val = strpos($str, "\n" . $key) + 2;
}
arsort($poss);
foreach($poss as $key => $pos) {
$result[$key] = trim(substr($str, $pos+1));
$str = substr($str, 0, $pos-1);
}
return str_replace("\n", "", $result);
}
And here's comparison of the performance:
Answer 1 by: hek2mgl 2.791 seconds (regexp)
Answer 2 by: Emo Mosley 2.553 seconds
Answer 3 by: anubhava 3.087 seconds (regexp)
Answer 4 by: jgb 1.53 seconds
Answer 5 by: matewka 1.403 seconds
Testing enviroment was the same as jgb's (100000 iterations - script borrowed from here).
Enjoy and please leave comments.

Replacing Placeholder Variables in a String

Just finished making this function. Basically it is suppose to look through a string and try to find any placeholder variables, which would be place between two curly brackets {}. It grabs the value between the curly brackets and uses it to look through an array where it should match the key. Then it replaces the curly bracket variable in the string with the value in the array of the matching key.
It has a few problems though. First is when I var_dump($matches) it puts puts the results in an array, inside an array. So I have to use two foreach() just the reach the correct data.
I also feel like its heavy and I've been looking over it trying to make it better but I'm somewhat stumped. Any optimizations I missed?
function dynStr($str,$vars) {
preg_match_all("/\{[A-Z0-9_]+\}+/", $str, $matches);
foreach($matches as $match_group) {
foreach($match_group as $match) {
$match = str_replace("}", "", $match);
$match = str_replace("{", "", $match);
$match = strtolower($match);
$allowed = array_keys($vars);
$match_up = strtoupper($match);
$str = (in_array($match, $allowed)) ? str_replace("{".$match_up."}", $vars[$match], $str) : str_replace("{".$match_up."}", '', $str);
}
}
return $str;
}
$variables = array("first_name"=>"John","last_name"=>"Smith","status"=>"won");
$string = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you {STATUS} the competition.';
echo dynStr($string,$variables);
//Would output: 'Dear John Smith, we wanted to tell you that you won the competition.'
I think for such a simple task you don't need to use RegEx:
$variables = array("first_name"=>"John","last_name"=>"Smith","status"=>"won");
$string = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you {STATUS} the competition.';
foreach($variables as $key => $value){
$string = str_replace('{'.strtoupper($key).'}', $value, $string);
}
echo $string; // Dear John Smith, we wanted to tell you that you won the competition.
I hope I'm not too late to join the party — here is how I would do it:
function template_substitution($template, $data)
{
$placeholders = array_map(function ($placeholder) {
return strtoupper("{{$placeholder}}");
}, array_keys($data));
return strtr($template, array_combine($placeholders, $data));
}
$variables = array(
'first_name' => 'John',
'last_name' => 'Smith',
'status' => 'won',
);
$string = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you have {STATUS} the competition.';
echo template_substitution($string, $variables);
And, if by any chance you could make your $variables keys to match your placeholders exactly, the solution becomes ridiculously simple:
$variables = array(
'{FIRST_NAME}' => 'John',
'{LAST_NAME}' => 'Smith',
'{STATUS}' => 'won',
);
$string = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you have {STATUS} the competition.';
echo strtr($string, $variables);
(See strtr() in PHP manual.)
Taking in account the nature of the PHP language, I believe that this approach should yield the best performance from all listed in this thread.
EDIT: After revisiting this answer 7 years later, I noticed a potentially dangerous oversight on my side, which was also pointed out by another user. Be sure to give them a pat on the back in the form of an upvote!
If you are interested in what this answer looked like before this edit, check out the revision history
I think you can greatly simplify your code, with this (unless I'm misinterpreting some of the requirements):
$allowed = array("first_name"=>"John","last_name"=>"Smith","status"=>"won");
$resultString = preg_replace_callback(
// the pattern, no need to escape curly brackets
// uses a group (the parentheses) that will be captured in $matches[ 1 ]
'/{([A-Z0-9_]+)}/',
// the callback, uses $allowed array of possible variables
function( $matches ) use ( $allowed )
{
$key = strtolower( $matches[ 1 ] );
// return the complete match (captures in $matches[ 0 ]) if no allowed value is found
return array_key_exists( $key, $allowed ) ? $allowed[ $key ] : $matches[ 0 ];
},
// the input string
$yourString
);
PS.: if you want to remove placeholders that are not allowed from the input string, replace
return array_key_exists( $key, $allowed ) ? $allowed[ $key ] : $matches[ 0 ];
with
return array_key_exists( $key, $allowed ) ? $allowed[ $key ] : '';
Just a heads up for future people who land on this page: All the answers (including the accepted answer) using foreach loops and/or the str_replace method are susceptible to replacing good ol' Johnny {STATUS}'s name with Johnny won.
Decent Dabbler's preg_replace_callback approach and U-D13's second option (but not the first) are the only ones currently posted I see that aren't vulnerable to this, but since I don't have enough reputation to add a comment I'll just write up a whole different answer I guess.
If your replacement values contain user-input, a safer solution is to use the strtr function instead of str_replace to avoid re-replacing any placeholders that may show up in your values.
$string = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you {STATUS} the competition.';
$variables = array(
"first_name"=>"John",
// Note the value here
"last_name"=>"{STATUS}",
"status"=>"won"
);
// bonus one-liner for transforming the placeholders
// but it's ugly enough I broke it up into multiple lines anyway :)
$replacement = array_combine(
array_map(function($k) { return '{'.strtoupper($k).'}'; }, array_keys($variables)),
array_values($variables)
);
echo strtr($string, $replacement);
Outputs: Dear John {STATUS}, we wanted to tell you that you won the competition.
Whereas str_replace outputs: Dear John won, we wanted to tell you that you won the competition.
This is the function that I use:
function searchAndReplace($search, $replace){
preg_match_all("/\{(.+?)\}/", $search, $matches);
if (isset($matches[1]) && count($matches[1]) > 0){
foreach ($matches[1] as $key => $value) {
if (array_key_exists($value, $replace)){
$search = preg_replace("/\{$value\}/", $replace[$value], $search);
}
}
}
return $search;
}
$array = array(
'FIRST_NAME' => 'John',
'LAST_NAME' => 'Smith',
'STATUS' => 'won'
);
$paragraph = 'Dear {FIRST_NAME} {LAST_NAME}, we wanted to tell you that you {STATUS} the competition.';
// outputs: Dear John Smith, we wanted to tell you that you won the competition.
Just pass it some text to search for, and an array with the replacements in.
/**
replace placeholders with object
**/
$user = new stdClass();
$user->first_name = 'Nick';
$user->last_name = 'Trom';
$message = 'This is a {{first_name}} of a user. The user\'s {{first_name}} is replaced as well as the user\'s {{last_name}}.';
preg_match_all('/{{([0-9A-Za-z_]+)}}/', $message, $matches);
foreach($matches[1] as $match)
{
if(isset($user->$match))
$rep = $user->$match;
else
$rep = '';
$message = str_replace('{{'.$match.'}}', $rep, $message);
}
echo $message;

Swear filter case sensitive

I have a little problem with my function:
function swear_filter($string){
$search = array(
'bad-word',
);
$replace = array(
'****',
);
return preg_replace($search , $replace, $string);
}
It should transform "bad-word" to "**" but the problem is the case sensivity
eg. if the user type "baD-word" it doesn't work.
The values in your $search array are not regular expressions.
First, fix that:
$search = array(
'/bad-word/',
);
Then, you can apply the i flag for case-insensitivity:
$search = array(
'/bad-word/i',
);
You don't need the g flag to match globally (i.e. more than once each) because preg_replace will handle that for you.
However, you could probably do with using the word boundary metacharacter \b to avoid matching your "bad-word" string inside another word. This may have consequences on how you form your list of "bad words".
$search = array(
'/\bbad-word\b/i',
);
Live demo.
If you don't want to pollute $search with these implementation details, then you can do the same thing a bit more programmatically:
$search = array_map(
create_function('$str', 'return "/\b" . preg_quote($str, "/") . "\b/i";'),
$search
);
(I've not used the recent PHP lambda syntax because codepad doesn't support it; look it up if you are interested!)
Live demo.
Update Full code:
function swear_filter($string){
$search = array(
'bad-word',
);
$replace = array(
'****',
);
// regex-ise input
$search = array_map(
create_function('$str', 'return "/\b" . preg_quote($str, "/") . "\b/i";'),
$search
);
return preg_replace($search, $replace, $string);
}
I think you mean
'/bad-word/i',
Do you even need to use regex?
function swear_filter($string){
$search = array(
'bad-word',
);
if (in_array(strtolower($string), $search){
return '****';
}
return $search
}
makes the following assumptions.
1) $string contains characters acceptable in the current local
2) all contents of the $search array are lowercase
edit: 3) Entire string consists of bad word
I suppose this would only work if the string was split and evaluated on a per word basis.

How to replace certain character with a defined character for it in a string using php?

I have a question that can I replace a certain character like # with # in a string.
I have all the character checkers and their replacer in an array. Like this--
$string_check = array(
"#" => "#",
.... and so on (list is too big)
);
So how can I do this thing. Please help me out. I only have 20 days of experience with php.
You can feed your translation table right into strtr():
$table = array(
'#' => '...',
);
$result = strtr($source, $table);
str_replace does exactly that and it also accepts arrays as replacement maps:
$string_check = array(
"#" => "#"
);
$result = str_replace (array_keys($string_check), array_values($string_check), $original);
$search = array('hello','foo');
$replace = array('world','bar');
$text = 'hello foo';
$result = str_replace($search,$replace,$text);
// $result will be 'world bar'
but in your case it looks like some kind of encoding, have you tried the htmlspecialchars?

Categories