I have scientific names in the following format:
S. daemon
A. cacatuoides
B. splendens
Etc, etc.
I'm having difficulty with the "." character.
This code works for full species names (i.e. Satanoperca daemon):
foreach ($species as $term) {
$term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
$pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
$urls[$term_norm] = '/dev/species/' . str_replace(" ", "-", rawurlencode($term));
$rels[$term_norm] = $urls[$term_norm] . '?preview=true';
$title[$term_norm] = $term;
But I can't get it to work for the aforementioned examples:
$genus_species = explode(" ", $term);
$genus = $genus_species[0];
$species = $genus_species[1];
$initial = substr($genus, 0, 1);
$shortened = $initial . '. ' . $species;
$term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($shortened)));
$pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
$urls[$term_norm] = '/dev/species/' . rawurlencode($term);
$rels[$term_norm] = $urls[$term_norm] . '?preview=true';
$title[$term_norm] = $term;
If I use this code, nearly all of my source, i.e. every word/character, is linked with . If I comment the code out, the full name linking works perfectly and not such problem occurs.
A little more info...
$pattern is echoing out as: /\b(SATANOPERCA\s+DAEMON|S(\.)\s+DAEMON)\b/i
The input is a list of species names, such as the ones I previously mentioned. The source is a species profile, which often refers to other species.
What I'd like the code to do is replace any mention of these species names with a link to that species profile.
Thanks in advance,
While looking into your issue I ran over the way you initially build the regular expression. I thought, why not simplify it? Here is what I've come up with:
foreach ($terms as $term) {
list($genus, $species) = explode(' ', $term);
$pattern = sprintf('~\b((?:%s[.]|%s) %s)~i', $genus[0], $genus, $species);
Which gives the following
~\b((?:S[.]|Satanoperca) daemon)~i
I'm making use of list here in combination with explode which often is less code, so better readable.
To build the regular expression I use sprintf which often is easier to formulate complex strings you need substitution in. It allows the usage of a mask.
Finally $genus[0] is the first character of $genus. You might need to replace it in case you're using a multibyte character set. Just saying.
The pattern itself is streamlined as well:
~\b((?:S[.]|Satanoperca) daemon)~i
The first subgroup is non-catching (?:) and offers both variants: Short with . or the long genus. Then followed by the space and finally the species. I also use [.] to express the dot in there, but sure \. would work as well:
~\b((?:S\.|Satanoperca) daemon)~i
What's left is the replacement procedure. I opted for using a callback function here. As the link needs only be build once for the term, I add that on top of the foreach. Again I'm using sprintf to format it:
foreach ($terms as $term) {
$termSlug = strtolower(strtr($term, array(' ' => '-')));
$termHref = sprintf('/dev/species/%s', rawurlencode($termSlug));
list($genus, $species) = explode(' ', $term);
$pattern = sprintf('~\b((?:%s\.|%s) %s)~i', $genus[0], $genus, $species);
What's left is the callback function that replaces every match with the link:
$string = preg_replace_callback($pattern, function($match) use($term, $termHref)
{
return sprintf('%s', $termHref
, htmlspecialchars($term), htmlspecialchars($match[1]));
}, $string);
And that's it. The full example:
$string = <<<STR
S. daemon
Satanoperca daemon
A. cacatuoides
B. splendens
STR;
$terms = array(
'Satanoperca daemon',
);
foreach ($terms as $term) {
$termSlug = strtolower(strtr($term, array(' ' => '-')));
$termHref = sprintf('/dev/species/%s', rawurlencode($termSlug));
list($genus, $species) = explode(' ', $term);
$pattern = sprintf('~\b((?:%s\.|%s) %s)~i', $genus[0], $genus, $species);
echo $pattern, "\n";
$string = preg_replace_callback($pattern, function($match) use($term, $termHref)
{
return sprintf('%s', $termHref
, htmlspecialchars($term), htmlspecialchars($match[1]));
}, $string);
}
echo $string;
And it's output:
S. daemon
Satanoperca daemon
A. cacatuoides
B. splendens
I hope this is helpful even it's completely new code everywhere.
Validate Terms:
// validate terms
$valid = '/^\w+ \w+$/';
foreach ($terms as $index => $term) {
if ($result = preg_match($valid, $term))
continue;
printf("Invalid Term: (%d) %s\n", $index, $term);
}
Do you want to include the . also like this
$term_norm = preg_replace('/[\s\.]+/', ' ', strtoupper(trim($shortened)));
Related
I am trying to remove certain strings from an array of strings
$replace = array(
're: ', 're:', 're',
'fw: ', 'fw:', 'fw',
'[Ticket ID: #'.$ticket["ticketnumber"].'] ',
);
$available_subjects = array($ticket["subject"], $update["subject"]);
I tried using these loops
This replaced words like "You're" because of the "re"
foreach($replace as $r) {
$available_subjects = str_ireplace($r, '', $available_subjects);
}
And the same with this one
foreach($replace as $r) {
$available_subjects = preg_replace('/\b'.$r.'\b/i', '', $available_subjects);
}
So I want to match the whole word, and not part of words
First, I would replace Ticket IDs statically like you did:
$ticket_prefix = '[Ticket ID: #' . $ticket["ticketnumber"] . '] ';
$available_subjects = str_ireplace($ticket_prefix, '', $available_subjects);
Then, I would use a regular expression to replace re and fw:
$available_subjects = preg_replace('#\b(fw|re):?\s*\b#i', '', $available_subjects);
I have a complete mysql query in string, inside Where clause of that query string i have defined some variables names that should i replace them with the value passed from interface, but if user has not filled any filter, i should replace the whole relevant sub statement inside query with 1.
Suppose i have a html select element in form named as provinces_province_code, and the same string inside the query.
I have developed bellow code blocks to remove the extra spaces from query, and get all strings after the where. Here $nput_key is provinces_province_code, and $input_val is value of select element'(0,01,02,....,etc..)'. I have checked if $input_val is 0 then i should replace users.province_code = provinces_province_code with 1.
if (preg_match('/= ' . $input_key . ' /', trim(preg_replace('/\s+/', ' ', $main_query))) or preg_match('/=' . $input_key . '/', trim(preg_replace('/\s+/', ' ', $main_query))) or preg_match('/= \'' . $input_key . '\'/', trim(preg_replace('/\s+/', ' ', $main_query))) or preg_match('/=\'' . $input_key . '\'/', trim(preg_replace('/\s+/', ' ', $main_query))) or preg_match('/= "' . $input_key . '"/', trim(preg_replace('/\s+/', ' ', $main_query))) or preg_match('/="' . $input_key . '"/', trim(preg_replace('/\s+/', ' ', $main_query)))) {
if($input_val == 0){
$main_query = preg_replace('/ where /', ' WHERE ', $main_query);
$main_query = preg_replace('/ Where /', ' WHERE ', $main_query);
$separatedByWhere = explode("WHERE", $main_query);
$last_where_clause = end($separatedByWhere);
Here I have removed all tild,single/double quatation before and after of province_province_code.
$string = preg_replace('/\`'.$prm_val.'\`/', $prm_val, $last_where_clause);
$string = preg_replace('/\''.$prm_val.'\'/', $prm_val, $string);
$string = preg_replace('/\"'.$prm_val.'\"/', $prm_val, $string);
$string = preg_replace('/\"'.$prm_key.'\"/', $prm_val, $string);
$string = preg_replace('/\"'.$prm_key.'\"/', $prm_val, $string);
$string = preg_replace('/\"'.$prm_key.'\"/', $prm_val, $string);
Now i want to define start, and end of string and call another function to get all sub strings occurred between those strings, in my criteria end of string is always the name of html element that i user as filter (provinces_province_code), but start of string is unknown, what i need to do is, I should get the first occurred word before = $end_string, to use it as $start_string.
$string is:
users.district_code IS NOT NULL AND users.district_code <> '' AND users.province_code = srs.province_code AND users.province_code = provinces_province_code
GROUP BY users.district_code
$start_string = '';
$end_string = $prm_val;
$statment2beReplaced = self::get_string_between($string, $start_string, $end_string);
$statment2beReplaced = preg_replace('/ and /', ' AND ',$statment2beReplaced);
if(preg_match('/ AND /',$statment2beReplaced)){
$searchString = $statment2beReplaced.''.$prm_val;
$statment2beReplaced = preg_replace('/'.$prm_key.' =/',1,$statment2beReplaced);
$string = preg_replace('/'.$searchString.'/',$statment2beReplaced,$string);
}
else {
$string = preg_replace('/' . $prm_key . $statment2beReplaced . $end_string . '/', 1, $string);
}
$separatedByWhere[key($separatedByWhere)] = $string;
$main_query = implode('WHERE', $separatedByWhere);
}
Now I have checked that provinces_province_code existed inside above string or not
1- if existed so, i want to get first word occurred before = provinces_province_code, which in above string itsusers.province_code`.
The question is now, how can i get that ?
thanks for any help.
Sorry I am not much good with understanding preg_match yet so I would look at exploding the string into an array something like :-
<?php
$string = "users.district_code IS NOT NULL AND users.district_code <> '' AND users.province_code = srs.province_code AND users.province_code = provinces_province_code GROUP BY users.district_code";
$subs = explode(" ", $string);
foreach ($subs as $item) {
echo "<li>$item</li>";
}
?>
then depending on the delimiter you use and how your string is formatted you can search the array and find the preceding word. For instance in the above example since the preceding word is separated by space = space, it will always be 2 rows before (= will be considered a word as well). Or you could strip the = and it will be the row before. This is assuming each word is preceded by a space and ended by a space
So, for my university graduation thesis I chose to build a web app that extracts the main idea from an article(summarization app). It's build in PHP.
But I have reached a situation to which I see no possible solutions, maybe you guys can give me an idea or a solution to the problem.
So basically the app relies on extractive algorithms, what I do:
Firstly, I "sanitize" the text, which means I remove all stop words, I stem the words, remove any abbreviation or initials that may contain a '.' that can alter my text from not being broken into sentences correctly.
After that I break the text into sentences by exploding the text by . token and I get all sentences in an array.
Now comes the process in which I "give" the sentences a rating, basically this is how I spot the most relevant sentence in the article, the one that has the highest rating is usually the one that contains the article's main idea.
But my problem starts now, the sentences that I have rated are the ones on which I applied all the 'sanitization' and are not in their original form. I want to take the highest rated sentence and based on that I want to extract the original sentence from the text to which this rated sentence matches. I have tried matching it with regex but it doesn't always work. I need a 100% working method that extracts the original sentence from the article based on the highest rated sentence.
I have no idea how to achieve this, since the rated sentence misses words from it.
I hope you understand my point. Thank you.
EDIT:
This is the function that I now use to match the original sentence in the article but I doesn't always work:
private function get_original_sentence($s, $t)
{
$s = preg_replace("/[^A-Za-z0-9 ]/", '', $s);
$s = trim($s);
$arr = explode(" ",$s);
$f_word = $arr[0];
$l_word = $arr[count($arr)-1];
preg_match('~(?<=\.)([a-zA-Z ]*)'.$f_word.'(.*?)'.$l_word.'([a-zA-Z ]*)(?=\.)~i', $t, $matches);
if(empty($string))
{
preg_match('~(?<=\.)([^\.]*)'.$f_word.'(.*?)'.$l_word.'([^\.]*)(?=\.)~i', $t, $matches);
}
return $matches[0] ? $matches[0] : false;
}
The $s parameter is the rated sentence after the summarization and $t is the full original article.
EDIT 2: The abbreviation removal function, which practicaly sanitizes the whole text not just abbreviations.
static private function _remove_abbrev($subject)
{
$domains = '\.ro|\.com|\.edu|\.org|\.gov';
foreach(self::$abrv as $abrv)
{
$not.= strtolower(str_replace('.', '\.', $abrv)).'|';
$not.= strtolower(trim(str_replace('.', '\.', $arbv))).'|';
}
$not = substr($not, 0, -1);
//$subject = preg_replace('~(\".*?\")~u', '', strtolower($subject));//replaces " " from text.
$subject = preg_replace('~(?<=\.|^)(?![^\.]{60,})[^\.&]*\&[^\.]*\.?~u', '', strtolower($subject));
$subject = preg_replace('~\b\s?[\dA-za-z\-\.]+('.$domains.')~u', '', strtolower($subject));
$subject = preg_replace('~\s*\(.*?\)\s*~u', '', strtolower($subject));
$subject = preg_replace('~\b('.$not.')~u', '', strtolower($subject));
$subject = preg_replace('~(?<=[^a-z])[A-Za-z]{1,5}+\.[\s\,]*(?=[a-z]|[0-9])~u', '', strtolower($subject));
$subject = preg_replace('~(?<=[\s\,\.\:])([A-Za-z]*(\.)){2,}+(.)(?=.*)~u', '', strtolower($subject));
$subject = preg_replace('~(\d)+\.(\d)*(\s)~u', '', strtolower($subject));
return $subject;
}
This is the abbreviation array collection:
static public $abrv = array(
' alin.', ' art.', ' A.N.P', ' A.V.A.B', ' A.V.A.S.', ' B.N.R', ' c.', ' C.A.S', ' C.civ.', ' C.com.', ' C.fam.', ' C.pen.', ' C.pr.civ.', ' C.pr.pen', ' C.N.C.D', ' C.N.V.M', ' C.N.S.A.S', ' C.S.M', ' C.S.J', ' D.G.F.P', ' D.G.P.M.B', ' D.N.A', ' D.S.V', 'Ed.', ' etc.', ' H.G.', ' I.G.P.F', ' I.G.P.R', ' I.N.M.L.', ' I.P.J', ' I.C.C.J', ' lit.', ' M.Ap.N.', ' art.', ' M.J.', ' M.Of.', ' nr.', ' O.G.', ' O.U.G.', ' p.', ' P.N.A.', ' par.', ' pct.', ' R.A.A.P.P.S.', ' subl. ns.', ' S.C.', ' S.A.', ' S.P.P.', ' S.R.I.', ' S.R.L.', 'U.N.B.R.', ' urm.', ' str.', ' sec.', ' pag.', ' a.c.', ' dv.', ' dvs.', ' prof.', ' conf.', ' dr.', ' drd.', ' mrd.', ' s.a.m.d'
);
How about this approach:
You extract all the matches with preg_match_all first into an array with numerical indexes $substitutions
Then you replace them with a unique marker utilizing the 4 variable of preg_replace: $count whose value points to the $substitutions array
A rough code sketch:
$count = 0;
$substitutions = array();
foreach ($patterns as $pattern) {
$matches = array();
preg_match_all($pattern, $subject, $matches);
preg_replace($pattern, $subject, '__'.$count.'__', -1, $count);
foreach ($matches[???] as $match) {
$substiutions[] = $match;
}
}
I'm not sure if i messed up the syntax for referring to $count as call by reference ( e.g. &$ in the documentation).
I think the crux of this approach is to extract the right value from the $matches array. There are some options, how the matches are extracted. Maybe another approach could be not to use $count from preg_replace but from the according sub-array of $matches
The _remove_abbr function doesn't seem to work very well. It removes words like "art" at the end of sentences but doesn't remove abbreviations like "C.A.S." (because it has already removed "c."). It also has at least one typo ($arbv) and doesn't define $not before concatenating to it.
Nevertheless, how about instead of removing the abbreviations, URLs, and so on, you replace them with space characters? That way, when you split the text into sentences, they would still have the same length as in the original text so you could store the position the sentences start and end at. If necessary, you could convert multiple spaces to a single space at this point but you'd still know where they came from in the original text.
You just need a callback function to achieve this:
$f = function($m){ return str_repeat(" ", strlen($m[0])); };
$subject = preg_replace_callback('~(?<=\.|^)(?![^\.]{60,})[^\.&]*\&[^\.]*\.?~u', $f, strtolower($subject));
$subject = preg_replace_callback('~\b\s?[\dA-za-z\-\.]+('.$domains.')~u', $f, $subject);
$subject = preg_replace_callback('~\s*\(.*?\)\s*~u', $f, $subject);
$subject = preg_replace_callback('~\b('.$not.')~u', $f, $subject);
$subject = preg_replace_callback('~(?<=[^a-z])[A-Za-z]{1,5}+\.[\s\,]*(?=[a-z]|[0-9])~u', $f, $subject);
$subject = preg_replace_callback('~(?<=[\s\,\.\:])([A-Za-z]*(\.)){2,}+(.)(?=.*)~u', $f, $subject);
$subject = preg_replace_callback('~(\d)+\.(\d)*(\s)~u', $f, $subject);
I want want my output like this when I search a keyword like
"programming"
php programming language
How to do this in php mysql?
Any idea?
Just perform a str_replace on the returned text.
$search = 'programming';
// $dbContent = the response from the database
$dbContent = str_replace( $search , '<b>'.$search.'</b>' , $dbContent );
echo $dbContent;
Any instance of "programming", even if as part of a larger word, will be wrapped in <b> tags.
For instances where more than one word are used
$search = 'programming something another';
// $dbContent = the response from the database
$search = explode( ' ' , $search );
function wrapTag($inVal){
return '<b>'.$inVal.'</b>';
}
$replace = array_map( 'wrapTag' , $search );
$dbContent = str_replace( $search , $replace , $dbContent );
echo $dbContent;
This will split the $search into an array at the spaces, and then wrap each match in the <b> tags.
You could use <b> or <strong> tags (See What's the difference between <b> and <strong>, <i> and <em>? for a dicussion about them).
$search = #$_GET['q'];
$trimmed = trim($search);
function highlight($req_field, $trimmed) //$req_field is the field of your table
{
preg_match_all('~\w+~', $trimmed, $m);
if(!$m)
return $req_field;
$re = '~\\b(' . implode('|', $m[0]) . ')\\b~';
return preg_replace($re, '<b>$0</b>', $req_field);
}
print highlight($req_field, $trimmed);
In this way, you can bolden the searched keywords. Its quite easy and works well.
The response is actually a bit more complicated than that. In the common search results use case, there are other factors to consider:
you should take into account uppercase and lowercase (Programming, PROGRAMMING, programming etc);
if your content string is very long, you wouldn't want to return the whole text, but just the searched query and a few words before and after it, for context;
This guy figured it out:
//$h = text
//$n = keywords to find separated by space
//$w = words near keywords to keep
function truncatePreserveWords ($h,$n,$w=5,$tag='b') {
$n = explode(" ",trim(strip_tags($n))); //needles words
$b = explode(" ",trim(strip_tags($h))); //haystack words
$c = array(); //array of words to keep/remove
for ($j=0;$j<count($b);$j++) $c[$j]=false;
for ($i=0;$i<count($b);$i++)
for ($k=0;$k<count($n);$k++)
if (stristr($b[$i],$n[$k])) {
$b[$i]=preg_replace("/".$n[$k]."/i","<$tag>\\0</$tag>",$b[$i]);
for ( $j= max( $i-$w , 0 ) ;$j<min( $i+$w, count($b)); $j++) $c[$j]=true;
}
$o = ""; // reassembly words to keep
for ($j=0;$j<count($b);$j++) if ($c[$j]) $o.=" ".$b[$j]; else $o.=".";
return preg_replace("/\.{3,}/i","...",$o);
}
Works like a charm!
i've asked same question before here, but now i need to higlight the keywortd in another way, then other part of word. example will be helpful i think
$str = "i like programming very much";
$key = "gram";
i need to get something like pro gram ming
$str = "i like <span class='first'>pro</span><span class='second'>gram</span><span class='firs'>ing</span>"
why preg_replace("/([^\s]*?)(".preg_quote($key).")([^\s]*)/iu","<span class="first">$0</span><span class="second">$1</span><span class="first">$2</span>",$str);
doesn't work?
Thanks
There are some errors, the first group is $1 and not $0 and so on. And you embeded ".
so, instead of :
preg_replace("/([^\s]*?)(".preg_quote($key).")([^\s]*)/iu","<span class="first">$0</span><span class="second">$1</span><span class="first">$2</span>",$str);
You have to do :
preg_replace("/([^\s]*?)(".preg_quote($key).")([^\s]*)/iu","<span class='first'>$1</span><span class='second'>$2</span><span class='first'>$3</span>",$str);
You're in PHP, so consider something like:
$words = explode(' ', "quick brown fox");
$out = array_map(process, $words);
return join(' ', $out);
and define
function process($word) {
$p = strpos($word, 'gram');
if ($p === FALSE) return $word;
/// take apart pieces with substr and friends, reformat in HTML, and return them
$left = substr($word, 0, $p);
$match = substr($word, $p, strlen('gram'));
$right = substr($word, $p+strlen('gram'));
return some_function_of($left, $match, $right);
}
A little more effort, but it does what you want.