I am experimenting with finding similar text between a string and an online article. I am playing with similar_text() in php that shows the percentage a string matches. But I am trying to figure out how to echo out what similar_text() is finding that is similar. Is there any way to do this?
Here is a sample of what I am trying to do:
$similarText = similar_text($articleContent, $wordArr[$wordNum][1], $p);
//if(strpos($articleContent, $wordArr[$wordNum][1] ) !== false)
if($p > .25)
{
$test =($wordArr[$wordNum][1] - similar_text($articleContent, $wordArr[$wordNum][1]));
echo $test."<br/>";
echo "Percent: $p%"."<br/>";
echo "MATCH NAME<br/>";
print_r($wordArr[$wordNum]);
echo "<br/><br/>";
}
The similar text gives me a percentage of the words that I am matching, but I kind of want to see how it is working, and actually show the word it matches to the word it is matching. Like echo out:
echo $matcher." matches ".$matchee
Consider make a example for get a better answer.
<?
similar_text($string1, $string2, $p);
echo "Percent: $p%";
?>
If you need see how much characters have been changed.
<?=(strlen($string2) - similar_text($string,$string2));?>
Related
I need to find in big strings each keyword, highlight them by and leave only five words before and after this keyword. You can see in screenshot i created php script which
Link to image that shows what i have and what i need:
http://dawid969.webd.pl/cut.jpg
Code i have - PHP - I created functionality to highlight each word but i cant cut the string (around highlighted words five words backwards and 5 words forward) there are so weird situations when each highlighted word are next to each other, then we cant cut string, we cut string only when different in words beetwen highlighted words is greater than 10 words.
Anyone have idea how can i make last point? - cutting string?
<?php
$sentence = "But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again";
$arrayOfWords = array('explain', 'complete', 'pleasure');
echo "<b>Sentence:</b> ".$sentence;
echo '<br><br>';
echo "<b>Words to find (in array):</b> ";
print_r($arrayOfWords);
echo "<br><br>";
$look = explode(' ',$sentence);
foreach($look as $find){
for($i=0;$i<count($arrayOfWords);$i++){
$keyword = $arrayOfWords[$i];
if(stripos($find, $keyword) !== false) {
if(!isset($highlight)){
$highlight[] = $find;
} else {
if(!in_array($find,$highlight)){
$highlight[] = $find;
}
}
}
}
}
if(isset($highlight)){
foreach($highlight as $key => $replace){
$sentence = str_replace($replace,'<b>'.$replace.'</b>',$sentence);
}
}
echo "<b>Sentence formatted I have:</b> ".$sentence;
echo '<br><br>';
echo "<b>Sentence formated I need: </b> But I must <b>explain</b> to you how all this mistaken idea of denouncing <b>pleasure</b> and praising pain was born and I will give you a <b>complete</b> account of the system, and expound... ...one rejects, dislikes, or avoids <b>pleasure</b> itself, because it is pleasure,... ...not know how to pursue <b>pleasure</b> rationally encounter consequences that are...";
?>
My regex pattern may take a little "staring at" but basically they match upto 5 "words" (using term loosely) on either side of each found keyword.
The sentence is first divided into an array of substrings that either DO and DO NOT contain keywords. Call var_export($chunks); to see what I mean.
Then each element is conditionally processed. If the element:
contains a keyword, the emboldening action is taken.
is exactly one space, the element is left alone.
does not contain a keyword and is the first or last element, it becomes ....
does not contain a keyword and occurs mid-sentence, it becomes ... ....
Code: (Demo)
$sentence = "But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again";
$arrayOfWords=['explain','complete','pleasure'];
$pattern_core=implode('|',array_map(function($v){return preg_quote($v,'/');},$arrayOfWords)); // escape regex-impacting characters and pipe together
// split the sentence on keywords with upto 5 "words" padding, retain the delimiters
$chunks=preg_split("/((?:\S+\s+){0,5}\S*(?:".$pattern_core.")\S*(?:\s+\S+){0,5})/iu",$sentence,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
$last_chunk=sizeof($chunks)-1;
foreach($chunks as $i=>&$chunk){ // make $chunk modifiable with & symbol
$chunk=preg_replace("/{$pattern_core}/iu",'<b>$0</b>',$chunk,-1,$count);
if(!$count && $chunk!=' '){ // if contains no keyword and not a single space...
if($i==0 || $i==$last_chunk){ // single set of dots at beginning and end of sentence
$chunk='...';
}else{
$chunk='... ...'; // double set of dots in the middle of sentence
}
}
}
echo implode($chunks);
Output:
But I must <b>explain</b> to you how all this mistaken idea of denouncing <b>pleasure</b> and praising pain was born... ...I will give you a <b>complete</b> account of the system, and... ...one rejects, dislikes, or avoids <b>pleasure</b> itself, because it is <b>pleasure</b>,... ...not know how to pursue <b>pleasure</b> rationally encounter consequences that are...
<?php
$sentence = "But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again";
$arrayOfWords = array('explain', 'complete', 'pleasure');
echo "<b>Sentence:</b> ".$sentence;
echo '<br><br>';
echo "<b>Words to find (in array):</b> ";
print_r($arrayOfWords);
echo "<br><br>";
//replace this part
$look = explode(' ',$sentence);
$last_checked =0;
$i=0;
$highlight=false;
for(;$i<count($look);$i++){
foreach ($arrayOfWords as $keyword){
$find=$look[$i];
if(stripos($find, $keyword) !== false) {
$highlight =true;
if($i-$last_checked>10){
$j = ($last_checked ==0)?0: $last_checked+ 5;
$dots=true;
for(;$j<$i -5;$j++) {
if($dots){
$look[$j]= "...";
$dots=false;
}else
$look[$j]= "";
}
}
$last_checked =$i;
}
}
}
$sentence=implode(" ",$look);
if(isset($highlight)){
foreach($arrayOfWords as $key => $replace){
$sentence = str_replace($replace,'<b>'.$replace.'</b>',$sentence);
}
}
echo "<b>Sentence formatted I have:</b> ".$sentence;
echo '<br><br>';
echo "<b>Sentence formated I need: </b> But I must <b>explain</b> to you how all this mistaken idea of denouncing <b>pleasure</b> and praising pain was born and I will give you a <b>complete</b> account of the system, and expound... ...one rejects, dislikes, or avoids <b>pleasure</b> itself, because it is pleasure,... ...not know how to pursue <b>pleasure</b> rationally encounter consequences that are...";
?>
I hope this gives you a view,i did't focus on exact numbers.
[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.
I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger
To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}
I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa
You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example
Sorry to add another "Regex explanation" question to the internet but I must know the reason for this. I have ran this regex through RegexBuddy and Regex101.com with no help.
I came across the following regex ("%4d%[^\\n]") while debugging a time parsing function. Every now and then I would receive an 'invalid date' error but only during the months of January and June. I mocked up some code to recreate exactly what was happening but I can't figure out why removing the one slash fixes it.
<?php
$format = '%Y/%b/%d';
$random_date_strings = array(
'2015/Jan/03',
'1985/Feb/13',
'2001/Mar/25',
'1948/Apr/02',
'1948/May/19',
'2020/Jun/22',
'1867/Jul/09',
'1901/Aug/11',
'1945/Sep/21',
'2000/Oct/31',
'2009/Nov/24',
'2015/Dec/02'
);
$year = null;
$rest_of_string = null;
echo 'Bad Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
sscanf($date_string, "%4d%[^\\n]", $year, $rest_of_string);
print_data($date_string, $year, $rest_of_string);
}
echo 'Good Regex:';
echo '<br/><br/>';
foreach ($random_date_strings as $date_string) {
sscanf($date_string, "%4d%[^\n]", $year, $rest_of_string);
print_data($date_string, $year, $rest_of_string);
}
function print_data($d, $y, $r) {
echo 'Date string: ' . $d;
echo '<br/>';
echo 'Year: ' . $y;
echo '<br/>';
echo 'Rest of string: ' . $r;
echo '<br/>';
}
?>
Feel free to run this locally but the only two outputs I'm concerned about are the months of June and January. "%4d%[^\\n]" will truncate $rest_of_string to /Ju and /Ja while "%4d%[^\n]" displays the rest of the string as expected (/Jan/03 & /Jun/22).
Here's my interpretation of the faulty regex:
%4d% - Get four digits.
[^\\n] - Look for those digits in between the beginning of the string and a new line.
Can anyone please correct my explanation and/or tell me why removing the slash gives me the result I expect?
I don't care for the HOW...I need the WHY.
Like #LucasTrzesniewski pointed out, that's sscanf() syntax, it has nothing to do with Regex. The format is explained in the sprintf() page.
In your pattern "%4d%[^\\n]", the two \\ translate to a single backslash character. So the correct interpretation of the "faulty" pattern is:
%4d - Get four digits.
%[^\\n] - Look for all characters that are not a backslash or the letter "n"
That's why it matches everything up until the "n" in "Jan" and "Jun".
The correct pattern is "%4d%[^\n]", where the \n translates to a new line character, and it's interpretation is:
%4d - Get four digits.
%[^\n] - Look for all characters that are not a new line
I'm using the code below for highlight one word from file_get_content and go to anchor.
$file='
IAR6=1002
SHF6=1
REF6=0002
TY7=2
DATE7=20130820182357
STAT_N7=1002
SEQ7=0002110000001
STA7=000005
TY8=2
DATE8=20130820182429
STAT_N8=1002
SH8=1
OP8=S123
SEQ8=0002120000081
';
$Seq = 0002110000001;
$text = preg_replace("/\b($Seq)\b/i", '<span class="highlight"><a name="here">\1</a></span>', $file);
for now this highlight : 0002110000001
i would like to highlight all part of the same index number.
ex:
looking for 0002110000001
highlight this part of txt only where number is 7
TY7=2
DATE7=20130820182357
STAT_N7=1002
SEQ7=0002110000001
STA7=000005
Any help will be appreciated.
EDIT:
i try to be more specific.
file contain lot of code parts always start by TYx (x is auto numbering)
i have the SEQ number for my search , in ex 0002110000001
the preg_replace("/\b($Seq)\b/i", '\1 find 0002110000001 and higlight them.
what i need is higlight what is between TY7 and TY8 instead of only 0002110000001.
Hope this is clear enough due to my bad english
thanks
You can make use of stripos() and explode() in PHP
<?php
$file='
IAR6=1002
SHF6=1
REF6=0002
TY7=2
DATE7=20130820182357
STAT_N7=1002
SEQ7=0002110000001
STA7=000005
TY8=2
DATE8=20130820182429
STAT_N8=1002
SH8=1
OP8=S123
SEQ8=0002120000081
';
//$Seq = "0002110000001";
$Seq = "7";
$new_arr=explode(PHP_EOL,$file);
foreach($new_arr as $k=>$v)
{
if(stripos($v,$Seq)!==false)
{
echo "$v\n";
}
}
OUTPUT :
TY7=2
DATE7=20130820182357
STAT_N7=1002
SEQ7=0002110000001
STA7=000005