I am using regular expressions with preg_replace() in order to find and replace a sentence in a piece of text. The $search_string contains plain text + html tags + elements. The problem is that only sometimes the elements convert to white space on run time, making it difficult to find and replace using str_replace(). So, I'm trying to build a pattern that is equal to the search string and will match anything like it which contains, or does not contain the elements;
For example:
$search_string = 'Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.';
$pattern = $search_string(BUT IGNORE THE elements in the subject)
$subject = "text text text text text". $search_string . "text text text text text";
Using A regular expression to exclude a word/string, I've tried:
$pattern = '`^/(?!\ )'.$search_string.'`';
$output = preg_replace($pattern, $replacement_string,$subject);
The end result will be that if the $subject does contains a string that is like my $seach_string but without the elements, it will still match and replace it with $replacement_string
EDIT:
The actual values:
$subject = file_get_contents("http://venturebeat.com/2015/11/10/sources-classpass-raises-30-million-from-google-ventures-and-others/");
$search_string = "Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.";
$replacement_string = "<span class='smth'>Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.</span>";
Not a very efficient way of doing it but it should workout for you,
preg_replace('Two.*?years.*?in.*?the.*?company.*?has.*?expanded.*?to.*?35.*?cities.*?five.*?of.*?which.*?are.*?outside.*?the.*?U\.S\..*?Plus.*?in.*?April.*?ClassPass.*?acquired.*?its.*?main.*?competitor.*?Fitmob\.', '<span class=\'smth\'>$0</span>', $subject);
Related
Existing code I have draws a txt file and first converts http elements to links and then converts hashtag elements to links then prints the result. At the end of every text is a dash then a time and date (DOY and YEAR with leading zero - for a reason). The text echos on the page as (ex.)
Blah blah blah and blah blah - 3:47:32 310 017
or time/date variations of
14:09:47 23 017
7:38:83 9 017
so there is no set figure of characters
$text = file_get_contents("temp.txt");
$link = preg_replace('#(https?://([-\w\.]+)+(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)?)#', '$1', $text);
$hash = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/m', '#$1', $link);
echo $hash;
As an admitted novice, I have not been able to best translate the entirety of the syntax that creates the above preg_replace(s) to understand it well enough to make use of it to apply to wanting to do the same with the end time/date. I have made several attempts but have seen no results to demonstrate I am even going in the proper direction.
My thought process is that the order is
0?colon00colon00space0??space000
are the identities of the positions to seek.
You could use this regular expression:
$link = preg_replace('#\b(\d\d?:\d\d:\d\d [1-9]\d{0,2} 0\d\d)\b#', '$1', $text);
How would you write a regular expression pattern that matches a string even if it is 90% accurate?
For example:
$search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. "
$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S."
The end result is that the $search_string matches the $subject and returns true even though they are not 100% the same.
You can have some optional parts on the regex pattern. For example:
$search_string = "A tiny little bear";
$regex = "A ([a-zA-Z]+)? little bear";
The ? character there says that the group before it goes optional, and the [a-zA-Z]+ indicates there will be one or more letters inside it.
Thus, using preg_match you can get a validation not 100% restrictive.
in case any one comes around looking for the right way to do it
$search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";
$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";
similar_text ($search_string,$subject,$sim);
echo 'text is: ' .round($sim). '% similar';
result:
text is:85% similar
you can use the result to determine what value is a match in your particular circumstances like so:
similar_text($search_string,$subject,$sim);
if($sim >=85){
echo 'MATCH';
}
Just for grins, I tried this out using Perl.
All the warnings about using regex to parse html apply:
(Should not use on html).
This will split the Search string on either html or entities or whitespace.
After that, the parts are joined with .*? using the modifiers (?is).
This is not a true partial matching substring regex because
it requires all the parts to exist.
This does overcome the distance or content between them however.
Possibly, with a little algorithm work, it could be tweaked in such
a way that parts are optional, in the form of clustering.
use strict;
use warnings;
my $search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";
my $subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";
## Trim leading/trailing whitespace from $search_string
$search_string =~ s/^\s+|\s+$//g;
## Split the $search_string on html tags or entities or whitespaces ..
my #SearchParts = split m~
\s+|
(?i)[&%](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a
-f]+)));|<(?:script(?:\s+(?:"[\S\s]*?"|'
[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script
\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:
(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?
))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE
[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:-
-[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTI
TY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
~x, $search_string;
## Escape the metacharacters from SearchParts
#SearchParts = grep { $_ = quotemeta } #SearchParts;
## Join the SearchParts into a regex
my $rx = '(?si)(?:' . ( join '.*?', #SearchParts ) . ')';
## Try to match SearchParts in the $subject
if ( $subject =~ /$rx/ )
{
print "Match in subject:\n'$&' \n";
}
Output:
Match in subject:
'Two years in,the company has expanded to 35 cities, five of which are outside the U.S.'
edit:
As a side note, each element of #SearchParts could be further split//
once again (on each character), joining with .*?.
This would get into the realm of a true partial match.
Not quite there though as each character is required to match.
The order is maintained, but each one would have to be optional.
Usually, without capture groups, there is no way to tell the percentage
of actual letter's matched.
If you were to use Perl however, it's fairly easy to count in
regex Code construct (?{{..}}) where a counter can be incremented.
I guess, at that point it becomes non-portable. Better to use C++.
So I have a general text file with some writing in it, it really ranges randomly, but I also have a wordlist that I want to compare it with and count the occurrences of each word that appears in the text file that is on the word list.
For example my word list can be comprised of this:
good
bad
cupid
banana
apple
Then I want to compare each of these individual words with my html/text file which may be like this:
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
I wish my output to generate how many times each occurrence of the listed words happen. I have a way to do this with a for-loop but I really wish to avoid the for-loop since it will take forever since my real words list is about 10000 words long.
So in this case my output should be (I think) 9 since it counts total occurrences of a word on that list.
Also if there is a way to display which words where matched and how many occurrences, that would be even more helpful.
$words = '...'; // fill these with your stuff
$text = '...';
// split the word list
$expr = explode("\n", $words);
// sanitize the words contained
$expr = array_map(function ($v) { return preg_quote(trim($v)); }, $expr);
// prepare words for use in regex
$expr = '/('. implode('|', $expr) . ')/i'; // remove "i" for case sensitiviy
// matex prepared words against provided text
preg_match_all($expr, $text, $matches);
// calculate result
echo 'Count: ' . count($matches[1]);
Try preg_match_all
$text = 'Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.';
preg_match_all('/good|bad|cupid|banana|apple/', $text, $matches);
// ECHO OUT NUMBER OF MATCHES
echo sizeof($matches[0]); // 9
// TO SHOW ALL MATCHES
print_r( $matches );
I'm trying to match the last "word" of a string that contains numbers. Basically, I want to match any characters after the final (space) of the string if any of them are a number.
In my particular case and as an example I'm trying to extract season/episode from a TV show RSS feed. If I have: Show Name 12x4 or Show Name 2013-04-12 I wish to match 12x4 and 2013-04-12.
I started with a last word regex ([^ ]*$) but some shows don't have a episode or date (Show Name) so I wish to make my regex more specific.
Try with this regex:
/[\w\-]+$/
Sometimes regex is just not enough, or it gets more complicated... Try splitting the string first:
$str = 'Show Name 12x4';
$lastWord = array_pop(explode(' ', $str));
if (preg_match('/(?=\d)/', $lastWord)) {
...
}
Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers
You can use preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS
how about this:
you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.
just an idea, but imho might work for you.
EDIT:
try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)
<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);
print_r($matches[0]);
print $str;