Converting PHP function to Python 3 - php

Trying to convert a PHP function to Python, i am newbie in case of python, tthats what i tried
Python ->
def stopWords(text, stopwords):
stopwords = map(to_lower(x),stopwords)
pattern = '/[0-9\W]/'
text = re.sub(pattern, ',', text)
text_array = text.partition(',');
text_array = map(to_lower(x), text_array);
keywords = []
for term in text_array:
if(term in stopwords):
keywords.append(term)
return filter(None, keywords)
stopwords = open('stop_words.txt','r').read()
text = "All words in the English language can be classified as one of the eight different parts of speech."
print(stopWords(text, stopwords))
PHP ->
function stopWords($text, $stopwords)
{
// Remove line breaks and spaces from stopwords
$stopwords = array_map(
function ($x)
{
return trim(strtolower($x));
}
, $stopwords);
// Replace all non-word chars with comma
$pattern = '/[0-9\W]/';
$text = preg_replace($pattern, ',', $text);
// Create an array from $text
$text_array = explode(",", $text);
// remove whitespace and lowercase words in $text
$text_array = array_map(
function ($x)
{
return trim(strtolower($x));
}
, $text_array);
foreach($text_array as $term)
{
if (!in_array($term, $stopwords))
{
$keywords[] = $term;
}
};
return array_filter($keywords);
}
$stopwords = file('stop_words.txt');
$stopwords = file('stop_words.txt');
$text = "All words in the English language can be classified as one of the eight different parts of speech.";
print_r(stopWords($text, $stopwords));
I am getting the error in python on cmd:
IndentationError: unindent does not match any outer indentation level
Plz figure out what i am doing wrong, and "file" alternative in python

The for should be indented, as you write it, it seems to be out of the function. Moreover, the last return is not aligned both with the for or the function.
A correct indentation should look like this:
def stopWords(text, stopwords):
stopwords = map(to_lower(x),stopwords)
pattern = '/[0-9\W]/'
text = re.sub(pattern, ',', text)
text_array = text.partition(',');
text_array = map(to_lower(x), text_array);
keywords = []
for term in text_array:
if(term in stopwords):
keywords.append(term)
return filter(None, keywords)

Related

Using php to extract keyword pairs for SEO

I'm currently investigating some new ideas for long tail SEO. I have a site where people can create their own blogs, which brings pretty good long tail traffic already. I'm already displaying the article title inside the article's title tags.
However, often the title does not match well for keywords in the content, and I'm interested in maybe adding some keywords into the title that php has actually determined would be best.
I've tried using a script which I made to work out what the most common words are on a page. This works ok but the problem with this is it comes up with pretty useless words.
It's occurred to me that what would be useful is to make a php script that would extract frequently occurring pairs (or sets of 3) words and then put them in an array ordered by how often they occur.
My problem: how to parse text in a more dynamic way to look for recurring pairs or triplets of words. How would I go about this?
function extractCommonWords($string, $keywords){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];
foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, $keywords);
return $wordCountArr;
}
For the sake of including some code - here's another primitive adaptation that returns multi-word keywords of a given length and occurrences - rather than strip all common words it only filters those that are at the start and end of a keyword. It still returns some nonsense but that is unavoidable really.
function getLongTailKeywords($str, $len = 3, $min = 2){ $keywords = array();
$common = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
$str = preg_replace('/[^a-z0-9\s-]+/', '', strtolower(strip_tags($str)));
$str = preg_split('/\s+-\s+|\s+/', $str, -1, PREG_SPLIT_NO_EMPTY);
while(0<$len--) for($i=0;$i<count($str)-$len;$i++){
$word = array_slice($str, $i, $len+1);
if(in_array($word[0], $common)||in_array(end($word), $common)) continue;
$word = implode(' ', $word);
if(!isset($keywords[$len][$word])) $keywords[$len][$word] = 0;
$keywords[$len][$word]++;
}
$return = array();
foreach($keywords as &$keyword){
$keyword = array_filter($keyword, function($v) use($min){ return !!($v>$min); });
arsort($keyword);
$return = array_merge($return, $keyword);
}
return $return;
}
run code *on random BBC News article
The problem with just ignoring common words, grammar and punctuation though is that they still carry meaning within a sentence. If you remove them you are at best changing the meaning or at worst generating unintelligible phrases. Even the idea of extracting "keywords" itself is flawed because words can have different meanings - when you remove them from a sentence you take them out of context.
It's not my area but there are complex studies into natural languages and there is no easy solution - though the general theory goes like this: A computer cannot decipher the meaning of a single piece of text, it has to rely on cross referencing a semantically tagged corpus of related material (which is a huge overhead).

mb_eregi_replace multiple matches get them

$string = 'test check one two test3';
$result = mb_eregi_replace ( 'test|test2|test3' , '<$1>' ,$string ,'i');
echo $result;
This should deliver: <test> check one two <test3>
Is it possible to get, that test and test3 was found, without using another match function ?
You can use preg_replace_callback instead:
$string = 'test check one two test3';
$matches = array();
$result = preg_replace_callback('/test|test2|test3/i' , function($match) use ($matches) {
$matches[] = $match;
return '<'.$match[0].'>';
}, $string);
echo $result;
Here preg_replace_callback will call the passed callback function for each match of the pattern (note that its syntax differs from POSIX). In this case the callback function is an anonymous function that adds the match to the $matches array and returns the substitution string that the matches are to be replaced by.
Another approach would be to use preg_split to split the string at the matched delimiters while also capturing the delimiters:
$parts = preg_split('/test|test2|test3/i', $string, null, PREG_SPLIT_DELIM_CAPTURE);
The result is an array of alternating non-matching and matching parts.
As far as I know, eregi is deprecated.
You could do something like this:
<?php
$str = 'test check one two test3';
$to_match = array("test", "test2", "test3");
$rep = array();
foreach($to_match as $val){
$rep[$val] = "<$val>";
}
echo strtr($str, $rep);
?>
This too allows you to easily add more strings to replace.
Hi following function used to found the any word from string
<?php
function searchword($string, $words)
{
$matchFound = count($words);// use tha no of word you want to search
$tempMatch = 0;
foreach ( $words as $word )
{
preg_match('/'.$word.'/',$string,$matches);
//print_r($matches);
if(!empty($matches))
{
$tempMatch++;
}
}
if($tempMatch==$matchFound)
{
return "found";
}
else
{
return "notFound";
}
}
$string = "test check one two test3";
/*** an array of words to highlight ***/
$words = array('test', 'test3');
$string = searchword($string, $words);
echo $string;
?>
If your string is utf-8, you could use preg_replace instead
$string = 'test check one two test3';
$result = preg_replace('/(test3)|(test2)|(test)/ui' , '<$1>' ,$string);
echo $result;
Oviously with this kind of data to match the result will be suboptimal
<test> check one two <test>3
You'll need a longer approach than a direct search and replace with regular expressions (surely if your patterns are prefixes of other patterns)
To begin with, the code you want to enhance does not seem to comply with its initial purpose (not at least in my computer). You can try something like this:
$string = 'test check one two test3';
$result = mb_eregi_replace('(test|test2|test3)', '<\1>', $string);
echo $result;
I've removed the i flag (which of course makes little sense here). Still, you'd still need to make the expression greedy.
As for the original question, here's a little proof of concept:
function replace($match){
$GLOBALS['matches'][] = $match;
return "<$match>";
}
$string = 'test check one two test3';
$matches = array();
$result = mb_eregi_replace('(test|test2|test3)', 'replace(\'\1\')', $string, 'e');
var_dump($result, $matches);
Please note this code is horrible and potentially insecure. I'd honestly go with the preg_replace_callback() solution proposed by Gumbo.

How to bold keyword in php mysql search?

I want want my output like this when I search a keyword like
"programming"
php programming language
How to do this in php mysql?
Any idea?
Just perform a str_replace on the returned text.
$search = 'programming';
// $dbContent = the response from the database
$dbContent = str_replace( $search , '<b>'.$search.'</b>' , $dbContent );
echo $dbContent;
Any instance of "programming", even if as part of a larger word, will be wrapped in <b> tags.
For instances where more than one word are used
$search = 'programming something another';
// $dbContent = the response from the database
$search = explode( ' ' , $search );
function wrapTag($inVal){
return '<b>'.$inVal.'</b>';
}
$replace = array_map( 'wrapTag' , $search );
$dbContent = str_replace( $search , $replace , $dbContent );
echo $dbContent;
This will split the $search into an array at the spaces, and then wrap each match in the <b> tags.
You could use <b> or <strong> tags (See What's the difference between <b> and <strong>, <i> and <em>? for a dicussion about them).
$search = #$_GET['q'];
$trimmed = trim($search);
function highlight($req_field, $trimmed) //$req_field is the field of your table
{
preg_match_all('~\w+~', $trimmed, $m);
if(!$m)
return $req_field;
$re = '~\\b(' . implode('|', $m[0]) . ')\\b~';
return preg_replace($re, '<b>$0</b>', $req_field);
}
print highlight($req_field, $trimmed);
In this way, you can bolden the searched keywords. Its quite easy and works well.
The response is actually a bit more complicated than that. In the common search results use case, there are other factors to consider:
you should take into account uppercase and lowercase (Programming, PROGRAMMING, programming etc);
if your content string is very long, you wouldn't want to return the whole text, but just the searched query and a few words before and after it, for context;
This guy figured it out:
//$h = text
//$n = keywords to find separated by space
//$w = words near keywords to keep
function truncatePreserveWords ($h,$n,$w=5,$tag='b') {
$n = explode(" ",trim(strip_tags($n))); //needles words
$b = explode(" ",trim(strip_tags($h))); //haystack words
$c = array(); //array of words to keep/remove
for ($j=0;$j<count($b);$j++) $c[$j]=false;
for ($i=0;$i<count($b);$i++)
for ($k=0;$k<count($n);$k++)
if (stristr($b[$i],$n[$k])) {
$b[$i]=preg_replace("/".$n[$k]."/i","<$tag>\\0</$tag>",$b[$i]);
for ( $j= max( $i-$w , 0 ) ;$j<min( $i+$w, count($b)); $j++) $c[$j]=true;
}
$o = ""; // reassembly words to keep
for ($j=0;$j<count($b);$j++) if ($c[$j]) $o.=" ".$b[$j]; else $o.=".";
return preg_replace("/\.{3,}/i","...",$o);
}
Works like a charm!

Create acronym from a string containing only words

I'm looking for a way that I can extract the first letter of each word from an input field and place it into a variable.
Example: if the input field is "Stack-Overflow Questions Tags Users" then the output for the variable should be something like "SOQTU"
$s = 'Stack-Overflow Questions Tags Users';
echo preg_replace('/\b(\w)|./', '$1', $s);
the same as codaddict's but shorter
For unicode support, add the u modifier to regex: preg_replace('...../u',
Something like:
$s = 'Stack-Overflow Questions Tags Users';
if(preg_match_all('/\b(\w)/',strtoupper($s),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
}
I'm using the regex \b(\w) to match the word-char immediately following the word boundary.
EDIT:
To ensure all your Acronym char are uppercase, you can use strtoupper as shown.
Just to be completely different:
$input = 'Stack-Overflow Questions Tags Users';
$acronym = implode('',array_diff_assoc(str_split(ucwords($input)),str_split(strtolower($input))));
echo $acronym;
$initialism = preg_replace('/\b(\w)\w*\W*/', '\1', $string);
If they are separated by only space and not other things. This is how you can do it:
function acronym($longname)
{
$letters=array();
$words=explode(' ', $longname);
foreach($words as $word)
{
$word = (substr($word, 0, 1));
array_push($letters, $word);
}
$shortname = strtoupper(implode($letters));
return $shortname;
}
Regular expression matching as codaddict says above, or str_word_count() with 1 as the second parameter, which returns an array of found words. See the examples in the manual. Then you can get the first letter of each word any way you like, including substr($word, 0, 1)
The str_word_count() function might do what you are looking for:
$words = str_word_count ('Stack-Overflow Questions Tags Users', 1);
$result = "";
for ($i = 0; $i < count($words); ++$i)
$result .= $words[$i][0];
function initialism($str, $as_space = array('-'))
{
$str = str_replace($as_space, ' ', trim($str));
$ret = '';
foreach (explode(' ', $str) as $word) {
$ret .= strtoupper($word[0]);
}
return $ret;
}
$phrase = 'Stack-Overflow Questions IT Tags Users Meta Example';
echo initialism($phrase);
// SOQITTUME
$s = "Stack-Overflow Questions IT Tags Users Meta Example";
$sArr = explode(' ', ucwords(strtolower($s)));
$sAcr = "";
foreach ($sArr as $key) {
$firstAlphabet = substr($key, 0,1);
$sAcr = $sAcr.$firstAlphabet ;
}
using answer from #codaddict.
i also thought in a case where you have an abbreviated word as the word to be abbreviated e.g DPR and not Development Petroleum Resources, so such word will be on D as the abbreviated version which doesn't make much sense.
function AbbrWords($str,$amt){
$pst = substr($str,0,$amt);
$length = strlen($str);
if($length > $amt){
return $pst;
}else{
return $pst;
}
}
function AbbrSent($str,$amt){
if(preg_match_all('/\b(\w)/',strtoupper($str),$m)) {
$v = implode('',$m[1]); // $v is now SOQTU
if(strlen($v) < 2){
if(strlen($str) < 5){
return $str;
}else{
return AbbrWords($str,$amt);
}
}else{
return AbbrWords($v,$amt);
}
}
}
As an alternative to #user187291's preg_replace() pattern, here is the same functionality without needing a reference in the replacement string.
It works by matching the first occurring word characters, then forgetting it with \K, then it will match zero or more word characters, then it will match zero or more non-word characters. This will consume all of the unwanted characters and only leave the first occurring word characters. This is ideal because there is no need to implode an array of matches. The u modifier ensures that accented/multibyte characters are treated as whole characters by the regex engine.
Code: (Demo)
$tests = [
'Stack-Overflow Questions Tags Users',
'Stack Overflow Close Vote Reviewers',
'Jean-Claude Vandàmme'
];
var_export(
preg_replace('/\w\K\w*\W*/u', '', $tests)
);
Output:
array (
0 => 'SOQTU',
1 => 'SOCVR',
2 => 'JCV',
)

Can I use regex for this?

Is this possible with regex?
I have a file, and if a '#' is found in the file, the text after the '#' with the '#' is to be replaced with the file with the same name as after the '#'.
File1: "this text is found in file1"
File2: "this file will contain text from file1: #file1".
File2 after regex: "this file will contain text from file1: this text is found in file1".
I wish to do this with php and I've heard that the preg function is better than the ereg, but whatever works is fine with me =)
Thanks a lot!
EDIT:
It has to be programmed, so that it looks through file2 without knowing which files to concatenate before it has gone through all occurrences of a # :)
PHP's native functions str_pos and str_replace are better to use when you're searching through larger files or strings. ;)
First of all the grammar of your templating is not a very good one becuase the parser may not exactly sure when will the file name ends.
My suggestion would be that you change to the one that can better detect the boundry like {#:filename}.
Anyhow, the code I give below follows your question.
<?php
// RegEx Utility functions -------------------------------------------------------------------------
function ReplaceAll($RegEx, $Processor, $Text) {
// Make sure the processor can be called
if(!is_callable($Processor))
throw new Exception("\"$Processor\" is not a callable.");
// Do the Match
preg_match_all($RegEx, $Text, $Matches, PREG_OFFSET_CAPTURE + PREG_SET_ORDER);
// Do the replacment
$NewText = "";
$MatchCount = count($Matches);
$PrevOffset = 0;
for($i = 0; $i < $MatchCount; $i++) {
// Get each match and the full match information
$EachMatch = $Matches[$i];
$FullMatch = is_array($EachMatch) ? $EachMatch[0] : $EachMatch;
// Full match is each match if no grouping is used in the regex
// Full match is the first element of each match if grouping is used in the regex.
$MatchOffset = $FullMatch[1];
$MatchText = $FullMatch[0];
$MatchTextLength = strlen($MatchText);
$NextOffset = $MatchOffset + $MatchTextLength;
// Append the non-match and the replace of the match
$NewText .= substr($Text, $PrevOffset, $MatchOffset - $PrevOffset);
$NewText .= $Processor($EachMatch);
// The next prev-offset
$PrevOffset = $NextOffset;
}
// Append the rest of the text
$NewText .= substr($Text, $PrevOffset);
return $NewText;
}
function GetGroupMatchText($Match, $Index) {
if(!is_array($Match))
return $Match[0];
$Match = $Match[$Index];
return $Match[0];
}
// Replacing by file content -----------------------------------------------------------------------
$RegEx_FileNameInText = "/#([a-zA-Z0-9]+)/"; // Group #1 is the file name
$ReplaceFunction_ByFileName = "ReplaceByFileContent";
function ReplaceByFileContent($Match) {
$FileName = GetGroupMatchText($Match, 1); // Group # is the gile name
// $FileContent = get_file_content($FileName); // Get the content of the file
$FileContent = "{# content of: $FileName}"; // Dummy content for testing
return $FileContent; // Returns the replacement
}
// Main --------------------------------------------------------------------------------------------
$Text = " === #file1 ~ #file2 === ";
echo ReplaceAll($RegEx_FileNameInText, $ReplaceFunction_ByFileName, $Text);
This will returns === {# content of: file1} ~ {# content of: file2} ===.
The program will replace all the regex match with the replacement returned from the result of the given function name.
In this case, the callback function is ReplaceByFileContent in which the file name is extract from the group #1 in the regex.
I believe my code is self documented but if you have any question, you can ask me.
Hope I helps.
Much cleaner:
<?php
$content = file_get_content('content.txt');
$m = array();
preg_match_all('`#([^\s]*)(\s|\Z)`ism', $content, $m, PREG_SET_ORDER);
foreach($m as $match){
$innerContent = file_get_contents($match[1]);
$content = str_replace('#'.$match[1], $innerContent, $content);
}
// done!
?>
regex tested with: http://www.spaweditor.com/scripts/regex/index.php

Categories