How to Remove Hidden Characters in PHP - php

I have following piece of code, which reads text files from a director. I have used a list of stopwords and after removing stopwords from the files when the words of these files along with their positions then there come extra blank characters in place of where stopword exist in the document.
For example, a file which reads like,
Department of Computer Science // A document
after removing stop word 'of' from the document when I loop through the document then following output comes out:
Department(0) (1) Computer(2) Science(3) //output
But blank space should not be there.
Here is the code:
<?php
$directory = "archive/";
$dir = opendir($directory);
while (($file = readdir($dir)) !== false) {
$filename = $directory . $file;
$type = filetype($filename);
if ($type == 'file') {
$contents = file_get_contents($filename);
$texts = preg_replace('/\s+/', ' ', $contents);
$texts = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $texts);
$text = explode(" ", $texts);
$text = array_map('strtolower', $text);
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " ");
$text = (array_diff($text,$stopwords));
echo "<br><br>";
$total_count = count($text);
$b = -1;
foreach ($text as $a=>$v)
{
$b++;
echo $text[$b]. "(" .$b. ")" ." ";
}
}
}
closedir($dir);
?>

Genuinely not 100% sure about the final output of the string position, but assuming you are placing that there for reference only. This test code using regex with preg_replace seems to work well.
header('Content-Type: text/plain; charset=utf-8');
// Set test content array.
$contents_array = array();
$contents_array[] = "Department of Computer Science // A document";
$contents_array[] = "Department of Economics // A document";
// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");
// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';
foreach ($contents_array as $contents) {
// Remove the stopwords.
$contents = preg_replace($regex, '', $contents);
// Clear out the extra whitespace; anything 2 spaces or more in a row.
$contents = preg_replace('/\s{2,}/', ' ', $contents);
// Echo contents.
echo $contents . "\n";
}
The output is cleaned up & formatted like this:
Department Computer Science // document
Department Economics // document
So to integrate it into your code, you should do this. Note how I moved $stopwords & $regex outside of the while loop since it makes no sense to reset those values on each while loop iteration. Set it once outside of the loop & let the stuff in the loop just be focused on what you need there in the loop:
<?php
$directory = "archive/";
$dir = opendir($directory);
// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");
// Set a regex based on the stopwords.
$regex = '/(' . implode('\b|', $stopwords) . '\b)/i';
while (($file = readdir($dir)) !== false) {
$filename = $directory . $file;
$type = filetype($filename);
if ($type == 'file') {
// Get the contents of the filename.
$contents = file_get_contents($filename);
// Remove the stopwords.
$contents = preg_replace($regex, '', $contents);
// Clear out the extra whitespace; anything 2 spaces or more in a row.
$contents = preg_replace('/\s{2,}/', ' ', $contents);
// Echo contents.
echo $contents;
}
}
closedir($dir);
?>

Just add \b after the pipe | operator as mentioned in the answer by Giacomo1968.
$regex = '/(' . implode('\b|\b', $stopwords) . '\b)/i';
It will work.

Related

Read fails with spaces

I am very new to PHP and want to learn. I am trying to make a top-list for my server but I have a problem. My file is built like this:
"Name" "Kills"
"^0user1^7" "2"
"user2" "2"
"user3" "6"
"user with spaces" "91"
But if I want to read this with PHP it fails because the user has spaces.
That's the method I use to read the file:
$lines = file('top.txt');
foreach ($lines as $line) {
$parts = explode(' ', $line);
echo isset($parts[0]) ? $parts[0] : 'N/A' ;
}
Maybe someone knows a better method, because this don't work very well :D.
You need REGEX :-)
<?php
$lines = array(
'"^0user1^7" "2"',
'"user2" "2"',
'"user3" "6"',
'"user with spaces" "91"',
);
$regex = '#"(?<user>[a-zA-Z0-9\^\s]+)"\s"(?<num>\d+)"#';
foreach ($lines as $line) {
preg_match($regex, $line, $matches);
echo 'user = '.$matches['user'].', num = '.$matches['num']."\n";
}
In the regex, we have # delimiters, then look for stuff between quotes. Using (?PATTERN) gives you a named capture group. The first looks for letters etc, the second digits only.
See here to understand how the regex is matching!
https://regex101.com/r/023LlL/1/
See it here in action https://3v4l.org/qDVuf
For your process this might help
$lines = file('top.txt');
$line = explode(PHP_EOL, $lines); // this will split file content line by line
foreach ($line as $key=>$value_line ) {
echo str_replace(" ","",$value_line);
}
As I commented above, below is a simple example with JSON.
Assuming, you have stored records in JSON format:
$json = '{
"user1": "12",
"sad sad":"23"
}';
$decoded = json_decode($json);
foreach($decoded as $key => $value){
echo 'Key: ' . $key . ' And value is ' . $value;
}
And here is the demo link: https://3v4l.org/ih1P7

Match variable value with text file row wise

I want to match variable value with text file rows, for example
$brands = 'Applica';
and text file content like -
'applica' = 'Applica','Black and Decker','George Foreman'
'black and decker' = 'Black and Decker','Applica'
'amana' = 'Amana','Whirlpool','Roper','Maytag','Kenmore','Kitchenaid','Jennair'
'bosch' = 'Bosch','Thermador'
As there are four rows in text file.
and first word of each row is brand which is compatible with their equal to brands.
like applica is compatible with 'Applica' and 'Black and Decker' and 'George Foreman'
I want to match variable $brands with word applica and if it matches then store their equal to value like 'Applica','Black and Decker','George Foreman' in new variable.
Please provide some guidance.
Thanks.
Update -
<?php
$brands = "brands.txt";
$contents = file_get_contents($brands);
$brandsfields = explode(',', $contents);
$csvbrand = 'applica';
foreach($brandsfields as $brand) {
$newname = substr($brand,1,-1);
echo $newname . "\t";
}
?>
This should work
$matches = explode("\n", "'applica' = 'Applica','Black and Decker','George Foreman'\n'black and decker' = 'Black and Decker','Applica'\n'amana' = 'Amana','Whirlpool','Roper','Maytag','Kenmore','Kitchenaid','Jennair'\n'bosch' = 'Bosch','Thermador'");
$brand = "applica";
$equalValues = [];
foreach ($matches as $key => $value) {
$keyMatch = str_replace("'", "", trim(explode('=', $value)[0]));
$valuesMatch = explode('=', $value)[1];
$escapedDelimiter = preg_quote("'", '/');
preg_match_all('/' . "'" . '(.*?)' . "'" . '/s', $valuesMatch, $matches);
if ($brand == $keyMatch) {
$equalValues = $matches[1];
}
}
var_dump($equalValues);
if brand is equal to applica $equalvalues shoud be equal to :
array(3) {
[0]=>
string(7) "Applica"
[1]=>
string(16) "Black and Decker"
[2]=>
string(14) "George Foreman"
}
preg_match_all("/'" . $csvbrand ."' = (.*)/", $contents, $output_array);
$names = explode(",", str_replace("'", "", $output_array[1][0]));
Var_dump($names); // results in ->
//Applica
//Black and Decker
//George Foreman

PHP, remove all lines from a big string containing a specific word

$file = file_get_contents("http://www.bigsite.com");
How could i go about removing all lines from string $file that contains the word "hello" ?
$file = file_get_contents("http://www.bigsite.com");
$lines = explode("\n", $file);
$exclude = array();
foreach ($lines as $line) {
if (strpos($line, 'hello') !== FALSE) {
continue;
}
$exclude[] = $line;
}
echo implode("\n", $exclude);
$file = file_get_contents("http://www.example.com");
// remove sigle word hello
echo preg_replace('/(hello)/im', '', $file);
// remove multiple words hello, foo, bar, foobar
echo preg_replace('/(hello|foo|bar|foobar)/im', '', $file);
EDIT Removing the Lines
// read each file lines in array
$lines = file('http://example.com/');
// match single word hello
$pattern = '/(hello)/im';
// match multiple words hello, foo, bar, foobar
$pattern = '/(hello|foo|bar|foobar)/im';
$rows = array();
foreach ($lines as $key => $value) {
if (!preg_match($pattern, $value)) {
// lines not containing hello
$rows[] = $line;
}
}
// now create the paragraph again
echo implode("\n", $rows);
Here you go:
$file = file('http://www.bigsite.com');
foreach( $file as $key=>$line ) {
if( false !== strpos($line, 'hello') ) {
unset $file[$key];
}
}
$file = implode("\n", $file);
$file = file_get_contents("http://www.bigsite.com");
echo preg_replace('/((^|\n).*hello.*(\n|$))/', "\n", $file).trim();
The 4 patterns are for matching
if the first line has hello
A center line has hello
The last line has hello
The only line has hello
In case this are files with \r\n (Carriage return & Newline like on Windows) you need to modify this accordingly. The Trim can remove trailing and/or leading newlines

Uppercase the first character of each word in a string except 'and', 'to', etc

How can I make upper-case the first character of each word in a string accept a couple of words which I don't want to transform them, like - and, to, etc?
For instance, I want this - ucwords('art and design') to output the string below,
'Art and Design'
is it possible to be like - strip_tags($text, '<p><a>') which we allow and in the string?
or I should use something else? please advise!
thanks.
None of these are really UTF8 friendly, so here's one that works flawlessly (so far)
function titleCase($string, $delimiters = array(" ", "-", ".", "'", "O'", "Mc"), $exceptions = array("and", "to", "of", "das", "dos", "I", "II", "III", "IV", "V", "VI"))
{
/*
* Exceptions in lower case are words you don't want converted
* Exceptions all in upper case are any words you don't want converted to title case
* but should be converted to upper case, e.g.:
* king henry viii or king henry Viii should be King Henry VIII
*/
$string = mb_convert_case($string, MB_CASE_TITLE, "UTF-8");
foreach ($delimiters as $dlnr => $delimiter) {
$words = explode($delimiter, $string);
$newwords = array();
foreach ($words as $wordnr => $word) {
if (in_array(mb_strtoupper($word, "UTF-8"), $exceptions)) {
// check exceptions list for any words that should be in upper case
$word = mb_strtoupper($word, "UTF-8");
} elseif (in_array(mb_strtolower($word, "UTF-8"), $exceptions)) {
// check exceptions list for any words that should be in upper case
$word = mb_strtolower($word, "UTF-8");
} elseif (!in_array($word, $exceptions)) {
// convert to uppercase (non-utf8 only)
$word = ucfirst($word);
}
array_push($newwords, $word);
}
$string = join($delimiter, $newwords);
}//foreach
return $string;
}
Usage:
$s = 'SÃO JOÃO DOS SANTOS';
$v = titleCase($s); // 'São João dos Santos'
since we all love regexps, an alternative, that also works with interpunction (unlike the explode(" ",...) solution)
$newString = preg_replace_callback("/[a-zA-Z]+/",'ucfirst_some',$string);
function ucfirst_some($match)
{
$exclude = array('and','not');
if ( in_array(strtolower($match[0]),$exclude) ) return $match[0];
return ucfirst($match[0]);
}
edit added strtolower(), or "Not" would remain "Not".
How about this ?
$string = str_replace(' And ', ' and ', ucwords($string));
You will have to use ucfirst and loop through every word, checking e.g. an array of exceptions for each one.
Something like the following:
$exclude = array('and', 'not');
$words = explode(' ', $string);
foreach($words as $key => $word) {
if(in_array($word, $exclude)) {
continue;
}
$words[$key] = ucfirst($word);
}
$newString = implode(' ', $words);
I know it is a few years after the question, but I was looking for an answer to the insuring proper English in the titles of a CMS I am programming and wrote a light weight function from the ideas on this page so I thought I would share it:
function makeTitle($title){
$str = ucwords($title);
$exclude = 'a,an,the,for,and,nor,but,or,yet,so,such,as,at,around,by,after,along,for,from,of,on,to,with,without';
$excluded = explode(",",$exclude);
foreach($excluded as $noCap){$str = str_replace(ucwords($noCap),strtolower($noCap),$str);}
return ucfirst($str);
}
The excluded list was found at:
http://www.superheronation.com/2011/08/16/words-that-should-not-be-capitalized-in-titles/
USAGE: makeTitle($title);

How to do something if sentence include one of the words in this array?

I want to do something if my sentence include one of the words in this array, How to do that ?
$sentence = "I dont give a badwordtwo";
$values = array("badwordone","badwordtwo","badwordthree","badwordfour");
Thanks...
If you want to censor an array of words in some string you can use str_ireplace:
$var = "This is my phrase.";
$var = str_ireplace( array("this", "phrase"), array("****", "*****"), $var);
edit: as chacha102 notes, you only need to use the second array to vary the number of stars,
$var = str_ireplace( array("this", "phrase"), "", $var);
is equally valid. I should also note that if you use a second array, it's length must match exactly the first array, and the replacements correspond by index.
I had a similar question a while back. This answer should fit you perfectly.
Is this efficient coding for anti-spam?
<?PHP
$banned = array('bad','words','like','these');
$looksLikeSpam = false;
foreach($banned as $naughty){
if (strpos($string,$naugty) !== false){
$looksLikeSpam=true;
}
}
if ($looksLikeSpam){
echo "You're GROSS! Just... ew!";
die();
}
?>
one way
$sentence = "I dont give a badwordtwo";
$values = array("badwordone","badwordtwo","badwordthree","badwordfour");
$s = explode(" ",$sentence);
foreach ($s as $a=>$b){
if (in_array($b, $values)) {
echo "Got $b";
}
}
output
$ php test.php
Got badwordtwo
OR
$sentence = "I dont give a badwordtwo";
$values = array("badwordone","badwordtwo","badwordthree","badwordfour");
$s = explode(" ",$sentence);
var_dump(array_intersect($s, $values));
output
$ php test.php
array(1) {
[4]=>
string(10) "badwordtwo"
}
Don't you just love php.net?
Example #1 Basic str_replace() examples
<?php
// Provides: <body text='black'>
$bodytag = str_replace("%body%", "black", "<body text='%body%'>");
// Provides: Hll Wrld f PHP
$vowels = array("a", "e", "i", "o", "u", "A", "E", "I", "O", "U");
$onlyconsonants = str_replace($vowels, "", "Hello World of PHP");
// Provides: You should eat pizza, beer, and ice cream every day
$phrase = "You should eat fruits, vegetables, and fiber every day.";
$healthy = array("fruits", "vegetables", "fiber");
$yummy = array("pizza", "beer", "ice cream");
$newphrase = str_replace($healthy, $yummy, $phrase);
// Provides: 2
$str = str_replace("ll", "", "good golly miss molly!", $count);
echo $count;
?>
This is a snippet from Kohana 3. I've always found it to be a useful function. It also allows you to censor partial words (or not).
public static function censor($str, $badwords, $replacement = '#', $replace_partial_words = TRUE)
{
foreach ((array) $badwords as $key => $badword)
{
$badwords[$key] = str_replace('\*', '\S*?', preg_quote((string) $badword));
}
$regex = '('.implode('|', $badwords).')';
if ($replace_partial_words === FALSE)
{
$regex = '(?<=\b|\s|^)'.$regex.'(?=\b|\s|$)';
}
$regex = '!'.$regex.'!ui';
if (strlen($replacement) == 1)
{
$regex .= 'e';
return preg_replace($regex, 'str_repeat($replacement, strlen(\'$1\'))', $str);
}
return preg_replace($regex, $replacement, $str);
}

Categories