PHP - Seperate words into different categories? - php

Lets say I have a text document that cannot be changed in any way and needs to be left as is.
Example of what the text document is likely formatted to be:
1. What is soup commonly paired with?
2.
3.
4. Alcohol
5. Water
6. Bread
7. Vegtables
8.
9.
10.
Note:
The numbers are not included, but they are used to represent the small spaces in between the words that are always there.
There is not always a question mark with the question
Note 2:
The question may be on 2 lines sometimes and may look like this below
0. What is soup
1. commonly paired with?
2.
3.
4. Alcohol
5. Water
6. Bread
7. Vegtables
8.
9.
10.
Other:
So how exactly do I seperate them, for example into an array?
So like $questions[] and $answers[]
The main problem is that I have nothing to link the questions and answers to:
I can't guess the exact line they are on
And the question doesn't always have a question mark
So there is nothing I can really link it to?

Assuming you have already read the text from the document into a variable $text, you can separate the question and answers by splitting on the first blank line in the text.
$qAndAs = preg_split('/\n\s*\n/', $text, 2, PREG_SPLIT_NO_EMPTY);
The split pattern is a line break (\n), zero or more whitespaces (\s*), and another line break.
That should give you an two-element array, where [0] is the question and [1] is the answers.
If it doesn't, then something went wrong.
if (count($qAndAs) !== 2) {
// The text from the document didn't fit the expected pattern.
// Decide how to handle that. Maybe throw an exception.
}
After you have separated them, you can remove any new lines from the question
$question = str_replace(["\r", "\n"], ' ', trim($qAndAs[0]));
and split your answers into another array.
$answers = preg_split('/\s*\n\s*/', $qAndAs[1], -1, PREG_SPLIT_NO_EMPTY);

Both solutions accept multiple questions / answers in a single file.
Solution 1 (similar to the other one in this thread):
$questions = array();
$answers = array();
//Split text into questions and answers blocks (2 line breaks or more from each other)
$text = preg_split('/\n{2,}/', $text, -1, PREG_SPLIT_NO_EMPTY);
foreach ($text as $key => $value)
{
//0, 2, 4, ... are questions, 1, 3, 5, ... are answers
if ($key % 2)
{
$answers[] = explode("\n", $value);
}
else
{
$questions[] = str_replace("\n", '', $value);
}
}
Solution 2 (ugly line by line reading from a file):
//Open the file
$f = fopen("test.txt","r");
//Initialize arrays of all questions and all answers
$all_questions = array();
$all_answers = array();
$is_question = true;
$last = ''; //contains a previous line
//Iterate over lines
while (true)
{
//Get line
$line = fgets($f);
//Check if end of file
$end = ($line === false);
//Trim current line
$line = trim($line);
if ($line != '')
{
//If the previous line was empty, then reset current question and answers
if ($last == '')
{
$question = array();
$answers = array();
}
//Add line of question or answer
if ($is_question)
{
$question[] = $line;
}
else
{
$answers[] = $line;
}
}
else
{
//If the previous line wasn't empty, or we reached the end of file, then save question / answers, and toggle $is_question
if ($last != '' OR $end)
{
if ($is_question)
{
$all_questions[] = implode(' ', $question); //implode to merge multiline question
$is_question = false;
}
else
{
$all_answers[] = $answers;
$is_question = true;
}
}
}
//Break if end of file
if ($end)
{
break;
}
$last = $line;
}
fclose($f);

Related

Check if any of array element is less than 2 characters length within a loop in PHP [duplicate]

This question already has answers here:
The 3 different equals
(5 answers)
Closed 10 months ago.
I'm currently creating a search engine for a database and realized that numbers or short words would mess up the search.
The engine works so that it splits the sentence into words, and each word are put in to an array called $searches.
An example of a search that creates a mess would look something like:
database info 3
Each word will be looked for everywhere and shown if it matches either search 1 and 2, 1 and 3, or 2 and 3. The problem is, that by searching "3", it creates a mess. Because there are so many things containing just a single character.
So I wonder how I can check the length of all content in an array, and loop this correctly.
This is the code I have so far (Not fully):
$search = $_GET['q']; //Search string
$count = 0; //Counter
$searches = explode(' ', $search); //Creates the array
for ($i = 0; $i < count($searches); $i++) { //Loop for each element in array
if (strlen($searches[$i]) <= 2) { //if it finds an element less than 2 characters
$keyword = $searches[$i]; //remember key word
foreach (glob('database/*/*/*.txt') as $path) { //Look through database
$title = basename($path, ".txt").PHP_EOL; //Get file instead of path
for ($q=0; $q < count($searches); $q++) { //Another loop in case keyword comes last (which it does)
if (($searches[$q] != $keyword) && (strripos($title,$keyword) != false) && (strripos($title,$searches[$q]) != false)) { //check if if keyword is not equal to itself while searching and tries to find a file with a combination of keyword and $searches[$q].
echo $title . '<br>'; //Gives output
$count += 1;
}
}
}
}
}
if($count == 0) {
echo 'NOTHING'; //Nothing
}
It seams like it won't output any files including both words. Basically not showing any files. The double loop just makes this complicated for checking the array length. Any clue how to get this to work?
Well, the first bit you have to take care of is:
$search = $_GET['q'];
except you are sure 100% about the security of your input you should sanitize it.
Then I would check:
if (strlen($searches[$i]) <= 2) {
Actually you are doing something only if your word has max 2 chars.
The reason I didn't get this to work was because this code was written incorrect:
if (($searches[$q] != $keyword) && (strripos($title,$keyword) != false) && (strripos($title,$searches[$q]) != false)) {
Forgot to put !== instead of !=. So fixing the code to this will work:
if (($searches[$q] != $keyword) && (strripos($title,$keyword) !== false) && (strripos($title,$searches[$q]) !== false)) {

Filtering repeated phone numbers in PHP

The output has repeated results/rows - something is wrong with the filtering.
While cycling to read the SQL results the output has to meet these conditions:
Condition 1: The message contains no emails: /[a-z0-9_\-\+]+#[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
Condition 2: The message must contain a Portuguese mobile number, 9 digits,
starting with 91, 92, 93 or 96, and it can have space between each group of 3 numbers: /(9[1236][0-9]) ?([0-9]{3}) ?([0-9]{3})/
910 000 000 and 920123123 are valid matches.
Condition 3: The preg_match result (phone number) must not repeat, my intention is that if someone has 2 or 3 posts with same phone number, just output the first one only.
If the 9 digit phone number already exits this cycle, output the first, skip the rest, go back to while cycle.
$array = array();
while ($result = $query->fetch_array()) {
$temp = array();
if (preg_match('/[a-z0-9_\-\+]+#[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i', $result['message']) == false) {
if (preg_match('/(9[1236][0-9]) ?([0-9]{3}) ?([0-9]{3})/', $result['message'], $temp)) {
preg_replace(" ", "", $temp);
foreach($temp as $value) {
if ((array_search($value, $array) == false) && (strlen($value) == 9)) {
array_push($array, $value);
/*HTML OUTPUTS HERE*/
}
}
}
}
}
$array = array();
while ($result = $query->fetch_array()) {
$temp = array();
if (preg_match('/[a-z0-9_\-\+]+#[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i', $result['message']) == false) {
if (preg_match('/(9[1236][0-9]) ?([0-9]{3}) ?([0-9]{3})/', $result['message'], $temp)) {
$phone = str_replace(' ', '', $temp[0]);
if (array_search($phone, $array) === false) {
$array[] = $phone;
}
}
}
}
Some notes:
str_replace is faster than preg_replace for simple, non-regular expression replacements.
array_search can return 0 when the value is in the first element of the array, so you need to use the === operator.
Lastly, there's no need for the last foreach loop as the full phone number is the first element in $temp.
In your query, I think it's best to indicate DISTINCT for removal of repeated results.
e.g:
SELECT DISTINCT(*) FROM table_name WHERE 1;

Cut text on pieces by 2 sentences PHP [duplicate]

This question already has answers here:
php - explode string at . but ignore decimal eg 2.9
(2 answers)
Closed 9 years ago.
I have a long string of text. I want to store it in an array by 2 sentences per element. I think it should be done by exploding the text around dot+space; however, there are elements like 'Mr.' which I don't know how to exclude from the explode function.
I also don't know how to adjust it to explode the text by 2 sentences, not by 1.
maybe something like:
$min_sentence_length = 100;
$ignore_words = array('mr.','ms.');
$text = "some texing alsie urj skdkd. and siks ekka lls. lorem ipsum some.";
$parts = explode(" ", $text);
$sentences = array();
$cur_sentence = "";
foreach($parts as $part) {
// Check sentence min length and is there period
if (strlen($cur_sentence) > $min_sentence_length &&
substr($part,-1) == "." && !in_array($part, $ignore_words)) {
$sentences[] = $cur_sentence;
$cur_sentence = "";
}
$cur_sentence .= $part . " ";
}
if (strlen($cur_sentence) > 0)
$sentences[] = $cur_sentence;
The comments on your question link to answers that use preg_split() instead of explode() to provide more accurate description of how and when to split the input. That might work for you. Another approach would be to split your input on every occurrence of ". " into a temporary array, then loop through that array, piecing it back together however you like. e.g.
$tempArray = explode('. ', $input);
$outputArray = array();
$outputElement = '';
$sentenceCount = 0;
foreach($tempArray as $part){
$outputElement .= $part . '. ';
//put other exceptions here, not just "Mr."
if ($part != 'Mr'){
$sentenceCount++;
}
if ($senteceCount == 2){
$outputArray[] = $outputElement;
$outputElement = '';
$sentenceCount = 0;
}
}

Navigating thru txt file with php, searching and displaying specific content

I have made a simple form with textfields, when i submit a button it wrties all textfield values into a .txt file. Here is an example of the .txt file content:
-----------------------------
How much is 1+1
3
4
5
1
-----------------------------
The 1st and last line ---- is there to just seperate data. The 1st line after the ---- is the question , the before the bottom seperator (1) is the true answer, and all the values between question and true answer are false answers.
What i want to do now is echo out the question , false answers and true answer , seperatly:
echo $quesiton;
print_r ($false_answers); //because it will be an array
echo $true answer;
I think the solution is strpos , but i dont know how to use it the way i want it to. Can i do somethinglike this? :
Select 1st line (question) after the 1st seperator
Select 1st line (true answer) before the 2nd seperator
Select all values inbetween question and true answer
Note that im only showing one example, the .txt file has a lot of these questions seperated with -------.
Are my thoughs correct about using strpos to solve this? Any suggestions?
Edit:
Found some function:
$lines = file_get_contents('quiz.txt');
$start = "-----------------------------";
$end = "-----------------------------";
$pattern = sprintf('/%s(.+?)%s/ims',preg_quote($start, '/'), preg_quote($end, '/'));
if (preg_match($pattern, $lines, $matches)) {
list(, $match) = $matches;
echo $match;
}
I think this might work, not sure yet.
You may try this:
$file = fopen("test.txt","r");
$response = array();
while(! feof($file)) {
$response[] = fgets($file);
}
fclose($file);
This way you will get response array like:
Array(
[0]=>'--------------',
[1]=>'How much is 1+1',
[2]=>'3',
[3]=>'4',
[4]=>'2',
[5]=>'1',
[6]=>'--------------'
)
You could try something like this:
$lines = file_get_contents('quiz.txt');
$newline = "\n"; //May need to be "\r\n".
$delimiter = "-----------------------------". $newline;
$question_blocks = explode($delimiter, $lines);
$questions = array();
foreach ($question_blocks as $qb) {
$items = explode ($newline, $qb);
$q['question'] = array_shift($items); //First item is the question
$q['true_answer'] = array_pop($items); //Last item is the true answer
$q['false_answers'] = $items; //Rest of items are false answers.
$questions[] = $q;
}
print_r($questions);

how to find out if csv file fields are tab delimited or comma delimited

how to find out if csv file fields are tab delimited or comma delimited. I need php validation for this. Can anyone plz help. Thanks in advance.
It's too late to answer this question but hope it will help someone.
Here's a simple function that will return a delimiter of a file.
function getFileDelimiter($file, $checkLines = 2){
$file = new SplFileObject($file);
$delimiters = array(
',',
'\t',
';',
'|',
':'
);
$results = array();
$i = 0;
while($file->valid() && $i <= $checkLines){
$line = $file->fgets();
foreach ($delimiters as $delimiter){
$regExp = '/['.$delimiter.']/';
$fields = preg_split($regExp, $line);
if(count($fields) > 1){
if(!empty($results[$delimiter])){
$results[$delimiter]++;
} else {
$results[$delimiter] = 1;
}
}
}
$i++;
}
$results = array_keys($results, max($results));
return $results[0];
}
Use this function as shown below:
$delimiter = getFileDelimiter('abc.csv'); //Check 2 lines to determine the delimiter
$delimiter = getFileDelimiter('abc.csv', 5); //Check 5 lines to determine the delimiter
P.S I have used preg_split() instead of explode() because explode('\t', $value) won't give proper results.
UPDATE: Thanks for #RichardEB pointing out a bug in the code. I have updated this now.
Here's what I do.
Parse the first 5 lines of a CSV file
Count the number of delimiters [commas, tabs, semicolons and colons] in each line
Compare the number of delimiters in each line. If you have a properly formatted CSV, then one of the delimiter counts will match in each row.
This will not work 100% of the time, but it is a decent starting point. At minimum, it will reduce the number of possible delimiters (making it easier for your users to select the correct delimiter).
/* Rearrange this array to change the search priority of delimiters */
$delimiters = array('tab' => "\t",
'comma' => ",",
'semicolon' => ";"
);
$handle = file( $file ); # Grabs the CSV file, loads into array
$line = array(); # Stores the count of delimiters in each row
$valid_delimiter = array(); # Stores Valid Delimiters
# Count the number of Delimiters in Each Row
for ( $i = 1; $i < 6; $i++ ){
foreach ( $delimiters as $key => $value ){
$line[$key][$i] = count( explode( $value, $handle[$i] ) ) - 1;
}
}
# Compare the Count of Delimiters in Each line
foreach ( $line as $delimiter => $count ){
# Check that the first two values are not 0
if ( $count[1] > 0 and $count[2] > 0 ){
$match = true;
$prev_value = '';
foreach ( $count as $value ){
if ( $prev_value != '' )
$match = ( $prev_value == $value and $match == true ) ? true : false;
$prev_value = $value;
}
} else {
$match = false;
}
if ( $match == true ) $valid_delimiter[] = $delimiter;
}//foreach
# Set Default delimiter to comma
$delimiter = ( $valid_delimiter[0] != '' ) ? $valid_delimiter[0] : "comma";
/* !!!! This is good enough for my needs since I have the priority set to "tab"
!!!! but you will want to have to user select from the delimiters in $valid_delimiter
!!!! if multiple dilimiter counts match
*/
# The Delimiter for the CSV
echo $delimiters[$delimiter];
There is no 100% reliable way to detemine this. What you can do is
If you have a method to validate the fields you read, try to read a few fields using either separator and validate against your method. If it breaks, use another one.
Count the occurrence of tabs or commas in the file. Usually one is significantly higher than the other
Last but not least: Ask the user, and allow him to override your guesses.
I'm just counting the occurrences of the different delimiters in the CSV file, the one with the most should probably be the correct delimiter:
//The delimiters array to look through
$delimiters = array(
'semicolon' => ";",
'tab' => "\t",
'comma' => ",",
);
//Load the csv file into a string
$csv = file_get_contents($file);
foreach ($delimiters as $key => $delim) {
$res[$key] = substr_count($csv, $delim);
}
//reverse sort the values, so the [0] element has the most occured delimiter
arsort($res);
reset($res);
$first_key = key($res);
return $delimiters[$first_key];
In my situation users supply csv files which are then entered into an SQL database. They may save an Excel Spreadsheet as comma or tab delimited files. A program converting the spreadsheet to SQL needs to automatically identify whether fields are tab separated or comma
Many Excel csv export have field headings as the first line. The heading test is unlikely to contain commas except as a delimiter. For my situation I counted the commas and tabs of the first line and use that with the greater number to determine if it is csv or tab
Thanks for all your inputs, I made mine using your tricks : preg_split, fgetcsv, loop, etc.
But I implemented something that was surprisingly not here, the use of fgets instead of reading the whole file, way better if the file is heavy!
Here's the code :
ini_set("auto_detect_line_endings", true);
function guessCsvDelimiter($filePath, $limitLines = 5) {
if (!is_readable($filePath) || !is_file($filePath)) {
return false;
}
$delimiters = array(
'tab' => "\t",
'comma' => ",",
'semicolon' => ";"
);
$fp = fopen($filePath, 'r', false);
$lineResults = array(
'tab' => array(),
'comma' => array(),
'semicolon' => array()
);
$lineIndex = 0;
while (!feof($fp)) {
$line = fgets($fp);
foreach ($delimiters as $key=>$delimiter) {
$lineResults[$key][$lineIndex] = count (fgetcsv($fp, 1024, $delimiter)) - 1;
}
$lineIndex++;
if ($lineIndex > $limitLines) break;
}
fclose($fp);
// Calculating average
foreach ($lineResults as $key=>$entry) {
$lineResults[$key] = array_sum($entry)/count($entry);
}
arsort($lineResults);
reset($lineResults);
return ($lineResults[0] !== $lineResults[1]) ? $delimiters[key($lineResults)] : $delimiters['comma'];
}
I used #Jay Bhatt's solution for finding out a csv file's delimiter, but it didn't work for me, so I applied a few fixes and comments for the process to be more understandable.
See my version of #Jay Bhatt's function:
function decide_csv_delimiter($file, $checkLines = 10) {
// use php's built in file parser class for validating the csv or txt file
$file = new SplFileObject($file);
// array of predefined delimiters. Add any more delimiters if you wish
$delimiters = array(',', '\t', ';', '|', ':');
// store all the occurences of each delimiter in an associative array
$number_of_delimiter_occurences = array();
$results = array();
$i = 0; // using 'i' for counting the number of actual row parsed
while ($file->valid() && $i <= $checkLines) {
$line = $file->fgets();
foreach ($delimiters as $idx => $delimiter){
$regExp = '/['.$delimiter.']/';
$fields = preg_split($regExp, $line);
// construct the array with all the keys as the delimiters
// and the values as the number of delimiter occurences
$number_of_delimiter_occurences[$delimiter] = count($fields);
}
$i++;
}
// get key of the largest value from the array (comapring only the array values)
// in our case, the array keys are the delimiters
$results = array_keys($number_of_delimiter_occurences, max($number_of_delimiter_occurences));
// in case the delimiter happens to be a 'tab' character ('\t'), return it in double quotes
// otherwise when using as delimiter it will give an error,
// because it is not recognised as a special character for 'tab' key,
// it shows up like a simple string composed of '\' and 't' characters, which is not accepted when parsing csv files
return $results[0] == '\t' ? "\t" : $results[0];
}
I personally use this function for helping automatically parse a file with PHPExcel, and it works beautifully and fast.
I recommend parsing at least 10 lines, for the results to be more accurate. I personally use it with 100 lines, and it is working fast, no delays or lags. The more lines you parse, the more accurate the result gets.
NOTE: This is just a modifed version of #Jay Bhatt's solution to the question. All credits goes to #Jay Bhatt.
When I output a TSV file I author the tabs using \t the same method one would author a line break like \n so that being said I guess a method could be as follows:
<?php
$mysource = YOUR SOURCE HERE, file_get_contents() OR HOWEVER YOU WISH TO GET THE SOURCE;
if(strpos($mysource, "\t") > 0){
//We have a tab separator
}else{
// it might be CSV
}
?>
I Guess this may not be the right manner, because you could have tabs and commas in the actual content as well. It's just an idea. Using regular expressions may be better, although I am not too clued up on that.
you can simply use the fgetcsv(); PHP native function in this way:
function getCsvDelimeter($file)
{
if (($handle = fopen($file, "r")) !== FALSE) {
$delimiters = array(',', ';', '|', ':'); //Put all that need check
foreach ($delimiters AS $item) {
//fgetcsv() return array with unique index if not found the delimiter
if (count(fgetcsv($handle, 0, $item, '"')) > 1) {
$delimiter = $item;
break;
}
}
}
return (isset($delimiter) ? $delimiter : null);
}
Aside from the trivial answer that c sv files are always comma-separated - it's in the name, I don't think you can come up with any hard rules. Both TSV and CSV files are sufficiently loosely specified that you can come up with files that would be acceptable as either.
A\tB,C
1,2\t3
(Assuming \t == TAB)
How would you decide whether this is TSV or CSV?
You also can use fgetcsv (http://php.net/manual/en/function.fgetcsv.php) passing it a delimiter parameter. If the function returns false it means that the $delimiter parameter wasn't the right one
sample to check if the delimiter is ';'
if (($data = fgetcsv($your_csv_handler, 1000, ';')) !== false) { $csv_delimiter = ';'; }
How about something simple?
function findDelimiter($filePath, $limitLines = 5){
$file = new SplFileObject($filePath);
$delims = $file->getCsvControl();
return $delims[0];
}
This is my solution.
Its works if you know how many columns you expect.
Finally, the separator character is the $actual_separation_character
$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";
$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;
/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;
$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
$row= fgets($file);
$row_1_replace=str_replace($separator_1,"",$row);
$row_1_length=strlen($row)-strlen($row_1_replace);
if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_1_number=$separator_1_number+$row_1_length;
}
$row_2_replace=str_replace($separator_2,"",$row);
$row_2_length=strlen($row)-strlen($row_2_replace);
if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_2_number=$separator_2_number+$row_2_length;
}
$row_3_replace=str_replace($separator_3,"",$row);
$row_3_length=strlen($row)-strlen($row_3_replace);
if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_3_number=$separator_3_number+$row_3_length;
}
$row_4_replace=str_replace($separator_4,"",$row);
$row_4_length=strlen($row)-strlen($row_4_replace);
if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_4_number=$separator_4_number+$row_4_length;
}
$row_5_replace=str_replace($separator_5,"",$row);
$row_5_length=strlen($row)-strlen($row_5_replace);
if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_5_number=$separator_5_number+$row_5_length;
}
} // while(! feof($file)) END
fclose($file);
/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */
if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}
/*
if the number of columns more than what you expect, do something ...
*/
if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}
If you have a very large file example in GB, head the first few line, put in a temporary file. Open the temporary file in vi
head test.txt > te1
vi te1
Easiest way I answer this is open it in a plain text editor, or in TextMate.

Categories