Regular Expression to match unlimited number of options - php

I want to be able to parse file paths like this one:
/var/www/index.(htm|html|php|shtml)
into an ordered array:
array("htm", "html", "php", "shtml")
and then produce a list of alternatives:
/var/www/index.htm
/var/www/index.html
/var/www/index.php
/var/www/index.shtml
Right now, I have a preg_match statement that can split two alternatives:
preg_match_all ("/\(([^)]*)\|([^)]*)\)/", $path_resource, $matches);
Could somebody give me a pointer how to extend this to accept an unlimited number of alternatives (at least two)? Just regarding the regular expression, the rest I can deal with.
The rule is:
The list needs to start with a ( and close with a )
There must be one | in the list (i.e. at least two alternatives)
Any other occurrence(s) of ( or ) are to remain untouched.
Update: I need to be able to also deal with multiple bracket pairs such as:
/var/(www|www2)/index.(htm|html|php|shtml)
sorry I didn't say that straight away.
Update 2: If you're looking to do what I'm trying to do in the filesystem, then note that glob() already brings this functionality out of the box. There is no need to implement a custom solutiom. See #Gordon's answer below for details.

I think you're looking for:
/(([^|]+)(|([^|]+))+)/
Basically, put the splitter '|' into a repeating pattern.
Also, your words should be made up 'not pipes' instead of 'not parens', per your third requirement.
Also, prefer + to * for this problem. + means 'at least one'. * means 'zero or more'.

Not exactly what you are asking, but what's wrong with just taking what you have to get the list (ignoring the |s), putting it into a variable and then explodeing on the |s? That would give you an array of however many items there were (including 1 if there wasn't a | present).

Non-regex solution :)
<?php
$test = '/var/www/index.(htm|html|php|shtml)';
/**
*
* #param string $str "/var/www/index.(htm|html|php|shtml)"
* #return array "/var/www/index.htm", "/var/www/index.php", etc
*/
function expand_bracket_pair($str)
{
// Only get the very last "(" and ignore all others.
$bracketStartPos = strrpos($str, '(');
$bracketEndPos = strrpos($str, ')');
// Split on ",".
$exts = substr($str, $bracketStartPos, $bracketEndPos - $bracketStartPos);
$exts = trim($exts, '()|');
$exts = explode('|', $exts);
// List all possible file names.
$names = array();
$prefix = substr($str, 0, $bracketStartPos);
$affix = substr($str, $bracketEndPos + 1);
foreach ($exts as $ext)
{
$names[] = "{$prefix}{$ext}{$affix}";
}
return $names;
}
function expand_filenames($input)
{
$nbBrackets = substr_count($input, '(');
// Start with the last pair.
$sets = expand_bracket_pair($input);
// Now work backwards and recurse for each generated filename set.
for ($i = 0; $i < $nbBrackets; $i++)
{
foreach ($sets as $k => $set)
{
$sets = array_merge(
$sets,
expand_bracket_pair($set)
);
}
}
// Clean up.
foreach ($sets as $k => $set)
{
if (false !== strpos($set, '('))
{
unset($sets[$k]);
}
}
$sets = array_unique($sets);
sort($sets);
return $sets;
}
var_dump(expand_filenames('/(a|b)/var/(www|www2)/index.(htm|html|php|shtml)'));

Maybe I'm still not getting the question, but my assumption is you are running through the filesystem until you hit one of the files, in which case you could do
$files = glob("$path/index.{htm,html,php,shtml}", GLOB_BRACE);
The resulting array will contain any file matching your extensions in $path or none. If you need to include files by a specific extension order, you can foreach over the array with an ordered list of extensions, e.g.
foreach(array('htm','html','php','shtml') as $ext) {
foreach($files as $file) {
if(pathinfo($file, PATHINFO_EXTENSION) === $ext) {
// do something
}
}
}
Edit: and yes, you can have multiple curly braces in glob.

The answer is given, but it's a funny puzzle and i just couldn't resist
function expand_filenames2($str) {
$r = array($str);
$n = 0;
while(preg_match('~(.*?) \( ( \w+ \| [\w|]+ ) \) (.*) ~x', $r[$n++], $m)) {
foreach(explode('|', $m[2]) as $e)
$r[] = $m[1] . $e . $m[3];
}
return array_slice($r, $n - 1);
}
print_r(expand_filenames2('/(a|b)/var/(ignore)/(www|www2)/index.(htm|html|php|shtml)!'));
maybe this explains a bit why we like regexps that much ;)

Related

Efficient way to check if any of the prefixes stored in comma separated list is the prefix of a word

I have a comma separated list of prefixes stored in a variable
$prefixes = “fa,go,urg”;
and a word stored in another variable
$word = “good”;
Now I want to know efficient way to check if any of the prefixes stored in $prefixes is the prefix of $word or not.
My intention is
If any of the prefixes stored in $prefixes is the prefix of the word stored in $word return TRUE.
If none of the prefixes stored in $prefixes is the prefix of the word stored in $word return FALSE.
Note:- Comma separated list of prefixes is provide by user using text box.
One thing that can be done is to have the prefixes within an array, and then check if $word is present within the array $preArr using in_array
in_array
(PHP 4, PHP 5, PHP 7)
in_array — Checks if a value exists in an array
$prefixes = “fa,go,urg”;
$preArr = explode(',', $prefixes); // Convert to array
$word = “good”;
if (in_array($word, $preArr)) {
echo "Success!";
} else {
echo "Failure!";
}
The substr function can achieve the desired result. It checks for the word good in the prefixes at specified location, which is the beginning of the word.
From the PHP Manual:
substr — Return part of a string
Description
string substr ( string $string , int $start [, int $length ] )
Returns the portion of string specified by the start and length parameters.
Try this:
$prefixes = “fa,go,urg”;
$word = “good”;
$Arr[] = explode(',', $prefixes); // Convert to array
$elements = count($Arr[]); //get total elements in array
for ($i=0;$i<count;$i++) {
if (substr( $Arr(i), 0, 4 ) === $word) {
return true;
}
else {return false;}
}
Your problem can be solved by a few ways, the most programmatic method being to just do a simple check, iterate across $prefixes and check it against 0.....i where i = N - 1 and N = count($prefixes[$i])
function inPrefixArr($prefixes, $word) {
$prefixesInArray = explode(',', $prefixes);
for ($i = 0; $i < count($prefixesInArray); i++) {
if (count($prefixesInArray[$i]) <= count($word)) {
if ($prefixesInArray[$i] == substr($word, 0, count($prefixesInArray[$i]))) {
return True;
}
}
}
return False
}
This checks if the any of the prefixes are a prefix of the word given in O(mn) time where m is the max length of some prefix in the array given. It is also the fastest and most space optimal solution that can be found.
As it seems you weren't asking for a theoretical/CS question, there are other interesting ways to implement this in other data structures which can yield better runtimes if you do this repeatedly.

Split an array with a regular expression

I'm wondering if it is possible to truncate an array by using a regular expression.
In particular I have an array like this one:
$array = array("AaBa","AaBb","AaBc","AaCa","AaCb","AaCc","AaDa"...);
I have this string:
$str = "AC";
I'd like the slice of $array from the start to the last occurrence of a string matching /A.C./ (in the sample, "AaCc" at index 5):
$result = array("AaBa","AaBb","AaBc","AaCa","AaCb","AaCc");
How can I do this? I thought I might use array_slice, but I don't know how to use a RegEx with it.
Here's my bid
function split_by_contents($ary, $pattern){
if (!is_array($ary)) return FALSE; // brief error checking
// keep track of the last position we had a match, and the current
// position we're searching
$last = -1; $c = 0;
// iterate over the array
foreach ($ary as $k => $v){
// check for a pattern match
if (preg_match($pattern, $v)){
// we found a match, record it
$last = $c;
}
// increment place holder
$c++;
}
// if we found a match, return up until the last match
// if we didn't find one, return what was passed in
return $last != -1 ? array_slice($ary, 0, $last + 1) : $ary;
}
Update
My original answer has a $limit argument that served no purpose. I did originally have a different direction I was going to go with the solution, but decided to keep it simple. However, below is the version that implements that $limit. So...
function split_by_contents($ary, $pattern, $limit = 0){
// really simple error checking
if (!is_array($ary)) return FALSE;
// track the location of the last match, the index of the
// element we're on, and how many matches we have found
$last = -1; $c = 0; $matches = 0;
// iterate over all items (use foreach to keep key integrity)
foreach ($ary as $k => $v){
// text for a pattern match
if (preg_match($pattern, $v)){
// record the last position of a match
$last = $c;
// if there is a specified limit, capture up until
// $limit number of matches, then exit the loop
// and return what we have
if ($limit > 0 && ++$matches == $limit){
break;
}
}
// increment position counter
$c++;
}
I think the easiest way might be with a foreach loop, then using a regex against each value - happy to be proven wrong though!
One alternative could be to implode the array first...
$array = array("AaBa","AaBb","AaBc","AaCa","AaCb","AaCc","AaDa"...);
$string = implode('~~',$array);
//Some regex to split the string up as you want, guessing something like
// '!~~A.C.~~!' will match just what you want?
$result = explode('~~',$string);
If you'd like a hand with the regex I can do, just not 100% on exactly what you're asking - the "A*C*"-->"AaCc" bit I'm not too sure on?
Assuming incremental numeric indices starting from 0
$array = array("AaBa","AaBb","AaBc","AaCa","AaCb","AaCc","AaDa");
$str = "AC";
$regexpSearch = '/^'.implode('.',str_split($str)).'.$/';
$slicedArray = array_slice($array,
0,
array_pop(array_keys(array_filter($array,
function($entry) use ($regexpSearch) {
return preg_match($regexpSearch,$entry);
}
)
)
)+1
);
var_dump($slicedArray);
PHP >= 5.3.0 and will give a
Strict standards: Only variables should be passed by reference
And if no match is found, will still return the first element.

Read a .info file with PHP

I've created a .info file similar to how you would in drupal.
#Comment
Template Name = Valley
styles[] = styles/styles.css, styles/media.css
scripts[] = js/script.js
I want to use PHP get each variable and their values. For example I'd like to put the Template Name value to a PHP variable called Template Name and put the styles[] values in an array if there is mroe than one.
I'd also need to avoid it picking up on comments that are defined be a hash # before the text.
It seems a lot to ask, bt I'm really not sure how to go about doing this. If someone has a solution I'd be very greatful, however if someone could just point me in the right direction that'll be just as helpful.
Thanks in advanced!
If you can adkust your info file slightly, you can use a built-in PHP function:
http://php.net/manual/en/function.parse-ini-file.php
#Comment
TemplateName = Valley
styles[] = "styles/styles.css"
styles[] = "styles/media.css"
scripts[] = "js/script.js"
which will result in an array
If all you're after is something "similar" you could take a look at the parse_ini_file() function.
Drupal was a good hint:
function drupal_parse_info_file($filename) {
$info = array();
$constants = get_defined_constants();
if (!file_exists($filename)) {
return $info;
}
$data = file_get_contents($filename);
if (preg_match_all('
#^\s* # Start at the beginning of a line, ignoring leading whitespace
((?:
[^=;\[\]]| # Key names cannot contain equal signs, semi-colons or square brackets,
\[[^\[\]]*\] # unless they are balanced and not nested
)+?)
\s*=\s* # Key/value pairs are separated by equal signs (ignoring white-space)
(?:
("(?:[^"]|(?<=\\\\)")*")| # Double-quoted string, which may contain slash-escaped quotes/slashes
(\'(?:[^\']|(?<=\\\\)\')*\')| # Single-quoted string, which may contain slash-escaped quotes/slashes
([^\r\n]*?) # Non-quoted string
)\s*$ # Stop at the next end of a line, ignoring trailing whitespace
#msx', $data, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
// Fetch the key and value string
$i = 0;
foreach (array('key', 'value1', 'value2', 'value3') as $var) {
$$var = isset($match[++$i]) ? $match[$i] : '';
}
$value = stripslashes(substr($value1, 1, -1)) . stripslashes(substr($value2, 1, -1)) . $value3;
// Parse array syntax
$keys = preg_split('/\]?\[/', rtrim($key, ']'));
$last = array_pop($keys);
$parent = &$info;
// Create nested arrays
foreach ($keys as $key) {
if ($key == '') {
$key = count($parent);
}
if (!isset($parent[$key]) || !is_array($parent[$key])) {
$parent[$key] = array();
}
$parent = &$parent[$key];
}
// Handle PHP constants.
if (isset($constants[$value])) {
$value = $constants[$value];
}
// Insert actual value
if ($last == '') {
$last = count($parent);
}
$parent[$last] = $value;
}
}
return $info;
}
Source, this function is part of the drupal code-base, drupal's license applies, used for documentation purposes here only.

PHP: Check string for certain words

How can I check if data submitted from a form or querystring has certain words in it?
I'm trying to look for words containing admin, drop, create etc in form [Post] data and querystring data so I can accept or reject it.
I'm converting from ASP to PHP. I used to do this using an array in ASP (keep all illegal words in a string and use ubound to check the whole string for those words), but is there a better (efficient) way to do this in PHP?
Eg: A string like this would be rejected: "The administrator dropped a blah blah" because it has admin and drop in it.
I intend using this to check usernames when creating accounts and for other things too.
Thanks
You could use stripos()
int stripos ( string $haystack , string $needle [, int $offset = 0 ] )
You could have a function like:
function checkBadWords($str, $badwords) {
foreach ($badwords as $word) {
if (stripos(" $str ", " $word ") !== false) {
return false;
}
}
return true;
}
And to use it:
if (!checkBadWords('something admin', array('admin')) {
// ...
}
strpos() will let you search for a substring within a larger string. It's quick and works well. It returns false if the string's not found, and a number (which could be zero, so you need to use === to check) if it finds the string.
stripos() is a case-insensitive version of the same.
I'm trying to look for words containing admin, drop, create etc in form [Post] data and querystring data so I can accept or reject it.
I suspect that you are trying to filter the string so it's suitable for including in something like a database query, or something like that. If this is the case, this is probably not a good way to go about it, and you'd need to actually need to escape the string using mysql_real_escape_string() or equivalent.
$badwords = array("admin", "drop",);
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (strpos($word, $bw) === 0) {
//contains word $word that starts with bad word $bw
}
}
}
For JGB146, here is a performance comparison with regular expressions:
<?php
function has_bad_words($badwords, $string) {
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (stripos($word, $bw) === 0) {
return true;
}
}
return false;
}
}
function has_bad_words2($badwords, $string) {
$regex = array_map(function ($w) {
return "(?:\\b". preg_quote($w, "/") . ")"; }, $badwords);
$regex = "/" . implode("|", $regex) . "/";
return preg_match($regex, $string) != 0;
}
$badwords = array("abc", "def", "ghi", "jkl", "mnop");
$string = "The quick brown fox jumps over the lazy dog";
$start = microtime(true);
for ($i = 0; $i < 10000; $i++) {
has_bad_words($badwords, $string);
}
echo "elapsed: ". (microtime(true) - $start);
$start = microtime(true);
for ($i = 0; $i < 10000; $i++) {
has_bad_words2($badwords, $string);
}
echo "elapsed: ". (microtime(true) - $start);
Example output:
elapsed: 0.076514959335327
elapsed: 0.29999899864197
So regular expressions are much slower.
You could use regular expression like this:
preg_match("~(admin)|(drop)|(another token)|(yet another)~",$subject);
building the pattern string from array
$pattern = implode(")|(", $banned_words);
$pattern = "~(".$pattern.")~";
function check($string, $array) {
foreach($array as $item) {
if( preg_match("/($item)/", $string) )
return true;
}
return false;
}
You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement.
Originally, I was thinking you could do this with a simple preg_match() call (hence the downvote), however preg_match does not support arrays. Instead, you can do a replacement via preg_replace to have all rejected strings replaced with nothing, and then check to see if the string is changed. This is simple and avoids requiring a loop iteration for each rejected string.
$rejectedStrs = array("/admin/", "/drop/", "/create/");
if($input == preg_replace($rejectedStrs, "", $input)) {
//do stuff
} else {
//reject
}
Note also that you can provide case-insensitive searches by using the i flag on the regex patterns, changing the array of patterns to $rejectedStrs = array("/admin/i", "/drop/i", "/create/i");
On Efficiency
There has been some debate about the efficiency of doing it this way vs the accepted nested loop method. I ran some tests and found the preg_replace method executed around twice as fast as the nested loop. Here is the code and output of those tests:
$input = "You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement. You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement.";
$input = "Short string with no matches";
$input2 = "Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. ";
$input3 = "Short string which loop will match quickly";
$input4 = "Longer string that will eventually be matches but first has a lot of words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words and then finally the word create near the end";
$start1 = microtime(true);
$rejectedStrs = array("/loop/", "/operation/", "/create/");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (preg_check($rejectedStrs, $input)) $p_matches++;
if (preg_check($rejectedStrs, $input2)) $p_matches++;
if (preg_check($rejectedStrs, $input3)) $p_matches++;
if (preg_check($rejectedStrs, $input4)) $p_matches++;
}
$start2 = microtime(true);
$rejectedStrs = array("loop", "operation", "create");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (loop_check($rejectedStrs, $input)) $l_matches++;
if (loop_check($rejectedStrs, $input2)) $l_matches++;
if (loop_check($rejectedStrs, $input3)) $l_matches++;
if (loop_check($rejectedStrs, $input4)) $l_matches++;
}
$end = microtime(true);
echo "preg_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$end."=".($end-$start2);
function preg_check($rejectedStrs, $input) {
if($input == preg_replace($rejectedStrs, "", $input))
return true;
return false;
}
function loop_check($badwords, $string) {
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (stripos($word, $bw) === 0) {
return true;
}
}
return false;
}
}
Output:
preg_match: 1281908071.4032 1281908071.9947= 0.5915060043335
loop_match: 1281908071.9947 1281908073.006=1.0112948417664
This is actually pretty simple, use substr_count.
And example for you would be:
if (substr_count($variable_to_search, "drop"))
{
echo "error";
}
And to make things even simpler, put your keywords (ie. "drop", "create", "alter") in an array and use foreach to check them. That way you cover all your words. An example
foreach ($keywordArray as $keyword)
{
if (substr_count($variable_to_search, $keyword))
{
echo "error"; //or do whatever you want to do went you find something you don't like
}
}

how to find out if csv file fields are tab delimited or comma delimited

how to find out if csv file fields are tab delimited or comma delimited. I need php validation for this. Can anyone plz help. Thanks in advance.
It's too late to answer this question but hope it will help someone.
Here's a simple function that will return a delimiter of a file.
function getFileDelimiter($file, $checkLines = 2){
$file = new SplFileObject($file);
$delimiters = array(
',',
'\t',
';',
'|',
':'
);
$results = array();
$i = 0;
while($file->valid() && $i <= $checkLines){
$line = $file->fgets();
foreach ($delimiters as $delimiter){
$regExp = '/['.$delimiter.']/';
$fields = preg_split($regExp, $line);
if(count($fields) > 1){
if(!empty($results[$delimiter])){
$results[$delimiter]++;
} else {
$results[$delimiter] = 1;
}
}
}
$i++;
}
$results = array_keys($results, max($results));
return $results[0];
}
Use this function as shown below:
$delimiter = getFileDelimiter('abc.csv'); //Check 2 lines to determine the delimiter
$delimiter = getFileDelimiter('abc.csv', 5); //Check 5 lines to determine the delimiter
P.S I have used preg_split() instead of explode() because explode('\t', $value) won't give proper results.
UPDATE: Thanks for #RichardEB pointing out a bug in the code. I have updated this now.
Here's what I do.
Parse the first 5 lines of a CSV file
Count the number of delimiters [commas, tabs, semicolons and colons] in each line
Compare the number of delimiters in each line. If you have a properly formatted CSV, then one of the delimiter counts will match in each row.
This will not work 100% of the time, but it is a decent starting point. At minimum, it will reduce the number of possible delimiters (making it easier for your users to select the correct delimiter).
/* Rearrange this array to change the search priority of delimiters */
$delimiters = array('tab' => "\t",
'comma' => ",",
'semicolon' => ";"
);
$handle = file( $file ); # Grabs the CSV file, loads into array
$line = array(); # Stores the count of delimiters in each row
$valid_delimiter = array(); # Stores Valid Delimiters
# Count the number of Delimiters in Each Row
for ( $i = 1; $i < 6; $i++ ){
foreach ( $delimiters as $key => $value ){
$line[$key][$i] = count( explode( $value, $handle[$i] ) ) - 1;
}
}
# Compare the Count of Delimiters in Each line
foreach ( $line as $delimiter => $count ){
# Check that the first two values are not 0
if ( $count[1] > 0 and $count[2] > 0 ){
$match = true;
$prev_value = '';
foreach ( $count as $value ){
if ( $prev_value != '' )
$match = ( $prev_value == $value and $match == true ) ? true : false;
$prev_value = $value;
}
} else {
$match = false;
}
if ( $match == true ) $valid_delimiter[] = $delimiter;
}//foreach
# Set Default delimiter to comma
$delimiter = ( $valid_delimiter[0] != '' ) ? $valid_delimiter[0] : "comma";
/* !!!! This is good enough for my needs since I have the priority set to "tab"
!!!! but you will want to have to user select from the delimiters in $valid_delimiter
!!!! if multiple dilimiter counts match
*/
# The Delimiter for the CSV
echo $delimiters[$delimiter];
There is no 100% reliable way to detemine this. What you can do is
If you have a method to validate the fields you read, try to read a few fields using either separator and validate against your method. If it breaks, use another one.
Count the occurrence of tabs or commas in the file. Usually one is significantly higher than the other
Last but not least: Ask the user, and allow him to override your guesses.
I'm just counting the occurrences of the different delimiters in the CSV file, the one with the most should probably be the correct delimiter:
//The delimiters array to look through
$delimiters = array(
'semicolon' => ";",
'tab' => "\t",
'comma' => ",",
);
//Load the csv file into a string
$csv = file_get_contents($file);
foreach ($delimiters as $key => $delim) {
$res[$key] = substr_count($csv, $delim);
}
//reverse sort the values, so the [0] element has the most occured delimiter
arsort($res);
reset($res);
$first_key = key($res);
return $delimiters[$first_key];
In my situation users supply csv files which are then entered into an SQL database. They may save an Excel Spreadsheet as comma or tab delimited files. A program converting the spreadsheet to SQL needs to automatically identify whether fields are tab separated or comma
Many Excel csv export have field headings as the first line. The heading test is unlikely to contain commas except as a delimiter. For my situation I counted the commas and tabs of the first line and use that with the greater number to determine if it is csv or tab
Thanks for all your inputs, I made mine using your tricks : preg_split, fgetcsv, loop, etc.
But I implemented something that was surprisingly not here, the use of fgets instead of reading the whole file, way better if the file is heavy!
Here's the code :
ini_set("auto_detect_line_endings", true);
function guessCsvDelimiter($filePath, $limitLines = 5) {
if (!is_readable($filePath) || !is_file($filePath)) {
return false;
}
$delimiters = array(
'tab' => "\t",
'comma' => ",",
'semicolon' => ";"
);
$fp = fopen($filePath, 'r', false);
$lineResults = array(
'tab' => array(),
'comma' => array(),
'semicolon' => array()
);
$lineIndex = 0;
while (!feof($fp)) {
$line = fgets($fp);
foreach ($delimiters as $key=>$delimiter) {
$lineResults[$key][$lineIndex] = count (fgetcsv($fp, 1024, $delimiter)) - 1;
}
$lineIndex++;
if ($lineIndex > $limitLines) break;
}
fclose($fp);
// Calculating average
foreach ($lineResults as $key=>$entry) {
$lineResults[$key] = array_sum($entry)/count($entry);
}
arsort($lineResults);
reset($lineResults);
return ($lineResults[0] !== $lineResults[1]) ? $delimiters[key($lineResults)] : $delimiters['comma'];
}
I used #Jay Bhatt's solution for finding out a csv file's delimiter, but it didn't work for me, so I applied a few fixes and comments for the process to be more understandable.
See my version of #Jay Bhatt's function:
function decide_csv_delimiter($file, $checkLines = 10) {
// use php's built in file parser class for validating the csv or txt file
$file = new SplFileObject($file);
// array of predefined delimiters. Add any more delimiters if you wish
$delimiters = array(',', '\t', ';', '|', ':');
// store all the occurences of each delimiter in an associative array
$number_of_delimiter_occurences = array();
$results = array();
$i = 0; // using 'i' for counting the number of actual row parsed
while ($file->valid() && $i <= $checkLines) {
$line = $file->fgets();
foreach ($delimiters as $idx => $delimiter){
$regExp = '/['.$delimiter.']/';
$fields = preg_split($regExp, $line);
// construct the array with all the keys as the delimiters
// and the values as the number of delimiter occurences
$number_of_delimiter_occurences[$delimiter] = count($fields);
}
$i++;
}
// get key of the largest value from the array (comapring only the array values)
// in our case, the array keys are the delimiters
$results = array_keys($number_of_delimiter_occurences, max($number_of_delimiter_occurences));
// in case the delimiter happens to be a 'tab' character ('\t'), return it in double quotes
// otherwise when using as delimiter it will give an error,
// because it is not recognised as a special character for 'tab' key,
// it shows up like a simple string composed of '\' and 't' characters, which is not accepted when parsing csv files
return $results[0] == '\t' ? "\t" : $results[0];
}
I personally use this function for helping automatically parse a file with PHPExcel, and it works beautifully and fast.
I recommend parsing at least 10 lines, for the results to be more accurate. I personally use it with 100 lines, and it is working fast, no delays or lags. The more lines you parse, the more accurate the result gets.
NOTE: This is just a modifed version of #Jay Bhatt's solution to the question. All credits goes to #Jay Bhatt.
When I output a TSV file I author the tabs using \t the same method one would author a line break like \n so that being said I guess a method could be as follows:
<?php
$mysource = YOUR SOURCE HERE, file_get_contents() OR HOWEVER YOU WISH TO GET THE SOURCE;
if(strpos($mysource, "\t") > 0){
//We have a tab separator
}else{
// it might be CSV
}
?>
I Guess this may not be the right manner, because you could have tabs and commas in the actual content as well. It's just an idea. Using regular expressions may be better, although I am not too clued up on that.
you can simply use the fgetcsv(); PHP native function in this way:
function getCsvDelimeter($file)
{
if (($handle = fopen($file, "r")) !== FALSE) {
$delimiters = array(',', ';', '|', ':'); //Put all that need check
foreach ($delimiters AS $item) {
//fgetcsv() return array with unique index if not found the delimiter
if (count(fgetcsv($handle, 0, $item, '"')) > 1) {
$delimiter = $item;
break;
}
}
}
return (isset($delimiter) ? $delimiter : null);
}
Aside from the trivial answer that c sv files are always comma-separated - it's in the name, I don't think you can come up with any hard rules. Both TSV and CSV files are sufficiently loosely specified that you can come up with files that would be acceptable as either.
A\tB,C
1,2\t3
(Assuming \t == TAB)
How would you decide whether this is TSV or CSV?
You also can use fgetcsv (http://php.net/manual/en/function.fgetcsv.php) passing it a delimiter parameter. If the function returns false it means that the $delimiter parameter wasn't the right one
sample to check if the delimiter is ';'
if (($data = fgetcsv($your_csv_handler, 1000, ';')) !== false) { $csv_delimiter = ';'; }
How about something simple?
function findDelimiter($filePath, $limitLines = 5){
$file = new SplFileObject($filePath);
$delims = $file->getCsvControl();
return $delims[0];
}
This is my solution.
Its works if you know how many columns you expect.
Finally, the separator character is the $actual_separation_character
$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";
$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;
/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;
$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
$row= fgets($file);
$row_1_replace=str_replace($separator_1,"",$row);
$row_1_length=strlen($row)-strlen($row_1_replace);
if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_1_number=$separator_1_number+$row_1_length;
}
$row_2_replace=str_replace($separator_2,"",$row);
$row_2_length=strlen($row)-strlen($row_2_replace);
if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_2_number=$separator_2_number+$row_2_length;
}
$row_3_replace=str_replace($separator_3,"",$row);
$row_3_length=strlen($row)-strlen($row_3_replace);
if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_3_number=$separator_3_number+$row_3_length;
}
$row_4_replace=str_replace($separator_4,"",$row);
$row_4_length=strlen($row)-strlen($row_4_replace);
if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_4_number=$separator_4_number+$row_4_length;
}
$row_5_replace=str_replace($separator_5,"",$row);
$row_5_length=strlen($row)-strlen($row_5_replace);
if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_5_number=$separator_5_number+$row_5_length;
}
} // while(! feof($file)) END
fclose($file);
/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */
if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}
/*
if the number of columns more than what you expect, do something ...
*/
if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}
If you have a very large file example in GB, head the first few line, put in a temporary file. Open the temporary file in vi
head test.txt > te1
vi te1
Easiest way I answer this is open it in a plain text editor, or in TextMate.

Categories