File with irregular spaces and tabs split/explode columnwise

File with irregular spaces and tabs split/explode columnwise - php

So I have a very old file with thousands of lines (I guess generated by hand) and I'm trying to move them into a rdb, but the lines don't have a format/pattern to convert into columns. Say for example the lines in the file looks like:
blah blahsdfas laslkdlasdj aksdjla
sldks slslsl lsdlksldj lsdjlfslk
I could say it has four fields when I look at it, primarily tried using awk but it wasn't printing the column as expected because the space between a column is not tab or with an equal space count.
You guys think its possible to extract? If yes can someone help with a php snippet?

Using preg_split(), you can break the line up using one or more whitespace characters as the delimiters:
$lines = file('filename', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
foreach($lines as $line)
{
$pieces = preg_split('/\s+/', $line);
// do something with pieces
}

It looks like preg_split('/\s{2,}/', $line) would split this apart. That'd split on two or more whitespace characters.
If this has been maintained by hand, you may have to do manual cleanup (e.g., maybe someone typed two spaces but didn't intend to start the next column). At only thousands of lines, manual cleanup is thankfully on tedious, not impossible.

Related

HTML textarea alignment

Suppose I have textarea filled with following text
employee/company/salary
john/microsoft/12.000
michael/citrusdata/15.000
How can I align each column vertically so I get following text:
employee__________company__________salary
john______________microsoft__________12.000
michael___________citrusdata__________15.000
In this example I used underscores to specify whitespaces, thought to write a simple function like nl2br() to replace '/' with one or many tab characters but it wont be a consistent solution, guess I need to read text line by line and considering the length of every word, I need to replace '/' with enough whitespace but dont have any idea how to code it, is there any other way?

I suppose you will output the textarea content outside the textarea itself, else you will need to use js alternative. My answer uses php :)
So, you may use the sprintf function that allows left or right padding.
Just split your content to get an array of lines
$lines = explode("\n", $content);
Take care of a eventual empty last entry (if your content end with a \n)
Then
foreach($lines as $line) {
$items = explode("/", $line) ;
echo sprintf("%-15s%-15s%-15s", $items[0], $items[1], $items[2]) . "<br/>";
}
"%-15" tells to left-pad with 15 empty spaces.
It works on console, but you have to nl2br it before echoing in web pages !
This is sample, so you have to add error testing (lines with only one / for example).

You should specify the width of each column like 50 characters for each or any desired width. let say it $COLUMN_WIDTH = 100;
find length of the column value (string) than subtract it from fixed length like
$COUNT_SPACES_TO_INSERT = $COLUMN_WIDTH - strlen($COLUMN_STR);
Than insert $COUNT_SPACES_TO_INSERT number of spaces it will solve your issue.

Remove lines with specific pattern at the beginning of them

I have a text file of approximately 25,000 lines. About 525kb.
Some lines have random text at the beginning.
Some have long strings of semicolons.
Some others only have three semi-colons and then a space and optionally more text on the same line. These are the lines I want to remove.
Here is a sample....
;;; Updated Time 20120706122706
;;; Generic DEveloper Output
;;; Some Random Comments
;;; I got some more...
;;; Yet another uneeded line
;;; Thanks for using StackOverflow <http://stackoverflow.com>, or...
;;; Not.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Banana Production
[Data_Release_Version]
Version=12586
Released=20120706122706
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Baseline Properties
[BaseLineProperties]
Comment=BaselineProperties
----- and so on.
Once it gets to the first line with 4 or more ; on the line, I need the rest of the file as there are no ";;; " lines.
Trying to find something fast instead of reading everything line and writing it back out if it doesn't match ";;; ".
File is ASCII (possibly UTF-8) text type file.
Any ideas?
Thank you for your time, assistance and knowledge.

What I would suggest is to use file_get_contents() and save file's contents in a variable as a string, then use explode() that string at every newline character, then in a foreach loop, use preg_match() to check if the line begins with 3 semicolons and a space, if it dosent, put it in another array named $output. After foreach, implode() $output and add a newline character and use file_put_contents() to print it in another file. Hope this helps :-)
code:
<?php
$string = file_get_contents($filename);
$array = explode("\n",$string);
foreach($array as $arr) {
if(!(preg_match("^;;;\s",$arr))) {
$output[] = $arr;
}
}
$out = implode("\n",$output);
file_put_contents($path,$out);
?>

Depends.. I would try to load into a string, then do a explode() with newline, so it's in array, then run a foreach with a skip on any that doesnt have strpos == 0 -AND- strpos !== false, you can put in a continue to skip to the next line if it doesnt match.
Another option, is to parse, and skip, or even using fseek, and such. Depends on alot of different factors to determine whats going to be fastest.
You can implode later on, and add the newlines back in, and then push out a file, and/or use line breaks. Depending where the output is supposed to go.

I think you gave the answer yourself:
Make a script that reads the input file line by line in a loop (while). It writes every line into an output file if two conditions are met: 1. a flag ("done") is FALSE and 2. the line does NOT start with ";;; " (not the blank). This removes those lines starting with three semicolons. Once you come about a line containing more semicolons you set the flag to TRUE, thus the remaining lines wil be copied without being examined.

Counting effective line count in sources by PHP

I am writing a PHP program which reads file with file_get_contents then attempts to count effective lines in that source file. It must not count empty lines or lines containing comments only. Sample file:
<?php
/**
* blah blah
*/
class Test {
// testfunc
function testfunc(){
return;
}
}
The number of lines in such a file should be 5. Here is what I've got so far:
$f=file_get_contents($this->file);
$f=preg_replace('|/\*.*?\*/|s','',$f);
$f=preg_replace('/^\s*$/','',$f); // <-- does not work
$f=preg_replace("/\n\n*/s","\n",$f);
$count=count(explode("\n",$f));
But for some reason it does not eliminate white-spaces. Is there a better way to get this done?
The following code does the job, since I don't care much about the spaces, but I still wonder, why my original line labeled "does not work" is not removing spaces from empty lines. Is there some extra character at the end? File format is unix.
$f=preg_replace('/ */','',$f); // removes all spaces properly.

Change /^[\s\t]*$/ to be /^\s*$/ms and that should fix it.
The \s class includes tabs, so no need to add \t. The s makes it match newline characters and the m option makes ^ and $ work when data contains multiple lines (matches line breaks).
Also, it might be better to change /\n\n/s to be /[\r\n]{2,}/.

I would just use trim() and then test each line.
foreach ($lines as $line) {
if (strlen(trim($line)) > 0) {
$total++;
}
}
Then, you're set up to test for other conditions as well, such as comment lines and what not. I suspect that this will be faster than doing a find/replace on a potentially large document, but you should test it either way, and choose the fastest method.

What is the proper New Line Character in Outlook Contact Export?

I have a CSV parser, that takes Outlook 2010 Contact Export .CSV file, and produces an array of values.
I break each row on the new line symbol, and each column on the comma. It works fine, until someone puts a new line inside a field (typically Address). This new line, which I assume is "\n" or "\r\n", explodes the row where it shouldn't, and the whole file becomes messed up from there on.
In my case, it happens when Business Street is written in two lines:
123 Apple Dr. Unit A
My code:
$file = file_get_contents("outlook.csv");
$rows = explode("\r\n",$file);
foreach($rows as $row)
{
$columns = explode(",",$row);
// Further manipulation here.
}
I have tried both "\n" and "\r\n", same result.
I figured I could calculate the number of columns in the first row (keys), and then find a way to not allow a new line until this many columns have been parsed, but it feels shady.
Is there another character for the new line that I can try, that would not be inside the data fields themselves?

The most common way of handling newlines in CSV files is to "quote" fields which contain significant characters such as newlines or commas. It may be worth looking into whether your CSV generator does this.
I recommend using PHP's fgetcsv() function, which is intended for this purpose. As you've discovered, splitting strings on commas works only in the most trivial cases.
In cases, where that doesn't work, a more sophisticated, reportedly RFC4180-compliant parser is available here.

I also recommend fgetcsv()
fgetcsv will also take care of commas inside strings ( between quotes ).
Interesting parsing tutorial
+1 to the previous answer ;)
PS: fgetcsv is a bit slower then opening the file and explode the contents etc. But imo it's worth it.

file_put_contents, file_append and line breaks

I'm writing a PHP script that adds numbers into a text file. I want to have one number on every line, like this:
1
5
8
12
If I use file_put_contents($filename, $commentnumber, FILE_APPEND), the result looks like:
15812
If I add a line break like file_put_contents($filename, $commentnumber . "\n", FILE_APPEND), spaces are added after each number and one empty line at the end (underscore represents spaces):
1_
5_
8_
12_
_
_
How do I get that function to add the numbers the way I want, without spaces?

Did you tried with PHP EOL constant?
file_put_contents($filename, $commentnumber . PHP_EOL, FILE_APPEND)
--- Added ---
I just realize that my file editor does the same, but don't worrie, is just a ghost character that the editor places there to signal that there is a newline
You could try this
A file with EOL after the last number looks like:
1_
2_
3_
EOF
but a file without that last character looks like
1_
2_
3
EOF
where _ means a space character
You could try to parse the file contents using php to see what's inside
$lines = explode( PHP_EOL, file_get_contents($file));
foreach($lines as $line ) {
var_dump($line);
}
...tricky

pauls answer has the correct approach but he has a mistake.
what you need ist the following:
file_put_contents($filename, trim($commentnumber).PHP_EOL, FILE_APPEND);
the PHP_EOL constant makes sure to use the right line ending on mac, windows and unix systems
the trim function removes any newline or whitespace on both sides of the string.
converting to integer would be a huge mistake because
1. you might end up having zero, expecially because of white space or special characters (wherever they come from...)
2. ids dont necessarily need to be integers

Ohh Guys! Just Use
\r\n
insted of \n

There is nothing in the code you provided that would generate those spaces, unless $commentnumber already contains the space to begin with. If that is the case, simply use trim($commentnumber) instead.
There is also nothing in your code that would explain empty lines at the bottom of the file, unless $commentnumber can be an empty string. If that is the case, and you want it to output the number 0 instead, use intval($commentnumber).
Of course, you need only one of those two. If you want to preserve string-like content, use trim(); if you always want integers, use intval(), which already trims it automatically.
It is also possible that you accidentally wrote " \n" instead of "\n" in your actual code, but in the code you posted here it is correct.

annoyingregistration, what you have there is absolutely fine.
PHP_EOL and "\n" are exactly the same.
The code you provided theres nothing wrong with it so it must be the value of $commentnumber that has a space at the end of it. as stated, run your $commentnumber through the trim() function.
file_put_contents($filename, trim($commentnumber . "\n"), FILE_APPEND);
Good luck.

After reading your code and responses, I have come up with a theory...
Since I can't see that there's anything wrong with your code, how did you open and read the file? Did you actually open it in a text editor? Did you use a PHP script to do it? If so, open the file with a text editor and check that there are actually spaces at the end of each line. If there is actually is...well, ignore the rest of this answer, then. If not, just read on.
For instance, if you use something like this:
<?php
$lines = file($filename);
if($lines) // Error reading
die();
foreach($lines as $line)
echo $line."<br />";
Then you would always a whitespace at the end of the line because of the way file() work. Make sure each $line does not have a whitespace - such as a newline character - at the end.
Since HTML handles all whitespaces - spaces, tabs, newlines etc. - as spaces, if there is a whitespace at the end of $line, then those would appear as spaces in the HTML output.
Solution: use rtrim($line) to remove whitespaces at the end of the lines. Using the following code:
<?php
$lines = file($filename);
if($lines) // Error reading
die();
foreach($lines as $line)
echo rtrim($line)."<br />";
wouldn't have the same problems as the first example, and all spaces at the end of the lines would be gone.

its because each time you write to the file, the file is being finished, file_put_contents inserts an extra line break at the end

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

File with irregular spaces and tabs split/explode columnwise - php

Using preg_split(), you can break the line up using one or more whitespace characters as the delimiters: $lines = file('filename', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); foreach($lines as $line) { $pieces = preg_split('/\s+/', $line); // do something with pieces }

Related

HTML textarea alignment

Remove lines with specific pattern at the beginning of them

Counting effective line count in sources by PHP

What is the proper New Line Character in Outlook Contact Export?

file_put_contents, file_append and line breaks

Categories

Resources