Scraping a plain text file with no HTML?

Scraping a plain text file with no HTML? - php

I have the following data in a plain text file:
1. Value
Location : Value
Owner: Value
Architect: Value
2. Value
Location : Value
Owner: Value
Architect: Value
... upto 200+ ...
The numbering and the word Value changes for each segment.
Now I need to insert this data in to a MySQL database.
Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ?
Seems hard to do with DOM scraping class since there is no HTML tags present.

If the data is constantly structured, you can use fscanf to scan them from file.
/* Notice the newlines at the end! */
$format = <<<FORMAT
%d. %s
Location : %s
Owner: %s
Arcihtect: %s
FORMAT;
$file = fopen('file.txt', 'r');
while ($data = fscanf($file, $format)) {
list($number, $title, $location, $owner, $architect) = $data;
// Insert the data to database here
}
fclose($file);
More about fscanf in docs.

If every block has the same structure, you could do this with the file() function: http://nl.php.net/manual/en/function.file.php
$data = file('path/to/file.txt');
With this every row is an item in the array, and you could loop through it.
for ($i = 0; $i<count($data); $i+=5){
$valuerow = $data[$i];
$locationrow = $data[$i+1];
$ownerrow = $data[$i+2];
$architectrow = $data[$i+3];
// strip the data you don't want here, and instert it into the database.
}

That will work with a very simple stateful line-oriented parser. Every line you cumulate parsed data into an array(). When something tells you're on a new record, you dump what you parsed and proceed again.
Line-oriented parsers have a great property : they require little memory and what's most important, constant memory. They can proceed with gigabytes of data without any sweat. I'm managing a bunch of production servers and there's nothing worse than those scripts slurping whole files into memory (then stuffing arrays with parsed content which requires more than twice the original file size as memory).
This works and is mostly unbreakable :
<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();
function dump_record($r) {
print_r($r);
}
$current = array();
while ($line = fgets($in)) {
/* Skip empty lines (any number of whitespaces is 'empty' */
if (preg_match('/^\s*$/', $line)) continue;
/* Search for '123. <value> ' stanzas */
if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) {
/* If we already parsed a record, this is the time to dump it */
if (!empty($current)) dump_record($current);
/* Let's start the new record */
$current = array( 'id' => $start[1] );
}
else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
/* Otherwise parse a plain 'key: value' stanza */
$current[ $keyval[1] ] = $keyval[2];
}
else {
error_log("parsing error: '$line'");
}
}
/* Don't forget to dump the last parsed record, situation
* we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);
fclose($in);
?>
Obvously you'll need something suited to your taste in function dump_record, like printing a correctly formated INSERT SQL statement.

This will give you what you want,
$array = explode("\n\n", $txt);
foreach($array as $key=>$value) {
$id_pattern = '#'.($key+1).'. (.*?)\n#';
preg_match($id_pattern, $value, $id);
$location_pattern = '#Location \: (.*?)\n#';
preg_match($location_pattern, $value, $location);
$owner_pattern = '#Owner\: (.*?)\n#';
preg_match($owner_pattern, $value, $owner);
$architect_pattern = '#Architect\: (.*?)#';
preg_match($architect_pattern, $value, $architect);
$id = $id[1];
$location = $location[1];
$owner = $owner[1];
$architect = $architect[1];
mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')");
//Change MYSQL query
}

Agreed with Topener solution, here's an example if each block is 4 lines + blank line:
$data = file('path/to/file.txt');
$id = 0;
$parsedData = array();
foreach ($data as $n => $row) {
if (($n % 5) == 0) $id = (int) $row[0];
else {
$parsedData[$id][$row[0]] = $row[1];
}
}
Structure will be convenient to use, for MySQL or whatelse. I didn't add code to remove the colon from the first segment.
Good luck!

preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m);
$matched = array();
foreach($m[1] as $k => $v) {
$matched[$v] = array(
"location" => trim($m[2][$v]),
"owner" => trim($m[3][$v]),
"architect" => trim($m[4][$v])
);
}

Related

How do I replace 1 value within a row in a CSV file using php?

So this is my very simple and basic account system (Just a school project), I would like the users to be able to change their password. But I am unsure on how to just replace the Password value within a row keeping all the other values the same.
CSV File:
ID,Username,Email,DateJoined,Password,UserScore,profilePics
1,Test,Test#Test.com,03/12/2014,Test,10,uploads/profilePics/Test.jpg
2,Alfie,Alfie#test.com,05/12/2014,1234,136,uploads/profilePics/Alfie.png
PHP:
("cNewPassword" = confirm new password)
<?php
session_start();
if(empty($_POST['oldPassword']) || empty($_POST['newPassword']) || empty($_POST['cNewPassword'])) {
die("ERROR|Please fill out all of the fields.");
} else {
if($_POST['newPassword'] == $_POST['cNewPassword']) {
if ($_POST['oldPassword'] == $_SESSION['password']) {
$file = "Database/Users.csv";
$fh = fopen($file, "w");
while(! feof($file)) {
$rows = fgetcsv($file);
if ($rows[4] == $_POST['oldPassword'] && $rows[1] == $_SESSION['username']) {
//Replace line here
echo("SUCCESS|Password changed!");
}
}
fclose($file);
}
die("ERROR|Your current password is not correct!");
}
die("ERROR|New passwords do not match!");
}
?>

You'll have to open file in read mode, open a temporary one in write mode, write there modified data, and then delete/rename files. I'd suggest trying to set up a real DB and work using it but if you're going for the csv, the code should look like more or less like this:
$input = fopen('Database/Users.csv', 'r'); //open for reading
$output = fopen('Database/temporary.csv', 'w'); //open for writing
while( false !== ( $data = fgetcsv($input) ) ){ //read each line as an array
//modify data here
if ($data[4] == $_POST['oldPassword'] && $data[1] == $_SESSION['username']) {
//Replace line here
$data[4] = $_POST['newPassword'];
echo("SUCCESS|Password changed!");
}
//write modified data to new file
fputcsv( $output, $data);
}
//close both files
fclose( $input );
fclose( $output );
//clean up
unlink('Database/Users.csv');// Delete obsolete BD
rename('Database/temporary.csv', 'Database/Users.csv'); //Rename temporary to new
Hope it helps.

My suggestion is a little function of mine which will turn your database data into an array which you can modify and then return to original state:
With this set of function, you simply have to precise how each row/row data are separated.
function dbToArray($db, $row_separator = "\n", $data_separator = ",") {
// Let's seperator each row of data.
$separate = explode($row_separator, $db);
// First line is always the table column name:
$table_columns =
$table_rows = array();
foreach ($separate as $key => $row) {
// Now let's get each column data out.
$data = explode($data_separator, $row);
// I always assume the first $row of data contains the column names.
if ($key == 0)
$table_columns = $data;
else if ($key > 0 && count($table_columns) == count($data)) // Let's just make sure column amount matches.
$table_rows[] = array_combine($table_columns, $data);
}
// Return an array with columns, and rows; each row data is bound with it's equivalent column name.
return array(
'columns' => $table_columns,
'rows' => $table_rows,
);
}
function arrayToDb(array $db, $row_separator = "\n", $data_separator = ",") {
// A valid db array must contain a columns and rows keys.
if (isset($db['columns']) && isset($db['rows'])) {
// Let's now make sure it contains an array. (This might too exagerated of me to check that)
$db['columns'] = (array) $db['columns'];
$db['rows'] = (array) $db['rows'];
// Now let's rewrite the db string imploding columns:
$returnDB = implode($data_separator, $db['columns']).$row_separator;
foreach ($db['rows'] as $row) {
// And imploding each row data.
$returnDB .= implode($data_separator, $row).$row_separator;
}
// Retunr the data.
return $returnDB;
}
// Faaaaaaaaaaaail !
return FALSE;
}
Let's just point out I tried these with your db example, and it works even when tested on it's own results such as : dbToArray(arrayToDb(dbToArray())) multiple times.
Hope that help. If I can be clearer don't hesitate. :)
Cheers,

You need a 3 step process to do this (create 3 loops, could be optimized to 1 or 2 loops):
Load the relevant data to memory
Update the desired data
Save the data to the file
Good luck! :)
PS. Also your passwords should never been stored in clear text, wether in memory(session) or on disk(csv), use a hasing function!

Increment number in text file

I want to record downloads in a text file
Someone comes to my site and downloads something, it will add a new row to the text file if it hasn't already or increment the current one.
I have tried
$filename = 'a.txt';
$lines = file($filename);
$linea = array();
foreach ($lines as $line)
{
$linea[] = explode("|",$line);
}
$linea[0][1] ++;
$a = $linea[0][0] . "|" . $linea[0][1];
file_put_contents($filename, $a);
but it always increments it by more than 1
The text file format is
name|download_count

You're doing your incrementing outside of the for loop, and only accessing the [0]th element so nothing is changing anywhere else.
This should probably look something like:
$filename = 'a.txt';
$lines = file($filename);
// $k = key, $v = value
foreach ($lines as $k=>$v) {
$exploded = explode("|", $v);
// Does this match the site name you're trying to increment?
if ($exploded[0] == "some_name_up_to_you") {
$exploded[1]++;
// To make changes to the source array,
// it must be referenced using the key.
// (If you just change $v, the source won't be updated.)
$lines[$k] = implode("|", $exploded);
}
}
// Write.
file_put_contents($filename, $lines);
You should probably be using a database for this, though. Check out PDO and MYSQL and you'll be on your way to awesomeness.
EDIT
To do what you mentioned in your comments, you can set a boolean flag, and trigger it as you walk through the array. This may warrant a break, too, if you're only looking for one thing:
...
$found = false;
foreach ($lines as $k=>$v) {
$exploded = explode("|", $v);
if ($exploded[0] == "some_name_up_to_you") {
$found = true;
$exploded[1]++;
$lines[$k] = implode("|", $exploded);
break; // ???
}
}
if (!$found) {
$lines[] = "THE_NEW_SITE|1";
}
...

one hand you are using a foreach loop, another hand you are write only the first line into your file after storing it in $a... it's making me confuse what do you have in your .txt file...
Try this below code... hope it will solve your problem...
$filename = 'a.txt';
// get file contents and split it...
$data = explode('|',file_get_contents($filename));
// increment the counting number...
$data[1]++;
// join the contents...
$data = implode('|',$data);
file_put_contents($filename, $data);

Instead of creating your own structure inside a text file, why not just use PHP arrays to keep track? You should also apply proper locking to prevent race conditions:
function recordDownload($download, $counter = 'default')
{
// open lock file and acquire exclusive lock
if (false === ($f = fopen("$counter.lock", "c"))) {
return;
}
flock($f, LOCK_EX);
// read counter data
if (file_exists("$counter.stats")) {
$stats = include "$counter.stats";
} else {
$stats = array();
}
if (isset($stats[$download])) {
$stats[$download]++;
} else {
$stats[$download] = 1;
}
// write back counter data
file_put_contents('counter.txt', '<?php return ' . var_export($stats, true) . '?>');
// release exclusive lock
fclose($f);
}
recordDownload('product1'); // will save in default.stats
recordDownload('product2', 'special'); // will save in special.stats

personally i suggest using a json blob as the content of the text file. then you can read the file into php, decode it (json_decode), manipulate the data, then resave it.

CSV encoding php

Here's the problem i need to post a .csv file from one server to another.
I do this by reading the contents of the .csv file and sending that with curl as post data.
This is working without problems.
But then when i try to parse the data and store it in a table in the database the trouble begins.
I have all the variables in a array, if i print this array it displays correctly.
But if i echo a value from that array i get all kinds of weird characters.
My best guess is it has something to do with the encoding of the csv file but i wouldnt have a clue how to fix that.
here's the function i use to parse the csv data:
public function parseCsv($data)
{
$quote = '"';
$newline = "\n";
$seperator = ';';
$dbQuote = $quote . $quote;
// Clean up file
$data = trim($data);
$data = str_replace("\r\n", $newline, $data);
$data = str_replace($dbQuote,'"', $data);
$data = str_replace(',",', ',,', $data);
$data .= $seperator;
$inquotes = false;
$startPoint = $row = $cellNo = 0;
for($i=0; $i<strlen($data); $i++) {
$char = $data[$i];
if ($char == $quote) {
if ($inquotes) $inquotes = false;
else $inquotes = true;
}
if (($char == $seperator or $char == $newline) and !$inquotes) {
$cell = substr($data,$startPoint,$i-$startPoint);
$cell = str_replace($quote,'',$cell);
$cell = str_replace('"',$quote,$cell);
$result[$row][$this->csvMap[$cellNo]] = $this->_parseValue($cellNo, $cell);
++$cellNo;
$startPoint = $i + 1;
if ($char == $newline) {
$cellNo = 0;
++$row;
}
}
}
return $result;
}
any help is appreciated!
EDIT:
Ok so after some more trial and error i found out its just the very first value of the first row that has some extra characters. If i echo that value everything i output after that gets messed up.
So i tried to change the encoding now if i echo the value its all good but i have a new problem, its a string but i need a int:
echo $val; //output: 7655 but messes up everything outputted after it
$val = mb_convert_encoding($val, "UTF-8");
echo $val // output: 7655
echo intval($val) //output: 0
EDIT:
expected output:
7655Array ( [kenmerk] => ÿþ7655 [status] => 205 [status_date] => 1991-12-30 [dob] => 1936-09-04 ) succes
messed up output
7655牁慲੹ਨ††歛湥敭歲⁝㸽＠㟾㘀㔀㔀਀††獛慴畴嵳㴠‾㈀㤀㔀਀††獛慴畴彳慤整⁝㸽 201ⴱ㄀㈀ⴀ30 †嬠潤嵢㴠‾㄀㤀㘀㘀-08〭㐀਀਩畳捣獥
i first echo the element 'kenmerk' after that i print the array
as you can see in the array the element 'kenmerk' has some extra charcters..
converting the data to utf-8 like so:
$data = mb_convert_encoding($data, "UTF-8");
eliminates the problem with messed up output and removes the 'ÿþ' (incorrectly-interpreted BOM?) but i still cant convert the values to a int
EDIT:
ok i sort of found a solution..
but as i have no idea why it works i'd appreciate any info
var_dump((int) $val); // output: 0
var_dump((int) strip_tags($val); // output: 7655

You need to remove ÿþ from 7655. intval() and int ($val = (int)$val;) will always output 0 when the first character is not a number. Ex. 765ÿþ5 will return 765, etc.
Regarding your first problem, I would also recommend you to read this answer. PHP messing with HTML Charset Encoding
I hope that it will give you more clarity about what you struggle with.
I will also build you striping process more stable, so it ex. match 7655 instead of ÿþ7655.

what's the code meaning?

$file = fopen("test.txt","r");
while($line = fgets($file)) {
$line = trim($line);
list($model,$price) = preg_split('/\s+/',$line);
if(empty($price)) {
$price = 0;
}
$sql = "UPDATE products
SET products_price=$price
WHERE products_model='$model'";
// run the sql query.
}
fclose($file);
the txt file like this:
model price
LB2117 19.49
LB2381 25.99
1, what's the meaning of list($model,$price) = preg_split('/\s+/',$line);
i know preg_split like explode, but i don't know what't the parameter meaning of the above line
2, how to skip the first record.

it's taking the results of the preg_split and assigning them to the vars $model and $price. You're looking at a parsing algorithm. Sorry if this is not enough. I have a hard time understanding the question as it is written.
Also, if I read this correctly, there is no need to skip line 1 unless you have an item with the model defined as "model" in the database.
But if you wanted to for some reason, you could add a counter...
$i = 0;
while($line = fgets($file)) {
if($i > 0)
{
$line = trim($line);
list($model,$price) = preg_split('/\s+/',$line);
if(empty($price)) {
$price = 0;
}
$sql = "UPDATE products
SET products_price=$price
WHERE products_model='$model'";
// run the sql query.
}
$i++;
}

That is a language construct that allows you to assign to multiple variables at once. You can think of it as array unpacking (preg_split returns an array). So, when you do:
<?php
list($a, $b) = explode(".","a.b");
echo $a . "\n";
echo $b . "\n";
You will get:
a
b
Having less elements in list than the array is ok, excess elements in array are ignored, but having insufficent elements in array will give you an undefined index error. For example:
list($a) = explode(".","a.b"); // ok
list($a,$b,$c) = explode(".","a.b") // error

I don't know if you meant that by skip the first record but...
$file = fopen("test.txt","r"); // open file for reading
$first = true;
while($line = fgets($file)) { // get the content file
if ($first === true) { $first = false;}//skip the first record
else{
$line = trim($line); // remove the whitespace before and after the test
// not in the middle
list($model,$price) = preg_split('/\s+/',$line); // create two variable model and price, the values are from the preg_split \s means whitespace, tab and linebreak
if(empty($price)) { // if $price is empty price =0
$price = 0;
}
$sql = "UPDATE products // sql update
SET products_price=$price
WHERE products_model='$model'";
// run the sql query.
}
}
fclose($file); //close the file

Processing multiple lines of data from a textarea; more efficient way?

In my app I have a textarea, which my users are meant to enter data in the format:
Forename, Surname, YYYY-MM-DD, Company
Forename, Surname, YYYY-MM-DD, Company
on each line. My intention is to then loop through each row, exploding at the comma and trimming any white space.
I then need to pass the exploded array in to an associative array. I'm doing this manually at the moment, on the assumption that the user has entered the data in the correct order and format; which does work, but does rely on the user not messing things up.
What would you suggest as being a better way of doing this? I think the way I'm checking each index to see if it's empty or not seems rather clunky, as well as error prone.
Any suggestions or things to consider?
/************************************
* sample data from textarea:
* Name, Surname, 1980-02-22, Company
* Foo, Bar, 1970-05-12, Baz
************************************/
$data = preg_split('/\r\n|\n/', $_POST['data'],
-1, PREG_SPLIT_NO_EMPTY);
$item = array();
// loop through the data
foreach($data as $row) :
// trim and explode each line in to an array
$item[] = array_map('trim', explode(',', $row));
endforeach;
$k=0;
foreach($item as $user) :
$processed_data[$k]['first_name'] = !empty($user[0]) ? $user[0] : NULL;
$processed_data[$k]['last_name'] = !empty($user[1]) ? $user[1] : NULL;
if(!empty($user[2])) :
$dob = strtotime($user[2]);
if($dob) {
$processed_data[$k]['dob'] = $user[2];
} else {
$processed_data[$k]['dob'] = NULL;
}
else:
$processed_data[$k]['dob'] = NULL;
endif;
$processed_data[$k]['company'] = !empty($user[3]) ? $user[3] : NULL;
$k++;
endforeach;
// print_r($processed_data);

you are from the old school :)
Well, as you say above you expect the user to enter the data correctly in the text area. Well if your app is working currently in a robust system don't touch it, but otherwise you should consider to add different parameters in your post request (one for each filed you want to explode)...
you can do like this to solve the problem you have now:
// The algorithm below believe user send data correctly
// Forename, Surname, YYYY-MM-DD, Company
$names = array('first_name', 'last_name', 'dob', 'company_name');
$lines = explode("\n", $_POST['data']);
$result = array();
foreach ($lines as $ line)
{
$exploded_line = explode(",", $line);
$row = array();
foreach ($exploded_line as $key=>$item) { $row[$names[$key]]= trim($item); }
$result[]=$row;
}
// Now in result there is an array like this
// result[0][first_name]
// result[0][last_name]
// result[0][dob]
// result[0][company_name]
// result[1][first_name]
// [ ... ]

You can encapsulate the parsing and the validation into classes of it's own. Additionally you could do the same for the datastructure holding the tabular data.
class TableParser
{
private $string;
public function __construct($string)
{
$this->string = (string) $string;
}
public function parse()
{
$buffer = $this->string;
$rows = explode("\n", $buffer);
$rows = array_map('trim', $rows);
return $this->parseRows($rows);
}
private function parseRows(array $rows)
{
foreach($rows as &$row)
{
$row = $this->parseRow($row);
}
return $rows;
}
private function parseRow($row)
{
$keys = array('forename', 'surname', 'date', 'company');
$keyCount = count($keys)
$row = explode(',', $row, $keyCount);
if (count($row) != $keyCount)
{
throw new InvalidArgumentException('A row must have 4 columns.');
}
$row = array_map('trim', $row);
$row = array_combine($keys, $row);
return $row;
}
}
This parser is still quite rough. You can improve it over time, e.g. providing better error handling, given information which line failed and such. Such a component can then be easier integrated into your normal application flow as you can return that information back to the user so to enable her to make changes to the input.
Additionally you can put apart the validation into a second class and only do the exploding / trimming in the parser, but validation against count, specifying the array keys as well as validating the date format / value in the second class to keep things more apart.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping a plain text file with no HTML? - php

preg_match_all("/(\d+)\.(.?)\sLocation\s\:\s(.?)\sOwner\s\:\s(.?)\sArchitect\s\:\s(.?)\s?/i",$txt,$m); $matched = array(); foreach($m[1] as $k => $v) { $matched[$v] = array( "location" => trim($m[2][$v]), "owner" => trim($m[3][$v]), "architect" => trim($m[4][$v]) ); }

Related

How do I replace 1 value within a row in a CSV file using php?

Increment number in text file

CSV encoding php

what's the code meaning?

Processing multiple lines of data from a textarea; more efficient way?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping a plain text file with no HTML? - php

preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m); $matched = array(); foreach($m[1] as $k => $v) { $matched[$v] = array( "location" => trim($m[2][$v]), "owner" => trim($m[3][$v]), "architect" => trim($m[4][$v]) ); }

Related

How do I replace 1 value within a row in a CSV file using php?

Increment number in text file

CSV encoding php

what's the code meaning?

Processing multiple lines of data from a textarea; more efficient way?

Categories

Resources

preg_match_all("/(\d+)\.(.?)\sLocation\s\:\s(.?)\sOwner\s\:\s(.?)\sArchitect\s\:\s(.?)\s?/i",$txt,$m); $matched = array(); foreach($m[1] as $k => $v) { $matched[$v] = array( "location" => trim($m[2][$v]), "owner" => trim($m[3][$v]), "architect" => trim($m[4][$v]) ); }