Preg-Match-All - Synonym File

Preg-Match-All - Synonym File - php

I am writing a php script that will parse through a file, (synonyms.dat), and coordinate a list of synonyms with their parent word, for about 150k words.
Example from file:
1|2
(adj)|one|i|ane|cardinal
(noun)|one|I|ace|single|unity|digit|figure
1-dodecanol|1
(noun)|lauryl alcohol|alcohol
1-hitter|1
(noun)|one-hitter|baseball|baseball game|ball
10|2
(adj)|ten|x|cardinal
(noun)|ten|X|tenner|decade|large integer
100|2
(adj)|hundred|a hundred|one hundred|c|cardinal
(noun)|hundred|C|century|one C|centred|large integer
1000|2
(adj)|thousand|a thousand|one thousand|m|k|cardinal
(noun)|thousand|one thousand|M|K|chiliad|G|grand|thou|yard|large integer
**10000|1
(noun)|ten thousand|myriad|large**
In the example above I want to link ten thousand, myriad, large to the word 1000.
I have tried various method of reading the .dat file into memory using file_get_contents and then exploding the file at \n, and using various array search techniques to find the 'parent' word and it's synonyms. However, this is extremely slow, and more often then not crashes my web server.
I believe what I need to do is use preg_match_all to explode the string, and then just iterate over the string, inserting into my database where appropriate.
$contents = file_get_contents($page);
preg_match_all("/([^\s]+)\|[0-9].*/",$contents,$out, PREG_SET_ORDER);
This matches each
1|2
1-dodecanol|1
1-hitter|1
But I don't know how to link the fields in between each match, IE the synonyms themselves.
This script is intended to be run once, to get all the information into my database appropriately. For those interested, I have a database 'synonym_index' which holds a unique id of each word, as well as the word. Then another table 'synonym_listing' which contains a 'word_id' column and a 'synomym_id' column where each column is a foreign key to synonym_index. There can be multiple synonym_id's to each word_id.
Your help is greatly appreciated!

You can use explode() to split each line into fields. (Or, depending on the precise format of the input, fgetcsv() might be a better choice.)
Illustrative example, which will almost certainly need adjustment for your specific use case and data format:
$infile = fopen('synonyms.dat', 'r');
while (!feof($infile)) {
$line = rtrim(fgets($infile), "\r\n");
if ( $line === '' ) {
continue;
}
// Line follows the format HEAD_WORD|NUMBER_OF_SYNONYM_LINES
list($headWord, $n) = explode('|', $line);
$synonyms = array();
// For each synonym line...
while ( $n-- ) {
$line = rtrim(fgets($infile), "\r\n");
$fields = explode('|', $line);
$partOfSpeech = substr(array_shift($fields), 1, -1);
$synonyms[$partOfSpeech] = $fields;
}
// Now here, when $headWord is '**10000', $synonyms should be array(
// 'noun' => array('ten thousand', 'myriad', 'large**')
// )
}

Wow, for this type of functionality you have databases with tables and indices.
PHP is to serve a request/response, not to read a big file into memory. I advise you to put the data in a database. That will be much faster - and it is made for it.

Related

Context index generation for meilisearch

I've been using all sorts of hacks to generate file indexes out of SMB shares. And it's all cool with basic filepath plus metadata indexing.
The next step I want to implement is an algorithm combining some unix-like utilities and php, to index specific context from within files.
Now the first step in this context generation is something like this
while read p; do egrep -rH '^;|\(|^\(|\)$' "$p"; done <textual.txt > text_context_search.txt
This is specific regexing for my purpose for indexing contents of programs, this extracts lines that are whole comments or contains comments out of CNC program files.
resulting output is something like
file_path:regex_hit
now obviously most programs has more than one comment, so theres too much redundancy not only in repetition, but an exhaustive context index is about a gigabyte in size
I am now working towards script that would compact redudancy in such pattern
file_path_1:regex_hit_1
file_path_1:regex_hit_2
file_path_1:regex_hit_3
...
would become:
file_path_1:regex_hit1,regex_hit_2,regex_hit3
and if I succeed to do this in efficient manner its all ok.
The problem here is whether I'm doing this in a proper way. Maybe I should be using different tools to generate such context index in the first place ?
EDIT
After further copying and pasting from stack overflow and thinking about it I glued up solution using not my code, that nearly entirely solves my previously mentioned issue.
<?php
// https://stackoverflow.com/questions/26238299/merging-csv-lines-where-column-value-is-the-same
$rows = array_map('str_getcsv', file('text_context_search2.1.txt'));
//echo '<pre>';
print_r($csv);
//echo '</pre>';
// Array for output
$concatenated = array();
// Key to organize over
$sortKey = '0';
// Key to concatenate
$concatenateKey = '1';
// Separator string
$separator = ' ';
foreach($rows as $row) {
// Guard against invalid rows
if (!isset($row[$sortKey]) || !isset($row[$concatenateKey])) {
continue;
}
// Current identifier
$identifier = $row[$sortKey];
if (!isset($concatenated[$identifier])) {
// If no matching row has been found yet, create a new item in the
// concatenated output array
$concatenated[$identifier] = $row;
} else {
// An array has already been set, append the concatenate value
$concatenated[$identifier][$concatenateKey] .= $separator . $row[$concatenateKey];
}
}
// Do something useful with the output
//var_dump($concatenated);
//echo json_encode($concatenated)."\n";
$fp = fopen('exemplar.csv', 'w');
foreach ($concatenated as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);

How to delete first 11 lines in a file using PHP?

I have a CSV file in which I want the first 11 lines to be removed. The file looks something like:
"MacroTrends Data Download"
"GOOGL - Historical Price and Volume Data"
"Historical prices are adjusted for both splits and dividends"
"Disclaimer and Terms of Use: Historical stock data is provided 'as is' and solely for informational purposes, not for trading purposes or advice."
"MacroTrends LLC expressly disclaims the accuracy, adequacy, or completeness of any data and shall not be liable for any errors, omissions or other defects in, "
"delays or interruptions in such data, or for any actions taken in reliance thereon. Neither MacroTrends LLC nor any of our information providers will be liable"
"for any damages relating to your use of the data provided."
date,open,high,low,close,volume
2004-08-19,50.1598,52.1911,48.1286,50.3228,44659000
2004-08-20,50.6614,54.7089,50.4056,54.3227,22834300
2004-08-23,55.5515,56.9157,54.6938,54.8694,18256100
2004-08-24,55.7922,55.9728,51.9454,52.5974,15247300
2004-08-25,52.5422,54.1672,52.1008,53.1641,9188600
I want only the stocks data and not anything else. So I wish to remove the first 11 lines. Also, there will be several text files for different tickers. So str_replace doesn't seem to be a viable option. The function I've been using to get CSV file and putting the required contents to a text file is
function getCSVFile($url, $outputFile)
{
$content = file_get_contents($url);
$content = str_replace("date,open,high,low,close,volume", "", $content);
$content = trim($content);
file_put_contents($outputFile, $content);
}
I want a general solution which can remove the first 11 lines from the CSV file and put the remaining contents to a text file. How do I do this?

Every example here won't work for large/huge files. People don't care about the memory nowadays. You, as a great programmer, want your code to be efficient with low memory footprint.
Instead parse file line by line:
function saveStrippedCsvFile($inputFile, $outputFile, $lineCountToRemove)
{
$inputHandle = fopen($inputFile, 'r');
$outputHandle = fopen($outputFile, 'w');
// make sure you handle errors as well
// files may be unreadable, unwritable etc…
$counter = 0;
while (!feof($inputHandle)) {
if ($counter < $lineCountToRemove) {
fgets($inputHandle);
++$counter;
continue;
}
fwrite($outputHandle, fgets($inputHandle) . PHP_EOL);
}
fclose($inputHandle);
fclose($outputHandle);
}

I have a CSV file in which I want the first 11 lines to be removed.
I always prefer to use explode to do that.
$string = file_get_contents($file);
$lines = explode('\n', $string);
for($i = 0; $i < 11; $i++) { //First key = 0 - 0,1,2,3,4,5,6,7,8,9,10 = 11 lines
unset($lines[$i]);
}
This will remove it and with implode you can create a new 'file' out of it
$new = implode('\n',$lines);
$new will contain the new file
Did'nt test it, but I'm pretty sure that this will work
Be carefull! I will quote #emix his comment.
This will fail spectacularly if the file content exceeds available PHP memory.
Be sure that the file isn't to 'huge'

Use file() to read it as array and simply trim first 11 lines:
$content = file($url);
$newContent = array_slice($content, 12);
file_put_contents($outputFile, implode(PHP_EOL, $newContent));
But answer these questions:
Why there is additional content in this CSV?
How will you know how much lines to cut off? What if it's more than 11 lines to cut?

How to exclude the first line from a text file using php

My text file sample.txt. I want to exclude the first row from the text file and store the other rows into mysql database.
ID Name EMail
1 Siva xyz#gmail.com
2 vinoth xxx#gmail.com
3 ashwin yyy#gmail.com
Now I want to read this data from the text file except the first row(ID,name,email) and store into the MYsql db.Because already I have created a filed in database with the same name.
I have tried
$handle = #fopen($filename, "r"); //read line one by one
while (!feof($handle)) // Loop till end of file.
{
$buffer = fgets($handle, 4096); // Read a line.
}
print_r($buffer); // It shows all the text.
Please let me know how to do this?
Thanks.
Regards,
Siva R

It's easier if you use file() since it will get all rows in an array instead:
// Get all rows in an array (and tell file not to include the trailing new lines
$rows = file($filename, FILE_IGNORE_NEW_LINES);
// Remove the first element (first row) from the array
array_shift($rows);
// Now do what you want with the rest
foreach ($rows as $lineNumber => $row) {
// do something cool with the row data
}
If you want to get it all as a string again, without the first row, just implode it with a new line as glue:
// The rows still contain the line break, since we only trimmed the copy
$content = implode("\n", $rows);
Note: As #Don'tPanic pointed out in his comment, using file() is simple and easy but not advisable if the original file is large, since it will read the whole thing into memory as an array (and arrays take more memory than strings). He also correctly recommended the FILE_IGNORE_NEW_LINES-flag, just so you know :-)

You can just call fgets once before your while loop to get the header row out of the way.
$firstline = fgets($handle, 4096);
while (!feof($handle)) // Loop till end of file.
{ ...

How to format I/O data from script

I was using a script to exclude a list of words from another list of keywords. I would like to change the format of the output. (I found the script on this website and I have made some modification.)
Example:
Phrase from outcome: my word
I would like to add quotes: "my word"
I was thinking that I should put the outcome in new-file.txt and after to rewrite it, but I do not understand how to capture the result. Please, kindly give me some tips. It's my first script :)
Here is the code:
<?php
$myfile = fopen("newfile1.txt", "w") or die("Unable to open file!");
// Open a file to write the changes - test
$file = file_get_contents("test-action-write-a-doc-small.txt");
// In small.txt there are words that will be excluded from the big list
$searchstrings = file_get_contents("test-action-write-a-doc-full.txt");
// From this list the script is excluding the words that are in small.txt
$breakstrings = explode(',',$searchstrings);
foreach ($breakstrings as $values){
if(!strpos($file, $values)) {
echo $values." = Not found;\n";
}
else {
echo $values." = Found; \n";
}
}
echo "<h1>Outcome:</h1>";
foreach ($breakstrings as $values){
if(!strpos($file, $values)) {
echo $values."\n";
}
}
fwrite($myfile, $values); // write the result in newfile1.txt - test
// a loop is missing?
fclose($myfile); // close newfile1.txt - test
?>
There is also a little mistake in the script. It works fine however before entering the list of words in test-action-write-a-doc-full.txt and in test-action-write-a-doc-small.txt I have to put a break for the first line otherwise it does not find the first word.
Example:
In test-action-write-a-doc-small.txt words:
pick, lol, file, cool,
In test-action-write-a-doc-full.txt wwords:
pick, bad, computer, lol, break, file.
Outcome:
Pick = Not found -- here is the mistake.
It happens if I do not put a break for the first line in .txt
lol = Found
file = Found
Thanks in advance for any help! :)

You can collect the accepted words in an array, and then glue all those array elements into one text, which you then write to the file. Like this:
echo "<h1>Outcome:</h1>";
// Build an array with accepted words
$keepWords = array();
foreach ($breakstrings as $values){
// remove white space surrounding word
$values = trim($values);
// compare with false, and skip empty strings
if ($values !== "" and false === strpos($file, $values)) {
// Add word to end of array, you can add quotes if you want
$keepWords[] = '"' . $values . '"';
}
}
// Glue all words together with commas
$keepText = implode(",", $keepWords);
// Write that to file
fwrite($myfile, $keepText);
Note that you should not write !strpos(..) but false === strpos(..) as explained in the docs.
Note also that this method of searching in $file will maybe give unexpected results. For instance, if you have "misery" in your $file string then the word "is" (if separated by commas in the original file) will be refused, as it is found in $file. You might want to review this.
Concerning the second problem
The fact that it does not work without first adding a line-break in your file leads me to think it is related to the Byte-Order Mark (BOM) that appears in the beginning of many UTF-8 encoded files. The problem and possible solutions are discussed here and elsewhere.
If indeed it is this problem, there are two solutions I would propose:
Use your text editor to save the file as UTF-8, but without BOM. For instance, notepad++ has this possibility in the encoding menu.
Or, add this to your code:
function removeBOM($str = "") {
if (substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str = substr($str, 3);
}
return $str;
}
and then wrap all your file_get_contents calls with that function, like this:
$file = removeBOM(file_get_contents("test-action-write-a-doc-small.txt"));
// In small.txt there are words that will be excluded from the big list
$searchstrings = removeBOM(file_get_contents("test-action-write-a-doc-full.txt"));
// From this list the script is excluding the words that are in small.txt
This will strip these funny bytes from the start of the string taken from the file.

how to insert value in a particular location in csv file using php

Is it possible to write at a particular location in a CSV file using PHP?
I don't want to append data at the end of the CSV file. But I want to add data at the end of a row already having values in the CSV.
thanks in advance

No, it s not possible to insert new data in the middle of a file, due to filesystem nature.
Only append at the end is possible.
So, the only solution is to make another file, write a beginning part of source, append a new value, and then append the rest of the source file. And finally rename a resulting file to original name.

There you go. Complete working code:
<?php
//A helping function to insert data at any position in array.
function array_insert($array, $pos, $val)
{
$array2 = array_splice($array, $pos);
$array[] = $val;
$array = array_merge($array, $array2);
return $array;
}
//What and where you want to insert
$DataToInsert = '11,Shamit,Male';
$PositionToInsert = 3;
//Full path & Name of the CSV File
$FileName = 'data.csv';
//Read the file and get is as a array of lines.
$arrLines = file($FileName);
//Insert data into this array.
$Result = array_insert($arrLines, $PositionToInsert, $DataToInsert);
//Convert result array to string.
$ResultStr = implode("\n", $Result);
//Write to the file.
file_put_contents($FileName, $ResultStr);
?>

Technically Col. Shrapnel's answer is absolutely right.
Your problem is that you don't want to deal with all these file operations just to change some data. I agree with you. But you're looking for the solution in a wrong level. Put this problem in a higher level. Create a model that represents an entity in your CSV database. Modify the model's state and call its save() method. The method should be responsible to write your model's state in CSV format.
Still, you can use a CSV library that abstracts low level operations for you. For instance, parsecsv-for-php allows you to target a specific cell:
$csv = new parseCSV();
$csv->sort_by = 'id';
$csv->parse('data.csv');
# "4" is the value of the "id" column of the CSV row
$csv->data[4]['firstname'] = 'John';
$csv->save();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Preg-Match-All - Synonym File - php

Wow, for this type of functionality you have databases with tables and indices. PHP is to serve a request/response, not to read a big file into memory. I advise you to put the data in a database. That will be much faster - and it is made for it.

Related

Context index generation for meilisearch

How to delete first 11 lines in a file using PHP?

How to exclude the first line from a text file using php

How to format I/O data from script

how to insert value in a particular location in csv file using php

Categories

Resources