Context index generation for meilisearch

Context index generation for meilisearch - php

I've been using all sorts of hacks to generate file indexes out of SMB shares. And it's all cool with basic filepath plus metadata indexing.
The next step I want to implement is an algorithm combining some unix-like utilities and php, to index specific context from within files.
Now the first step in this context generation is something like this
while read p; do egrep -rH '^;|\(|^\(|\)$' "$p"; done <textual.txt > text_context_search.txt
This is specific regexing for my purpose for indexing contents of programs, this extracts lines that are whole comments or contains comments out of CNC program files.
resulting output is something like
file_path:regex_hit
now obviously most programs has more than one comment, so theres too much redundancy not only in repetition, but an exhaustive context index is about a gigabyte in size
I am now working towards script that would compact redudancy in such pattern
file_path_1:regex_hit_1
file_path_1:regex_hit_2
file_path_1:regex_hit_3
...
would become:
file_path_1:regex_hit1,regex_hit_2,regex_hit3
and if I succeed to do this in efficient manner its all ok.
The problem here is whether I'm doing this in a proper way. Maybe I should be using different tools to generate such context index in the first place ?
EDIT
After further copying and pasting from stack overflow and thinking about it I glued up solution using not my code, that nearly entirely solves my previously mentioned issue.
<?php
// https://stackoverflow.com/questions/26238299/merging-csv-lines-where-column-value-is-the-same
$rows = array_map('str_getcsv', file('text_context_search2.1.txt'));
//echo '<pre>';
print_r($csv);
//echo '</pre>';
// Array for output
$concatenated = array();
// Key to organize over
$sortKey = '0';
// Key to concatenate
$concatenateKey = '1';
// Separator string
$separator = ' ';
foreach($rows as $row) {
// Guard against invalid rows
if (!isset($row[$sortKey]) || !isset($row[$concatenateKey])) {
continue;
}
// Current identifier
$identifier = $row[$sortKey];
if (!isset($concatenated[$identifier])) {
// If no matching row has been found yet, create a new item in the
// concatenated output array
$concatenated[$identifier] = $row;
} else {
// An array has already been set, append the concatenate value
$concatenated[$identifier][$concatenateKey] .= $separator . $row[$concatenateKey];
}
}
// Do something useful with the output
//var_dump($concatenated);
//echo json_encode($concatenated)."\n";
$fp = fopen('exemplar.csv', 'w');
foreach ($concatenated as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);

Related

PHP If-Else does not work for comparing filecontents

I am trying to make a PHP application which searches through the files of your current directory and looks for a file in every subdirectory called email.txt, then it gets the contents of the file and compares the contents from email.txt with the given query and echoes all the matching directories with the given query. But it does not work and it looks like the problem is in the if-else part of the script at the end because it doesn't give any output.
<?php
// pulling query from link
$query = $_GET["q"];
echo($query);
echo("<br>");
// listing all files in doc directory
$files = scandir(".");
// searching trough array for unwanted files
$downloader = array_search("downloader.php", $files);
$viewer = array_search("viewer.php", $files);
$search = array_search("search.php", $files);
$editor = array_search("editor.php", $files);
$index = array_search("index.php", $files);
$error_log = array_search("error_log", $files);
$images = array_search("images", $files);
$parsedown = array_search("Parsedown.php", $files);
// deleting unwanted files from array
unset($files[$downloader]);
unset($files[$viewer]);
unset($files[$search]);
unset($files[$editor]);
unset($files[$index]);
unset($files[$error_log]);
unset($files[$images]);
unset($files[$parsedown]);
// counting folders
$folderamount = count($files);
// defining loop variables
$loopnum = 0;
// loop
while ($loopnum <= $folderamount + 10) {
$loopnum = $loopnum + 1;
// gets the emails from every folder
$dirname = $files[$loopnum];
$email = file_get_contents("$dirname/email.txt");
//checks if the email matches
if ($stremail == $query) {
echo($dirname);
}
}
//print_r($files);
//echo("<br><br>");
?>
Can someone explain / fix this for me? I literally have no clue what it is and I debugged soo much already. It would be heavily gracious and appreciated.
Kind regards,
Bluppie05

There's a few problems with this code that would be preventing you from getting the correct output.
The main reason you don't get any output from the if test is the condition is (presumably) using the wrong variable name.
// variable with the file data is called $email
$email = file_get_contents("$dirname/email.txt");
// test is checking $stremail which is never given a value
if ($stremail == $query) {
echo($dirname);
}
There is also an issue with your scandir() and unset() combination. As you've discovered scandir() basically gives you everything that a dir or ls would on the command line. Using unset() to remove specific files is problematic because you have to maintain a hardcoded list of files. However, unset() also leaves holes in your array, the count changes but the original indices do not. This may be why you are using $folderamount + 10 in your loop. Take a look at this Stack Overflow question for more discussion of the problem.
Rebase array keys after unsetting elements
I recommend you read the PHP manual page on the glob() function as it will greatly simplify getting the contents of a directory. In particular take a look at the GLOB_ONLYDIR flag.
https://www.php.net/manual/en/function.glob.php
Lastly, don't increment your loop counter at the beginning of the loop when using the counter to read elements from an array. Take a look at the PHP manual page for foreach loops for a neater way to iterate over an array.
https://www.php.net/manual/en/control-structures.foreach.php

PHP Array sorting within WHILE loop

I have a huge issue, I cant find any way to sort array entries. My code:
<?php
error_reporting(0);
$lines=array();
$fp=fopen('file.txt, 'r');
$i=0;
while (!feof($fp))
{
$line=fgets($fp);
$line=trim($line);
$lines[]=$line;
$oneline = explode("|", $line);
if($i>30){
$fz=fopen('users.txt', 'r');
while (!feof($fz))
{
$linez=fgets($fz);
$linez=trim($linez);
$lineza[]=$linez;
$onematch = explode(",", $linez);
if (strpos($oneline[1], $onematch[1])){
echo $onematch[0],$oneline[4],'<br>';
}
else{
}
rewind($onematch);
}
}
$i++;
}
fclose($fp);
?>
The thing is, I want to sort items that are being echo'ed by $oneline[4]. I tried several other posts from stackoverflow - But was not been able to find a solution.

The anser to your question is that in order to sort $oneline[4], which seems to contain a string value, you need to apply the following steps:
split the string into an array ($oneline[4] = explode(',',
$oneline[4]))
sort the resulting array (sort($oneline[4]))
combine the array into a string ($oneline[4] = implode(',',
$oneline[4]))
As I got the impression variable naming is low on the list of priorities I'm re-using the $oneline[4] variable. Mostly to clarify which part of the code I am referring to.
That being said, there are other improvements you should be making, if you want to be on speaking terms with your future self (in case you need to work on this code in a couple of months)
Choose a single coding style and stick to it, the original code looked like it was copy/pasted from at least 4 different sources (mostly inconsistent quote-marks and curly braces)
Try to limit repeating costly operations, such as opening files whenever you can (to be fair, the agents.data could contain 31 lines and the users.txt would be opened only once resulting in me looking like a fool)
I have updated your code sample to try to show what I mean by the points above.
<?php
error_reporting(0);
$lines = array();
$users = false;
$fp = fopen('http://20.19.202.221/exports/agents.data', 'r');
while ($fp && !feof($fp)) {
$line = trim(fgets($fp));
$lines[] = $line;
$oneline = explode('|', $line);
// if we have $users (starts as false, is turned into an array
// inside this if-block) or if we have collected 30 or more
// lines (this condition is only checked while $users = false)
if ($users || count($lines) > 30) {
// your code sample implies the users.txt to be small enough
// to process several times consider using some form of
// caching like this
if (!$users) {
// always initialize what you intend to use
$users = [];
$fz = fopen('users.txt', 'r');
while ($fz && !feof($fz)) {
$users[] = explode(',', trim(fgets($fz)));
}
// always close whatever you open.
fclose($fz);
}
// walk through $users, which contains the exploded contents
// of each line in users.txt
foreach ($users as $onematch) {
if (strpos($oneline[1], $onematch[1])) {
// now, the actual question: how to sort $oneline[4]
// as the requested example was not available at the
// time of writing, I assume
// it to be a string like: 'b,d,c,a'
// first, explode it into an array
$oneline[4] = explode(',', $oneline[4]);
// now sort it using the sort function of your liking
sort($oneline[4]);
// and implode the sorted array back into a string
$oneline[4] = implode(',', $oneline[4]);
echo $onematch[0], $oneline[4], '<br>';
}
}
}
}
fclose($fp);
I hope this doesn't offend you too much, just trying to help and not just providing the solution to the question at hand.

How to format I/O data from script

I was using a script to exclude a list of words from another list of keywords. I would like to change the format of the output. (I found the script on this website and I have made some modification.)
Example:
Phrase from outcome: my word
I would like to add quotes: "my word"
I was thinking that I should put the outcome in new-file.txt and after to rewrite it, but I do not understand how to capture the result. Please, kindly give me some tips. It's my first script :)
Here is the code:
<?php
$myfile = fopen("newfile1.txt", "w") or die("Unable to open file!");
// Open a file to write the changes - test
$file = file_get_contents("test-action-write-a-doc-small.txt");
// In small.txt there are words that will be excluded from the big list
$searchstrings = file_get_contents("test-action-write-a-doc-full.txt");
// From this list the script is excluding the words that are in small.txt
$breakstrings = explode(',',$searchstrings);
foreach ($breakstrings as $values){
if(!strpos($file, $values)) {
echo $values." = Not found;\n";
}
else {
echo $values." = Found; \n";
}
}
echo "<h1>Outcome:</h1>";
foreach ($breakstrings as $values){
if(!strpos($file, $values)) {
echo $values."\n";
}
}
fwrite($myfile, $values); // write the result in newfile1.txt - test
// a loop is missing?
fclose($myfile); // close newfile1.txt - test
?>
There is also a little mistake in the script. It works fine however before entering the list of words in test-action-write-a-doc-full.txt and in test-action-write-a-doc-small.txt I have to put a break for the first line otherwise it does not find the first word.
Example:
In test-action-write-a-doc-small.txt words:
pick, lol, file, cool,
In test-action-write-a-doc-full.txt wwords:
pick, bad, computer, lol, break, file.
Outcome:
Pick = Not found -- here is the mistake.
It happens if I do not put a break for the first line in .txt
lol = Found
file = Found
Thanks in advance for any help! :)

You can collect the accepted words in an array, and then glue all those array elements into one text, which you then write to the file. Like this:
echo "<h1>Outcome:</h1>";
// Build an array with accepted words
$keepWords = array();
foreach ($breakstrings as $values){
// remove white space surrounding word
$values = trim($values);
// compare with false, and skip empty strings
if ($values !== "" and false === strpos($file, $values)) {
// Add word to end of array, you can add quotes if you want
$keepWords[] = '"' . $values . '"';
}
}
// Glue all words together with commas
$keepText = implode(",", $keepWords);
// Write that to file
fwrite($myfile, $keepText);
Note that you should not write !strpos(..) but false === strpos(..) as explained in the docs.
Note also that this method of searching in $file will maybe give unexpected results. For instance, if you have "misery" in your $file string then the word "is" (if separated by commas in the original file) will be refused, as it is found in $file. You might want to review this.
Concerning the second problem
The fact that it does not work without first adding a line-break in your file leads me to think it is related to the Byte-Order Mark (BOM) that appears in the beginning of many UTF-8 encoded files. The problem and possible solutions are discussed here and elsewhere.
If indeed it is this problem, there are two solutions I would propose:
Use your text editor to save the file as UTF-8, but without BOM. For instance, notepad++ has this possibility in the encoding menu.
Or, add this to your code:
function removeBOM($str = "") {
if (substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str = substr($str, 3);
}
return $str;
}
and then wrap all your file_get_contents calls with that function, like this:
$file = removeBOM(file_get_contents("test-action-write-a-doc-small.txt"));
// In small.txt there are words that will be excluded from the big list
$searchstrings = removeBOM(file_get_contents("test-action-write-a-doc-full.txt"));
// From this list the script is excluding the words that are in small.txt
This will strip these funny bytes from the start of the string taken from the file.

Handle Large File with PHP

I have a file with the size of around 10 GB or more. The file contains only numbers ranging from 1 to 10 on each line and nothing else. Now the task is to read the data[numbers] from the file and then sort the numbers in ascending or descending order and create a new file with the sorted numbers.
Can anyone of you please help me with the answer?

I'm assuming this is somekind of homework and goal for this is to sort more data than you can hold in your RAM?
Since you only have numbers 1-10, this is not that complicated task. Just open your input file and count how many occourances of every specific number you have. After that you can construct simple loop and write values into another file. Following example is pretty self explainatory.
$inFile = '/path/to/input/file';
$outFile = '/path/to/output/file';
$input = fopen($inFile, 'r');
if ($input === false) {
throw new Exception('Unable to open: ' . $inFile);
}
//$map will be array with size of 10, filled with 0-s
$map = array_fill(1, 10, 0);
//Read file line by line and count how many of each specific number you have
while (!feof($input)) {
$int = (int) fgets($input);
$map[$int]++;
}
fclose($input);
$output = fopen($outFile, 'w');
if ($output === false) {
throw new Exception('Unable to open: ' . $outFile);
}
/*
* Reverse array if you need to change direction between
* ascending and descending order
*/
//$map = array_reverse($map);
//Write values into your output file
foreach ($map AS $number => $count) {
$string = ((string) $number) . PHP_EOL;
for ($i = 0; $i < $count; $i++) {
fwrite($output, $string);
}
}
fclose($output);
Taking into account the fact, that you are dealing with huge files, you should also check script execution time limit for your PHP environment, following example will take VERY long for 10GB+ sized files, but since I didn't see any limitations concerning execution time and performance in your question, I'm assuming it is OK.

I had a similar issue before. Trying to manipulate such a large file ended up being huge drain on resources and it couldn't cope. The easiest solution I ended up with was to try and import it into a MySQL database using a fast data dump function called LOAD DATA INFILE
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Once it's in you should be able to manipulate the data.
Alternatively, you could just read the file line by line while outputting the result into another file line by line with the sorted numbers. Not too sure how well this would work though.
Have you had any previous attempts at it or are you just after a possible method of doing it?

If that's all you don't need PHP (if you have a Linux maschine at hand):
sort -n file > file_sorted-asc
sort -nr file > file_sorted-desc
Edit: OK, here's your solution in PHP (if you have a Linux maschine at hand):
<?php
// Sort ascending
`sort -n file > file_sorted-asc`;
// Sort descending
`sort -nr file > file_sorted-desc`;
?>
:)

CSV file generation error

I'm working on a project for a client - a wordpress plugin that creates and maintains a database of organization members. I'll note that this plugin creates a new table within the wordpress database (instead of dealing with the data as custom_post_type meta data). I've made a lot of modifications to much of the plugin, but I'm having an issue with a feature (that I've left unchanged).
One half of this feature does a csv import and insert, and that works great. The other half of this sequence is a feature to download the contents of this table as a csv. This part works fine on my local system, but fails when running from the server. I've poured over each portion of this script and everything seems to make sense. I'm, frankly, at a loss as to why it's failing.
The php file that contains the logic is simply linked to. The file:
<?php
// initiate wordpress
include('../../../wp-blog-header.php');
// phpinfo();
function fputcsv4($fh, $arr) {
$csv = "";
while (list($key, $val) = each($arr)) {
$val = str_replace('"', '""', $val);
$csv .= '"'.$val.'",';
}
$csv = substr($csv, 0, -1);
$csv .= "\n";
if (!#fwrite($fh, $csv))
return FALSE;
}
//get member info and column data
$table_name = $wpdb->prefix . "member_db";
$year = date ('Y');
$members = $wpdb->get_results("SELECT * FROM ".$table_name, ARRAY_A);
$columns = $wpdb->get_results("SHOW COLUMNS FROM ".$table_name, ARRAY_A);
// echo 'SQL: '.$sql.', RESULT: '.$result.'<br>';
//output headers
header("Content-type: application/octet-stream");
header("Content-Disposition: attachment; filename=\"members.csv\"");
//open output stream
$output = fopen("php://output",'w');
//output column headings
$data[0] = "ID";
$i = 1;
foreach ($columns as $column){
//DIAG: echo '<pre>'; print_r($column); echo '</pre>';
$field_name = '';
$words = explode("_", $column['Field']);
foreach ($words as $word) $field_name .= $word.' ';
if ( $column['Field'] != 'id' && $column['Field'] != 'date_updated' ) {
$data[$i] = ucwords($field_name);
$i++;
}
}
$data[$i] = "Date Updated";
fputcsv4($output, $data);
//output data
foreach ($members as $member){
// echo '<pre>'; print_r($member); echo '</pre>';
$data[0] = $member['id'];
$i = 1;
foreach ($columns as $column){
//DIAG: echo '<pre>'; print_r($column); echo '</pre>';
if ( $column['Field'] != 'id' && $column['Field'] != 'date_updated' ) {
$data[$i] = $member[$column['Field']];
$i++;
}
}
$data[$i] = $member['date_updated'];
//echo '<pre>'; print_r($data); echo '</pre>';
fputcsv4($output, $data);
}
fclose($output);
?>
So, obviously, a routine wherein a query is run, $output is established with fopen, each row is then formatted as comma delimited and fwrited, and finally the file is fclosed where it gets pushed to a local system.
The error that I'm getting (from the server) is
Error 6 (net::ERR_FILE_NOT_FOUND): The file or directory could not be found.
But it clearly is getting found, its just failing. If I enable phpinfo() (PHP Version 5.2.17) at the top of the file, I definitely get a response - notably Cannot modify header information (I'm pretty sure because phpinfo() has already generated a header). All the expected data does get printed to the bottom of the page (after all the phpinfo diagnostics), however, so that much at least is working correctly.
I am guessing there is something preventing the fopen, fwrite, or fclose functions from working properly (a server setting?), but I don't have enough experience with this to identify exactly what the problem is.
I'll note again that this works exactly as expected in my test environment (localhost/XAMPP, netbeans).
Any thoughts would be most appreciated.
update
Ok - spent some more time with this today. I've tried each of the suggested fixes, including #Rudu's writeCSVLine fix and #Fernando Costa's file_put_contents() recommendation. The fact is, they all work locally. Either just echoing or the fopen,fwrite,fclose routine, doesn't matter, works great.
What does seem to be a problem is the inclusion of the wp-blog-header.php at the start of the file and then the additional header() calls. (The path is definitely correct on the server, btw.)
If I comment out the include, I get a csv file downloaded with some errors planted in it (because $wpdb doesn't exist. And if comment out the headers, I get all my data printed to the page.
So... any ideas what could be going on here?
Some obvious conflict of the wordpress environment and the proper creation of a file.
Learning a lot, but no closer to an answer... Thinking I may need to just avoid the wordpress stuff and do a manual sql query.

Ok so I'm wondering why you've taken this approach. Nothing wrong with php://output but all it does is allow you to write to the output buffer the same way as print and echo... if you're having trouble with it, just use print or echo :) Any optimizations you could have got from using fwrite on the stream then gets lost by you string-building the $csv variable and then writing that in one go to the output stream (Not that optimizations are particularly necessary). All that in mind my solution (in keeping with your original design) would be this:
function escapeCSVcell($val) {
return str_replace('"','""',$val);
//What about new lines in values? Perhaps not relevant to your
// data but they'll mess up your output ;)
}
function writeCSVLine($arr) {
$first=true;
foreach ($arr as $v) {
if (!$first) {echo ",";}
$first=false;
echo "\"".escapeCSVcell($v)."\"";
}
echo "\n"; // May want to use \r\n depending on consuming script
}
Now use writeCSVLine in place of fputcsv4.

Ran into this same issue. Stumbled upon this thread which does the same thing but hooks into the 'plugins_loaded' action and exports the CSV then. https://wordpress.stackexchange.com/questions/3480/how-can-i-force-a-file-download-in-the-wordpress-backend
Exporting the CSV early eliminates the risk of the headers already being modified before you get to them.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Context index generation for meilisearch - php

Related

PHP If-Else does not work for comparing filecontents

PHP Array sorting within WHILE loop

How to format I/O data from script

Handle Large File with PHP

CSV file generation error

Categories

Resources