Using Explode or Preg_Split to split filenames in a string - php

In my PHP script I pull from a database field a list of file names. The names in the field are separated by commas and can be various lengths containing various characters and / or spaces. The string could look something like this:
"fileone.wav, file two with spaces.mp3, another file but this one has commas, which is, of course, the problem.mp3, another_one.mp3"
I am using this to explode them into an array ($attachments contains the string from the db field):
$filenames = explode(", ", $attachments);
My dliemma is that sometimes the file names contain commas, therefore explode fails since it is separating the names at the comma. It of course breaks the filename into separate array elememts.
I'm wondering if maybe preg_split would be a better way to match and split filenames. I'm very inexperienced with regex but conceptually I imagine I'd split the names by matching the ".", the three characters that follow, whatever they are and the comma.
Is this a good way to do this? And how would I write that expression?

If your filenames can have commas in them (and have no escape character) it's impossible to decide how to split the filenames properly.
Maybe you have a file named one.mp3,two.mp3. Whoever decided to store the filenames like this made a terrible mistake. There are so many serializers available there is no excuse not to use any. Even something like (un)serialize($attachments) is sufficient.
You can do simple detection like find an extension (. followed by something) and then split at the first comma. You don't need a regular expression for that, just walk the string.

The data format as you have it is fundamentally flawed, as you've discovered.
Ideally, you need to fix the data. If you want to stick with the basic format you have (ie comma separated), you should make sure that it is saved in a valid CSV format -- ie with quotes around the values that contain commas, so your string would look like this:
fileone.wav, file two with spaces.mp3, "another file but this one has commas, which is, of course, the problem.mp3", another_one.mp3
With the data in this format, you could use PHP's build-in CSV handling function str_getcsv() to read the data instead of explode(). Problem solved.
If you're happy to try other formats, you could also reformat the data into JSON or some other serialised format, which would also make things easier to manage.
The most technically correct answer remains to normalise the database so that the filenames have their own table and each one is in a separate record, but this may be overkill and/or too much upheaval for your purposes.
So yes, ideally you should fix the data, because it is in a very very badly designed format.
However if you really can't fix the data, then you will have to resort to some clever regex trickery to split the files.
Assuming all files end in ".mp3", it's relatively simple; you could do something like this:
preg_split(".mp3(,|$)",$data)
...which will give you the filenames without the .mp3 extension. If they're all mp3, then it's easy enough to add it back on again.
If your file names are mixed file types, then it gets more complex; you'd need to use regex look-aheads to find the extensions but without removing them.
Your problem with all of this, however, is that it would be possible for a filename to contain .mp3, somewhere in the middle of the name. Not likely of course, but possible, especially if you allow your users to upload their own file names.

Related

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?
This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.
Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Detect random strings

I am building a string to detect whether filename makes sense or if they are completely random with PHP. I'm using regular expressions.
A valid filename = sample-image-25.jpg
A random filename = 46347sdga467234626.jpg
I want to check if the filename makes sense or not, if not, I want to alert the user to fix the filename before continuing.
Any help?
I'm not really sure that's possible because I'm not sure it's possible to define "random" in a way the computer will understand sufficiently well.
"umiarkowany" looks random, but it's a perfectly valid word I pulled off the Polish Wikipedia page for South Korea.
My advice is to think more deeply about why this design detail is important, and look for a more feasible solution to the underlying problem.
You need way to much work on that. You should make an huge array of most-used-word (like a dictionary) and check if most of the work inside the file (maybe separated by - or _) are there and it will have huge bugs.
Basically you will need of
explode()
implode()
array_search() or in_array()
Take the string and look for a piece glue like "_" or "-" with preg_match(); if there are some, explode the string into an array and compare that array with the dictionary array.
Or, since almost every words has alternate vowel and consonants you could make an huge script that checks whatever most of the words inside the file name are considered "not-random" generated. But the problem will be the same: why do you need of that? Check for a more flexible solution.
Notice:
Consider that even a simple-and-friendly-file.png could be the result of a string generator.
Good luck with that.

Importing txt files with Unknown Delimiter

I want to import really clean .txt files into Mysql with PHP. I've read that this is easy if you know the delimiter. but I don't.
In my case, the .txt files look like tables - ie: they're still structured like tables, not like a standard, jumbled CSV file.
Does this mean I don't have a delimited file? If so, any advice on how I might approach importing?
Sometimes the delimiter is the column number, rather than a character.
I.e. each data column begins in a specified physical character column. Each column is a fixed width, and parsing is as simple as splitting the string on those character width boundaries, and trimming whitespace if needed.
Sorry about that. An example would obviously help.
Here's an idea of what it looks like - ie: it already looks like a table.
https://gist.github.com/9753ad04b0fab256e452

How to show a comma within a comma separated file

I'm generating a csv file using php, now some columns contain a paragraph with commas, now when I open the file , every comma within the file counts as a new column, is it maybe possible to escape these commas on a way?
Depends, what, your, CSV, reader, is, "but, quoting, should, work"
Many CSV readers will allow commas within a single column by surrounding the column with double quotes. In that case, double quotes can be represented by double double quotes:
column 1,"column 2, with comma","column 3 with ""quote chars"", and comma"
That's the BIG problem with using a , in a CSV file. I would recommend using a different separator like | (it's less likely to appear on a text) or using a different more robust file format like XML for generating your file.
You're using a comma because it's a delimiter. That is, the comma has special meaning no matter when its used. By that very definition, it becomes hard to treat it as context sensitive. It can be done though, considering symbols like '\n.
You can try a new delimiter, such as ,\n, though that might not be an option for you.
Looks to me that the best solution would be to use a different persistence mechanism. Things will get sticky otherwise.

Concatenate RTF files in PHP (REGEX)

I've got a script that takes a user uploaded RTF document and merges in some person data into the letter (name, address, etc), and does this for multiple people. I merge the letter contents, then combine that with the next merge letter contents, for all people records.
Affectively I'm combining a single RTF document into itself for as many people records to which I need to merge the letter. However, I need to first remove the closing RTF markup and opening of the RTF markup of each merge or else the RTF won't render correctly. This sounds like a job for regular expressions.
Essentially I need a regex that will remove the entire string:
}\n\page ANYTHING \par
Example, this regex would match this:
crap
}
\page{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20 September 30, 2008\par
more crap
So I could make it just:
crap
\page
more crap
Is RegEx the best approach here?
UPDATE: Why do I have to use RTF?
I want to enable the user to upload a form letter that the system will then use to create the merged letters. Since RTF is plain text, I can do this pretty easily in code. I know, RTF is a disaster of a spec, but I don't know any other good alternative.
I would question the use of RTF in this case. It's not entirely clear to me what you're trying to do overall, so I can't necessarily suggest anything better, but if you can try to explain your project more broadly, maybe I can help.
If this is really the way you want to go though, this regex gave me the correct output given your input:
$output = preg_replace("/}\s?\n\\\\page.*?\\\\par\s?\n/ms", "\\page\n", $input);
To this I can say ick ick ick. Nevertheless, rcar's cludge probably will work, barring some weird edge-case where RTF doesn't actually end in that form, or the document-wide styles include important information that utterly messes up the formatting, or any other of the many failure modes.

Categories