Best way to parse a text document

Best way to parse a text document - php

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?

This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.

Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Related

What's the best approach to find words from a set of words in a string?

I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated

A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie

Sorting user input

I attempting what I thought would be a simple exercise, but unless I’m missing a trick, it seems anything but simple.
Im attempting to clean up user input into a form before saving it. The particular problem I have is with hyphenated town names. For example, take Bourton-on-the-Water. Assume the user has Caps lock on or puts spaces next to the hyphens of any other screw up that might come to mind. How do I, within reason, turn it into what it’s meant to be?

You can use trim() to remove whitespace (or other characters) from the beginning and end of a string. You can also use explode() to break strings into parts by a specified character and then recreate your string as you like.

I think the only way you can really accomplish this is by improving the way the user inputs their data.
For example use a postcode lookup system that enters an address based on what they type.
Or have a autocomplete from a predefined list of towns (similar to how Facebook shows towns).
To consider every possible permutation of Bourton On The Water / Bourton-On-The-Water etc... is pretty much impossible.

String Parsing in PHP

For a small project of my own, I'm writing a parser that parses event logs from a certain application. Normally I'd have little issue with handling such a thing, but the problem is that strings from these logs do not always have the same parameters. For example, one such string could be:
DD/MM HH:MM:SS.MSEC TYPE_OF_EVENT SOURCE, SOURCE_FLAGS, TARGET, TARGET_FLAGS, PARAM1
On another occasion, the string could have a series of parameters, all the way up to 27 of them, the other has 16. Reading through the documentation, there is some logic in the parameters, for example, the 17th Parameters will always hold an integer. While that is good, unfortunately the 17th parameter might be the 7th thing on the string. The only thing that is really constant on every string is the time stamp and the 6th first parameters.
How would I go around parsing strings like these? I'm sorry if my question is a tad unclear, I find it difficult to word my problem.

Ok, followup for my comment up at the top.
If the log's format is "constant" based on the TYPE_OF_EVENT field, you'll just have to do some simple pre-parsing, after which the rest should follow easily.
read a line
extract the universally common fields: timestamp, type of event, source/target
based on type_of_event, do further analysis
switch (event type) {
case 'a': parse out 'a' event parameters
case 'b': parse out 'b' event parameters
default: log unknown event type for future analysis
}
and so on.

I would use a different logging solution, or find a way to modify it so that you have empty place holders, item,,item3,,,item6 etc.
Just my opinion without knowing too much about this app - this app doesn't sound too good. I usually judge apps by factors like this, if there is not a good reason for the log file to be non-standardized then what do you think the rest of the code look like? :)

That's not an input that can be "parsed" as such, because there are no fixed keywords to look out for. But regular expressions seem sufficient to extract and split up the contents.
http://regular-expressions.info/ has a good introduction, and https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world lists a few cool tools that help in designing regular expressions.
In your case you would need \d+ for matching decimals, use delimiters literally, und you probably can get away with .*? separated by the , comma delimiters to find the individual parts. Maybe:
preg_match('#(\d+/\d+) (\d+:\d+:\d+.\d+) (\w+) (.*?),(.*),(.*),...#');
If there is a variable length of attributes, then you should prefer two regexps (though it can be done in one). First get the .* remainder of each line, then split it afterwards.

How about splitting the string by the ", " separator and putting everything in an array. That way you'll have a numeric index to check if a parameter exists or not.

Detect random strings

I am building a string to detect whether filename makes sense or if they are completely random with PHP. I'm using regular expressions.
A valid filename = sample-image-25.jpg
A random filename = 46347sdga467234626.jpg
I want to check if the filename makes sense or not, if not, I want to alert the user to fix the filename before continuing.
Any help?

I'm not really sure that's possible because I'm not sure it's possible to define "random" in a way the computer will understand sufficiently well.
"umiarkowany" looks random, but it's a perfectly valid word I pulled off the Polish Wikipedia page for South Korea.
My advice is to think more deeply about why this design detail is important, and look for a more feasible solution to the underlying problem.

You need way to much work on that. You should make an huge array of most-used-word (like a dictionary) and check if most of the work inside the file (maybe separated by - or _) are there and it will have huge bugs.
Basically you will need of
explode()
implode()
array_search() or in_array()
Take the string and look for a piece glue like "_" or "-" with preg_match(); if there are some, explode the string into an array and compare that array with the dictionary array.
Or, since almost every words has alternate vowel and consonants you could make an huge script that checks whatever most of the words inside the file name are considered "not-random" generated. But the problem will be the same: why do you need of that? Check for a more flexible solution.
Notice:
Consider that even a simple-and-friendly-file.png could be the result of a string generator.
Good luck with that.

Concatenate RTF files in PHP (REGEX)

I've got a script that takes a user uploaded RTF document and merges in some person data into the letter (name, address, etc), and does this for multiple people. I merge the letter contents, then combine that with the next merge letter contents, for all people records.
Affectively I'm combining a single RTF document into itself for as many people records to which I need to merge the letter. However, I need to first remove the closing RTF markup and opening of the RTF markup of each merge or else the RTF won't render correctly. This sounds like a job for regular expressions.
Essentially I need a regex that will remove the entire string:
}\n\page ANYTHING \par
Example, this regex would match this:
crap
}
\page{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20 September 30, 2008\par
more crap
So I could make it just:
crap
\page
more crap
Is RegEx the best approach here?
UPDATE: Why do I have to use RTF?
I want to enable the user to upload a form letter that the system will then use to create the merged letters. Since RTF is plain text, I can do this pretty easily in code. I know, RTF is a disaster of a spec, but I don't know any other good alternative.

I would question the use of RTF in this case. It's not entirely clear to me what you're trying to do overall, so I can't necessarily suggest anything better, but if you can try to explain your project more broadly, maybe I can help.
If this is really the way you want to go though, this regex gave me the correct output given your input:
$output = preg_replace("/}\s?\n\\\\page.*?\\\\par\s?\n/ms", "\\page\n", $input);

To this I can say ick ick ick. Nevertheless, rcar's cludge probably will work, barring some weird edge-case where RTF doesn't actually end in that form, or the document-wide styles include important information that utterly messes up the formatting, or any other of the many failure modes.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Best way to parse a text document - php

Maybe this:? array_filter(preg_split('/\b/', $document_text)) the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Related

What's the best approach to find words from a set of words in a string?

Sorting user input

String Parsing in PHP

Detect random strings

Concatenate RTF files in PHP (REGEX)

Categories

Resources