Concatenate RTF files in PHP (REGEX) - php

I've got a script that takes a user uploaded RTF document and merges in some person data into the letter (name, address, etc), and does this for multiple people. I merge the letter contents, then combine that with the next merge letter contents, for all people records.
Affectively I'm combining a single RTF document into itself for as many people records to which I need to merge the letter. However, I need to first remove the closing RTF markup and opening of the RTF markup of each merge or else the RTF won't render correctly. This sounds like a job for regular expressions.
Essentially I need a regex that will remove the entire string:
}\n\page ANYTHING \par
Example, this regex would match this:
crap
}
\page{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20 September 30, 2008\par
more crap
So I could make it just:
crap
\page
more crap
Is RegEx the best approach here?
UPDATE: Why do I have to use RTF?
I want to enable the user to upload a form letter that the system will then use to create the merged letters. Since RTF is plain text, I can do this pretty easily in code. I know, RTF is a disaster of a spec, but I don't know any other good alternative.

I would question the use of RTF in this case. It's not entirely clear to me what you're trying to do overall, so I can't necessarily suggest anything better, but if you can try to explain your project more broadly, maybe I can help.
If this is really the way you want to go though, this regex gave me the correct output given your input:
$output = preg_replace("/}\s?\n\\\\page.*?\\\\par\s?\n/ms", "\\page\n", $input);

To this I can say ick ick ick. Nevertheless, rcar's cludge probably will work, barring some weird edge-case where RTF doesn't actually end in that form, or the document-wide styles include important information that utterly messes up the formatting, or any other of the many failure modes.

Related

Compare Text Files as Template

I have certain files that are uploaded to my website, and I would like to compare them to known templates of files for a potential match.
For example-sake, I could have a text file with the following "template".
template.txt
Hi, my name is (\w+), and I like (\w+).
Then, say someone uploads the following file.
bio.txt
Hi, my name is James, and I like cats.
This would match the template, and thus trigger a rule associated with that match.
Obviously, this is a very watered down example, but should get the point across. The real regex would be more expressive to match something specific like any URL, Bitcoin address, etc. (I already have expressions for all of that).
I have tried regular expressions, as this is how I do other, more complicated matching with uploaded files, but this really becomes cumbersome when it comes to challenges such as new lines and escaping tons of characters. For example, I've had 20-line+ files as the template - converting all the newlines to \n, and escaping all parenthesis, brackets, etc. is crazy, not to mention this feels super overkill? Regular expressions don't seem to like using multi-lines for the pattern part from what I can tell.
Is there a better way to match this "template" file against the provided file? Any searches I do for "comparing text files" gives me more-so line-by-line comparisons, such as for showing differences in lines of code. I need to exactly match with the patterns expected; if the template does not match, then the overall result should be no match of course.
I currently do have something like this setup with the regular expressions, and I basically just throw them both through preg_match(), but is there a better, more statistical way I should be using for this problem?
My website is programmed in PHP, so I would need to implement it as such. Afraid I don't get access to much on the shell of a shared (cheap) host.

How to modify a specific character in an existing XFA PDF?

I'm stuck on a crazy project that has me looking for a strange solution. I've got a XFA PDF document generated by an outside party. There's are several checkmark characters 'āœ“' on the PDF's that I need to simply change to 'X'. The reason for this is beyond my control. I'm just looking for a way to change the āœ“'s into X's. Can anyone point me in the right direction? Is it possible?
Currently we use PHP and TCPDF for creating "our" server PDF's, but this particular PDF is generated outside of my control by a third party that doesn't want to alter their way of doing things. To make things worse, I don't know how many or where the checkmarks may exist. It's just one very specific character that is in need of changing. Does any know a way of hacking the document to change the character?
Character 2713
http://www.fileformat.info/info/unicode/char/2713/index.htm
Yes, I think you can. To my (rather limited) knowledge of the PDF format, you can only reliably search and replace strings of one character in length, since they are created by placing strings of variable length at specific co-ordinates, in an arbitrary order. The string 'hello' could therefore be one string of five letters, or five strings of one letter each or some combination thereof, all placed in the correct position (and in whatever order the print driver decided upon).
I'm afraid I don't know of any libraries that will do this, but I'd be surprised if they don't exist. You'll need to read PDF objects in, do the replacement, and write them out to a new file. I'd start off researching around the answers to this question.
Edit: this looks like it might be useful.

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
Iā€™m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?
This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.
Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

PHP: Regex replace while ignoring content between html tags

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.
Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>
I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)
:)
I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.
Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.
And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D
It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.
Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.
I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

How to get some elements from html source and convert them to readable text?

I have a page which displays "HeLLo 54292" in ASCII art, using + characters inside <table> tags to produce block letters. I'm generating this with PHP. You can check out page's html source code, and see how the ASCII art is constructed.
I want to convert the ASCII-art letters to actual text, so I could parse that HTML source and would end up with the string "HeLLo 54292". How would I accomplish this?
Step 1: Write an HTML rendering engine in PHP. It will parse the HTML, lay out the page and render it to an image.
Step 2: Write an optical character recognition library in PHP. It will take an image as input, and identify letters in that image by their shapes.
Step 3: Combine those programs and you can convert your tables back to text.
Estimated time for full solution: 1-2 years.
I believe you could package this as a task on Mechanical Turk. This exactly fits the profile of solving problems which are presented via browser rendering.
https://www.mturk.com/mturk/welcome
The latency would be pretty good, probably just a little bit faster than Stack Overflow.
Actually, ok, if you hook it up to SO.. No seriously, those of you reading this, would you rather get three pennies, or 10 rep points? Mmmmm?
Wow I'm gonna go with impossible. Why would you need to convert it to text? Do you have a program generating text in such a format? If so whats stopping you from getting the original variable??
Deconstruct the HTML by using the same patterns you used to produce it.
You used PHP to create that HTML from a string. Reverse the process to convert the HTML back into a string. You have the source code, it should be easy.
Do a reverse replace of each string representing a pixel and recreate the pattern. Then compare that pattern to the one you generated from each character to find the sequence.
I voted to close this as not a real question. But, on the off chance that this is somehow a real question, I'll try to provide a real answer.
What I would suggest, assuming that the characters are not always the same and your goal here is to convert any ASCII art text to a string representation, would be to render the page to an image and try to use some sort of [OCR program]9http://en.wikipedia.org/wiki/Optical_character_recognition) to attempt to recognize the characters and determine what the original text was.
Of course if the ASCII art always uses the same characters, you could parse this using RegExes or other string manipulation.

Categories