I'm stuck on a crazy project that has me looking for a strange solution. I've got a XFA PDF document generated by an outside party. There's are several checkmark characters '✓' on the PDF's that I need to simply change to 'X'. The reason for this is beyond my control. I'm just looking for a way to change the ✓'s into X's. Can anyone point me in the right direction? Is it possible?
Currently we use PHP and TCPDF for creating "our" server PDF's, but this particular PDF is generated outside of my control by a third party that doesn't want to alter their way of doing things. To make things worse, I don't know how many or where the checkmarks may exist. It's just one very specific character that is in need of changing. Does any know a way of hacking the document to change the character?
Character 2713
http://www.fileformat.info/info/unicode/char/2713/index.htm
Yes, I think you can. To my (rather limited) knowledge of the PDF format, you can only reliably search and replace strings of one character in length, since they are created by placing strings of variable length at specific co-ordinates, in an arbitrary order. The string 'hello' could therefore be one string of five letters, or five strings of one letter each or some combination thereof, all placed in the correct position (and in whatever order the print driver decided upon).
I'm afraid I don't know of any libraries that will do this, but I'd be surprised if they don't exist. You'll need to read PDF objects in, do the replacement, and write them out to a new file. I'd start off researching around the answers to this question.
Edit: this looks like it might be useful.
Related
we get book titles from different sources (library systems) (with possibly different encoding, but mostly utf8). These strings are shown in the web and via export to Endnote and RefWorks. RefWorks (windows Quotation system) does not accept any other encoding than ANSI.
In the RIS/Refworks export, activating the line
$smarty = iconv("UTF-8", "Windows-1252", $smarty);
Example string
Diphosphen-komplexes (CO) 5CrPhPPPhCr(CO) 5
does suddenly cut off everything after the first subscript char (the rectangles). These chars are also not correctly printed in HTML but this output is okay because nothing is cut off. In UTF-8 export file encoding nothing is cut off, too. Despite that, the Windows software can't read UTF-8.
The simplest solution would be to convert any subscript number to a regular number. Everything would work quite well then. But I could not find any simple solution to this. Working with hex codes is the only thing I could imagine. This solutions is also preferred for use in our Solr index.
Anybody knows a better solutions?
The example string contains Private Use code points such as U+E5F8. By definition, no standard assigns any meaning to them; their use is purely by private agreements. It is thus impossible to convert them to anything, or to do anything with them, without knowing or inferring the private agreements involved. Some systems use Private Use code points to represent some symbols that are assigned to those points in some special font. Knowing what that font is and inspecting it may thus help to find out the agreement.
The conversion would need to be coded separately, in an ad hoc manner, since there is an an hoc agreement involved.
“ANSI”, which here means windows-1252, does not contain any subscript characters. In the context of a chemical formula, replacing subscript digits by normal digits does not change the meaning, and the formula is understandable, though it looks unprofessional.
When converting to HTML format (or other rich text format), you can use normal digits wrapped in elements that cause subscript rendering (or otherwise style them). HTML has the sub element for this, but its implementations differ between browsers and tend to be a poor quality, so a better approach is to generate <span class=sub>...</span> and use CSS to set the vertical position and font size.
I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.
is there any way we can check if a php file has been obfuscated, using php? I was thinking regex maybe (for instance ioncube's encoded file contains a very long alphabet string, etc.
One idea is to check for whitespace. The first thing that an obfuscator will do is to remove extra whitespace. Another thing you can look for is the number of characters per line, as obfuscators will put all the code into few (one?) lines.
Often, obsfuscators initialize very large arrays to translate variables into less meaningful names (eg. see obsfucator article
One technique may be to search for these super-large arrays, close to the top of the class/file etc. You may be able to hook xdebug up to examine/look for these. The whole thing of course depends on the obsfuscation technique used. Check the source code, there may be patterns they've used that you can search on.
I think you can use token_get_all() to parse the file - then compute some statistics. For example check for number of function calls(in calse obfuscator uses some eval() string and nothing else) and calculate average function length - for obfuscators it will usually be about 3-5 chars, for normal PHP code it should be much bigger. You can also use dictionary lookup for function/variable names, check for comments etc. I think if you know all obfuscator formats that you want to detect - it will be easy.
im' trying to make a file based article system and i want the folders to be my categories and files inside it to be the actual articles. But sometimes i need some special characters in my folder/file name(\/:*?",and actually i'm interested just in double quotes and question mark). Is there a way to do the trick...something like & is in html or something like this. thanks
Short answer: Your operating system could support such file names, but it doesn't seem to.
There isn't a simple way for you to do something easy like & for this. You could store the real filename, or make a conversion table such that something like _____questionmark_____ converted into that symbol or something silly like that, but then you run into problems with that particular string.
Fundamentally though, you should store the title separately from the file itself. A database would be an appropriate location.
At a deeper level, if you're asking a question like this, I think it's safe to say that allowing users to specify filenames on your system is likely to be a large security risk.
There are only few special characters which aren't allowed in file names. so keep your own insanitized sequence for those characters. For instance, replace all '?' with '#quest' before creating files and so. Do the reverse when you read them, aint this good? Insanitized means some combination of characters that we don't type usually like '#quest'.
I would recommend using a .htaccess file to pass the "filename" as an argument to your PHP script. Subsequently have the script look up the article in a database lookup table that points to the article file.
I've got a script that takes a user uploaded RTF document and merges in some person data into the letter (name, address, etc), and does this for multiple people. I merge the letter contents, then combine that with the next merge letter contents, for all people records.
Affectively I'm combining a single RTF document into itself for as many people records to which I need to merge the letter. However, I need to first remove the closing RTF markup and opening of the RTF markup of each merge or else the RTF won't render correctly. This sounds like a job for regular expressions.
Essentially I need a regex that will remove the entire string:
}\n\page ANYTHING \par
Example, this regex would match this:
crap
}
\page{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20 September 30, 2008\par
more crap
So I could make it just:
crap
\page
more crap
Is RegEx the best approach here?
UPDATE: Why do I have to use RTF?
I want to enable the user to upload a form letter that the system will then use to create the merged letters. Since RTF is plain text, I can do this pretty easily in code. I know, RTF is a disaster of a spec, but I don't know any other good alternative.
I would question the use of RTF in this case. It's not entirely clear to me what you're trying to do overall, so I can't necessarily suggest anything better, but if you can try to explain your project more broadly, maybe I can help.
If this is really the way you want to go though, this regex gave me the correct output given your input:
$output = preg_replace("/}\s?\n\\\\page.*?\\\\par\s?\n/ms", "\\page\n", $input);
To this I can say ick ick ick. Nevertheless, rcar's cludge probably will work, barring some weird edge-case where RTF doesn't actually end in that form, or the document-wide styles include important information that utterly messes up the formatting, or any other of the many failure modes.