I donot know much about files and its related security. I have a LOT of data in XML files which i am planning on parsing to put in the database. I get these XML files from 3rd party people. I will be getting minimum around 1000 files per day. So i will write a script to parse them to enter in our database. Now i have many questions regarding this.
I know how to parse a single file. And i can extend the logic to multiple files in a single loop. But.Is there a better way to do the same? How can i use multi threaded programming to parse the files simultaneously many of them. There will be a script which, given the file, parses the single file and outputs to database. How can i use this script to parse in multiple threads/parallel processing
The File as i said, Comes from a 3rd party site. So how can i be sure that there are no security loop holes. I mean, i dono much about file security. But what are the MINIMUM common basic security checks i need to take.(like sql injection and XSS in web programing are VERY basic)
Again security related: How to ensure that the incoming XML file is XML itself. I mean i can use the extension, But is there a possibility to inject scripts and make them run when i parse these files. And What steps should i take while parsing individual files
You want to validate the XML. This does two things:
Make sure it is "well-formed" - a valid XML document
Make sure it is "valid" - follows a schema, dtd or other definition - it has the elements and you expect to parse.
In php5 the syntax for validating XML documents is:
$dom->validate('articles.dtd');
$dom->relaxNGValidate('articles.rng');
$dom->schemaValidate('articles.xsd');
Of course you need an XSD (XML Schema) or DTD (Document Type Definition) to validate against.
I can't speak to point 1, but it sounds fairly simple - each file can be parsed completely independently.
Points 2 and 3 are effectively about the contents of the file. Simply put, you can check that it's valid XML by parsing it and asking the parser to validate as it goes, and that's all you need to do. If you're expecting it to follow a particular DTD, you can validate it against that. (There are multiple levels of validation, depending on what your data is.)
XML files are just data, in and of themselves. While there are "processing instructions" available as XML, they're not instructions in quite the same way as direct bits of script to be executed, and there should be no harm in just parsing the file. Two potential things a malicious file could do:
Try to launch a denial-of-service attack by referring to a huge external DTD, which will make the parser use large amounts of bandwidth. You can probably disable external DTD resolution if you want to guard against this.
Try to take up significant resources just by being very large. You could always limit the maximum file size your script will handle.
Related
I have to modify a bunch of XMLs files to make them compliant to a given XSD. I know how to read or write an XML. I already know how to validate a generic XML against a given XSD, however, since the XSD is quite complex I'm looking for a solution to save me the burden to check every single node.
Otherwise, also the mere converter to produce an empty XML to be filled in a second passage would be appreciated.
I've heard about XSL, but it looks like only works with XSL stylesheets.
Thanks in advance.
I have to modify a bunch of XMLs files to make them compliant to a given XSD. I know how to read or write an XML. I already know how to validate a generic XML against a given XSD,
All quite typical.
however, since the XSD is quite complex I'm looking for a solution to save me the burden to check every single node.
This part appears to reflect a misunderstanding of validation.
The validating parser itself takes on the burden of checking every single node, leaving you with the substantially smaller task of addressing validation issues that it reports to you via diagnostic messages.
Otherwise, also the mere converter to produce an empty XML to be filled in a second passage would be appreciated.
There are tools that can instantiate an XSD with a starter XML document that's valid against the XSD. Such tools can be helpful in creating a new XML document that conforms to an XSD, not in validating existing XML documents.
I've heard about XSL, but it looks like only works with XSL stylesheets.
XSLT would help if you wanted to transform one XML document to another via an mapping you specify via templates. Starting with XSLT 2.0, there's support for obtaining type information from XSDs. However, none of this is designed to help with correcting validation errors in an automated manner.
Is there an easy way to do this without parsing the entire resource pointed to by the URL and finding out the different content types (images, javascript files, etc.) linked to inside that URL?
Just some quick thoughts for you.
You should be aware that caching, and the differences in the way in which browsers, obey and disobey caching directives can lead to different resource requests generated for the same page, by different browsers at different times, might be worth considering.
If the purpose of your project is simply to measure this metric and you have control over the website in question you can pass every resource through a php proxy which can count the requests. i.e you can follow this pattern for ssi, scripts, styles, fonts, anything.
If point 2 is not possible due to the nature of your website but you have access, then how about parsing the HTTP log? I would imagine this will be simple compared with trying to parse a html/php file, but could be very slow.
If you don't have access to the website source / http logs, then I doubt you could do this with any real accuracy, huge amount of work involved, but you could use curl to fetch the initial HTML and then parse as per the instructions by DaveRandom.
I hope something in this is helpful for you.
EDIT
This is easily possible using PhantomJS, which is a lot closer to the right tool for the job than PHP.
Original Answer (slightly modified)
To do this effectively would take so much work I doubt it's worth the bother.
The way I see it, you would have to use something like DOMDocument::loadHTML() to parse an HTML document, and look for all the src= and href= attributes and parse them. Sounds relatively simple, I know, but there are several thousand potential tripping points. Here are a few off the top of my head:
Firstly, you will have to check that the initial requested resource actually is an HTML document. This might be as simple as looking at the Content-Type: header of the response, but if the server doesn't behave correctly in this respect, you could get the wrong answer.
You would have to check for duplicated resources (like repeated images etc) that may not be specified in the same manner - e.g. if the document you are reading from example.com is at /dir1/dir2/doc.html and it uses an image /dir1/dir3/img.gif, some places in the document this might be refered to as /dir1/dir3/img.gif, some places it might be http://www.example.com/dir1/dir3/img.gif and some places it might be ../dir3/img.gif - you would have to recognise that this is one resource and would only result in one request.
You would have to watch out for browser specific stuff (like <!--[if IE]) and decide whether you wanted to include resources included in these blocks in the total count. This would also present a new problem with using the XML parser, since <!--[if IE] blocks are technically valid SGML comments and would be ignored.
You would have to parse any CSS docs and look for resources that are included with CSS declarations (like background-image:, for example). These resources would also have to be checked against the src/hrefs in the initial document for duplication.
Here is the really difficult one - you would have to look for resources dynamically added to the document on load via Javascript. For example, one of the ways you can use Google AdWords is with a neat little bit of JS that dynamically adds a new <script> element to the document, in order to get the actual script from Google. In order to do this, you would have to effectively evaluate and execute the Javascript on the page to see if it generates any new requests.
So you see, this would not be easy. I suspect it may actually be easier to go get the source of a browser and modify it. If you want to try and come up with a PHP based solution that comes up with an accurate answer be my guest (you might even be able to sell something as complicated as that) but honestly, ask yourself this - do I really have that much time on my hands?
Lately I've been playing with Microsoft COM object class for PHP to manipolate word files. So far so good, as I've been able to make it work and do some file conversions, such as saving an entire DOC as a PDF on the server.
Now I'm facing a problem: since I'll be converting and manipulating the given word file a lot at runtime, I thought it would be much better if I could save every single -page- separately and work on them one by one instead of reprocessing the whole document each time.
I have been reading all the MSDN part about the COM Document Class, and I have the feeling that I can't save just one page of the document, unless I do some sort of magic using the Range Method, but apparently there's -no way- to know the 'current end position' for each page. Any ideas?
tl;dr I'm trying to save single pages inside a word document using a 'word.application' COM object through a PHP script, but I can't find examples of the Document.Range method.
Francesco, I'll have to warn you. #SLaks is correct in that you really cannot use Word Automation on a server. No, really. We're serious.
There are two reasons:
First, Word is an incredibly complex piece of software designed to be used by an interactive user. It was not programmed or tested to be used under a server environment, and does not work correctly when running under a non-interactive account (the way services do). Sooner or later it will crash or freeze. I've seen it. I'm not talking necessarily about bugs. There are things that Word will do that require a full user account; or where Word expects somebody will be clicking on message boxes. There is no escaping it.
Second, because even if you manage to make it do what you want, it turns out that the Office license expressely forbids you from running Word that way.
Now, exclusively from the point of view of Automation:
Word doesn't really manipulate 'pages'. 'Pages' are just an incidental side-effect of whichever printer is currently selected. Take the same file to a different computer with a different printer and/or driver, and the pagination can change. On large documents it will change.
Yes, most of the time the page breaks don't move (a lot), particularly if you have a document that is a bunch of not-quite-a-full-page forms, but I'm not trying to be fastidious: The point is, the Word document object model won't help you a lot to manipulate 'pages' because they are not a first-class citizen but incidental formatting.
I guess that your best bet would be to use section breaks between the pages, instead of letting the pages autoflow; that way you have something for the object model to grab onto.
You can use the ActiveDocument.Sections collection to locate your... ahem... 'pages' (really, section objects), then use the Range method (to extract the Range object) and the ExportAsFixedFormat method to export that range to a PDF.
If you want a Word document instead, I don't think the object model allows you to save a piece of the document as a separate document. However you can easily copy-and-paste the range to a new document and save that instead.
I have written some code in VB.net that splits a passed word document into individual pages. It then goes on to save the pages as JPG images so I would think this is what you want.
I am happy to share the code with you if you've not accomplished the task yet?
I was going to use XML for communicating some data betwwen my server and the client, then I saw few articles saying using XML at any occation may not be the best idea. The reason given was using XML would increase the size of my message which is quite true, specially for me where nost of my messages are very short ones.
Is it a good idea to send several information segements seperated by a new line? (maximum diffenernt types of data that one message may have is 3 or 4) Or what are the alternative methods that I should look in to.
I have diffenrent types of messages. Ex: one message may contain username and password and the next message may have current location and speed. I'll be using apache server and php.
Serializing data in an XML format can certainly have the negative side effect of bloating it a little (the angle bracket tax), but the incredible extensibility of XML greatly outweighs that consequence, IMO. Also, you can serialize XML in a binary format which greatly cuts down on size, and in most cases the additional bloat would be negligible.
Separating your information segments by newlines could be problematic if your information segments might ever need to include newlines.
JSON is a much lighter weight alternative to XML, and lots of software that supports XML often supports JSON as an alternative. It's pretty easy to use. Since your messages are short, it sounds like they would benefit from using JSON over XML.
http://json.org/
I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?
In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.
You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.
Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.
There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.
The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.
If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.
This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.
It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.
There is a short description and example of a sax parser here
Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).
Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.