Load an invalid XML in PHP DOM - php

I have and input XML file that is not correctly formatted ( ie. it has '&' instead of '& amp;')
When i try to load this XML using PHP DOM, $doc->load("file.xml") it throws and error and stops the parsing.
Is there any way to load this un-formatted XML? and No I cant edit the source XML file.
I did try using $doc->loadHTML() but it throws errors all over the place.
I wanted to know if there is a proper way to do this (like load file contents and change it using regex or something similar)

Try setting $doc->validateOnParse = false; before loading your XML via $doc->loadHTML(...).

First, check that it's the & that's causing the error and not something else.
One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML is loaded from a string, can't you just replace the invalid characters with the correct ones?
If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.

If you are sure that's the only thing making it not validate, then you could try loading the file into a string with file_get_contents() function, then search & replace through the string to change the &'s into &'s, then place that string into simpleXML like $xml = simplexml_load_string($cleaned_string);

Related

process unconform xml in php without simplexml [duplicate]

I'm having some trouble parsing malformed XML in PHP. In particular I'm querying a third party webservice that returns data in an XML format without encoding the XML entities in actual data. For example one of the the elements contains an ASCII heart, '<3', without the quotes, which the XML parser sees as an opening tag. It should be '<3'.
Right now I'm simply passing the XML string into a SimpleXMLElement which, predictably, fails on these instances. I've done some looking around and it seems like PHP Tidy package might be able to help me, but the amount of configuration you can do is overwhelming :(
Thus, I'm just wondering if anyone else has had a problem like this and, if so, how they were able to solve it.
Thanks!
Try tidy.repairString:
php > $tidy = new tidy();
php > $repaired = $tidy->repairString("<foo>I <3 Philadelphia</foo>", array("input-xml"=>1));
php > print($repaired);
<foo>I <3 Philadelphia</foo>
php > $el = new SimpleXMLElement($repaired);
Read the content as a string.
htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','',$string))
Load the transformed string in SimpleXMLElement
It worked for me so far.

Extract XML from .prt file using PHP but file becomes unreadable when opened with PHP

I have a .prt (CAD Design File) that I need to extract some XML from using PHP. When I view this file directly in the browser, I can see the XML along with some unreadable areas. However, when I go to open it using PHP to get the XML I need from it, the file becomes mostly unreadable and the XML is no where to be found as the file looks like it was encrypted.
This is an example of what the .prt file looks like when opened directly in the browser: File in Browser
This is an example of what the file looks like when opened using PHP: Using PHP
This is how I am trying to open the file with PHP:
$handle = fopen("thePart.prt", "rb");
$contents = trim(stream_get_contents($handle));
fclose($handle);
//echo out contents to see what happens
echo $contents;
If I could get this file to open without doing what it is doing, I can get the XML out of it myself. How do I fix the issue that I am having? Thank you very much in advance.
Real Answer
Turns out that there was no problem at all with the code. The browser was just interpreting the XML tags as HTML and so the data was not displayed (PHP by default sets a content type of text/html). When viewing the source code, the XML was plain and visible. The XML can also be seen without viewing the source by setting the content type of the php file:
header('Content-Type: text/plain');
This way, the browser will just display the XML as it is, without attempting to parse it as HTML first.
Initial Answer
Just a guess here, but it might be that you're opening the file in binary mode (the "rb" in your first line of code. Try opening it as a plain text file (use "r" instead of "rb").
More likely, it's an encoding issue where PHP is trying to decode a UTF-8 file as ASCII, for instance. Since you are opening a binary file (CAD Design File is binary with a little XML, I'm assuming), PHP might be getting confused while trying to detect the encoding of the file. I would need a copy of the file to know for sure.
Try comparing the result of mb_detect_encoding:
mb_detect_encoding($contents)
and the actual encoding of the XML data within the .prt file. If they are different, that's how you know that PHP is using the wrong encoding. In that case, use mb_convert_encoding to convert from PHP's detected encoding to that of the XML data.

PHP Simple HTML DOM Parser - FIle not being read

I've written a script to process html files from URLs, however, due to a 30's script runtime restriction with my cheap host provider I've had to alter the script to store the html as txt files and run it from a local WAMP server.
I am trying to load each file up, extract what I need, then move onto the next file.
URL's as source file_get_html was doing the job perfectly (I could ->find the required elements)
Txt file as source file_get_html is returning a blank object.
Based on some advice in the below post I changed file_get_html for file_get_contents which created an array with a single large string containing the contents of the text file.
First, make sure that file_get_contents can get data. If it can, file_get_html will be able to load data to simplehtml Dom
If file_get_contents returns a string, which it does, how would I "load data to simplehtml Dom?"
File not getting read using file_get_html
I then tried to convert the string into an object str_get_html, however, this didn't work either.
include('simple_html_dom.php');
$html = file_get_html('file.txt');
var_dump($html);
Returns: object(simple_html_dom)[1] but with no other contents or arrays.
include('simple_html_dom.php');
$html = file_get_contents('file.txt');
var_dump($html);
Returns: string < ! DOCTYPE html PUBLIC.....
Questions:
Can anyone give me any advice? What's the best way to load up a text file containing html markup into an object so that I can utilise the find method on it's contents. I want to avoid loading the file into an array of strings and using regex to process contents.
Are there any considerations I need to make if using a local WAMP server?
(Answered by the OP in a question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
I managed to solve this myself. I am sure i'd already tried to extract html from string, doh!
include('simple_html_dom.php');
$html = file_get_contents('file.txt');
$html = str_get_html($html);
var_dump($html)
Returns object(simple_html_dom)[1] including all expected arrays etc
Instead of trying to create the html object directly from the source file using file_get_html I've extracted the file contents file_get_contents then converted str to html using str_get_html which allows me to use the simple html dom methods e.g. find on attributes within the object e.g.
$html->find('a');

getting xml data as is(with tags, attributes and values) with php

Hello im struggling with a problem. I have an url that contains xml data...
when i'm using file_get_contents($url) or fopen($url,'r') it gives me only values:
Consider the xml:
<tag1 attrName="something">
<tag2>some Value</tag2>
<tag2>some Other Value</tag2>
...
...
</tag1>
what i get: some Value, some Other Value
But i need to get whole xml (with tags and attributes and its' values) and parse it with my own way because there's a restriction that i'm not allowed to use php 5.x practices.I mean i cant use any parser.. It shouldnt be so hard to get xml data as is.. should it??
what i get: some Value, some Other Value
Nope - my suspicion is that that is what you see in your browser, because it is swallowing all <tags>.
The XML source code will be there after a file_get_contents() operation.
You are using file_get_contents() which states
This function is similar to file(), except that file_get_contents()
returns the file in a string, starting at the specified offset up to
maxlen bytes. On failure, file_get_contents() will return FALSE.
Press Ctrl+u to see the source code in any of the major browsers(except IE where its F12 in IE9). I am sure that your code will be there. Your browser wont display the tags that's all.
The other longer(but better way) to display an XML file from your php file is to pass the content type as text/xml. Use the following way
<?php
header("Content-Type: text/xml");//SHOULD come before any output
// dynamically generate and output your xml here
?>

Why do I see `á` instead of a space when writing to screen (encoding problem)?

I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.
Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.
When parsing the HTML, instead of a space I get á only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.
I had to first mb_convert_encoding($xmlcontent, "ASCII"); so I could get that XML to parse in PHP.
Any idea what my problem is?
extract HTML from a .tar.gz file using Perl
my $tar = Archive::Tar->new;
$tar->read("myfile.tar.gz");
$tar->extract_file('index.html', 'output.html');
load HTML, this is where it starts to get funky, I get output like Numberáofásourceálines
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('output.html') or die $!;
$tree->elementify;
write to XML
my $output = new IO::File(">output.xml");
my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);
If it looks correct when you write it to a file and wrong when you write it to the terminal, it sounds like your terminal is expecting the wrong encoding. Check your terminal settings.'
Also, see Jon Rockway's answer to "Why does modern Perl avoid UTF-8 by default?". With encodings, you have to convert your input to the correct encoding and convert your output to the correct encoding. Everything that looks at the data needs to know which encoding you're using.
I think I just fixed it by processing this on the html before parsing it, thanks for all the great pointers!
s/\&nbsp\;/ /g;

Categories