XML parsing _ dealing with an exception ... basic PHP problem

XML parsing _ dealing with an exception ... basic PHP problem - php

I have written a perfectly working XML parser using PHP...
using,
$xml = simplexml_load_file($newfile);
Now, The newfile was a pointer to an xml file that has over 20000 lines.
The problem was, the system ( an android) started generating tags called
< none >
... when no value existed...
BUT, there is NO < /none > tag ... There are multiple < none > values !!
It seems like
a) Either instruct Android not to do this ! - TRIED, but it can't be controlled .. OS deals with it.
b) or, Create a PHP loophole to prevent this error !
I need help. How do I achieve this ?
Thanks

If there no end tag - xml is not valid. If there only issue with none tag you can try ignore it in your parser. Or even try remove none tags with php "find and replace" functions like preg_replace before dealing with parser

Related

process unconform xml in php without simplexml [duplicate]

I'm having some trouble parsing malformed XML in PHP. In particular I'm querying a third party webservice that returns data in an XML format without encoding the XML entities in actual data. For example one of the the elements contains an ASCII heart, '<3', without the quotes, which the XML parser sees as an opening tag. It should be '<3'.
Right now I'm simply passing the XML string into a SimpleXMLElement which, predictably, fails on these instances. I've done some looking around and it seems like PHP Tidy package might be able to help me, but the amount of configuration you can do is overwhelming :(
Thus, I'm just wondering if anyone else has had a problem like this and, if so, how they were able to solve it.
Thanks!

Try tidy.repairString:
php > $tidy = new tidy();
php > $repaired = $tidy->repairString("<foo>I <3 Philadelphia</foo>", array("input-xml"=>1));
php > print($repaired);
<foo>I <3 Philadelphia</foo>
php > $el = new SimpleXMLElement($repaired);

Read the content as a string.
htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','',$string))
Load the transformed string in SimpleXMLElement
It worked for me so far.

HTML minifier puts every tag on a new line

I am using a html minifier, which can be found here: HTML minify
The strange thing to me is that every tag is placed on a new line. Is this common behavior or am I doing something wrong. The output looks something like this:
Anyone know how I can fix this so that is just creates one line of code, or is has this was of minifying some advantages.

Checked the code?
// use newlines before 1st attribute in open tags (to limit line lengths)
$this->_html = preg_replace('/(<[a-z\\-]+)\\s+([^>]+>)/i', "$1\n$2", $this->_html);
Long lines can be a bad bad thing - browsers might fill buffers or just drop stuff at the end of the line. So it looks like that Minify script has it hard coded in, with no options to change. So if you really want it all on one line, just customise your version to not do that replacement. Open Source win.

Why do I see `á` instead of a space when writing to screen (encoding problem)?

I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.
Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.
When parsing the HTML, instead of a space I get á only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.
I had to first mb_convert_encoding($xmlcontent, "ASCII"); so I could get that XML to parse in PHP.
Any idea what my problem is?
extract HTML from a .tar.gz file using Perl
my $tar = Archive::Tar->new;
$tar->read("myfile.tar.gz");
$tar->extract_file('index.html', 'output.html');
load HTML, this is where it starts to get funky, I get output like Numberáofásourceálines
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('output.html') or die $!;
$tree->elementify;
write to XML
my $output = new IO::File(">output.xml");
my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);

If it looks correct when you write it to a file and wrong when you write it to the terminal, it sounds like your terminal is expecting the wrong encoding. Check your terminal settings.'
Also, see Jon Rockway's answer to "Why does modern Perl avoid UTF-8 by default?". With encodings, you have to convert your input to the correct encoding and convert your output to the correct encoding. Everything that looks at the data needs to know which encoding you're using.

I think I just fixed it by processing this on the html before parsing it, thanks for all the great pointers!
s/\&nbsp\;/ /g;

PHP Simple HTML DOM Parser denies to handle [invalid] HTML - first trial fails

g day dear community - hello all!
well I am trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. Perhaps i have to study the manpages again and again.
Well - the DOM-technique somewhat goes over my head:
But my example is very simple and seems to comply to the examples given in the manual (simplehtmldom.sourceforge AT net/manual.htm) but it just wont work, it's driving me up the wall. Other example scripts given with simple dom work fine.
See the example: http://www.aktive-buergerschaft.de/buergerstiftungsfinder
This is the easiest example i have found ... The question is - how to parse it?
Should i do it with Perl - The example HTML page is invalid HTML.
I do not know if the Simple HTML DOM Parser is able to handle badly malformed HTML
(probably not). And then i am lost.
Well: it is pretty hard to believe - but you can get the content with file_get_contents: But you afterwards have to do the parser job! And there i have some missing parts!
Finally: if i cannot get it to run i can try out some Perl parsers eg HTML::TreeBuilder::XPath

1: check whether file_get_contents is working!!!!
2: If no use curl or fopen or telnet to read the data.
Simple Html Dom filters all the noise can process malformed tags also...
Problem might be with your data retrieving

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D

You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

XML parsing _ dealing with an exception ... basic PHP problem - php

If there no end tag - xml is not valid. If there only issue with none tag you can try ignore it in your parser. Or even try remove none tags with php "find and replace" functions like preg_replace before dealing with parser

Related

process unconform xml in php without simplexml [duplicate]

HTML minifier puts every tag on a new line

Why do I see `á` instead of a space when writing to screen (encoding problem)?

PHP Simple HTML DOM Parser denies to handle [invalid] HTML - first trial fails

Using PHP PCRE to fetch div content

Categories

Resources