Keeping file offsets while parsing HTML with the DOM?

Keeping file offsets while parsing HTML with the DOM? - php

I want to modify <img src=""> attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I'm afraid people in blue furry suits will come haunt me in my sleep.
If I use the DOM parser to read the HTML and modify the <img> tags, I'm afraid I can't reconstruct the post exactly as it was (with only my modification), because the DOM parser will probably do too much cleanup and maybe remove essential data. A SAX parser can probably not handle invalid XML, so this will also not work.
So, is there a middle way, where I can use a DOM parser, but one that knows where each element started, so I can do string replacements or something similar from there? I know some nodes in the DOM tree will not exist in the source document (<b>Some <i>bizarre</b> formatting</i> will probably trigger this), but does this mean it is always impossible? I see there is a DOMNode::getLineNo() function added in PHP 5.3, but I'm using 5.2.x.

If PHP's DOM will write "too clean" results, you could try string-based SimpleHTMLDOM whether it's more lenient.
However, with formatting as bizarre as you show, I would never entirely trust the parser to do it "right". But try it out, maybe it just skips such stuff.
The DOM library's DOMNode class has a getLineNo() method. I don't entirely see how this works though, seeing as it doesn't provide an offset to go with it. Not sure whether that'll help your use case.

Related

preg_replace vs DOMDocument replaceChild

I was wondering which method mentioned in the title is more efficient to replace content in a html page.
I have this custom tag in my page: <includes module='footer'/> which will be replaced with some content.
Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild for instance when i forgot to add the slash in the tag, like so <includes module='footer'> the whole site crashes.
Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}.
Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?
More here: Append child/element in head using XML Manipulation
cheers

I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.
Instead, I would be concerned about accuracy. In general DOMDocument will be much better at parsing XML since it was built to read and understand the language. However, it does fail on <includes module='footer'> because it is an un-closed tag (expecting: </includes>).
Most common HTML/XML formatting issues can be fixed with PHP's Tidy class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after the module, elements within the includes element, unexpected characters like <includes module='foo>bar'>, etc.
In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters module will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (after Tidy'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).
Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.
If you want me to get more specific in any area (i.e. give a Tidy example, or work on a benchmark), let me know.

So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.
This is how my html looks like:
<!DOCTYPE html>
<html>
<head>
{includes:title}
{includes:style}
</head>
<body>
{includes:body}
{includes:footer}
...
allot more here
...
</body>
</html>
this is the regex is used: /{([ ]*)includes:([ ]*)$key([^}]*)}/i
As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.
For the replaceChild method i used a custom tag like this: <includes module='body'/>
Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...

Using php to get all translatable text from a website/html-page

I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.
One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?

In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.
If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).
If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.

I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.

Postfix every link in a HTML file with a tracking code

I want to postfix every link in a HTML file with a Google Analytics tracking code. The entire HTML is contained in the $content variabile. Is it possible to add this tracking code to all links, except mailto?

No, you can't do it - at least not reliably so. HTML is very contextual, which means you need a real parser to pull this off. Regular expressions may cover many cases, but you'll end up with both false positives (your regex matching on something that is not really a link) and false negatives (real links being missed). See the link in my Pony comment for a more thorough... uhm... "explanation".
If you really have to go through the final HTML and post-process it, your best bet is to find a proper HTML parser (in a pinch, DOMDocument might do: IIRC, it can parse both XML and HTML), walk through the DOM tree and replace links as appropriate, then render the tree back into a string.
Ideally though, you have an HTML-aware template system in place (e.g. XSLT), in which case you can probably intercept the DOM tree earlier in the process, which means you can skip the additional parsing and rendering steps and go right to the DOM tree.

PHP Dom document html is faster or preg_match_all function is faster?

I got a doubt in mind that which one is faster in processing?
dom document or preg_match_all with curl function is faster in html page parsing?? and will dom document function leave a trace on other server like curl function do? For example in curl function we use a user agent to define who is accessing but in dom document there is nothing.

Does it matter which is faster if one gives you incorrect results?
Matching with regular expressions to get a single bit of data out of the document will be faster than parsing an entire HTML document. But regular expressions cannot parse HTML correctly in all cases.
See http://htmlparsing.com/regexes.html, which I have started to address this common question. (And for the rest of you reading this, I can use help. The source is on github, and I need examples for many different languages.)

Regular expressions will likely be faster, but they are also likely the worse choice. Unless you have benchmarked and profiled your application and found nothing else to optimize, you should look into a proper existing parser.
While Regular Expressions can be used to match HTML, it takes a thorough effort to come up with a reliable parser. PHP offers a bunch of native extensions to work with XML (and HTML) reliably. There is also a number of third party libraries. See my answer to
Best Methods to parse HTML
As for sending a custom user agent, this is possible with DOM too. You have to create a custom stream context and attach it with the underlying libxml functions. You can supply any of the available HTTP Stream context options this way. See my answer to
DOMDocument::validate() problem
for an example how to supply a custom UserAgent.

dom functions dont have anything to do with html fetching.
however there are load functions that can be used to fetch http resources directly.
they will show the same behaviour as file_get_contents without context params.
as to the other part of your question. preg functions are faster. however they are not intended for that use and you will probably regret using them for this purpose very soon.
if you are parsing html with regular expressions, you are either completely insanely nuts awesome, or just don't get the concept of html.

What's the difference between the different XML parsing libraries in PHP5?

The original question is below, but I changed the title because I think it will be easier to find others with the same doubt. In the end, a XHTML document is a XML document.
It's a beginner question, but I would like to know which do you think is the best library for parsing XHTML documents in PHP5?
I have generated the XHTML from HTML files (which where created using Word :S) with Tidy, and know I need to replace some elements from them (like the and element, replace some attributes in tags).
I haven't used XML very much, there seems to be many options for parsing in PHP (Simple XML, DOM, etc.) and I don't know if all of them can do what I need, an which is the easiest one to use.
Sorry for my English, I'm form Argentina. Thanks!
I bit more information: I have a lot of HTML pages, done in Word 97. I used Tidy for cleaning and turning them in XHTML Strict, so now they are all XML compatible. I want to use an XML parser to find some elements and replace them (the logic by which I do this doesn't matter). For example, I want all of the pages to use the same CSS stylesheet and class attributes, for unified appearance. They are all static pages which contains legal documents, nothing strange there. Which of the extensions should I use? Is SimpleXML enough? Should I learn DOM in spite of being more difficult?

You could use SimpleXML, which is included in a default PHP install. This extensions offers easy object-oriented access to XML-structures.
There's also DOM XML. A "downside" to this extension is that it is a bit harder to use and that it is not included by default.

Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.
DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.
The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.
For your usage, I would probably use the DOM api.

DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.
DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less 'terse' than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn't include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it'll save you a lot of code. It's also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It's not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn't have things like 'getElementsByTagName' or getElementById', but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat 'magical'. The properties it exposes if you var_dump/printr/var_export don't correspond to its complete internal representation. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the [] operator as if it was an associative array.
You don't have to fully commit to one or the other, because PHP implements the functions:
simplexml_import_dom(DOMNode)
dom_import_simplexml(SimpleXMLElement)
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn't build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it's an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don't need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don't care about structure.

I prefer SimpleXMLElement as it's pretty easy to use to lop through elements.
Edit: It says no version info avaliable but it's avaliable in PHP5, at least 5.2.5 but probably earlier.
It's really personal choice though, there's plenty of XML extensions.
Bear in mind many XML parsers will balk if you have invalid markup - XHTML should be XML but not always!

It's been a long time (2 years or more) since I worked with XML parsing in PHP, but I always had good, usable results from the XML_Parser Pear package. Having said that, I have had minimal exposure to PHP5, so I don't really know if there are better, inbuilt alternatives these days.

I did a little bit of XML parsing in PHP5 last year and decided to use a combination of SimpleXML.
DOM is a bit more useful if you want to create a new XML tree or add to an existing one, its slightly more flexible.

It really depends on what you're trying to accomplish.
For pulling rather large amounts of data, I.E many records of say, product information from a store website, I'd probably use Expat, since its supposedly a bit faster...
Personally, I've has XML's large enough to create a noticeable performance boost.
At those quantities you might as well be using SQL.
I recommend using SimpleXML.
It's pretty intuitive, easy to use/write.
Also, works great with XPath.
Never really got to use DOM much, but if you're using the XML Parser for something as large as you're describing you might want to use it, since its a bit more functional than SimpleXML.
You can read about all three at W3C Schools:
http://www.w3schools.com/php/php_xml_parser_expat.asp
http://www.w3schools.com/php/php_xml_simplexml.asp
http://www.w3schools.com/php/php_xml_dom.asp

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.