How to "read" a HTML document in PHP? - php

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.
I am writting a little php script that creates a PDF file from a dynamically created HTML file.
Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.
E.g.
<div><p>Test</p></div>
My script should recognize:
First tag is a div: do function for div
Second tag is a p: do function for p
I don't know for what I should search. Regular expressions? HTML parser?
Thanks for a hint!

Try an XML parser. In PHP the SimpleXML is probably what you are looking for.

I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).

What you need to do is read the HTML file into a PHP variable/object
http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx
And then use RegEx to parse the HTML Tags and Attributes
http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML

Related

Word XML to HTML (with styling)

I have a word template (msword 2010) that I inject variables into using PHPWord, and would like to convert that into a PDF.
My thought process is to convert the word document into xml (which I have done), then turn that xml into styled html.
So far I have managed to replace the xml elements that represent line breaks and paragraphs, but am wondering if there is some code somewhere that will convert the other xml elements into styled html. I know it is unlikely to be perfect, but something close would be good.
Your best bet is to use XSLT. There are some good tutorials on the web. This page gives the code for doing this in PHP.

Parse HTML and replace content in DIV

I want to know how i can find the DIV tag in a HTML page. This is because i want to replace the links inside that DIV with different links. I do not understand what exact code i require.
First, notice that PHP won't do anything client side. But you should already know it.
you should use file_get_contents to read the webpage as a string (or what is provided by a library for html parsing).
There is already a question that explain how to parse html in any way: Robust and Mature HTML Parser for PHP
If it doesn't fit your needs, try searching it on google: php html parsing, I found some libraries
For example this library I've found allows you to find all tags: http://simplehtmldom.sourceforge.net/
Notice that this is not a great approach and I suggest you change your html page to be a PHP page, and insert some code in place of A tags. This will make everything easier.
Last thing, if the html page is static (it doesn't change), you can use easily line counting to get contents from X line to Y line, put your customized A-tags and then read from J to the end of file.
Good luck anyway.

Parse HTML without xpath

I'm trying to create a simple tool to parse html files.
Specifically, I need it to get all the name attributes out of all the div tags.
My HTML string varies and I don't have any control over it, so if I try and use xpath I tend to get errors as the HTML is not 100% written correctly.
Any ideas?
Thanks,
There is also a great class called PHP Simple HTML DOM Parser on http://simplehtmldom.sourceforge.net/
Works fine with invalid HTML, but needs a lot of memory for parsing long html-files.

Read external HTML page and then find data within

I'm playing around with an idea, and I'm stuck at this one part. I want to read an external HTML page and then extract the data held within two <dd> tags. I've been using file_get_contents with good results, but I'm at a loss as to how to accomplish that last part. The two tags I want to extract the value from are always enclosed within a particular <div>, was wondering if that might help?
In my mind it reads the entire html file into a string, then dumps all the data up until this one particular <div>, and dumps all the data after the closing </div>. Is that possible? I think this needs regex syntax which I've never used yet. So any tips, links, or examples would be great! I can provide more info as necessary.
Maybe this could help:
http://simplehtmldom.sourceforge.net/
You are complicating way too much. Simply load the page content and then search for the proper regex (preg_match()). This will do fine
preg_match('~<tag id="foobar">(?P<content>.*?)</endtag>~is', $input, $matches);
If you use HTQL COM to query the page, the query is: <dd>1:tx

preg_match_all Problems

I'm trying to match a string that contains HTML code that contains parameters to a function in Javascript.
There are several of these functions found in the the string containing the HTML code.
changeImg('location','size');
Let's say that I want to grab the location within the single quotes, how would I go about doing this? There are more than one instance in the string.
Thanks in advance.
This is a fairly common question on SO and the answer is always the same: regular expressions are a poor tool for parsing HTML. Use an XML or HTML parser. That's what they're for. Take a look at Parse HTML With PHP And DOM for an example and Parsing Html The Cthulhu Way for a bit of background.
Parsing Javascript is even harder as it can appear inside <script> tags and attributes so in the very least you'd need to get every <script> tag and parse the contents as well as every element and parse their event handlers (onclick, etc).
I'm reminded of this quote:
"Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems." -- Jamie Zawinski

Categories