I'm trying to create a simple tool to parse html files.
Specifically, I need it to get all the name attributes out of all the div tags.
My HTML string varies and I don't have any control over it, so if I try and use xpath I tend to get errors as the HTML is not 100% written correctly.
Any ideas?
Thanks,
There is also a great class called PHP Simple HTML DOM Parser on http://simplehtmldom.sourceforge.net/
Works fine with invalid HTML, but needs a lot of memory for parsing long html-files.
Related
There are 365 codes that is
<sometag class=“day” date=“yyyy-mm-dd” count=“some Int”></sometag>
I have to parse date and count with php, beautiful soup or any parsing library which can be use with php and make JSON String
How can I parse? Every tags have same class name that is “day”
I will be waiting for more answers for more wide info. Thank you.
Agree with #Scuzzy. SimpleXML would be the easiest and cleanest way way for this very specific example. Problem with SimpleXML, I have found it to be very slow if you are parsing larger documents.
There's also this library for parsing html which I think you will find very useful if you're screen scraping with php which it looks like you are doing.
http://www.schrenk.com/nostarch/webbots/DSP_download.php
There are a bunch of HTML text extraction tools out there. Mostly for Java or Python. The one I come across most often is boilerpipe. There are a few APIs here and there, and some seem to work pretty well. Does anyone know of anything in PHP that does this?
You could try phpQuery:
http://code.google.com/p/phpquery/
DomDocument is a class available in PHP if you have libxml support that can parse HTML documents and let you iterate over them or issue XPath queries to find specific nodes in the DOM tree. This is the ideal method.
Or, if the text is simple enough and uniform, you can use preg_match() to extract text from the data using Regular Expressions.
I want to know how i can find the DIV tag in a HTML page. This is because i want to replace the links inside that DIV with different links. I do not understand what exact code i require.
First, notice that PHP won't do anything client side. But you should already know it.
you should use file_get_contents to read the webpage as a string (or what is provided by a library for html parsing).
There is already a question that explain how to parse html in any way: Robust and Mature HTML Parser for PHP
If it doesn't fit your needs, try searching it on google: php html parsing, I found some libraries
For example this library I've found allows you to find all tags: http://simplehtmldom.sourceforge.net/
Notice that this is not a great approach and I suggest you change your html page to be a PHP page, and insert some code in place of A tags. This will make everything easier.
Last thing, if the html page is static (it doesn't change), you can use easily line counting to get contents from X line to Y line, put your customized A-tags and then read from J to the end of file.
Good luck anyway.
I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.
I am writting a little php script that creates a PDF file from a dynamically created HTML file.
Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.
E.g.
<div><p>Test</p></div>
My script should recognize:
First tag is a div: do function for div
Second tag is a p: do function for p
I don't know for what I should search. Regular expressions? HTML parser?
Thanks for a hint!
Try an XML parser. In PHP the SimpleXML is probably what you are looking for.
I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).
What you need to do is read the HTML file into a PHP variable/object
http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx
And then use RegEx to parse the HTML Tags and Attributes
http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML
Im trying to put an html embed code for a flash video into the rss feed, which will then be parser by a parser (magpie) on my other site. How should I encode the embed code on one side, and then decode it on the other so I can insert clean html into the DB on the receiving server?
Since RSS is XML, you might want to check out CDATA, which I believe is valid in the various RSS specs.
<summary><![CDATA[Data Here]]>
Here's the w3schools entry on it: http://www.w3schools.com/XML/xml_cdata.asp
htmlencode/htmldecode should do the trick.
Ive been using htmlentities/html_entity_decode but for some reason it doesnt work with the parser. In a normal test it works, but parser always returns html code without < > " characters.
RSS is XML. It has very specific rules for encoding HTML. If you're generating it, I'd recommend using an xml library to write the node containing HTML, to be sure you get the encoding right.
HTMLencode will only perform the escaping necessary for embedding data within HTML, XML rules are more strict.
Instead of writing your own RSS XML feed, consider using the Django syndication framework from django.contrib.syndication:
https://docs.djangoproject.com/en/dev/ref/contrib/syndication/
It also supports enclosures, which is the RSS way for embedding images or video.
For custom tags, there is also an lowlevel API which allows you to change the XML:
https://docs.djangoproject.com/en/dev/ref/contrib/syndication/#the-low-level-framework