Extract data from HTML in PHP or Python

Extract data from HTML in PHP or Python - php

I need to extract this data and display a simple graph out of it.
Something like Equity Share Capital -> array (30.36, 17, 17 .... etc) would help.
<html:tr>
<html:td>Equity Share Capital</html:td>
<html:td class="numericalColumn">30.36</html:td>
<html:td class="numericalColumn">17.17</html:td>
<html:td class="numericalColumn">15.22</html:td>
<html:td class="numericalColumn">9.82</html:td>
<html:td class="numericalColumn">9.82</html:td>
</html:tr>
How do I go about this task in PHP or Python?

A good place to start looking would be the python module BeautifulSoup which extracts the text and places it into a table.
Assuming you've loaded the data into a variable called raw:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(raw)
for x in soup.findAll("html:td"):
if x.string == "Equity share capital":
VALS = [y.string for y in x.parent.findAll() if y.has_key("class")]
print VALS
This gives:
[u'30.36', u'17.17', u'15.22', u'9.82', u'9.82']
Which you'll note is a list of unicode strings, make sure to convert them to whatever type you desire before processing.
There are many ways to do this via BeautifulSoup. The nice thing I've found however is the quick hack is often good enough (TM) to get the job done!

BeautifulSoup

Don't forget lxml in Python. It also works well to extract data. It's harder to install but faster. http://pypi.python.org/pypi/lxml/2.2.8

Related

Parsing Python list with PHP

is there any way to parse Python list in PHP?
I have data coming from python stored in mysql, something like this:
[{u'hello: u'world'}]
And need to use it in PHP script. The data is a valid JSON, only difference are those leading u'
So I can replace all u' with ' and then replace all ' with " to get it into json.
When I replace everything, if there is ' in the actual value, it is replaced by " as well and brakes the json.
So.. I tried a lot of stuff, but none of them was able to parse proper json thus my question -> Is there any way to parse Python generated list/json-like data in PHP? I dont mind using some third-party library or etc, just want to get the data parsed...
Thank you

If you have access to python, you can convert it to json from the command line.
Here's an example.
$ echo "{u'key': u'value'}" |\
python -c "import sys, json, ast; print(json.dumps(ast.literal_eval(sys.stdin.read())))"
{"key": "value"}
Here's a better formatted version of the python oneliner:
import sys, json, ast
data = ast.literal_eval(sys.stdin.read())
print(json.dumps(data))
By using ast.literal_eval instead of regular eval we can evaluate the python dictionary literal and not worry about potential code execution vulnerabilities.

how to get html text differences like svn?

I am using PEAR text_diff class to get comparison of text. It works correct for plain text, but when I try to compare text with HTML tags, It gives wrong result
is there any way to compare two HTML blocks and in result display text that pre-serv its HTML and show differences like svn

In my experience, these two are fantastic:
https://github.com/cygri/htmldiff (Python)
http://www.w3.org/People/Bos/#jpegxmp (C, must be compiled)
Yes, none of the programs are written in plain PHP. You just run them via PHP:
// Python script:
$html_diff = shell_exec ('python /path/to/htmldiff version1.html version2.html');
// C program:
$html_diff = shell_exec ('/path/to/htmldiff --start-delete="<span class=\'delete\'>" --end-delete="</span>" --start-insert="<span class=\'insert\'>" --end-insert="</span>" version1.html version2.html');
Since they aren't written in PHP, you can enjoy the incredible high speed :)

I'm not sure which OS you're on, but I always use Meld on Ubuntu for diff:ing non-versioned files. It doesn't have any problems diffing HTML code (or anything else afaik):
http://meldmerge.org/

You can do this right in javascript itself. Check out google-diff-match-patch.
Diff demo here.

Dealing with XML in PHP

I'm currently working a project that has me working with XML a lot. I have to take an XML response and decrypt each text node and then do various tasks with the data. The problem I'm having is taking the response and processing each text node. Originally I was using the XMLToArray library, and that worked fine I would change the XML into an array and then loop through the array and decrypt the values. However some of the XML response I'm dealing with have repeated tags and the XMLToArray library will only return the last values.
Is there a good way that I can take an XML response and process all the text nodes and easily putting the values into an array that has a similar structure to the response?
Thanks in advance.

I would use SimpleXML.
Here's a small example of using it. It loads and parses XML from http://www.w3schools.com/xml/plant_catalog.xml and then outputs values of "COMMON" and "PRICE" tags of each "PLANT" tag.
$xml = simplexml_load_file('http://www.w3schools.com/xml/plant_catalog.xml');
foreach ( $xml->PLANT as $plantNode ) {
echo $plantNode->COMMON, ' - ', $plantNode->PRICE, "\n";
}
If you have any problems with adapting it to your needs, just give an example of your XML so that we can help with it.

All those XML to array libraries are a remain of the times where PHP 4 would force you to write your own XML parser almost from scratch. In recent PHP versions you have a good set of XML libraries that do the hard job. I particularly recommend SimpleXML (for small files) and XMLReader (for large files). If you still find them complicate, you can try phpQuery.

You might want to give SimpleXML a try. Plus it comes by default in php so you dont need to install

Check out SimpleXML, it may offer a bit more for what you are looking for.

PHP equivalent to Perl format function

Is there an equivalent to Perl's format function in PHP? I have a client that has an old-ass okidata dotmatrix printer, and need a good way to format receipts and bills with this arcane beast.
I remember easily doing this in perl with something like:
format BILLFORMAT =
Name: #>>>>>>>>>>>>>>>>>>>>>> Age: ####
$name, $age
.
write;
Any ideas would be much appreciated, banging my head on the wall with this one. O.o
UPDATE: I cannot install Perl in this environment, otherwise I would simply use Perl's format function directly.

You could use printf to do something similar.
http://www.php.net/manual/en/function.printf.php
printf("Name: %21s Age: %3i\n",$name,$age);
If you wanted the name left aligned, you would just add a -
printf("Name: %-21s Age: %3i\n",$name,$age);
It defaults to right aligned.

If you don't mind using a Perl process to control the printer, you could serialize the data in PHP and pass it to a Perl script.
I've had great luck using PHP::Serialization to handle data serialization and sharing between Perl and PHP. You could also use YAML or JSON for this task.

Sounds like a perfect situation to use heredoc.

Template extraction in python/php

Are there existing template extract libraries in either python or php? Perl has Template::Extract, but I haven't been able to find a similar implementation in either python or php.
The only thing close in python that I could find is TemplateMaker (http://code.google.com/p/templatemaker/), but that's not really a template extraction library.

After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: Options for HTML scraping? among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).
Hope this helps anyone else who is looking for the same solution.

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.
The example shows:
# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')
# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b> spacy and <u>underlined</u></b>')
(' spacy ', '<u>underlined</u>')
# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...
So, to achieve the task you require, I think you should:
Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
Use the inferred template to extract data from new documents.
Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/
It seems to be a lot like what I would call a wrapper induction library.
If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract data from HTML in PHP or Python - php

BeautifulSoup

Don't forget lxml in Python. It also works well to extract data. It's harder to install but faster. http://pypi.python.org/pypi/lxml/2.2.8

Related

Parsing Python list with PHP

how to get html text differences like svn?

Dealing with XML in PHP

PHP equivalent to Perl format function

Template extraction in python/php

Categories

Resources