How do I extract data from a web page with regexes?

How do I extract data from a web page with regexes? - php

I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below:
http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL)
Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last name etc. into my own variables. I am very weak in regex that is why I am here. Or is there another way?
Can anybody help me in doing that?

phpQuery is very nice for screen-scraping in PHP. It lets you access the DOM using the same methods jQuery has.

You don't want regexes (see Can you provide some examples of why it is hard to parse XML and HTML with a regex?, look for an HTML Parser for PHP. See this answer to Can you provide an example of parsing HTML with your favorite parser?

I tend to agree with the previous poster about RegEx not being the right tool for the job. If you just want a quick and dirty expression, here goes:
Offender Id:.*
.*&nbsp;[0-9]*
NOTE:
You must include the newline in this expression. Also note that this is very fragile as it will break if the source that your are parsing changes much at all.

Related

How do you process invalid HTML in PHP?

I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?

You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.

I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.

Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)

Parsing HTML with PHP to get data for several articles of the same kind

I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
http://www.biglion.ru/moscow/
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.

You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..

While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);

The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
echo $e->href . '<br>';
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.

Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.

Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Text Parser with PHP, like Instapaper

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.

You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki

Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent

You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.

this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.

PHP regex to get contents of a specific span element

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?

There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).

(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS

There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I extract data from a web page with regexes? - php

phpQuery is very nice for screen-scraping in PHP. It lets you access the DOM using the same methods jQuery has.

You don't want regexes (see Can you provide some examples of why it is hard to parse XML and HTML with a regex?, look for an HTML Parser for PHP. See this answer to Can you provide an example of parsing HTML with your favorite parser?

Related

How do you process invalid HTML in PHP?

Parsing HTML with PHP to get data for several articles of the same kind

How to know if the website being scraped has changed?

Text Parser with PHP, like Instapaper

PHP regex to get contents of a specific span element

Categories

Resources