Read external HTML page and then find data within - php

I'm playing around with an idea, and I'm stuck at this one part. I want to read an external HTML page and then extract the data held within two <dd> tags. I've been using file_get_contents with good results, but I'm at a loss as to how to accomplish that last part. The two tags I want to extract the value from are always enclosed within a particular <div>, was wondering if that might help?
In my mind it reads the entire html file into a string, then dumps all the data up until this one particular <div>, and dumps all the data after the closing </div>. Is that possible? I think this needs regex syntax which I've never used yet. So any tips, links, or examples would be great! I can provide more info as necessary.

Maybe this could help:
http://simplehtmldom.sourceforge.net/

You are complicating way too much. Simply load the page content and then search for the proper regex (preg_match()). This will do fine
preg_match('~<tag id="foobar">(?P<content>.*?)</endtag>~is', $input, $matches);

If you use HTQL COM to query the page, the query is: <dd>1:tx

Related

Find a match, then grab some html before and after it

In php, I am scraping some html from one of my other external sites. I'm performing the scrape and getting all of the page html in a php string. I need to find the first .png file type in this string. I then need to just grab the html from this point to find the beginning http before it AND grab the html after it just before the following characters begin "\u002522". Any ideas?
So:
<html><head><title>Hello</title></head><body><p>Here's a nice image</p><img src="http://www.exampleurl.com/image.png?id=35435646&v=5647\\u002522"/></body></html>
Would turn into:
http://www.exampleurl.com/image.png?id=35435646&v=5647
I've looked everywhere for combining all these things at the same time, but to no luck :(
I have used this before and it worked great for me. How to extract img src, title and alt from html using php?
Then just clean up the URL and split on //.
Let me know if I need to be more specific.

Fetching images faster from a web page

I'm looking for a plugin or a simple code that fetches images from a link FASTER. I have been using http://simplehtmldom.sourceforge.net/ to extract first 3 images from a given link.
simplehtmldom is quite slow and many users on my site are reporting it as an issue.
Correct me if I'm wrong , I believe this plugin is taking lot of time to fetch complete html code from the url I pass and then it searches for img tags.
Someone please suggest me a technique to improvise the speed of fetching html code or an alternate plugin that i can try ?
What I'm thinking is something like fetching html code until it finds first three img tags and then kill the code fetching process ? So that things get faster.
I'm not sure if it's possible with php although, I'm trying hard to design that using jquery.
Thanks for your help !
Cross-site scripting rules will prevent you from doing something like this in jQuery/JS (unless you control all the domains that you'll be grabbing content from). What you're doing is not going to be super fast in any case, but try writing your own using file_get_content() paired with DOMDocument... the DOMDocument getElementsByTagName method may be faster than simplehtmldom's find() method.
You could also try a regex approach. It won't be as fool-proof as a true DOM parser, but it will probably be faster... Something like:
$html = file_get_contents($url);
preg_match_all("/<img[^']*?src=\"([^']*?)\"[^']*?>/", $html, $arr, PREG_PATTERN_ORDER);
If you want to avoid reading whole large files, you can also skip the file_get_contents() call and sub in a fopen(); while(feof()) loop and just check for images after each line is read from the remote server. If you take this approach, however, make sure you're regexing the WHOLE buffered string, not just the most recent line, as you could easily have the code for an image broke across several lines.
Keep in mind that real-life variability in HTML will make regex an imperfect solution at best, but if speed is a major concern it might be your best option.

How to "read" a HTML document in PHP?

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.
I am writting a little php script that creates a PDF file from a dynamically created HTML file.
Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.
E.g.
<div><p>Test</p></div>
My script should recognize:
First tag is a div: do function for div
Second tag is a p: do function for p
I don't know for what I should search. Regular expressions? HTML parser?
Thanks for a hint!
Try an XML parser. In PHP the SimpleXML is probably what you are looking for.
I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).
What you need to do is read the HTML file into a PHP variable/object
http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx
And then use RegEx to parse the HTML Tags and Attributes
http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML

edit html page and redisplay PHP

So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific that would be the same in each page, remove it and then redisplay the page within the browser with its new edits?
You can use the Tidy or HTML Purifier libraries to clean up and navigate the document tree, find the elements you are looking for, and remove them. I can't find comprehensive docs for Tidy, but the examples on php.net should be enough to help you get started.
Yes this is possible, you'd need to use file_get_contents("http://url"); to load the page into a string, then preg_replace with a regex to clean the string.

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Categories