I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.
Related
I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it
I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.
You'd better check them all & use everyone in its best intended occasion.
This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn't been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
To answer the original question of, "Is there any simple way to do this without any external libraries/classes?" The answer is no. HTML is rather complex and there's nothing built into PHP that's particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you'll find plenty of uses for such a library.
Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is also very nice to have (i.e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that's finicky about input (e.g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
Obviously, Ultimate Web Scraper Toolkit does all of the above...and more:
I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I've had to do that a couple of times.
It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you've got a good library when you can turn the client into a server....but then I went and made PHP App Server with those classes. I think it's becoming a monster!
You can use SimpleHtmlDom for this. and then look for the title and img tags or what ever else you need to do.
I like the Dom Crawler library. Very easy to use, has lots of options like:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});
I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
http://www.biglion.ru/moscow/
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.
You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..
While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);
The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
echo $e->href . '<br>';
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.
I want to upload dynamically content from a soccer live score website to my database.
I also want to do this daily, from a single page on that website (the soccer matches for that day).
If you can help me only with the connection and retrieval of data from that webpage, I will manage the rest.
website: http://soccerstand.com/
language: php/java - mysql
Thank you !
You can use php's file function to get the data. You just pass it a URL and it returns the content as an array of lines from the file. You can also use file_get_contents to get the content as one big string.
Ethical questions about scraping other site's data aside:
With php you can do an "open" call on a website as long as you're setup corectly. See this page for more details on that and examples: http://www.php.net/manual/en/wrappers.http.php
From there you have the content of the web page and it's a matter of breaking it up. Off the top of my head, I'd use regular expressions or an HTML parser to break apart the HTML, and then loop through the child elements and parse the data into your database calls to save the data.
There are a lot of resources for parsing HTML on the web and it's simply a matter of choosing the one that will work best for you.
Keep in mind you'll need to monitor the site for changes, because if they change elements, or their classes/ids you might need to change your parsing structure as well.
Using curl you will get the content of the page, then using regex you will get what you want.
There is an easy way: http://www.jonasjohn.de/lab/htmlsql.htm
I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.
I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.