Scrape using html dom parser

Scrape using html dom parser - php

Is it right way to scrape other websites contents into my website using simple_html_dom. If it is wrong, suggest me what is the method to display news in my website.

simple_html_dom is some extension I am guessing. If you are looking for something in Core PHP(PHP Extension), use DOMDocument
Basically by scraping you are taking the sites content. And if you are doing the same with their(sites team) consent then its okay, otherwise its not legal(depends on their T&C). Also sites have mechanism to block such acts.
Better ask the site team for content, they might be able to provide the data in much better and simpler way. Like API, RSS or a direct Database.

Related

how to get a script tag value with php [duplicate]

I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it

I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>

You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.
You'd better check them all & use everyone in its best intended occasion.

This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn't been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
To answer the original question of, "Is there any simple way to do this without any external libraries/classes?" The answer is no. HTML is rather complex and there's nothing built into PHP that's particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you'll find plenty of uses for such a library.
Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is also very nice to have (i.e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that's finicky about input (e.g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
Obviously, Ultimate Web Scraper Toolkit does all of the above...and more:
I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I've had to do that a couple of times.
It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you've got a good library when you can turn the client into a server....but then I went and made PHP App Server with those classes. I think it's becoming a monster!

You can use SimpleHtmlDom for this. and then look for the title and img tags or what ever else you need to do.

I like the Dom Crawler library. Very easy to use, has lots of options like:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});

Scripting a webcrawler to fill and send forms on remote sites

Now before you get out the torches and rail against spammers, I'll explain my intent here. I have written a series of scripts which scrape a certain website for contact information. These contacts are highly focused and are likely in a position where they are in need of a specific service I offer. The messages I plan on sending to them are one-offs and are written to be very helpful and respectful.
Now having said that, I'm having a hard time finding information on how to write a PHP bot that can enter a website, access a form, and send it. Everything I find is about stopping "spambots", unsurprisingly. I'm not worried about duping recaptchas or anything like that. If they have measures like that in place, I'm fine skipping them.

This question is too broad, so I have to give you a broad answer too...
First you need to download the page. You can use cURL (or file_get_contents might sufice).
Then you need to parse it with an HTML parser. You can use DOMDocument that comes bundled with PHP but you'll probably choke since DOMDocument is not very forgiving about pages with HTML syntax errors (or HTML5, for that matter)
Then you need to traverse the DOM and look for the form itself, extract the url and the method and make a request.
You can then use cURL to send a submit request to that url.
However, this will fail for dynamic pages (for instance, angular and other heavy javascripted pages). You probably better to use a headless browser like phantomjs.

Using PHP to retrieve information from a different site

I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit

First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you

Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;

This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.

Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.

How to extract content from other websites automatically?

I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?

Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.

"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.

You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.

There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.

I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.

screen scraping technique using php

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?

Zend_Http_Client and Zend_Dom_Query

You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.
Once you have the page, you're probably best off using regular expressions to scrape the data you want.

You should look look at curl.

You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.
How easy it would be to call from PHP I don't know though.

You could also check out http://php.net/dom

Curl, and once ure in, use QueryPath php library. (querypath.org)
You can access dom elements just like in JQuery, via CSS selectors,
there's method chaining...
Way better than just using php's native xml functions.
It also works as drupal extension, but I suppose you could implement it in any php project.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.