Scripting a webcrawler to fill and send forms on remote sites

Scripting a webcrawler to fill and send forms on remote sites - php

Now before you get out the torches and rail against spammers, I'll explain my intent here. I have written a series of scripts which scrape a certain website for contact information. These contacts are highly focused and are likely in a position where they are in need of a specific service I offer. The messages I plan on sending to them are one-offs and are written to be very helpful and respectful.
Now having said that, I'm having a hard time finding information on how to write a PHP bot that can enter a website, access a form, and send it. Everything I find is about stopping "spambots", unsurprisingly. I'm not worried about duping recaptchas or anything like that. If they have measures like that in place, I'm fine skipping them.

This question is too broad, so I have to give you a broad answer too...
First you need to download the page. You can use cURL (or file_get_contents might sufice).
Then you need to parse it with an HTML parser. You can use DOMDocument that comes bundled with PHP but you'll probably choke since DOMDocument is not very forgiving about pages with HTML syntax errors (or HTML5, for that matter)
Then you need to traverse the DOM and look for the form itself, extract the url and the method and make a request.
You can then use cURL to send a submit request to that url.
However, this will fail for dynamic pages (for instance, angular and other heavy javascripted pages). You probably better to use a headless browser like phantomjs.

Related

how to get a script tag value with php [duplicate]

I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it

I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>

You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.
You'd better check them all & use everyone in its best intended occasion.

This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn't been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
To answer the original question of, "Is there any simple way to do this without any external libraries/classes?" The answer is no. HTML is rather complex and there's nothing built into PHP that's particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you'll find plenty of uses for such a library.
Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is also very nice to have (i.e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that's finicky about input (e.g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
Obviously, Ultimate Web Scraper Toolkit does all of the above...and more:
I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I've had to do that a couple of times.
It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you've got a good library when you can turn the client into a server....but then I went and made PHP App Server with those classes. I think it's becoming a monster!

You can use SimpleHtmlDom for this. and then look for the title and img tags or what ever else you need to do.

I like the Dom Crawler library. Very easy to use, has lots of options like:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});

cURL PHP - load a fully page

I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers

No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.

No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.

Methods of injecting text

I recently created an image that automatically changes depending on the time thanks to a PHP script. I'm now thinking about doing something but I'm not sure if it's possible.
I do have restrictions. I need this to work on a forum board so it means I have to have all scripting on a different server. I would Google how to do this but I'm not sure what to search hence the broad title. If someone could possibly tell me if it's possible and show a small example to get me on the right track, that'd be appreciated.
What I need to do is print text out onto the page. As I stated above, all the scripting needs to be on a different server as the forum doesn't allow for php and only basic HTML (similar to here). This means I can't use include 'file.php';.

IMHO You have two options
Use HTML iframe element to embeed the external content (just give the external URL and the browser will handle the rest)
Call ajax request from javascript and inject the result into the DOM tree of the board.
Now that Your decision what suits You more.

Would it be better to parse HTML on the server with PHP or on the end user side with JavaScript?

I need to write a script that takes a link and parses the HTML of the linked page to pull in the title and a few other pieces of data like potentially a short description much like when you link to something on Facebook.
It will be called when a user adds a link to the site, so could see a decent number of hits when the client launches the site.
I am curious if I should do this on the server side with PHP or the end user side with Javascript? I have been writing the logic behind trying to figure out which areas of the markup are filled with potential content and it made me wonder if the load would be too much if I continue in PHP.
The client has just the one decent web server and I worry parsing/analyzing HTML pages may be too much load where we could do it in Javascript and farm it out to the user adding the link.
Any advice or thoughts on the matter would be awesome. Thank you.
Edit: This data is not going straight into the database, it is used to help the user by auto filling the description of their link which still goes through my regular vetting before being stored to the DB.

Well, this is an easy one, because performing this from the client-side purely with JavaScript just plain isn't an option at all due to the same origin policy.
Parsing HTML isn't that heavy of a task, you should be fine doing it in PHP.

I would offload this to the end-user via javascript, with a listener you could then bind it back to the server. The reasons why are simple:
This is a helper to the front-end not the backend (values aren't stored or manipulated on the backend directly.)
The load is better spread around than localized on your server, also you'll probably give a better user experience here if the end-user is only pulling 1 url vs. the server pulling thousands.
Processing in the front-end also mitigates the possibility of malicious code being executed directly on your server.

If you're thinking about having the client actually got and fetch some random site, parse it for you in Javascript, grab the title, description and other data and then submit that in your form for you, your form's submit time is going to be held hostage to your user's network connection speed for fetching that page and whatever overhead (likely miniscule) for parsing the data. If you do that server side using cURL, the hit will be in parsing the document for what you need. the best speed solution would probably be to let the person enter the URL, get it back in PHP, have PHP hand it off to a Perl script (which has some wicked fast DOM parsers) and get the required data back for the PERL script. From personal experience, the Perl scripts outperform cURL all day long, and cURL generally outperforms javascript AJAX gets by a wide margin just by nature of being on a bigger pipe than a home user.

You can do both....
1) PHP:
checkout HTML DOM Parser, could be helpful
or use php curl and then parse with DOMDocument
2) JavaScript:
you don't have to bother your server (pro)
parsing content with jQuery is easy (pro)
you need to handle cross domain policy (cons)

Xhtml instead of Php?

I want to develop a site that will allow be to publish information to users, and give them and opportunity to subscribe to a mailing list so they can be updated each time I make a change to the site.
*Add new information, etc.
I also would like for the users to be able to add comments about reviews posted, and give me suggestions...Things that will encourage user interaction
I understand that this is possible with php...
But I do not know php, and to learn and test it I apparently need a domain to begin with...etc.
Is it possible that I use Xhtml/Html to get the same results?
--
I know I can use the
Mail
but that would also leave my email open to spam...Any suggestions?
And I do apologize if this question has been posted before, I did some research and found no such thing.
All helpful responses are appreciated.

XHTML and HTML are essentially the same thing, just xhtml is based on an xml standard (thats where the x comes from), therefore being a bit more stricter.
HTML/XHTML is generally used for structure of your webpage, where as PHP is a server based language, meaning it works behind the scenes.
You could use html, but it'd be hideously complex to make, so i'd say you'd be better of biting the bullet and making a start on your first php app:) Don't worry it's very easy to get your head around. You do not need a domain to get started with the development, simply install WAMP (for windows), or MAMP (if your apple freak like me), these programs act as self contained mini servers, very useful for development!
Then i'd suggest trying it all out using html for starters, just so you get used to the WAMP/MAMP sever, before heading over to http://devzone.zend.com/article/627 for a brilliant set of tutorials on PHP!
EDIT: Another poster mentioned wordpress, its a great platform too! But i always favour learning the basics so in the event of something going wrong, or not working the way you want it to, you'll know what to do, or at least have an idea. Therefore i'd stick with your own php solution as a starter, then progressing to wordpress, when you feel comfortable.
I hope this helps :)

(X)HTML is the markup language that's interpreted by the browser, to display your web pages.
PHP is a language, used on the server, that can :
Generate that HTML markup
Act as a 'glue' with other systems, such as a database, for data-persitence.
(X)HTML by itself it not dynamic : it's only used to display data.
And PHP by itself doesn't display much information : it generates them.
So, basically, you'll need to use both (X)HTML and PHP :
PHP for everything thats' dynamic
like interaction with a database, a form, ...
HTML (possibly generated by the PHP code) to display the data.

No, you will need some kind of server side scripting language to be able to interrogate a database, print out comments and send the generated HTML to the browser.
If you don't know how to use PHP, how about using an open source solution like WordPress, this is a bloging platform but offers all the things you listed.

I would suggest using WordPress because:
It is easy to learn, the documentation is excellent
There are thousands of free plugins to add functionality to your site
There is a plugin, Contact Form 7, that will allow your users to send your email while doing a good job of curbing spam
There is a built in RSS feed to push out to your users notices when your site is updated
WordPress can be installed on shared hosting, virtual private hosts, and almost any machine with the LAMP stack
If you are new to creating websites, WordPress has free themes which are a good starting place
Finally, to answer your question, XHTML and PHP do different things. XHTML is like the idea of a picture. You can see it, it has shapes, outlines, sometimes words, etc. Where as PHP is like film where viewers can see something, but there is something in the background that is updating and moving.

HTML is just a markup language used by the browser to format data to display to users.
Most hosting solutions provide form mailer scripts that just take an HTML form and email the fields to a specified email address which you can configure.
They also provide mailing list functionality.
So, maybe check for a (PHP) hosting solution that provide this functionality and you won't need to write any PHP until you require more complex, custom functionality.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.