I'm trying to get data from a website:
http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do?metodo=listarMesesChuvasMensais (translated via Google Translate)
In Delphi I know working with XML based Webservices (SOAP, WSDL), but this site does not provide such kind of services.
But I do not have knowledge in languages like PHP and HTML, not even web languages in general.
My question is, is there any way to get data from that site with the knowledge I do (not) have? Is there a tool to do this in Delphi? What are the common first steps to study the ability to do this?
Input January and 2014:
http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do?metodo=listarMesesChuvasMensais
Output:
a generic method url
http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do
First you need to use a HTTP client such as Internet Direct (Indy) or Synapse to get the web page as text.
Then you can either a HTML parser library or plain string routines to extract the table data.
How to perform HTTP requests is shown in many articles and Stackoverflow questions.
Related
I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it
I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.
You'd better check them all & use everyone in its best intended occasion.
This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn't been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
To answer the original question of, "Is there any simple way to do this without any external libraries/classes?" The answer is no. HTML is rather complex and there's nothing built into PHP that's particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you'll find plenty of uses for such a library.
Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is also very nice to have (i.e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that's finicky about input (e.g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
Obviously, Ultimate Web Scraper Toolkit does all of the above...and more:
I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I've had to do that a couple of times.
It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you've got a good library when you can turn the client into a server....but then I went and made PHP App Server with those classes. I think it's becoming a monster!
You can use SimpleHtmlDom for this. and then look for the title and img tags or what ever else you need to do.
I like the Dom Crawler library. Very easy to use, has lots of options like:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});
Now before you get out the torches and rail against spammers, I'll explain my intent here. I have written a series of scripts which scrape a certain website for contact information. These contacts are highly focused and are likely in a position where they are in need of a specific service I offer. The messages I plan on sending to them are one-offs and are written to be very helpful and respectful.
Now having said that, I'm having a hard time finding information on how to write a PHP bot that can enter a website, access a form, and send it. Everything I find is about stopping "spambots", unsurprisingly. I'm not worried about duping recaptchas or anything like that. If they have measures like that in place, I'm fine skipping them.
This question is too broad, so I have to give you a broad answer too...
First you need to download the page. You can use cURL (or file_get_contents might sufice).
Then you need to parse it with an HTML parser. You can use DOMDocument that comes bundled with PHP but you'll probably choke since DOMDocument is not very forgiving about pages with HTML syntax errors (or HTML5, for that matter)
Then you need to traverse the DOM and look for the form itself, extract the url and the method and make a request.
You can then use cURL to send a submit request to that url.
However, this will fail for dynamic pages (for instance, angular and other heavy javascripted pages). You probably better to use a headless browser like phantomjs.
I'm trying to traverse the whole PhoneGap thing to get a native app up and running. I am completely fine with creating html5 markup for the actual app, what I need help with is trying to pull in dynamic content from a website. In particular, there is some content on our website that also needs to be in the app. We use a program call Expression Engine that handles all of our content. The content that I would need to pull over would be:
Sermon Videos
Sermon Series
Locations
Plain text content
The majority of the app will be local, but there are some dynamic needs as you can see. I've read a couple things that say "JSON" is the way to go, but it looks pretty complicated as I'm not quite familiar with AJAX. Is this the only way or are there any options or resources anyone can point me to that might help. I'm not even sure if that method would work for our website. I appreciate any help you can provide.
They are correct. What you need to look into is AJAX/JSON and how to present your data to your app using these technologies.
Expression Engine would actually be quite a good choice for this as its template system is quite flexible. There are even add-on modules for delivering your content as JSON if you want t go that route.
A quick google led me to: http://samcroft.co.uk/2011/updated-loading-data-in-phonegap-using-jquery-1-5/
It's a bit more than you need since you will have your content in an existing CMS instead of creating a new database to store the data, but the concepts will hold true and I am sure you will be able to use it to find more tutorials that suit you better.
Trying to learning some more PHP. Here is what I'm after.
Essentially, I would like to search a website and return data to my own website.
Add a few keywords to a form.
Use those keywords to query a website such as monster.com for results that match the keywords entered.
Grab that data and return it to my own website.
How hard is something like this? I acknowledge the above outline is oversimplified but any tips you can offer are much appreciated.
If you're querying a site that has an API designated for this kind of functionality, you're on easy street. Just call the API's appropriate search function and you're all set.
If the site you're querying doesn't have an API, you still might be able to search the site with an HTTP GET using the right parameters. Then you just need to scrape through the file for the search results with your script and a few regex functions.
Here's a little tutorial on screen scraping with PHP. Hopefully that will be of some help to you. The trouble with this is that in general if the site hasn't made it easy to access their data, they might not want you to do this.
Enter Yahoo Query Language (yql). It's a service that let's you use things like xpath to get data from websites and put them into an easy to use xml or json format. The language is similarly structured to sql (hence the name).
I've used it for other sites to build rss feeds for sites that didn't have it and it was pretty easy to learn.
http://developer.yahoo.com/yql/
Ok, I'm using the term "Progressive Enhancement" kind of loosely here but basically I have a Flash-based website that supports deep linking and loads content dynamically - what I'd like to do is provide alternate content (text) for those either not having Flash and for search engine bots. So, for a user with flash they would navigate to:
http://www.samplesite.com/#specific_page
and they would see a flash site that would navigate to the "specific_page." Those without flash would see the "specific_page" rendered in text in the alternative content section.
Basically, I would use php/mysql to create a backend to handle all of this since the swf is also using dynamic data. The question is, does something out there that does this already exist?
There's an inherent problem with what you're trying to achieve.
The URL hash (or anchor) is client-side only - that token is not sent to the server. This means the only way (that I know of) to load the content you need for example.com/#some_page is to use AJAX, which can read the hash and then request the page-specific data from the server.
Done? No. Because this will kill search engine bots. A possible solution is to have example.com/some_page serve the same content (in fact, that could easily be a REST service that you've already made to return the AJAX or Flash-requested content), and provide a sitemap.xml which indexes those URIs to help out the search engines.
I know of no existing framework that does specifically these tasks, although it certainly seems like one could be made w/o too much trouble.
Well, according to OSFlash (the open source flash people) both CakePHP and PHPWCMS can do what you need, although from a first glance at their sites' feature list, it is not entirely obvious.
Let us know if they do work!
if you're using SWFAddress with Flash/Flex then you can read in the URL and then split that into an array and do as you wish:
SWFAddress.addEventListener ( SWFAddressEvent.CHANGE, onChange );
private function onChange ( e : SWFAddressEvent ) : void
{
var ar : Array = SWFAddress.getValue ().split ( '/' );
trace ( 'Array : ', ar );
}
For your non-flash stuff if you're using code igniter you'd be able to pull the url and convert that into an array as well.
Another alternative is to use FAUST. What you can do with FAUST is have PHP render out the HMTL as valid markup then FAUST will pull the HTML and pass that to Flash via Flash Vars as XML. This method makes search engines really really happy ( see http://www.bartoncreek.com ).
So to answer your question there are tools out there that will help you achieve your goals.