I am trying to get data from a site and be able to manipulate it to display it on my own site.
The site contains a table with ticks and is updated every few hours.
Here's an example: http://www.astynomia.gr/traffic-athens.php
This data is there for everyone to use, and I will mention them on my own site just to be sure.
I've read something about php's cURL but I have no idea if this is the way to go.
Any pointers/tutorials, or code anyone could provide so I can start somewhere would be very helpful.
Also any pointers on how I can get informed as soon as the site is updated?
If you want to crawl the page, use something like Simple HTML DOM Parser for PHP. That'll server your purpose.
First, your web host/localhost should have the php_curl extension enabled.
To start with, you should read a bit here. If you want to jump in directly, there is a simple function here Why I can't get website content using CURL. You just have to change the value of the variables $url,$timeout
Lastly, to get the updated data every 2hrs you will have to run the script as a cronjob. Please refer to this post
PHP - good cronjob/crontab/cron tutorial or book
Related
For a system I am developing I need to programmatically go to a specific page. Fill out one field in the form (I know the id and name of the input element), submit it and store the results.
I have seen a few different Perl, python and java classes that do this. However I would like to do this using PHP and havent found anything as of yet.
I do have the permission to do this from the site i am getting the information from as well.
Any help is appreciated
Take a look at David Walsh's simple explanation.
http://davidwalsh.name/curl-post
You can easily store the response (in this example, $result) in your database or logfile.
Usually PHP crawlers/scrapers use CURL - http://php.net/manual/en/book.curl.php.
It allows you to make a query from the server where PHP runs and get response from the website that you need to crawl. It returns response data in plain format and parsing it is up to you. You can manually check what does the form submit when you do it manually, and do the same thing via curl.
You also may try phpcrawl (http://phpcrawl.cuab.de), seems to fit all your needs.
(See "addPostData()"-method)
Let me just start by saying I know almost nothing about PHP but, I think that may prove to be the best way to do what I'm trying to do. I'd like to grab the value of a variable from an external page so that I can then process it for the creation of graphs and statistics on my page. An example page that I'm trying to get the variable from (requires a Facebook Account) is - http://superherocity.klicknation.com/game/pages/battle_replay.php?battle=857337182
The variable name is fvars and it contains data about what the 2 players used for attacks, how much damage they did, etc. Ultimately what I'd like is to provide a page with a form where a player can go and plug in their replay link (like above) and get a nice neat detailed breakdown of the battle.
At the very least, if someone could explain to me how to just echo out the value of fvars after a form submission with the replay url as input it would help out immensely!!! I've tried looking at some PHP references and other posts here but, have so far been lost. :(
Thank you for any help or guidance.
One way you could approach it is to use Selenium. You would need to setup the selenium server and a browser and then write a selenium script to fetch the page for you. The key point here is that selenium can run a firefox client with javascripts, facebook logins etc, everything you have on your ordinary firefox, through selenium programmatically.
I run selenium in a Linux environment and control it through php cli scripts. I run the java selenium-server-standalone along with framebuffered X and firefox. PHP Unit test library allready has an extension though you wouldn't need it for testing obviously.
You can get the contents of any webpage like so:
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
And then just use regex or basic searching to find the variable you need in $homepage. The problem is that you need to be logged in via Facebook. I know of no current way to do this dynamically with PHP.
Mike
Edit: found an SO question that addresses this exact issue - Scraping from a website that requires a login?
good day everyone!
i am trying to append a script to a remote page (not mine and it is a form page) that would hide some of its content (some in particular) before showing it. i am using curl but the only thing i could do is retrieve its html code.
is there anyway of doing what i wanted to happen?
I'm assuming that the user asks your server for content, and your server needs to fetch that content on another server and process it before sending it back to the user.
Query the other script using CURL, then run your script on that HTML to remove the pieces that you don't want to keep (I hope for your sake that they are reasonably easy to find and eliminate), and finally output the resulting HTML to the user.
To remove some part of the html, you could preg_replace() it using regular expressions.
Googling for an online regexp might be of some help, if you have no experience with regular expressions.
I'm looking for a way to extract some information from this site via PHP:
http://www.mycitydeal.co.uk/deals/london
There ist a counter where the time left is displayed, but the information is within the JavaScript. Since I'm really a JavaScript rookie, I didn't really know how to get the information.
Normally I would extract the information with "preg_match" and some regular expressions. Can someone help me to extract the information (Hrs., Min., Sec.) ?
Jennifer
Extracting the count-down time is not going to be easy, because it is fetched and set purely using JavaScript, which cannot be parsed using pure PHP. You would have to de-code the JavaScript code and see what calls it makes to fetch the initial times.
That is not an easy process, and could be changed by the site owners in no time.
Also, doing that, you would be in clear breach of their T&C:
For the avoidance of doubt, scraping of the Website (and hacking of the Website) is not allowed.
I hate to say "no", but in this situation PHP is not the right job for this. JavaScript requires a browser to run (in this case) and on top of that you probably have a jQuery lib.
The only thing PHP could do is invoke a browser that would contain some JavaScript (i.e., GreaseMonkey) that could try and scrape the page for the info. But this is really a job for embedded JavaScript.
As the others have said you can usually not access JavaScript stuff from PHP. However JavaScript has to get its data from somewhere, and this is where to start.
I found this in the source code:
<input type="hidden" id="currentTimeLeft" value="3749960"/>
That's the number of microsecond until whatever it is.
However this was only present in firefox, not when fetching it with wget. I found out it's the cookie that matters, so you'd have to request the page once, store the cookies and then access it a second time.
Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.