for academic reasons I need to scrape a North Korean dictionary (having already informed myself about the copyright-related issues), which 'actually' should be quite simple: The website is returned by a PHP script, which just uses ascending numbers in the URLs for each dictionary entry is:
uriminzokkiri.com/uri_foreign/dic/index.php?page=1
and the last entry is located at:
uriminzokkiri.com/uri_foreign/dic/index.php?page=313372
So basically I'd assume that the easiest way to do this is writing a simple shell script where the number of entry gets incremented using a loop construction, plus checking whether a site got downloaded successfully, since the connection is not good, so that it repeats trying to download the site until it was successful (also trivial).
But then I tried to download a site containing an entry to test this, which failed. The site makes use of session cookies, so I first saved the according cookie in a file using the "-c" parameter and then invoked curl with the "-v" (verbose) and "-b" (get cookie(s) from file) parameter, which produced the following output:
curl output
These are the request and response headers as being shown by Firebug:
Request/Response headers
I also tried to pass all these request headers using the "-H" parameter, however this didn't work neither.
Someone started coding a Python-based scraper for scraping this dictionary, but if this could be realized using a simple bash script, then this looks a bit like an overkill to me.
Does anyone know why the approach I tried so far doesn't work and how this could be achieved?
Many thanks in advance and kind regards
you could put some more Http-Headers like:
Origin: witch is domain of original site you scrap.
User-Agent: witch is your client configuration witch you can get from internet.
otherwise you can get bash curl script from your browser code inspection then convert it to php code. all automated and exist online.
Related
I have a GAE PHP script that accepts a POSTed message consisting of $_POST['version_name'], $_POST['version_comments'] and $_FILES['userfile']['tmp_name'][0].
It runs a file_get_contents against $_FILES['userfile']['tmp_name'][0] and stores the binary away in a CloudSQL DB.
This is the end point for a PHP-driven form, so users can upload new versions (with names / comments) through a friendly GUI from their browser. It works fine.
Now I want to be able to use the same handler as the end point for a Python script. I've written this:
r = requests.post('http://handler_url_here/',
data={'version_name': "foo", 'version_comments': "bar"},
files={'userfile': open('version_archive.tar.gz', 'rb')})
version_archive.tar.gz is a non-empty file, but file_get_contents($_FILES['userfile']['tmp_name'][0]) is returning null. Uploading files is a bit tricky with GAE, so I'd prefer to not change the listener - is there some way I can make Python send its payload in the same format the listener is expecting?
$_POST['version_name'] and $_POST['version_comments'] are working as expected.
I'd start by looking at the middle-man, which in this case is the HTTP request. Keep in mind, your Python script isn't posting directly to PHP; it's making an HTTP POST request, which is then getting interpreted by PHP into the $_POST variables and whatnot.
Figure out a way to "capture" or "dump" the HTTP request that Python is sending so you can inspect its contents. (You can find a number of free tools that help you do this in various ways. Reading the HTTP request should be pretty self-explanatory if you're familiar with working with $_GET and $_POST variables in PHP.) Then send a supposedly identical request from PHP, capture the HTTP request, and determine how and why they're different.
Good luck!
I have this problem when I'm trying to use wget to retrieve the OUTPUT of a specific php script, but it looks like this site generates 2 identical PHP files.
The 1st one is smaller and the 2nd one, in the sequence, is the correct one. The problem is every time I try the wget command, I end-up with the smallest output file, which does not contain the desired info :(
Is there a way to download the correct file, using wget, by adding some sort of identifier to the link, to make sure I'm downloading the correct file.
Here is the command I've been trying:
$ wget http://www.fernsehen.to/index.php
If your run/play this and use Fidller or Wireshark for capture, you'll end-up with two (2) "http://www.fernsehen.to/index.php" and I need the bigger file of the two.
P.S. To manually get the desired output file, you can open http://www.fernsehen.to/index.php in Firefox or chrome and view source.
Thank you in advance!
What you want is not really practically possible. When you visit that page, they first generate a small file with a load of Javascript, that detects browser features and sends them back to the server in a stateful manner in order to produce the exact code required for your browser, probably including stuff like supported codecs for video mainly. Probably they also do some session fingerprinting for DRM purposes, to stop people like you from exactly what you're trying to do.
wget cannot emulate this behaviour because it is not a full browser, and cannot execute all that Javascript, nor if it did properly supply browser-like data. You'd have to write an extensive piece of custom code that exactly mimics everything the in-between page is doing to achieve the intended effect. Possible, but not easy, and most certainly not with a basic generic-purpose tool like wget.
I'm fetching pages with cURL in PHP. Everything works fine, but I'm fetching some parts of the page that are calculated with JavaScript a fraction after the page is loaded. cURL already send the page's source back to my PHP script before the JavaScript calculations are done, thus resulting in wrong end-results. The calculations on the site are fetched by AJAX, so I can't reproduce that calculation in an easy way. Also I have no access to the target-page's code, so I can't tweak that target-page to fit my (cURL) fetching needs.
Is there any way I can tell cURL to wait until all dynamic traffic is finished? It might be tricky, due to some JavaScripts that are keep sending data back to another domain that might result in long hangs. But at least I can test then if I at least get the correct results back.
My Developer toolbar in Safari indicates the page is done in about 1.57s. Maybe I can tell cURL statically to wait for 2 seconds too?
I wonder what the possibilities are :)
cURL does not execute any JavaScript or download any files referenced in the document. So cURL is not the solution for your problem.
You'll have to use a browser on the server side, tell it to load the page, wait for X seconds and then ask it to give you the HTML.
Look at: http://phantomjs.org/ (you'll need to use node.js, I'm not aware of any PHP solutions).
With Peter's advise and some research. It's late but I have found a solution. Hope someone find it helpful.
All you need to do is request the ajax call directly. First, load the page that you want to get in chrome, go to Network tab, filter XHR.
Now you have to find the ajax call that you want. Check the response to verify it.
Right click on the name of the ajax call, select copy -> "copy as Curl (bash)"
Go to https://reqbin.com/curl, paste the Curl and click Run. Check the response content.
If it's what you want then move to the next step.
Still in reqbin window, click Generate code and choose the language that you want it to be translated and you will get the desired code. Now intergrated to your code however you want.
Some tips: if test run on your own server return 400 error or nothing at all: Set POSTFIELDS to empty. If it return 301 permanently moved, check your url whether it's https or not.
Not knowing a lot about the page you are retrieving or the calculations you want to include, but it could be an option to cURL straight to the URL serving those ajax requests. Use something like Firebug to inspect the Ajax calls being made on your target page and you can figure out the URL and any parameters passed. If you do need the full web page, maybe you can cURL both the web page and the Ajax URL and combine the two in your PHP code, but then it starts to get messy.
There is one quite tricky way to achieve it using php. If you' really like it to work for php you could potentially use Codeception setup in junction with Selenium and use Chrome browser webdriver in headless mode.
Here are some general steps to have it working.
You make sure you have codeception in your PHP project
https://codeception.com
Download chrome webdriver:
https://chromedriver.chromium.org/downloads
Download selenium:
https://www.seleniumhq.org/download/
Configure it accordingly looking into documentation of codeception framework.
Write codeception test where you can use expression like $I->wait(5) for waiting 5 seconds or $I->waitForJs('js expression here') for waiting for js script to complete on the page.
Run written in previous step test using command php vendor/bin/codecept path/to/test
Let me just start by saying I know almost nothing about PHP but, I think that may prove to be the best way to do what I'm trying to do. I'd like to grab the value of a variable from an external page so that I can then process it for the creation of graphs and statistics on my page. An example page that I'm trying to get the variable from (requires a Facebook Account) is - http://superherocity.klicknation.com/game/pages/battle_replay.php?battle=857337182
The variable name is fvars and it contains data about what the 2 players used for attacks, how much damage they did, etc. Ultimately what I'd like is to provide a page with a form where a player can go and plug in their replay link (like above) and get a nice neat detailed breakdown of the battle.
At the very least, if someone could explain to me how to just echo out the value of fvars after a form submission with the replay url as input it would help out immensely!!! I've tried looking at some PHP references and other posts here but, have so far been lost. :(
Thank you for any help or guidance.
One way you could approach it is to use Selenium. You would need to setup the selenium server and a browser and then write a selenium script to fetch the page for you. The key point here is that selenium can run a firefox client with javascripts, facebook logins etc, everything you have on your ordinary firefox, through selenium programmatically.
I run selenium in a Linux environment and control it through php cli scripts. I run the java selenium-server-standalone along with framebuffered X and firefox. PHP Unit test library allready has an extension though you wouldn't need it for testing obviously.
You can get the contents of any webpage like so:
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
And then just use regex or basic searching to find the variable you need in $homepage. The problem is that you need to be logged in via Facebook. I know of no current way to do this dynamically with PHP.
Mike
Edit: found an SO question that addresses this exact issue - Scraping from a website that requires a login?
I am trying to get data from a site and be able to manipulate it to display it on my own site.
The site contains a table with ticks and is updated every few hours.
Here's an example: http://www.astynomia.gr/traffic-athens.php
This data is there for everyone to use, and I will mention them on my own site just to be sure.
I've read something about php's cURL but I have no idea if this is the way to go.
Any pointers/tutorials, or code anyone could provide so I can start somewhere would be very helpful.
Also any pointers on how I can get informed as soon as the site is updated?
If you want to crawl the page, use something like Simple HTML DOM Parser for PHP. That'll server your purpose.
First, your web host/localhost should have the php_curl extension enabled.
To start with, you should read a bit here. If you want to jump in directly, there is a simple function here Why I can't get website content using CURL. You just have to change the value of the variables $url,$timeout
Lastly, to get the updated data every 2hrs you will have to run the script as a cronjob. Please refer to this post
PHP - good cronjob/crontab/cron tutorial or book