PHP Magento Screen Scraping - php

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page.
The problem is:
You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method.

Using stream_context_create you can specify headers to be sent when calling your file_get_contents.
What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.
Edit: Here's an example from PHP.net:
<?php
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>
Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.
If you want to scrape something simple, you can also use RegEx with preg_match_all.

If you're familiar with CURL this should be relatively simple to do in a day or so. I've created some similar apps to login to banks to retrieve data - which of course also require authentication.
Below is a link with an example of how to use CURL with cookies for authentication purposes:
http://coderscult.com/php/php-curl/2008/05/20/php-curl-cookies-example/
If you can grab the output of the page you can parse for your results with a regex. Alternatively, you can use a class like Snoopy to do this work for you:
http://sourceforge.net/projects/snoopy/

Related

file_get_contents VS dom->loadHTMLFile

I've been making a PHP crawler that needs to get all links from a site and fire those links (instead of clicking it manually or doing client-side JS).
I have read these:
How do I make a simple crawler in PHP?
How do you parse and process HTML/XML in PHP?
and others more, and I decided to follow 1.
So far it has been working, but I have been baffled by the difference in the approach of using file_get_contents against dom->loadHTMLFile. Can you please enlighten me with these and the implications it might cause, pros and cons, or simple versus scenario.
Effectively these method are doing the same. However, using file_get_contents() you will need to store the results, at least temporarily, in a string variable unless you pass it to DOMDocument::loadHTML(). This leads to a higher memory usage in your application.
Some sites may require you to set some special header values, or use an other HTTP method than GET. If you need this, you need to specify a so called stream context. You can achieve this for both of the above methods using stream_context_create():
Example:
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$ctx = stream_context_create($opts);
You can set this context using both of the above ways, but they differ in how to achieve this:
// With file_get_contents ...
$file_get_contents($url, false, $ctx);
// With DOM
libxml_set_streams_context($ctx);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
Leaves to be said, that using the curl extension you will have even more control about he HTTP transfer, what might be necessary in some special cases.

Stuck getting price info from a web page with multi-currencies

page: http://www.nastygal.com/accessories/minnie-bow-clutch
code: $html = file_get_contents('http://www.nastygal.com/accessories/minnie-bow-clutch');
The $html always contains the USD price of the product even when I change the currency on the upper right of the page. How do I capture the html that has the CAD price when I change the currency of the page to CAD?
It looks like currency preferences are being saved in a cookie named: CURRENCYPREFERENCE
Since it's not your browser making the connection to retrieve that view, you're likely not sending any cookie data along with your request.
I believe example #4 here will get you what you need:
http://php.net/manual/en/function.file-get-contents.php
It seems as though the country and currency selection are stored in cookies.
I'm assuming you're going to have to pass those values along with your file_get_contents() call. See: PHP - Send cookie with file_get_contents
EDIT #1
To follow up on my comment, I just tested this:
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: CURRENCYPREFERENCE=cad\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.nastygal.com/accessories/minnie-bow-clutch', false, $context);
print_r($file);
And was able to get this:
EDIT #2:
In response to your second comment. Those were important details. What does your bookmarklet do with the scraped contents? Are you saving a copy of the bookmarked product page on your own website? Regardless, you're going to have to modify your bookmarklet to check the user's cookies before submitting the request to run file_get_contents().
I was able to access my cookies from nastygal.com using the following simple bookmarklet example. Note: nastygal.com uses jQuery and the jQuery UI cookie plugin. If you're looking for a more generic solution, you should not rely on these scripts being there:
javascript:(function(){ console.log($.cookie('CURRENCYPREFERENCE')); }());
Output in the JS console:
cad

cUrl alternatives to get POST answer on a webpage

I would like to get the resulting web page of a specific form submit. This form is using POST so my current goal is to be able to send POST data to an url, and to get the HTML content of the result in a variable.
My problem is that i cannot use cUrl (not enabled), that's why i ask for your knowledge to know if an other solution is possible.
Thanks in advance
See this, using fsockopen:
http://www.jonasjohn.de/snippets/php/post-request.htm
Fsockopen is in php standard library, so all php fron version 4 has it :)
try file_get_contents() and stream
$opts = array( 'http'=>array('method'=>"POST", 'content' => http_build_query(array('status' => $message)),));
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);

How I can do HTTP post to a url called by an Iframe

I want to make a http post to an outside url using php. By outside url I mean the url i not hosted on my servers.The url is called in an iframe. I need to know if this is technically possible to do this.
I tried doing this using curl but curl creates its own session with the remote server while I want to use the session which the browser has already created.
Please let me know your thoughts on this.
<?php
php code to make http post.
?>
<iframe src="outside url to be posted" height="100" width="100"/>
The outside url is google calender, so when I call it, if the user is already logged into google, his calender should display and I need to make a post to the calender using http post to save a calender event.
I hope this makes myself more clear on what am trying to achieve.
Update - Current Answer
After the update to your question, here's a different answer that I think addresses your issue more closely.
I think the question you are asking involves doing things with a user's credentials on another site. This is dancing dangerously close to Cross-site Request Forgery.
If you only do the POSTing when the user requests that you do it, it's a little better (I guess) but still inadvisable.
Why don't you use the Google Calendar API to do what you need?
Previous Answer
You need to tell cURL to use a particular session. Because PHP is managing the session, you'll also need to tell php to stop writing to the session while cURL uses it.
Try this:
$strCookie = 'PHPSESSID=' . $_COOKIE['PHPSESSID'] . '; path=/';
session_write_close();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt( $ch, CURLOPT_COOKIE, $strCookie );
$response = curl_exec($ch);
curl_close($ch);
$_COOKIE['PHPSESSID'] will be the identifier for your PHP session, and $url will be the URL you've pulled out of the iframe.
This is taken virtually verbatim from this blog post. It was one of the first links on Google, so I didn't do a lot of extra digging.
I've done a bit of messing with cURL and PHP sessions, so this looks right based on what I remember.
Edit:
By the way, you should reference this SO question for the method to do POSTs with cURL. I assume you at least have some idea of how to do this, but there it is in case you need a refresher.
Also (in case it's not clear already), you can run as many
curl_setopt($handle, (CURL OPTION), (CURL VALUE));
lines as you need to configure cURL the way you need it.
e.g.:
POST vals
Session settings
etc., etc.
Good luck!
It's javascript, not php.
<form id="post_form" method="post" target="post_frame">
<input type="hidden name="field1" value="value1>
.... other fields
</form>
<script type="text/javascript">
document.getElementById("post_form").submit();
</script>
<iframe name="post_frame" height="100" width="100"/>
right off the file_get_contents man page:
<?php
// Create a stream
$opts = array(
'http'=>array(
'method'=>"POST",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
//put post content into cookie part
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>
<div><?=$file?></div>
not rly an iframe but the same idea

How to load a URL and only get back the last 20k of it

I have been made aware of the Accept-Range header.
I have a URL that I am calling that always returns a 2mb file. I don't need this much and only need the last section 20-50k.
I am not sure how to go about using it? Would I need to use cURL? I am currently using file_get_contents().
Would someone be able to provide me with an example / tutorial?
Thanks.
EDIT: If this isn't possible then what is post on about? Here ...
EDIT: Ulrika! I'm not insane.
This is possible using the Range header, provided the server supports it. See the HTTP 1.1 spec. You would want to send a header in the following format in your request:
Range: bytes=-50000
This would give you the last 50,000 bytes. Adjust to whatever you need.
You can specify this header in file_get_contents using a context. For example:
// Create a stream
$opts = array(
'http'=>array(
'method' => "GET",
'header' => "Range: bytes=-50000\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
If you were to file_get_contents() and dump that to a passthrough 'cache' file on disk, then you could use the unix/linux tail -c to only grab back the last 20kb or so. This doesn't mitigate the actual transfer, but gets that 20kb into the application.
This is indeed possible - see this question for an example of the HTTP headers sent and received
you can't do that. You're going to have to load the entire file (which is sent in its entirety, sequentially, by the source server), and just discard most of it.
What you're asking is like "I'm tuning to this radio station on my car stereo and I only want to hear the last 5 minutes of the show, without having to wait for the rest to complete or change channels".

Categories