$html = file_get_html('http://www.livelifedrive.com/');
echo $html->plaintext;
I've no problem scraping other websites but this particular one returns gibberish.
Is it encrypted or something?
Actually, the gibberish you see is a GZIPed content.
When I fetch the content with hurl.it for instance, here are the headers returned by server:
GET http://www.livelifedrive.com/malaysia/ (the url http://www.livelifedrive.com/ resolves to http://www.livelifedrive.com/malaysia/)
Connection: keep-alive
Content-Encoding: gzip <--- The content is gzipped
Content-Length: 18202
Content-Type: text/html; charset=UTF-8
Date: Tue, 31 Dec 2013 10:35:42 GMT
P3p: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Server: nginx/1.4.2
Vary: Accept-Encoding,User-Agent
X-Powered-By: PHP/5.2.17
So once you have scraped the content, unzip it. Here is a sample code:
if ( ! function_exists('gzdecode'))
{
/**
* Decode gz coded data
*
* http://php.net/manual/en/function.gzdecode.php
*
* Alternative: http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping
*
* #param string $data gzencoded data
* #return string inflated data
*/
function gzdecode($data)
{
// strip header and footer and inflate
return gzinflate(substr($data, 10, -8));
}
}
References:
http://www.php.net/manual/en/function.gzdecode.php#106397
http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping
There's nothing really like site encryption, if the content can reach your browser and is HTML, it can be scraped.
It's probably because the site uses a lot of Javascript and Flash which cannot be scraped by an HTML parser. Even Google itself is just begginning to make inroads into accurate scraping of flash and Javascript.
To scrape a site in it's browser glory, try Selenium.
Links:
https://code.google.com/p/php-webdriver-bindings/
https://groups.google.com/forum/#!topic/selenium-users/Rj6BYEkz9Q0
A neat tip to know what you can scrape using an HTML scraper, try disabling Javascript and Flash on your browser and loading the website. The content you can view is easily scrapable - the rest you have to be a little more clever in your methods.
Maybe the files on their servers aren't saved as UTF-8?
I've tried your function on several sites and sometimes it works (on servers I know that they save their files as UTF-8, and not just stating those are encoded in UTF-8) and some other times it gives gibberish.
Try testing it yourself on your local machine, parsing files saved as UTF-8 and other encodings, and see what comes up...
$html->plaintext;
This will give you only text but if you need to fetch html then you need to use
$html->innertext;
For more information you can refer http://simplehtmldom.sourceforge.net/manual.htm
Related
I would know if it's possible to download a specific files (json) from github to insert a directory without to download all the files via a zip.
I have this
$json = #file_get_contents($this->GetGithubRepo() . '/' . $module_name . '/contents/' . $this->ModuleInfosJson . '?ref=master', true, $this->context );
This line read the json, I would to write the json on a directory.
Objective is to create a cache and read the cache before to read on github.
Thank you
Github is not "in love" with this behavior, but I have an entire framework that runs on the same paradigm. I do, however, use the zip. You can hit the raw content following this pattern:
https://raw.githubusercontent.com/YOURHANDLE/THE_REPO/THE_BRANCH/FILE/PATH/ETC
Look for the "raw" option on a particular file when browsing.
Here is a config file from one of my repo in a format similar to how you wish:
https://raw.githubusercontent.com/datamafia/ShopifyETL/master/config.cfg
should currently return
[all]
SHOPIFY_KEY=YOUR-SHOPIFY-API-KEY
SHOPIFY_PASSWORD=YOUR-SHOPIFY-API-PW
SHOPIFY_STORE=YOUR-SHOPIFY-STORE-DOES-NOT-EXIST
SHOPIFY_BASE_URL=SHOPIFY_STORE.myshopify.com-or-custom-FQDN
Pay attention to document type and encoding, these could trip you up. Your JSON may not be JSON via the header (should not be).
One final problem, beyond encoding, is that of private accounts. Once private a big can of work will be put on your plate to auth in and see the data.
The header, mildly changed:
Date: Thu, 13 Apr 2017 00:02:48 GMT
Via: 1.1 varnish
Cache-Control: max-age=300
Etag: "154ec087bc75e501a18e72d4e14a6f17bc2f706b"
Connection: keep-alive
X-Served-By: cache-dfw1840-DFW
X-Cache: HIT
x-cache-hits: 1
X-Timer: S1492012345.3876515,VS0,VE0
Vary: Authorization,Accept-Encoding
Access-Control-Allow-Origin: *
X-Fastly-Request-ID: XYZABCXYZABCXYZABCXYZABCXYZABC
Expires: Thu, 13 Apr 2017 00:07:48 GMT
Source-Age: 35
Document type is "plain" (text), so some casting and checking will be important. There are tools in PHP to handle the incoming data and use as JSON. Good luck.
The following 'code' is sometimes (random) printed on a webpage after refresh.
>HTTP/1.1 200 OK
>Date: Fri, 18 Mar 2016 09:05:03 GMT
>Server: Apache
>X-Powered-By: PHP/5.3.6-pl0-gentoo
>X-Frame-Options: DENY
>X-XSS-Protection: 1; mode=block
>X-Content-Type-Options: nosniff
>Expires: Thu, 19 Nov 1981 08:52:00 GMT
>Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
>Pragma: no-cache
>Keep-Alive: timeout=15, max=86
>Connection: Keep-Alive
>Transfer-Encoding: chunked
>Content-Type: text/html
>5
(the last number, 5 in this case, is random, the rest is constant.
This is what I tried to solved this annoying 'bug?':
Removing HTML <head> contents
Removing HTML <body> contents
Removing AJAX (XHR) calls
Updating Smarty (engine that parses the templates)
PHP trim() around output to prevent unnessary spaces before or after <doctype> and <html> tags
Killing almost all PHP code (this is to much to explain here, but since I stripped it down complety I am 99% sure it is not the serverside (PHP) code)
Looking for PHP functions that are able to print these headers (greps for headers_list, getallheaders, apache_request_headers, etc.)
Tried multiple pages, same results, no matter its contents.
My customer sees the seem results on Microsoft Edge browser.
Updated other components, like browser detection
Added PHP ob_start();
Validated HTML
Made sure to clean Javascript console errors (now clean)
Gave a go on WireShark for Windows, to look at what headers are received, but this was to difficult for me. (should I retry?)
This problem sounds a lot like mine, but wasn't helping to fix mine: bugzilla DOT mozilla DOT org/show_bug.cgi?id=229710
Checked other Stack Overflow questions. Could not find a matching question/solution.
More, which I forgot :)
Notes:
The site is server over HTTPS with a valid certificate.
Here is the site link: https://www.10voordeleraar.nl
Attached screenshot links below.
The funny thing is, this only happens on Microsoft Edge, sometimes. It is behaving properly on all other browsers, so do my other sites.
Regards,
Laird
Screenshots:
Printed HTTP headers example on site top
Printed HTTP headers example in DOM inspect
I'm trying to submit a (java servlet) form using CURL in PHP, but it seems like there is a problem with the parameters. I cant really understand why its happening since I'm testing the CURL with a identical string parameters that is being used by the browser.
After some research in diverse forums I wasn't able to find a solution to my particular problem.
this is the POSTFIELDS string generated by the browser (and working):
submissionType=pd&__multiselect_PostCodeList=&selectedPostCode=01&selectedPostCode=02&selectedPostCode=03&__multiselect_selectedPostCodes=
and I'm using and identical (for testing) string in the PHP script but it im getting a HTML file as a answers telling "Missing parameters in search query".
I believe that the form
__multiselect_PostCodeList=
&selectedPostCode=01
&selectedPostCode=02
&selectedPostCode=03
&__multiselect_selectedPostCodes=
is quite wired (never see before this) and I'm wondering that it can be the reason of why the post is not working from CURL.
The form seems to be successfully submitted since I'm getting this header
HTTP/1.1 200 OK
Date: Wed, 07 Aug 2013 08:02:56 GMT
Content-length: 1791
Content-type: text/html;charset=UTF-8
X-Powered-By: Servlet/2.4 JSP/2.0
Vary: Accept-Encoding
Content-Encoding: gzip
Connection: Keep-Alive
Note: I tried submitting the same form from Lynx and I'm also getting the same result ("Missing parameters in search query"). So it seems like its only working with browsers like Mozilla or Chrome.
Please some help will be really appreciated, I don't have any more ideas at this point.
Thanks!
Oscar
I'm been stuck on this problem for a while and I'm pretty sure it must be something quite simple that hopefully someone out there can shed some light on.
So, I'm currently using jQuery UI's Autocomplete plugin to reference and external PHP which gets information from a database (in an array) and sends it to a JSON output.
From my PHP file (search.php) when I do this:
echo json_encode($items);
My output (when looking at the search.php file) is this:
["Example 1","Example 2","Example 3","Example 4","Example 5"]
Which is valid JSON according to jsonlint.com
The problem is that when I use jQuery UI's Autocomplete script to reference the external search.php file, Chrome just gives me the following error:
GET http://www.example.com/search.php?term=my+search+term 404 (Not Found)
I have tried inputting the JSON code straight into the 'Source:' declaration in my jQuery, and this works fine, but it will not read the JSON from the external PHP file.
Please can someone help?
Here's my code:
HMTL
<p class="my-input">
<label for="input">Enter your input</label>
<textarea id="input" name="input"
class="validate[required]"
placeholder="Enter your input here.">
</textarea>
</p>
jQuery
$(function() {
$( "#input" ).autocomplete({
source: "http://www.example.com/search.php",
minLength: 2
});
});
PHP
header("Content-type: application/json");
// no term passed - just exit early with no response
if (empty($_GET['term'])) exit ;
$q = strtolower($_GET["term"]);
// remove slashes if they were magically added
if (get_magic_quotes_gpc()) $q = stripslashes($q);
include '../../../my-include.php';
global $globalvariable;
$items = array();
// Get info from WordPress Database and put into array
$items = $wpdb->get_col("SELECT column FROM $wpdb->comments WHERE comment_approved = '1' ORDER BY column ASC");
// echo out the items array in JSON format to be read by my jQuery Autocomplete plugin
echo json_encode($items);
Result
In browser, when information is typed into #input
GET http://www.example.com/search.php?term=Example+1 404 (Not Found)
Update: the real PHP url is here: http://www.qwota.co.uk/wp/wp-content/themes/qwota/list-comments.php?term=Your
Please help!
UPDATE: ANSWER
The answer to my problem has been pointed out by Majid Fouladpour
The problem wasn't with my code but rather with trying to use WordPress' $wpdb global variable as (as far as I understand) it includes it's own headers, and anything outside of it's usual layout will result in a 404 error, even if the file is actually there.
I'm currently trying to get around the problem by creating my own MySQL requests and not using WordPress's global variables / headers.
PS. Majid, I'll come back and give you a 'helpful tick' once StackOverflow lets me! (I'm still a n00b.)
Are you sure the path source: "http://www.example.com/search.php" is correct?
You have to make sure that the target URL exists. If you are really using http://www.example.com/search.php then, wk, it simply does not exist, so this is why it does not work.
Update
Since you have a real URL that's working (I tested it!), here are a few steps you can take:
Make sure there's no typo. If there's one, fix it.
Make sure you can open that URL from your browser. If you cannot, then you might be having network access problems (firewall, proxy, server permission issues, etc.)
Try redirecting to another know URL, just to make sure. The 404 error is really a "not found" error. It cannot be anything else.
I think the include is the issue. As Majid pointed out... use the below include instead.
include("../../../wp-load.php");
Good luck!
Your apache server is sending wrong headers. Here is a pair of request and response:
Request
GET /wp/wp-content/themes/qwota/list-comments.php?term=this HTTP/1.1
Host: www.qwota.co.uk
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=142729525.1341149814.1305551961.1305551961.1305551961.1; __utmb=142729525.3.10.1305551961; __utmc=142729525; __utmz=142729525.1305551961.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Response headers
HTTP/1.1 404 Not Found
Date: Mon, 16 May 2011 13:28:31 GMT
Server: Apache
X-Powered-By: PHP/5.2.14
X-Pingback: http://www.qwota.co.uk/wp/xmlrpc.php
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Mon, 16 May 2011 13:28:31 GMT
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Response body
["Bake 'em away... toys.","Content precedes design. Design in the absence of content is not design, it\u2019s decoration.","Hanging on in quiet desperation is the English way.","I'm a reasonable man, get off my case.","Look at me, Damien! It's all for you!","Never get out of the boat... absolutely god damn right.","That gum you like is going to come back in style.","The secret to creativity is knowing how to hide your sources.","Things could be different... but they're not.","Your eyes... they turn me."]
So, even though you receive back response from the server, it has HTTP/1.1 404 Not Found in the headers. Someone may be able to investigate this and provide a potential reason and solution.
So I just now learned of the X-Robots-Tag which can be set as part of a server response header. Now that I have learned about this particular field, I am wondering if there are any other specific fields I should be setting when I output a webpage via PHP? I did see this list of responses, but what should I be manually setting? What do you like to set manually?
Restated, in addition to...
header('X-Robots-Tag: noindex, nofollow, noarchive, nosnippet', true);
...what else should I be setting?
Thanks in advance!
You don't necessarily need to set any of them manually, and I don't send any unless absolutely necessary: most response headers are the web server's job, not the application's (give or take Location & situational cache-related headers).
As for the "X-*" headers, the X implies they aren't "official," so browsers may or may not interpret them to mean anything - like, you can add an arbitrary "X-My-App-Version" header to a public project to get a rough idea of where people are using it, but it's just extra info unless the requester knows what to do with it.
I think most X-headers are more commonly delivered via HTML as meta tags already. For example, <meta name="robots" content="noindex, nofollow, (etc)" />, which does the same as X-Robots-Tag. That's arguably better handled with the meta tag version anyway, since it won't trip over output buffering as header() can do, and it will be naturally cached since it's part of the page.
These are headers from Stackoverflow (this page), so the answer is, probably none.
You don't want your site indexed (noindex)?
Status=OK - 200
Cache-Control=public, max-age=60
Content-Type=text/html; charset=utf-8
Content-Encoding=gzip
Expires=Tue, 28 Sep 2010 01:23:00 GMT
Last-Modified=Tue, 28 Sep 2010 01:22:00 GMT
Vary=*
Set-Cookie=usr=t=&s=; domain=.stackoverflow.com; expires=Mon, 28-Mar-2011 01:22:00 GMT; path=/; HttpOnly
Date=Tue, 28 Sep 2010 01:21:59 GMT
Content-Length=6929
This header comes handy to me. Characters are displayed correctly, even if meta tag is missing.
Content-Type: text/html; charset=utf-8