This is my first question I hope to get best guidance.
I'm trying to grab the content of a webpage using file_get_contents().
In many occasions it's working fine, but there is one thing that is driving me crazy.
I'm separating a long link into three parts and put it back together with the code below. The link is a pagination link and the "3" is responsible for indicating the page, so in this particular link I want to see page 3.
$combinedlink = $firstpart."3".$secondpart."3".$thirdpart."1445256372";
$input = file_get_contents($combinedlink);
When I now echo $input, it is showing me page 1 instead of 3. When I echo the $combinedlink and follow it, it is taking me to the correct page. Now the shocking part: When I copy the output of echo $combinedlink; and insert it like this:
$input = file_get_contents("http://www.ReallyLongLink.de/EvenMoreStuff");
It is working fine and takes me to page 3. But the variable contains exactly the right thing but it is only working when I hard-code the link. Var_dump also shows me String(178) and then the string in quotationmarks.
The website you are trying to crawl might be using some other means of pagination besides the URL, such as a cookie/ session. That might explain why the link works in your browser but not in your script.
To track cookies sent by the website, you may want to try using a library, such as Guzzle, to fetch the pages.
UPDATED
$input = file_get_contents(html_entity_decode($combinedlink));
Related
I want to do the following : when the user access www.mysite.com, i want the server dynamically fetches the content of another site (let's say www.othersite.com) and generate the same html output for www.mysite.com.
So when the user goes to www.mysite.com, he will see exactly the same as he would see on www.othersite.com. This need to work also with www.mysite.com/myfolder, www.mysite/myotherfolder and so on.
I know i could use a redirect on .htaccess to do that, but for study purposes i want do that using only PHP.
Is there a way ?
You can fetch the target site's html code f.e. with file_get_contents and then just echo it out:
<?php
$htmlContent = file_get_contents('http://www.example.com');
echo $htmlContent;
But this won't fix the links in it, f.e. when the page has a Click and you click on it on your mirrored site, it will point to a non existing script on your server.
You could replace all links with preg_replace to point to your script with a query parameter of the target.
I want to load pages from PeoplePerHour.com into python to run some data analysis, but it keeps getting data from a page I didn't ask for, I think it must go to the main page and then refreshes somehow into the page I ask for.
For example:
I want to pull the prices from all users at http://www.peopleperhour.com/freelance/data+analyst, and the data spans over multiple pages.
Say I want to request page 2, http://www.peopleperhour.com/freelance/data+analyst#page=2. If I go here in a browser, it works fine and pulls up page 2, but I think it pulls up page one first and then "refreshes" into page 2 (I think). If I access this in python, it loads the HTML from the first page, and never sees page 2.
Here's my code:
import requests
from pattern import web
import re
import pandas as pd
def list_of_prices(url):
html = requests.get(url).text
dom = web.DOM(html)
list = []
for person in dom('.freelancer-list-item .medium.price-tag'):
currency = person('sup')
amount = person('span')
list.append([currency[0].content if currency else 'na', amount[0].content if amount else 'na'])
return list
list_of_prices('http://www.peopleperhour.com/freelance/data+analyst#page=2')
No matter what, this returns the prices from page 1.
What is going on that I'm just not seeing?
If I understand correctly, you want to iterate through the pages. If that's the case, I believe the problem is with your URL.
Here's the URL you gave:
http://www.peopleperhour.com/freelance/data+analyst#page=2
The problem is, "page" is not a bookmark on that page. When you use the #page=2, it tells the browser to go down to the same page for a bookmark called "page=2".
Here's the URL for the Next button in that site:
http://www.peopleperhour.com/freelance/data+analyst?sort=most-relevant&page=2
You'll see it says "&page=2" which means something else. In their code "page" is a variable being passed via the url, with a value of 2. You use the "&" if there are more than one of these variables. Also, you are missing a "?" symbol. If you're passing variables via the URL, you have to put a ? followed by the name=value pairs for your variables.
So, easy fix, change your url to this:
http://www.peopleperhour.com/freelance/data+analyst?page=2
That's in comparison to your old url:
http://www.peopleperhour.com/freelance/data+analyst#page=2
As a quick test, copy/paste the corrected url on your web browser. You will see it now is on page 2.
Getting dynamic content (those generated by client-side code) is always very tricky. There is no easy solution to this, but if you really want to dig into it, I recommend PyV8, a JavaScript engine in Python.
Error in pattern when using pattern3 in python 3.6
Please click on the above Hyperlink to open the Image
What is the alternative to executing the same code under python3.6 environment because due to this I have to install the pattern3, the pattern is not supported by the python 3.6
Thanks!
I have a PHP website that I send users to via a Dynamic URL like this:
http://mwebsitehere.com/?gw=1
well the page I send them too, works great with the code I am using to do certain things if the Dynamic content is set in the url. But whenever they click on a link on the page, which are ALWAYS changing, the Dynamic Content in the url is completely gone... For instances:
Lets say they are on the homepage that looks like this http://mwebsitehere.com/?gw=1, and then they click on a link that looks like this http://mwebsitehere.com/new-page/. Notice the ?gw=1 is completely gone from the url.
Is there a way to keep the Dynamic Links on every page if the url has dynamic content.
Like if it were to say ?gw=2 could all the links they click on or url somehow keep ?gw=2 on every page. Or if it said ?gw=1 for it to do the same thing.
Any help would be appreciated! Let me know if I need to explain my question better. Thanks!
I am also using wordpress, just in case you know anything wordpress specific! Thx!
the only reason to have get variables ?gw=2 in the url is if they are needed for that page, if you are wanting them for all pages,
have your scripts check to see if it exists in the $_GET array or $_COOKIES array, if its in the $_GET array but not it in the $_COOKIE array then set it in the cookies. That way your script will still see it,by checking the cookies.
No sense in cluttering the url with variables that dont need to always be shown.
If you want the exact same variable passed to every page, why not use
$_SESSION['gw'];
or
$_COOKIE['gw'];
to store "gw".
Otherwise you would have to pass it on via each link as follows
For example on page http://mwebsitehere.com/?gw=1
Link
There are a few ways you can do this.
You may use $_SERVER['QUERY_STRING'] and put it in every single link in your page. It will keep your links always repeating the same query string that your current file is.
You should try storing data in sessions! Then you can carry data from a page to another. Take a look at the PHP manual.
Good luck!
I'm using the "Snoopy" class to pick up HTML for phrasing.
The problem is that with one of the pages I need to get the html for redirects automatically because I'm using a the sites search and if it find a perfect result it will redirect.
Here is my snoop:
if($snoopy->fetch("http://www.rottentomatoes.com/search/?search=$pagelink&sitesearch=rt")){
$printable = $snoopy->results;
If the search is exact it will place me on a page like this...
http://www.rottentomatoes.com/m/captain-america/
I need this above link.
Any help would be great,
Thanks!
From poking around in the code a little, it seems like you should be able to check the variable $snoopy->lastredirectaddr, which should be set if you got redirected (if not, it should be a blank string).
for one project, i need to get the facebook source page (html one) via a php application.
i try lot of method like curl, file_get_content, change my ini_set, etc.... but facebook never let me get the html result file.
Does anyone can help ?
for example this page :
ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);
$data = file_get_contents("http://apps.facebook.com/is_cool/?cafe_action=album&view=scroll",0);
Print strip_tags($data,"");
Thanks a lot.
Damien
Comment 1 :
- I need to create 2 application. I want to parse the html code to get some information from one to the other. I don't want to duplicate or take the facebook code. I just want to make a "view source" (like IE or firefox) and put it on a file, without ask my users. When my user is logged in my first application, i just want to is is credential to get the other content.
The reason you're having problems is that the majority of the facebook homepage content is loaded via AJAX. The data is not hardcoded into what your browser renders.
You should think of a different way to accomplish your goals. If you tell us a little more about what you're trying to do, we can probably help you find an alternate method.