Scraping some specified part of url content

Scraping some specified part of url content - php

I am using Beautiful Soup and pyquery(on python) and pquery (on php) to scraping(parse and fetch my desire part of html of url), But I have one problem with them, the number of URLs I want to try to fetch them is too much and all of this methods first try to load all section of page the we can scrap it with desire selectors,I just need some part of that pages , as example only specified class but I must get all of the page, I cause more consumption on bandwidth.
I want to know is there any way ( my knowledge tells me there is not but I ask maybe someone has idea or trick about it) or tools that instead get all page only try to get specified part of it?
more deatils:
let suppose I want get my answer title in this page, the url is https://stackoverflow.com/posts/34892845
and I just want text of question-hyperlink. I want get the title without get the whole of the page data ( I don't want fetch the whole of the page because saving my bandwidth in bulk operation)

Related

PHP redirect/disguise URLs

I have a page which receives a parameter and uses it to load it's content.
Let's say the url is www.myweb.com/category.php?name=shampoo. What I want is the URL to display www.myweb.com/shampoo, and that when some user types www.myweb.com/shampoo it actually loads www.myweb.com/category.php?name=shampoo but the url keeps displaying www.myweb.com/shampoo.
Same thing with, per example, www.myweb.com/category.php?name=soap displays as www.myweb.com/soap
Is this posible with php? I have made some research but I has not been able to find anything. Thank you very much.

In Python, how can I request specific data from a dynamically loaded website?

I want to load pages from PeoplePerHour.com into python to run some data analysis, but it keeps getting data from a page I didn't ask for, I think it must go to the main page and then refreshes somehow into the page I ask for.
For example:
I want to pull the prices from all users at http://www.peopleperhour.com/freelance/data+analyst, and the data spans over multiple pages.
Say I want to request page 2, http://www.peopleperhour.com/freelance/data+analyst#page=2. If I go here in a browser, it works fine and pulls up page 2, but I think it pulls up page one first and then "refreshes" into page 2 (I think). If I access this in python, it loads the HTML from the first page, and never sees page 2.
Here's my code:
import requests
from pattern import web
import re
import pandas as pd
def list_of_prices(url):
html = requests.get(url).text
dom = web.DOM(html)
list = []
for person in dom('.freelancer-list-item .medium.price-tag'):
currency = person('sup')
amount = person('span')
list.append([currency[0].content if currency else 'na', amount[0].content if amount else 'na'])
return list
list_of_prices('http://www.peopleperhour.com/freelance/data+analyst#page=2')
No matter what, this returns the prices from page 1.
What is going on that I'm just not seeing?

If I understand correctly, you want to iterate through the pages. If that's the case, I believe the problem is with your URL.
Here's the URL you gave:
http://www.peopleperhour.com/freelance/data+analyst#page=2
The problem is, "page" is not a bookmark on that page. When you use the #page=2, it tells the browser to go down to the same page for a bookmark called "page=2".
Here's the URL for the Next button in that site:
http://www.peopleperhour.com/freelance/data+analyst?sort=most-relevant&page=2
You'll see it says "&page=2" which means something else. In their code "page" is a variable being passed via the url, with a value of 2. You use the "&" if there are more than one of these variables. Also, you are missing a "?" symbol. If you're passing variables via the URL, you have to put a ? followed by the name=value pairs for your variables.
So, easy fix, change your url to this:
http://www.peopleperhour.com/freelance/data+analyst?page=2
That's in comparison to your old url:
http://www.peopleperhour.com/freelance/data+analyst#page=2
As a quick test, copy/paste the corrected url on your web browser. You will see it now is on page 2.

Getting dynamic content (those generated by client-side code) is always very tricky. There is no easy solution to this, but if you really want to dig into it, I recommend PyV8, a JavaScript engine in Python.

Error in pattern when using pattern3 in python 3.6
Please click on the above Hyperlink to open the Image
What is the alternative to executing the same code under python3.6 environment because due to this I have to install the pattern3, the pattern is not supported by the python 3.6
Thanks!

a php site that creates a page based on a users click

I'm not even 100% sure how to ask this question, but I will try my best...
So, take youtube. You've got this:
URL/watch?v=Video_URL_Here
While on this video, you decide to click a video in the related on the right-side.
While doing that, the page refreshes, and instantly jumps to that video.
I have the basic concept down:
> Create a variable.
$var;
> User: *Clicks First Video*
$var = Video_One; // Pulls from mySQL-DB
> Open a new page (ex: URL/watch?v=Video_ONE)
PHP: >Creates a whole new page for the video.<
> User: *clicks new video*
$var = Video_Two;
> Open a new page (ex: URL/watch?v=Video_TWO)
PHP: >Working more magic.<
However, I'm having a hard time actually doing this.
Could anyone point me in the right direction or explain how it works?
It would be very appreciated.

The way YouTube works is using $_GET variables. That's what the ?v= is. It's taking in the v variable and checking the database for a video with that video id. The way they create the new page is by fetching each of the values corresponding to the id that was passed in the url, then putting that data in each of the page sections.

Let me answer with a very general and oversimplified example
actually, more than having a unique "$var" that changes every time you click on a video (on your example), it is more like the page already knows where to go for each link (or click), that is, every video already has a link associated, with the corresponding url.
all this is done BEFORE the page loads. (there are ways to make it after, but that is another matter).
Just to give a quick example (it may not be exactly how youtube works, it is just an example)
Lets say you store each videos name, description, rating, etc on a database table.
e.g.
video1name, url1, description1, etc1
video2name, url2, description2, etc2
video3name, url3, description3, etc3
also assume each video has already related videos stored somewhere (the videos which would show on the right side) (imagine they are in the same table, each video having their own "related videos" associated.
so, when putting the page together, via PHP (in this case), what the code does is, read the data from the database, so it will know what it will "paint", at that point, it stores such data in variables, and using those vairables, it is ready to build the page, using such data.
imagine you say "i need 5 videos here, those videos are this, this other .... etc"
so php will read those 5 videos info form the database, AND knowing their data, it already "knows" what will the specific url for each video will be.
it only has to build links for each video, each having it's speciffic url.
e.g.
[some html]
...
<a href="myvid1url" > ...</a>
<a href="myvid2url" > ...</a>
<a href="myvid2url" > ...</a>
...
[the rest of html]
the only thing php is doing, is creating HTML dynamically, based on that data, and once it finishes, it sends it to the browser, which only has to "paint" plain html, all of which is already filled with the particular urls, names, etc for each part.
This is a VERY generalized example, but i hope you get the idea.
The most important part is to understand that most of the time, pages are already "built" before being displayed, once loaded, they already "know" what to do when you click somewhere, etc.
Of course, you can add interactive functionality, using javascript, ajax, etc, and that MAY change the page already loaded, but that is another concept.
I think you should first tell us what your experience with programming is, or if you have only made plain simple htmls pages or anything, so we could give you better advice.
have fun!

You could use JQuery and have the second video load in a frame, iframe, div, table, new window, etc (depending on the data source, of course)
External sources (depending)
jQuery loading external page (cross domain) into Div element
Local content sources
Load HTML page dynamically into div with jQuery
For external data loading you could get all creative and run a curl to save the data locally, parse it for what you need and then serve that locally

create a content-only page for mobile news feed in modx

I've been tasked with providing the backend for a news feed that will be used by our company apps. The feed will pull articles from our current website, which is built with ModX (evolution). So far, I've designed the feed to send JSON through a specified url containing the needed information. It's currently in the following format (using Ditto placeholders):
{
"title":"[+longtitle+]",
"description":"[+description+]",
"link":"[(site_url)][~[+id+]~]"
},
Here's my issue - the link I'm providing through the JSON (in the link tag) opens the full, desktop version of the page. Our current site is not responsive, and was not originally designed to handle mobile devices. We would like to open a small, clean page showing ONLY the ['content'] of that particular article. I'm looking for a way to link to a page showing only this content - no header, no footer, nothing.
I know that I could create a new page to handle all of this, but it needs to be dynamic. New articles are created regularly, and I'd like to avoid having to add another page to handle this for every article, while also making it simple for the writing team to integrate this feature.
One of my ideas so far is:
Pass a GET parameter to the URL "link" in the JSON - something like - www.mysite.com/article1?contentOnly=true. Then, in my article, detect this parameter in PHP and handle accordingly. I would need this snippet on each article written, so it may cause issues down the road if our staff writers forget to add it.
I haven't worked with ModX long, so I'm assuming there's a better way to handle this. Any ideas would be greatly appreciated. Please let me know if I need to provide more information.

I am not 100 % sure how you have done this, but here's my tip.
Don't use the resource itself to output the JSON. Doing this based on a GET-paramter will required the entire site to be uncached. Instead, use a single resource for the feed and supply the id/permalink there.
For example: mysite.com/feed?id=1, mysite.com/feed?latest or something like that.
Done this way, you could have an empty template with just the snippet that is parsing to JSON in it. This has to be uncached of course, but the rest of the site could be cached as normal.

How to store crawled data from webpages

I want to build an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?

You can grab them with file_get_contents() function. So you'd have
$homepage = file_get_contents('http://www.example.com/homepage');
This function returns the page into a string.
Hope this helps. Cheers

Building a crawler I would make the list of URLs to get and finally get them
A. Make the list
Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.
For this you could use this class, which makes parsing html really easy :
https://simplehtmldom.sourceforge.io/
B. Get content
Loop on the array made and get the content. file_get_contents will do this for you :
https://www.php.net/file-get-contents
This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.