iTunes App Store Web Scraping - php

I'm looking to have a user enter an app ID on a website, save the information from their app (my sql database), and then display that information on the website.
If anyone would mind sharing the code/process that would be used to do this or are there tutorials that you can point me in the direction of learning how to do this?
If you could help me out at all I will be very grateful. Thanks.

You CANNOT do this via screen scraping. Read Apple's Terms Of Use
Your Use of the Site You may not use
any “deep-link”, “page-scrape”,
“robot”, “spider” or other automatic
device, program, algorithm or
methodology, or any similar or
equivalent manual process, to access,
acquire, copy or monitor any portion
of the Site or any Content, or in any
way reproduce or circumvent the
navigational structure or presentation
of the Site or any Content, to obtain
or attempt to obtain any materials,
documents or information through any
means not purposely made available
through the Site. Apple reserves the
right to bar any such activity.
What you need to do is investigate Apple's Partner Program which includes a program for developers and I believe would grant you access to an API where you would be able to directly query for and receive the info you wanted (such as app descriptions) to display on your site and perhaps even get a commission for sales you generate when people purchase something from apple via links on your site etc.
Odds are that the other site you see which is displaying such info from Apple's store, has an affiliate/partner arrangement with apple. (and if not, it's just a matter of time till they get blocked from the site, find cease and desist letters in the mail from apple's laywers, get sued, or some combination of those three.

You should really look at these first.
https://stackoverflow.com/questions/822380/how-legal-is-screen-scraping
https://stackoverflow.com/questions/396778/legalities-of-screen-scraping
Knowing Apple, they'll probably sue you. They have sued for less. Or IOW who haven't they sued?

If you want to save the information to your database then you might want to look at the program Appfetcher at http://www.altraware.com.
It uses the Xml that apple provides so there should be no legality issues.

Related

Is it secure to send a php value in a link

I have a dynamic page where it should take data from a db. So the approach I thought of was to create the dynamic page with this php code at the top
<?php $pid = $_GET["pid"]; ?>
Then later in the file it connects to the database and shows the correct content according to the page ID ($pid). So on the home page, I want to add the links to display the correct pages. For example, the data for the "Advertise" page is saved in the database in the row where the pid is 100. So I added the link to the "Advertise" page on the homepage like this:
Advertise</li>
So my question is, anyone can see the value that's send on the link and play around by changing the pid. Is there an easy way to mask this value, or a safer method to send the value to the page.php?
The general concept you're looking for is Access Control. You have a resource (in this case, a page and its content), and you want to control who can access it (users, groups, etc), and probably how they can access it as well (for example, read-only, read-and-write, write-but-only-on-the-first-Monday-of-the-month, etc).
Defining the problem
The first thing you need to decide is which resources you need access control for, and which you don't. It sounds to me like some of these pages are supposed to be "public access" (thus they are listed on some kind of index page), while others are supposed to be restricted in some way.
Secondly, you need to come up with an access policy - this can be informally described for a small project, but larger projects usually have some structured system for defining this policy. For each resource, your policy should answer questions like:
Do you have some kind of user account system, and you only want account holders (or certain types of account holders) to access it? Or, are you going to send links to email addresses, and want to limit access to just those people who have the link?
What kind of access should each user have? Read-only? Should they be able to change the content as well (if your system supports that)?
Are there any other types of restrictions on a users' access? Group membership? Do they need to pay before they get access? Are they only allowed access at specific times?
Implementing your policy
Once you've answered these questions, you can start to think about implementation. As it stands, I think you are mixing up access control with identification. Your pid identifies a page (page 100, for example), but it doesn't do anything to limit access. If your pages are identified with a predictable numbering scheme, anyone can easily modify the number in the request (this is true for both GET requests, such as when you type a URL into an address bar, and POST requests, such as when you submit a form).
To securely control access there needs to be a key, usually a string that is very difficult to guess, which is required before access is granted. In very simple systems, it is perfectly fine for this key to be directly inserted in the URL, provided you can still keep the key secret from unauthorized users. This is exactly how Google Drive's "get a link to share" feature works. More complex systems will use either a server-side session or an API key to control access - but in the end, it's still a secret, difficult-to-guess string that the client (user or user's browser) sends to the server along with their request for the resource.
You can think of identification like your street address, which uniquely identifies your house but is not, and is not meant to be, secret. Access control is the key to your house. Only you and the people you've given a key to can actually get inside your house. If your lock is high quality, it will be difficult to pick the lock.
Bringing it together
Writing code is easy, designing software is hard. Before you can determine the solution best for you, you need to think ahead about the ramifications of what you decide. For example, do you anticipate needing to "change the keys" to these pages in the future? If so, you'll have to give your authorized users (the ones that are still supposed to have access) the new key when that happens. A user-account system decouples page access control from page identification, so you can remove one user's access without affecting everyone else.
On the other hand, you also need to think about the nature of your audience. Maybe your users don't want to have to make accounts? This is something that is going to be very specific to your audience.
I get the sense that you're still fairly new to web development, and that you're learning on your own. The hardest part of learning on one's own is "learning what to learn" - Stack Overflow is too specific, and textbooks are too general. So, I'm going to leave you with a short glossary of concepts that seem most relevant to your current problem:
Access control. This is the name of the general problem that you're trying to solve with this question.
Secrecy vs obscurity. When it comes to security, secrecy == good, obscurity == bad.
Web content management system. You've probably heard of Wordpress, but there are tons of others. I'm not sure what your system is supposed to do, but a content management system might solve these problems for you.
Reinventing the wheel. Good in the classroom, bad in the real world.
How does HTTP work. Short but to the point. A lot of questions I see on SO stem from a fundamental misunderstanding of how websites actually work. A website isn't so much a single piece of software, as a conversation between two players - the client (e.g. the user and their browser), and the server. The client can only say something to the server via a request, and the server can only say something to the client via a response. Usually, this conversation consists of the client asking for some resource (an HTML web page, a Javascript file, etc), to which the server responds. The server can either say "here you go, I got it for you", or respond with some kind of error ("I can't find it", "you're not allowed to see that", "I'm too busy right now", "I'm not working properly right now", etc).
PHP The Right Way. Something I wish I had found when I first started learning web development and PHP, not seven years later ;-)
It is always safer to $_POST when you can, but if you have to use something in the query string, it is safer to use a hash or GUID rather than something that is so obviously an auto-incremental value. It makes it harder to guess what the IDs would be. There are other ways values can be past between pages ($_SESSIONs, cookies etc), but it is really about what you want to achieve.
Sending it to php is not an issue, should be fine.
What php does with it afterwards... that's how you secure.
First thing I'd do is make sure it's an integer.
$pid=(is_int($_GET['pid']))? $_GET['pid'] : 1; //1 is the default pid, change this to whatever you want.
Now that you know you're dealing with an integer, use $pid after that and you should be good to go.

Get information about the Visitor (Javascript / PHP)

I am trying to create a website which can track/analyze the visitor and gather as much information as possible about the visitor. I have already found services which were not free, but they have provided every piece of information about the user who is visiting the website.
Is there a solution for that which is free yet effective like the paid ones out there? Also, should I use client side or server side scripting? Which is more reliable?
The main reason that I'm trying to gather these information is because I'd like to know more about my website visitors and eliminate the fraudsters or cheaters by analyzing their information. Or is there another good solution for that?
I'm sorry if this is a weird question, I'm quite new in this field this is why I'd like to know more about it.
EDIT: I have already tried Google Analytics, but it is not really good in my case.
If you would like to Geo Locate your visitors, and discover their physical location based off their IP, MaxMind has a great service with a free option.
http://www.maxmind.com/en/geolocation_landing

How to extract/fetch and show data ( from other ecommerce site ) to my website using PHP

I have seen many of the eCommerce portals which are showing the list of products from another bigger eCommerce websites from across the world.
The fetching is not a big problem i think, by using file_get_contents or CURL in php, But the question is,
Do they provide some api to allow others to fetch their data/product info?
Do we need to get their permissions to fetch data from their sites.
Are there some elegant and specific method/way to fetch data to show on our site (instead of CURL & file_get_contents)?
Some websites provide their API to access data. Some cost money, Some may be free. In any case , yes, you need permission.
But you can always scrape their sites without permission.
Here's some general guidelines on the subject.
You should check to see if they have a robot.txt file denying permission to spider some areas of the site.
Although there are copyright issues with reproducing content, search engines publish excerpts of site content all the time. Therefore to some extent, reproducing content is legally permissible.
APIs are sometimes available, but search engines scrape sites all the time without any sort of permission (except for perhaps the robot.text files).
Respect the site owner's wishes concerning their bandwidth. Poorly written robot code can wastefully tie up server resources.
If you can get permission, all the better.
I use cURL and the DomDocument class. I don't know what else you would want in terms of elegance.
Write a crawler to get all the data you want from those websites.
Use the APIs if provided.But usually it costs much.
Create you own APIs using third-party software.

Legal script that scrapes and indexes?

I want to create a website that scrapes certain websites (specified by me) to collect data and pricing and then offer that data as search results on my own site. So basically like a search engine, but for specific sites, indexed in a specific way. I can write this myself, but would like to know:
Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
What if I make money off this?
Are there any popular PHP scripts that already do this?
The legal aspect has been covered. I found a way around this (well, I got permission from the persons creating the content)... so the only real question is: what can I use to crawl the content, especially keeping in mind, each site will have diffrent rules that I will have to set up? It must also be clever enough to not spider the same content twice?
Is it legal?
Yes. And no. Probably.
There isn't one set of laws covering the entire planet, and SO isn't really for legal advice, you need to find a lawyer in your jurisdiction.
My own thoughts are that you would probably be okay in most jurisdictions as long as you use only the information. So, no eBay logos, no representations that you may be associated with them and so on.
But I am not a lawyer (though I deal a lot with the US sub-species as part of my work), certainly not your lawyer, and this advice (which isn't legal advice) is worth every cent you paid for it, which is ZERO!
What if I make money of this?
Good for you :-) Make mega-bucks. But see above point.
Are there any popular PHP scripts that already do this?
That's the bit I can't answer. My experience with PHP ranges somewhere between zero and nothing.
The legality is a bit shady in this area. You should look for the presence of a robots.txt ( http://www.robotstxt.org/robotstxt.html ) file to first determine if the website welcomes web spiders.
Also, there is a very good PHP search script called sphider ( http://www.sphider.eu/ ), you should have a look at.
EDIT:
I can't see many websites having an issue with you taking snippets of their website and then linking users onto the webpage which the content came from. However, if you plan on just taking all their content and displaying it on your own website in order to make profit, I can only assume many web sites would have an issue as they are the ones who should be profiting off the content.
1) Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
This is technically feasible. You can build a PHP script that does this quite easily. I would say that it is borderline illegal however, because by scraping content from somebody elses site you will be using their intellectual property, their data without permission.
2) What if I make money off this?
Then the original owners of the data are very likely to come after you, issue a cease and desist notice then sue you. An organization as large as ebay could do this without blinking.
3) Are there any popular PHP scripts that already do this?
Because of the questionable legal nature of your question, I highly doubt there are any scripts that already do this.
The correct technique of getting data from ebay and other large data providers is by using APIs, or application programming interfaces. These are special protocols, languages, designed for programs to communicate with each other. This has the benifit of being significantly more efficient than page-scraping, while also being a known legal way to get data from a provider.
More information about the ebay specific API can be found here; http://developer.ebay.com/common/api/

Logging/tracking in PHP: Scribe, Chukwa, log4php?

This is probably a pretty high-level question that requires a lot of explaining, but I'm in need of a lot of explaining.
Basically I'm developing a PHP application that requires a lot of logging and tracking. Tracking clicks, interactions, performance, etc. etc. Anything under the sun. Facebook's Scribe and Yahoo's Chukwa are both great implementations of this. I know little about log4php.
What I want is a high-level overview of how this kind of logging works, specifically in conjunction with a PHP application. You can stop at the point where the log gets processed; I already know that I want to use Hadoop/Hive for processing and storage.
I'd also like some fairly low-level looks at what happens within the application itself. For example, how does one take the behavior of a click and send that to the logger? I'd appreciate any reading that can help get me started, as well.
You can buy/get the tools to do this for you or build in-house.
buy/get:
1 - Tag your pages with Google/Yahoo analytics - This will track pageviews, page flow performance, SEO ranking for keywords, etc.
2 - For tracking and logging user behavior, which include clicks, interactions and performance. I found nothing better than ClickTale - http://www.clicktale.com/default_e.aspx - It video records user sessions and puts these "log files" in a server.
in-house:
1 - Creating hidden fields in your forms that submits to a logging database also works. You specify unique IDs to forms and keep track of it's actions during submits.
I'm sure there's lots more, but these are the basics. These are not PHP specific though.
HTH
EDIT #1 :
This may be beyond the scope of your question, but tracking doesn't necessarily mean data that goes in-house. An example would be adding a "like it" or "digg it" button to articles or pages. This will "log" popularity for you. You can go to facebook or digg.com to see progress of your site. it'll also help with SEO. basically, it's a tracking system. And it's easy to use. there are PHP snippets out there that you can copy and paste to your code. If you have WordPress, there is a plugin - just look for "digg", "like it" in the plugin search section.
Going back to Google Analytics, if you want to go beyond tracking clicks, go ahead and make goals/funnels. It'll track user behavior, and answer questions such as "What were my most valuable keywords?" "where are all my users dropping off?" "what is the bounce rate for each page?" "what are the top 3 entry points to my site and from what traffic medium?" these are question SEO/SEM managers are most concerned about. and it's definitely good to track and understand.
ClickTale starts where Google Analytics ends. GA will describe user behavior in the page level, but not in the field level. ClickTale, which has heat maps, will answer these questions "I know this page has a high bounce rate, but why? which field is a problem field for my customers?" "At what area of the page do users spend most of their time in?" "how do i prove to the graphics guys that a particular section needs to be redesigned?".
EDIT #2
For high traffic sites, you will need to scale your logging DB. It really helps when it comes to reporting. What I suggest is a 3-tier database reporting structure. tier 1 = last 7 days, tier 2 = last 6 months, tier = everything. You can modify these according to the business. The point being, data moves from one tier to another. keeping fresh data readily available. You want to generate reports asap. A a single huge DB just doesn't scale.
You can monitor user clicks by logging the path the user is taking, referrer --> new uri, assuming both are verbose and descriptive enough. For example, if a user clicks on one of his friends you should log the uris:
Referrer: /users/41251
Target: /users/66257
storing them properly for easy querying and reporting. Here a direct click like that would assume the target is in the referrer's page, so is a friend. If you have more complicated scenarios be sure to describe them with distinct uris, eg: /users/suggestion/14152 for a suggested connection.
Add to that timestamps and you have a very rough estimate of how long they stayed on each page, although users tend to lose focus, switch tabs/applications and come back, etc. Google Analytics, for one, does this well.
For a summary of where users click most on your site using heatmaps I like the free (GPL) Clickheat.
Check out Splunk
On the frontend where you're doing the logging from, here is some sample PHP code that you might find useful:
http://www.alphadevx.com/a/85-Logging-Messages-to-Scribe-from-PHP
In terms of the architecture, you have a lot of flexibility with Scribe. I would recommend having a local Scribe instance running on each application node, and having your application log locally to localhost. These local Scribe instances can in turn be configured to log to a central Scribe server when it is not too busy, otherwise they will continue to queue up messages locally. You actually consume your logs on the central server where they are aggregated by category.
I'm a big fan of Scribe, and I think it's designed well is so far as it's got a very small memory and processor footprint, and it is quite easy to configure (although murder to install due to the dependencies!). It just lacks documentation.

Categories