Scrape page on download site to extract specific URLs - php

On a download site, I want to scrape all the URLs for the mirror sites. I am using PHP.
For example, on this page:
http://drivers.softpedia.com/progDownload/Gigabyte-GA-P55A-UD3-rev-10-Intel-SATA-RAID-Preinstall-Driver-9501037-Download-99091.html
I want to extract the following URLs:
http://drivers.softpedia.com/dyn-postdownload.php?p=99091&t=0&i=1
http://drivers.softpedia.com/dyn-postdownload.php?p=99091&t=0&i=2

Try with:
(http:\/\/drivers\.softpedia\.com\/dyn-postdownload\.php\?p=\d+&t=\d+&i=\d+)

It is unclear where you got the "t" and "i" parameters from the source url, it only contains the id (p). The below should do for retrieving that last group of digits.
%(\d+)\.html$%

Related

How to extract Google Search results redirect URLs using PHP

I need a way for PHP to extract the FIRST result's URL from google search results page but I need the Redirect URL.
The search query:
https://www.google.com/search?site=&source=hp&q=what%27s+new+in+windows+phone
I need this link:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CC0QFjAA&url=http%3A%2F%2Fwww.windowsphone.com%2Fen-us%2Fhow-to%2Fwp8%2Fbasics%2Fwhats-new-in-windows-phone&ei=KL03VeH-Icvfavz-gcAD&usg=AFQjCNEYZMSiSVQ-TKnKgbNNT5CY1o_1kw`
Not the actual link:
http://www.windowsphone.com/en-us/how-to/wp8/basics/whats-new-in-windows-phone
Can anyone suggest anything? I would think you need to use Curl to fetch the page and then use Dom to parse the HTML to give you the redirect url. Just don't know how to do this exactly. Any help would be appreciated.
I have researched quite a bit but have only found questions on how to get the normal URL.

How to get image url from tweet text in php?

I'm currently displaying pics in my app from twitter's default image service since they get included in the json response but I'd like to try to get images from yfrog, twitpic, lockerz or similar providers.
I'm using the rest api so I was thinking about adding filter:links to the search query, extract the url from the tweet and check if the link is an image but I'm not sure exactly how to get the url since I assume it'll require some regular expressions plus most of the tweets url are shortened versions that redirect to the actual photo somewhere so I believe this could be a problem. It'd be nice if I could verify that the url contains any of the image providers mentioned above too (kind of like a first filter before checking if the url is an image)
Could someone point me in the right direction? Thanks in advance!
For detecting the links, just google for a regex to match a url. Like this here:
http://snipplr.com/view/2371/ or http://www.catswhocode.com/blog/15-php-regular-expressions-for-web-developers
and cycle through the matches array:
http://php.net/manual/en/function.preg-match.php
This one should solve the short links problem (assuming you have curl installed): follow redirects with curl in php
Use this here to check if the link is an image:
http://php.net/manual/en/function.get-headers.php (parse "Content-Type" for "image")
I hope this helps.

Google API | URL to site title

For example to get the favicon of a site I can use
http://www.google.com/s2/favicons?domain=
and fill in the domain. Google returns the favicon.
I would also like to pull the title.
I know that I could parse the title from the html on the server side...or
I could use javascript document.title on the client side.
But I don't want to have to download the whole site.
I used the favicon example b.c. it was a good example of how you have data about a site available on the web with out having to do any "heavy lifting"
There must be a similar for the title. Essentially I want to match a URL to title.
You can make use of the Google custom search API to get the title of a website. Just search for "info:siteurl" and grab the title of the first request. I don't know exactly what you want to do, but it allows for 100 requests a day.
See details of the API here:
http://code.google.com/apis/customsearch/v1/reference.html
This post has a very nice piece of code which fetches the URL, description and keywords...
Getting title and meta tags from external website
You do have to download the whole pages source, but its only one page and using the PHP DOMDocument class is very efficient.
You don't have to load the whole page to get a favicon because its a separate file but titles are stored inside the page source.
http://forums.digitalpoint.com/showthread.php?t=605681
I think you are looking for something like this
With Google Search API
Create an API key here: https://developers.google.com/custom-search/v1/overview#api_key
Create a "Programmable Search Engine" from here: https://programmablesearchengine.google.com/ You can restrict it to a specific domain in these settings if desired.
Run a GET request with this URL: https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}
searchAPIKey comes from step 1
searchID comes from step 2
url is the search text, putting a url will usually put that result first in the results. However, newer or hidden links won't show up in these results.
In the JSON response, you can get the title of the first result with items[0].title
Javascript Fetch Example with Async/Await
const searchAPIKey = ''
const searchID = ''
fetch(`https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}`).then(function(response) {
return response.json();
}).then(function(data) {
console.log('title:', data.items[0].title)
}

How to make google search dynamic pages of my site

I am planning an informational site on php with mysql.
I have read about google sitemap and webmaster tools.
What i did not understand is will google be able to index dynamic pages of my site using any of these tools.
For example if i have URLs like www.domain.com/articles.php?articleid=103
Obviously this page will be having same title and same meta information always but the content will change according to articleid. So how google will come to know about the article on the page to display in search.
Is there some way that i can get google rankings for these pages
A URL is a URL, Google doesn't give up when it sees a question mark in one (although excessive parameters may get ignored, but you only have one). All you need is a link to a page.
You could alternatively make the url SEO friendly with mod_rewrite www.domain.com/articles/103
RewriteRule ^articles/(.*)$ articles.php?articleid=$1 [L]
I do suggest you give each individual page relevant meta tags no more then 80 chars and dont place the article content within a table tag as googles placement algorithm is strict, random non related links will also do harm to the rank.
You have to link to the page for Google to notice it. And the more links you have the higher up in Google's result list your page will get. A smart thing to do is to find a page where you can link to all of your pages. This way Google will find them and give them a higher ranking than if you only link to them once.

How to store crawled data from webpages

I want to build an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?
You can grab them with file_get_contents() function. So you'd have
$homepage = file_get_contents('http://www.example.com/homepage');
This function returns the page into a string.
Hope this helps. Cheers
Building a crawler I would make the list of URLs to get and finally get them
A. Make the list
Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.
For this you could use this class, which makes parsing html really easy :
https://simplehtmldom.sourceforge.io/
B. Get content
Loop on the array made and get the content. file_get_contents will do this for you :
https://www.php.net/file-get-contents
This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.

Categories