How to store crawled data from webpages

How to store crawled data from webpages - php

I want to build an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?

You can grab them with file_get_contents() function. So you'd have
$homepage = file_get_contents('http://www.example.com/homepage');
This function returns the page into a string.
Hope this helps. Cheers

Building a crawler I would make the list of URLs to get and finally get them
A. Make the list
Define a list of URL to crawl
Add this URL to the list of URL to crawl (job list)
Define the max depth
Parse the first page, get all the find the href, get the link.
For each link: if it's from same domain or relative, add it to the job list.
Remove the current URL from the job list,
Restart from the next URL job list if non empty.
For this you could use this class, which makes parsing html really easy :
https://simplehtmldom.sourceforge.io/
B. Get content
Loop on the array made and get the content. file_get_contents will do this for you :
https://www.php.net/file-get-contents
This is just basically valid for a start, in step A, you should keep a list of already parsed URL to check them only one. Query string can also be something you look after to avoid scanning multiple pages with different query string.

Related

Scraping some specified part of url content

I am using Beautiful Soup and pyquery(on python) and pquery (on php) to scraping(parse and fetch my desire part of html of url), But I have one problem with them, the number of URLs I want to try to fetch them is too much and all of this methods first try to load all section of page the we can scrap it with desire selectors,I just need some part of that pages , as example only specified class but I must get all of the page, I cause more consumption on bandwidth.
I want to know is there any way ( my knowledge tells me there is not but I ask maybe someone has idea or trick about it) or tools that instead get all page only try to get specified part of it?
more deatils:
let suppose I want get my answer title in this page, the url is https://stackoverflow.com/posts/34892845
and I just want text of question-hyperlink. I want get the title without get the whole of the page data ( I don't want fetch the whole of the page because saving my bandwidth in bulk operation)

File get contents params

I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.
Crawler uses file_get_contents to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents ignores that part after # (returns only first 21 items instead of all). Any ideas?

file_get_contents would ignore the #xxxxx part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.
You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar, that's a good sign.
So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.

Implementing a slug / clean URL system - general plan

I am currently working on a eCommerce style project that uses a search engine to browse 7,000+ entries that are stored in a database. Every one of these search results contain a link to a full description page. I have been looking into creating clean/slug URLs for this, my goal is if a user clicks on some search result entry the browser will navigate to a new page using the slug URL.
www.mydomain.com/category/brown-fox-statue-23432323
I have a system in place to convert a string / sentence into URL form. However, it is not clear to me what the proceeding steps are once these URL's are created. What is the general plan for implementing this system? Do the URL's need to be stored in a database? Am I suppose to be using post or get data from the search result page to create content in these full description urls?
I appreciate any suggestions!
Many thanks in advance!

Each product has a unique url associated with it in the database.
When you perform a search you just return the correct unique url.
That way you only ever work out what the url should be once, when the product is first added and that url will always relate to that one product. This is the stage you use your system to create that url

Maybe you can enlighten us as to if you are using a framework? Some frameworks (like Zend) have ini / xml files for routing. But you will still need to store the urls or at least the article slugs in a db.
Storing the entry urls in the db after they have been "searched" is necessary because you want slugs to stay the same for entries. This allows for better caching / SEO which will improve your sites usability.
Hope that helps!
Edit: Saw your question about pulling up individual articles. You will have to start by setting up a relation between your entries to urls in your database. Create a url table with url_id, and url. Then place url_id on the entry table. Then whenever someone goes to any URL search the url table for the current url, recall the url_id, and then pull the entry. At that point its just styling the page to make it look the way you want.

A common approach is to have a bijective (reversible) function that can convert a "regular" URL into a user-friendly URL:
E.g.:
www.mydomain.com/category/brown-fox-statue-23432323
<=>
www.mydomain.com/index.php?category=brown-fox-statue-23432323
Then you need not keep record of this mapping (convention vs. configuration).
Search StackOverflow for "User Friendly URL Rewriting" for information on how to achieve this automatically with Apache. This question is a good starting point.

Populate html/php page with google search results of specific keywords and urls?

I need to figure out how to (if it is possible) populate html/php page with following information:
I have a url of a page and a set of keywords, I'd would like to check every week what position in google search results is that url, if search is preformed for that set of keywords that is associated with it.
Say if it is on a second page of google it will have position of 18 etc.. (count starting from first result on first page).
I then have a html/php page with a table structure which has a column with urls, another column with keywords associated to those urls. Than there should be two more columns which contain information of position in google's search and date when that position was checked (so these two columns should be populated by that script that checks the position).
I'm gona be honest, I have no idea how to achieve this nor as I know if it is possible. Please suggest ideas, code snippets, maybe some services that do this kind of stuff.

To scrape Google's result pages, have a look here.
But note, that Google's former SOAP API does no longer exist. This I wonder, that it is legal to scrape Google's pages. See this Google blog page and Google's Terms of Use.
Google writes this:
Automated searching is strictly prohibited, as is permanently storing any search results. Please refer to the Terms of Use for more detail.

Google API | URL to site title

For example to get the favicon of a site I can use
http://www.google.com/s2/favicons?domain=
and fill in the domain. Google returns the favicon.
I would also like to pull the title.
I know that I could parse the title from the html on the server side...or
I could use javascript document.title on the client side.
But I don't want to have to download the whole site.
I used the favicon example b.c. it was a good example of how you have data about a site available on the web with out having to do any "heavy lifting"
There must be a similar for the title. Essentially I want to match a URL to title.

You can make use of the Google custom search API to get the title of a website. Just search for "info:siteurl" and grab the title of the first request. I don't know exactly what you want to do, but it allows for 100 requests a day.
See details of the API here:
http://code.google.com/apis/customsearch/v1/reference.html

This post has a very nice piece of code which fetches the URL, description and keywords...
Getting title and meta tags from external website
You do have to download the whole pages source, but its only one page and using the PHP DOMDocument class is very efficient.
You don't have to load the whole page to get a favicon because its a separate file but titles are stored inside the page source.

http://forums.digitalpoint.com/showthread.php?t=605681
I think you are looking for something like this

With Google Search API
Create an API key here: https://developers.google.com/custom-search/v1/overview#api_key
Create a "Programmable Search Engine" from here: https://programmablesearchengine.google.com/ You can restrict it to a specific domain in these settings if desired.
Run a GET request with this URL: https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}
searchAPIKey comes from step 1
searchID comes from step 2
url is the search text, putting a url will usually put that result first in the results. However, newer or hidden links won't show up in these results.
In the JSON response, you can get the title of the first result with items[0].title
Javascript Fetch Example with Async/Await
const searchAPIKey = ''
const searchID = ''
fetch(`https://www.googleapis.com/customsearch/v1?key=${searchAPIKey}&cx=${searchID}&q=${url}`).then(function(response) {
return response.json();
}).then(function(data) {
console.log('title:', data.items[0].title)
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to store crawled data from webpages - php

I want to build an educational search engine on my web app and so I decided to crawl about 10 websites using PHP from my web page and store the data into my database for later searching. How do I retrieve this data and store them in my database?

You can grab them with file_get_contents() function. So you'd have $homepage = file_get_contents('http://www.example.com/homepage'); This function returns the page into a string. Hope this helps. Cheers

Related

Scraping some specified part of url content

File get contents params

Implementing a slug / clean URL system - general plan

Populate html/php page with google search results of specific keywords and urls?

Google API | URL to site title

Categories

Resources