PHP web-crawler [duplicate] - php

This question already has answers here:
Closed 11 years ago.
I'm looking for a PHP web-crawler to gather all the links to for a large site and tell me if the links are broken.
So far I've tried modifying an example on here myself. My question about the codeI've also tried grabbing phpDig but the site is down. any suggestions would be great on how I should proceed would be great.
EDIT
The problem isn't the grabbing of the links the issue of the scale I'm not sure if the script I modified is sufficient enough to grab what possibly be thousands of URL's as I tried setting the depth for the search link to 4 and the crawler timed out through the browser. Someone else mentioned something about killing processes as to not overload the server, could someone please elaborate on the issue.

Not a ready-to-use solution, but Simple HTML Dom parser is one of my favourite dom parsers.
It let's you use CSS selectors for finding nodes over the document, so you can easily find <a href="">'s.
With these hyperlinks's you can build your own crawler and check if the pages are still available.
You can find it here.

Related

SimpleHtmlDom load items that load on scroll down (Advice) [duplicate]

This question already has answers here:
Loading content as the user scrolls down
(2 answers)
Closed 4 years ago.
Is it possible to get data from items that are loaded on scroll with simplehtml parser?
My code is done and it works perfectly I don't need any help with that , I'm just asking for advice if this is even possible to acomplish, I know when the dom praser loads what ever it sees on the first load of the page but is it possible to load more?
Example:
The page that i am loading has 10 items on it. But when you scroll down it Loads 10 more. Or thats not possible?
simple-html-dom doesn't do this by itself, you need to study how the javascript on the page fetches new items, and re-implement that in PHP. the "Network" tab of the Developer Tools of chrome is of great help in doing this, rather than study the javascript itself, you can just study the requests created by the javascript when you scroll, i usually find that to be a much easier approach.

Php copy website table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
HTML Scraping in Php
Far from being web developer expert, so sorry in advance if I'm missing something basic:
I need to copy a table into mySql database using PHP; the table resides in a website which I don't own, however I have permission to copy and publish.
Manually when I watch this website in my web-browser I need to click on a link in the main website URL (I can't reach the final destination page link since it changes all time, however the main page link is static and the link to click is also static).
Example to such a content I need to copy from (just an example, this is not the real content):
http://www.flightstats.com/go/FlightStatus/flightStatusByAirport.do?airportCode=JFK&airportQueryType=0
Most people are going to ask what have you tried. Since you mentioned that you don't have much development experience, here are some tips on how to go about it - have to put it as an answer so it is easier to read.
What you're going to need to do is scraping.
Using PHP, you'd use the following functions at the very least
file_get_contents() - this function will read the data in the URL
preg_match_all - use of regular expressions will let you get the data you are looking for. Though some/many people will say that you should go through the DOM.
The data that is returned with preg_match_all can be stored into your MySQL table. Though because the data changes so frequently, you might be better off just scraping that section and storing the entire table as cache (though I do have to say I have no idea what you are trying to do on your site - so I could well be wrong).

Create, Insert, Update or Delete XML using PHP [duplicate]

This question already has answers here:
Add, update and edit an XML file with PHP
(3 answers)
Closed 6 years ago.
So completed my uni stuff using a little help from you fantastic programmers out there and a few all nighters to go as far as I could with manipulating XML data through Javascript. Now I'm done for the summer and my dear old mother has asked me to create her a basic site for her maths tutoring service with info and prices etc... I was thinking as she doesn't need much I would go in for using XML again but this time not restricted on the use of PHP to Create new elements/nodes, update or delete.
I was going to create her a basic booking system with a little admin panel for editing entries etc... As the information doesn't really need to be too secure the use of XML seems to be alright for the purpose.
My question is Does anyone know of any clean basic functions that can be used to this end with XML using PHP ?? In terms of functions I would mean things like Create/Insert, Edit/Update, Delete etc...
Any help or even a site that has a decent tutorial on it would be great as I've gone through youtube and there isn't anything decent or clean and simple.
Thanks in advance!
Well, you can use the DOM methods to create/edit/delete nodes from an XML.
http://us2.php.net/manual/en/book.dom.php
Just curious, why do you want to play with XML, may be the easier choice would be a database as simple as SQLite http://php.net/sqlite.

What is the best way to create a sitemap.xml? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to create a sitemap using PHP & MySQL
So I'm kinda stuck here. I have a website with a pretty big database, that constantly changes. Now I want to help the search engines by supplying a sitemap.xml file. Normally I would use a webservice that would do this, but thats not really possible in this case.
To be honest. I have no clue where to start. How would I go about doing this? Sorry if this is a too basic question, but Google couldn't help me.
Edit: Some more info. DB is currently 1k pages. Want to go up to like 10k. I use Mysql to echo this from my database, and then htaccess to rewrite the URLs.
(PHP's get ID, etc)
You need to install a crawler of doing it like a webservice. The easier way is to write a php script and generate sitemap XML file by yourself.
Write a query to get the links from your database and then iterate over it to create a sitemap.
See this post for example php script How to create a sitemap using PHP & MySQL

How to programmatically turn any webpage into an RSS feed? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
There are many websites and blog which provide RSS feeds, but on the other hand there are also many which do not. I want to turn that type of web page into RSS feeds.
I found some solutions using through Google like Feed43, Page2rss, Dapper etc, but I want an Open Source project which can perform this task or any tutorial explaining about it.
Please give me suggestions and if you can explain, you are most welcome.
My preferable language is PHP.
There's nothing magic about RSS. I suggest you read this tutorial to understand how to build an RSS feed from scratch:
http://www.xul.fr/en-xml-rss.html
Then use your PHP skills to build one from your content. A generic HTML-to-RSS scraper can be found online by searching for "html to rss converter" or whatever, but most of these will be hosted solutions and the RSS feeds they produce aren't that great. A good RSS feed requires understanding the content that you're syndicating, not just the raw HTML. IMHO.
In general there is not going to be any "one size fites all" solution to something like this. You'll have to examine the HTML structure of the blog you want to build an RSS feed from, then parse out the content you are interested in, and stick it into an RSS feed.
Here's some PHP things to help get you started:
Parsing HTML:
DOMDocument (swiss-army-knife of HTML/XML parsing)
SimpleXML (easy to use, but requires valid XML)
Tidy (can be used to clean up bad HTML)
Understanding RSS Feeds:
http://en.wikipedia.org/wiki/RSS
To construct them with PHP, you can once again use DOMDocument or SimpleXML. Another option is, depending on the format of the HTML you want to convert into RSS, you may be able to create an XSLT stylesheet to transform it.
There is no simple or concrete answer to this question, but I will get you started.
First, you need to build a crawler of sorts. Typically, you are going to want this to be multi-threaded and run in the background on your server. This might be as simple as forking PHP processes on the server, but you might find a more efficient way, depending on how much traffic you expect.
Now probably the best way to start would be to read the DOM. See http://php.net/manual/en/class.domdocument.php Look for headings and try to associate them with the paragraphs below them. Beware though that probably less than half the sites out there (and likely far fewer from the ones that don't already have a feed) don't structure their site in an organized way. But, it is a place to start.
There are plenty of element attributes too you can use, such as alt text. Also, in time you may find a lot of sites using a particular template that you can write code to handle directly.
You should also have something to read existing feeds. If a site has a feed, no sense in generating one for it, right? Use SimplePie to get started, but there are alternatives you don't like it. http://simplepie.org/
Once you have parsed the page, you'll want a database backend to track it and changes and what not.
From there, you need something to generate the feed. There are plenty of OOP classes for doing this. Often times, I just write my own, but that is up to you.
If you build sites with the simple symphony cms then yes, its very easy. See this snippet of a tutorial. Learn here

Categories