I'd like to know how to build a site crawler, in php, that detects each page of a website and generates an entry in a xml file. I've seen plenty of websites doing this so I'm curious how to do it from scratch or there is any script or tutorial to teach that.
don't use regex. the proper way to parse html would be by using a DOMDocument object.
Load the first page into a DOMDocument object.
Use XPath statements to gather all of the anchor tag hrefs foudn in that page.
Use those values to find more pages to load, to start over with on step one again.
http://www.php.net/manual/en/class.domdocument.php
Here is the algorithm
Step 1-> Get a site's address, verify the address is in correct format and it ends with a page (www.xyz.com/page.html) not like (www.xyz.com/).
Step 2-> Get the contents of the file, using regular expression try to get the list of pages.
Step 3-> Harvest them in the DB for future use and do the step 2 on those files too.
Related
I want get data from a website and capture a part content of website. Example: capture box menu of website.
See this example
example
You might want to use a DOM-parser of sorts (For example) to read the entire structure of the website.
Be aware this is a very heavy process!
With a DOM parser you import the entire website structure, find the part you need and you can read the contents from it.
So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific that would be the same in each page, remove it and then redisplay the page within the browser with its new edits?
You can use the Tidy or HTML Purifier libraries to clean up and navigate the document tree, find the elements you are looking for, and remove them. I can't find comprehensive docs for Tidy, but the examples on php.net should be enough to help you get started.
Yes this is possible, you'd need to use file_get_contents("http://url"); to load the page into a string, then preg_replace with a regex to clean the string.
I want to upload dynamically content from a soccer live score website to my database.
I also want to do this daily, from a single page on that website (the soccer matches for that day).
If you can help me only with the connection and retrieval of data from that webpage, I will manage the rest.
website: http://soccerstand.com/
language: php/java - mysql
Thank you !
You can use php's file function to get the data. You just pass it a URL and it returns the content as an array of lines from the file. You can also use file_get_contents to get the content as one big string.
Ethical questions about scraping other site's data aside:
With php you can do an "open" call on a website as long as you're setup corectly. See this page for more details on that and examples: http://www.php.net/manual/en/wrappers.http.php
From there you have the content of the web page and it's a matter of breaking it up. Off the top of my head, I'd use regular expressions or an HTML parser to break apart the HTML, and then loop through the child elements and parse the data into your database calls to save the data.
There are a lot of resources for parsing HTML on the web and it's simply a matter of choosing the one that will work best for you.
Keep in mind you'll need to monitor the site for changes, because if they change elements, or their classes/ids you might need to change your parsing structure as well.
Using curl you will get the content of the page, then using regex you will get what you want.
There is an easy way: http://www.jonasjohn.de/lab/htmlsql.htm
I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.
Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.