I want get data from a website and capture a part content of website. Example: capture box menu of website.
See this example
example
You might want to use a DOM-parser of sorts (For example) to read the entire structure of the website.
Be aware this is a very heavy process!
With a DOM parser you import the entire website structure, find the part you need and you can read the contents from it.
Related
I'd like to know how to build a site crawler, in php, that detects each page of a website and generates an entry in a xml file. I've seen plenty of websites doing this so I'm curious how to do it from scratch or there is any script or tutorial to teach that.
don't use regex. the proper way to parse html would be by using a DOMDocument object.
Load the first page into a DOMDocument object.
Use XPath statements to gather all of the anchor tag hrefs foudn in that page.
Use those values to find more pages to load, to start over with on step one again.
http://www.php.net/manual/en/class.domdocument.php
Here is the algorithm
Step 1-> Get a site's address, verify the address is in correct format and it ends with a page (www.xyz.com/page.html) not like (www.xyz.com/).
Step 2-> Get the contents of the file, using regular expression try to get the list of pages.
Step 3-> Harvest them in the DB for future use and do the step 2 on those files too.
I want to upload dynamically content from a soccer live score website to my database.
I also want to do this daily, from a single page on that website (the soccer matches for that day).
If you can help me only with the connection and retrieval of data from that webpage, I will manage the rest.
website: http://soccerstand.com/
language: php/java - mysql
Thank you !
You can use php's file function to get the data. You just pass it a URL and it returns the content as an array of lines from the file. You can also use file_get_contents to get the content as one big string.
Ethical questions about scraping other site's data aside:
With php you can do an "open" call on a website as long as you're setup corectly. See this page for more details on that and examples: http://www.php.net/manual/en/wrappers.http.php
From there you have the content of the web page and it's a matter of breaking it up. Off the top of my head, I'd use regular expressions or an HTML parser to break apart the HTML, and then loop through the child elements and parse the data into your database calls to save the data.
There are a lot of resources for parsing HTML on the web and it's simply a matter of choosing the one that will work best for you.
Keep in mind you'll need to monitor the site for changes, because if they change elements, or their classes/ids you might need to change your parsing structure as well.
Using curl you will get the content of the page, then using regex you will get what you want.
There is an easy way: http://www.jonasjohn.de/lab/htmlsql.htm
I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.
I have a bunch of big txt files (game walkthroughs) that I need translating from English to French. My first instinct was to host them on a server and use a PHP script to automate the translation process by doing a file_get_contents() and some URL manipulation to get the translated text. Something like:
http://translate.google.com/translate?hl=fr&sl=en&u=http://mysite.com/faq.txt
I found it poses two problems: 1) there are frames 2) the frame src values are relative (ie src="/translate_c?....") so nothing loads.
Is there any way to fetch pages translated via Google in PHP (without using their AJAX API as it's really not suitable here)?
Use cRL to get the resulting page and then parse it.
Instead of using the regular translate URL which has frames, use the src of the frame:
http://translate.googleusercontent.com/translate_c?hl=<TARGET LANGUAGE>&sl=<SOURCE LANGUAGE>&tl=af&u=http://<URL TO TRANSALTE>&rurl=translate.google.com&twu=1&usg=ALkJrhhxPIf2COh7LOgXGl4jZdEBNutZAg
For example to translate the page http://chaimchaikin.za.net/ from English to Afrikaans:
http://translate.googleusercontent.com/translate_c?hl=en&sl=en&tl=af&u=http://chaimchaikin.za.net/&rurl=translate.google.com&twu=1&usg=ALkJrhhxPIf2COh7LOgXGl4jZdEBNutZAg
This will open up only a "frameless" page of the translation.
You may want to examine and test around to find the codes for the required language.
Also bear in mind that Google may add scripts to the translation (for example to show original text on hover).
EDIT: It appears, on examing the code, that there is a lot of javascript in between the translation. You may need to find a way to get rid of it.
EDIT: Further examination shows that the end bit "usg=ALkJr..." seems to change every time. Maybe first run a request on the regular Google translate page (e.g. http://translate.google.com/translate?hl=fr&sl=en&u=http://mysite.com/faq.txt) than find and parse the "usg=.." part and use it for your next request on the "frameless" page (http://translate.googleusercontent.com/translate_c?...).
I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.