So I'm looking for ideas on how to best replicate the functionality seen on digg. Essentially, you submit a URL of your page of interest, digg then crawl's the DOM to find all of the IMG tags (likely only selecting a few that are above a certain height/width) and then creates a thumbnail from them and asks you which you would like to represent your submission.
While there's a lot going on there, I'm mainly interested in the best method to retrieve the images from the submitted page.
While you could try to parse the web page HTML can be such a mess that you would be best with something close but imperfect.
Extract everything that looks like an image tag reference.
Try to fetch the URL
Check if you got an image back
Just looking for and capturing the content of src="..." would get you there. Some basic manipulation to deal with relative vs. absolute image references and you're there.
Obviously anytime you fetch a web asset on demand from a third party you need to take care you aren't being abused.
I suggest cURL + regexp.
You can also use PHP Simple HTML DOM Parser which will help you search all the image tags.
Related
I have a php application in which we allow every user to have a "public page" which shows their linked video. We are having an input textbox where they can specify the embed video's html code. The problem we're running into is that if we take that input and directly display it on the page as it is, all sorts of scripts can be inserted here leading into a very insecure system.
We want to allow embed code from all sites, but since they differ in how they're structured, it becomes difficult to keep tabs on how each one is structured.
What are the approaches folks have taken to tackle this scenario? Are there third-party scripts that do this for you?
Consider using some sort of pseudo-template which takes advantage of oEmbed. oEmbed is a safe way to link to a video (as the content authority, you're not allowing direct embed, but rather references to embeddable content).
For example, you might write a parser that searches for something like:
[embed]http://oembed.link/goes/here[/embed]
You could then use one of the many PHP oEmbed libraries to request the resource from the provided link and replace the pseudo-embed code with the real embed code.
Hope this helps.
I would have the users input the URL to the video. From there you can insert the proper code yourself. It's easier for them and safer for you.
If you encounter an unknown URL, just log it, and add the code needed to support it.
The best approach would be to have a white list tag that are allowed and remove everything else. It would also be necessary to filter all the attribute of those tag to remove the "onsomething" attribute.
In order to do a proper parsing, you need to use a XML parser. XMLReader and XMLWriter would works nicely to do that. You read the data from XMLReader, if the tag is in the white list, you write it in the XMLWriter. At the end of the process, you have your parsed data in the XMLWritter.
A code example of this would be this script. It has in the white list the tag test and video. If you give it the following input :
<z><test attr="test"></test><img />random text<video onclick="evilJavascript"><test></test></video></z>
It will output this :
<div><test attr="test"></test>random text<video><test></test></video></div>
So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific that would be the same in each page, remove it and then redisplay the page within the browser with its new edits?
You can use the Tidy or HTML Purifier libraries to clean up and navigate the document tree, find the elements you are looking for, and remove them. I can't find comprehensive docs for Tidy, but the examples on php.net should be enough to help you get started.
Yes this is possible, you'd need to use file_get_contents("http://url"); to load the page into a string, then preg_replace with a regex to clean the string.
I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.
I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.
I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.