I'm trying to get up to speed on HTML/CSS/PHP development and was wondering how I should validate my code when it contains content I can't control, like an RSS feed?
For example, my home page is a .php doc that contains HTML and PHP code. I use the PHP to create a simple RSS reader (using SimpleXML) to grab some feeds from another blog and display them on my Web page.
Now, as much as possible, I'd like to try to write valid HTML. So I'm assuming the way to do this is to view the page in the browser (I'm using NetBeans, so I click "Preview page"), copy the source (using View Source), and stick that in W3C's validator. When I do that, I get all sorts of validation errors (like "cannot generate system identifier for general entity" and "general entity "blogId" not defined and no default entity") coming from the RSS feed.
Am I following the right process for this? Should I just ignore all the errors that are flagged in the RSS feed?
Thanks.
In this case, where you are dealing with an untrusted on uncontrolled feed, you have limited options for being safe.
Two that come to mind are:
use something like striptags() to take all of the formatting out of the RSS feed content.
use a library like HTMLPurifier to validate and sanitize the content before outputting.
For performance, you should cache the output-ready content, FYI.
--
Regarding Caching
There are many ways to do this... If you are using a framework, chances are it already has a way to do it. Zend_Cache is a class provided by the Zend framework, for example.
If you have access to memcached, then that is super easy. But if you don't then there are a lot of other ways.
The general concept is to prepare the output, and then store it, ready to be outputted many times. That way, you do not incur the overhead of fetching and preparing the output if it is simply going to be the same every time.
Consider this code, which will only fetch and format the RSS feed every 5 minutes... All the other requests are a fast readfile() command.
# When called, will prepare the cache
function GenCache1()
{
//Get RSS feed
//Parse it
//Purify it
//Format your output
file_put_contents('/tmp/cache1', $output);
}
# Check to see if the file is available
if(! file_exists('/tmp/cache1'))
{
GenCache1();
}
else
{
# If the file is older than 5 minues (300 seconds), then regen
$a = stat('/tmp/cache1');
if($a['mtime'] + 300 < time())
GenCache1();
}
# Now, simply use this code to output
readfile('/tmp/cache1');
I generally use HTML Tidy to clean up the data from outside the system.
RSS should always be XML compliant. So I suggest you use XHTML for your website. Since XHTML is also XML compliant you should not have any errors when validating an XHTML page with RSS.
EDIT:
Of course, this only counts if the content your getting is actually valid XML...
Related
I'm looking for a way to make a small preview of another page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page, an image (like the logo of the website) and a bit of text or a description if it's available. Is there any simple way to do this without any external libraries/classes? Thanks
So far I've tried using the DOCDocument class, loading the HTML and displaying it on the screen, but I don't think that's the proper way to do it
I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
You may use either of these libraries. As you know each one has pros & cons, so you may consult notes about each one or take time & try it on your own:
Guzzle: An Independent HTTP client, so no need to depend on cURL, SOAP or REST.
Goutte: Built on Guzzle & some of Symfony components by Symfony developer.
hQuery: A fast scraper with caching capabilities. high performance on scraping large docs.
Requests: Famous for its user-friendly usage.
Buzz: A lightweight client, ideal for beginners.
ReactPHP: Async scraper, with comprehensive tutorials & examples.
You'd better check them all & use everyone in its best intended occasion.
This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn't been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
To answer the original question of, "Is there any simple way to do this without any external libraries/classes?" The answer is no. HTML is rather complex and there's nothing built into PHP that's particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you'll find plenty of uses for such a library.
Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is also very nice to have (i.e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that's finicky about input (e.g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
Obviously, Ultimate Web Scraper Toolkit does all of the above...and more:
I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I've had to do that a couple of times.
It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you've got a good library when you can turn the client into a server....but then I went and made PHP App Server with those classes. I think it's becoming a monster!
You can use SimpleHtmlDom for this. and then look for the title and img tags or what ever else you need to do.
I like the Dom Crawler library. Very easy to use, has lots of options like:
$crawler = $crawler
->filter('body > p')
->reduce(function (Crawler $node, $i) {
// filters every other node
return ($i % 2) == 0;
});
Im using a xml data feed to get information using simplexml and then generate a page using that data.
for this im getting the xml feed using
$xml = simplexml_load_file
Am i right in thinking that to parse the xml data the server has to download it all before it can work with it ?
Obviously this is no such problem with a 2kb file, but some files are nearing 100kb, so for every page load that has to be downloaded first before the php can start generating the page.
On some of the pages were only looking for a 1 attribute of an xml array so parseing the whole document seems unessarcery, normally i would look into caching the feed, but these feeds relate to live makets that are changing frequently so that not ideal as i would always have the up to the minute data.
Is there a better way to make more efficient calls of the xml feed ?
One of the first tactics to optimize XML parsing is by parsing on-the-fly - meaning, don't wait until the entire data arrives, and start parsing immediately when you have something to parse.
This is much more efficient, since the bottleneck is often network connection and not CPU, so if we can find our answer without waiting for all network info, we've optimized quite a bit.
You should google the term XML push parser or XML pull parser
In the article Pull parsing XML in PHP - Create memory-efficient stream processing you can find a tutorial that shows some code on how to do it with PHP using the XMLReader library that is bundled with PHP5
Here's a quote from this page which says basically what I just did in nicer words:
PHP 5 introduced XMLReader, a new class for reading Extensible Markup Language (XML). Unlike SimpleXML or the Document Object Model (DOM), XMLReader operates in streaming mode. That is, it reads the document from start to finish. You can begin to work with the content at the beginning before you see the content at the end. This makes it very fast, very efficient, and very parsimonious with memory. The larger the documents you need to process, the more important this is.
Parsing in streaming mode is a bit different from procedural parsing. Keep in mind that all the data isn't already there. What you usually have to do is supply event handlers that implement some sort of state-machine. If you see tag A, do this, if you see tag B, do that.
Regarding the difference between push parsing and pull parsing take a look at this article. Long story short, both are stream-based parsers. You will probably need a push parser since you want to parse whenever data arrives over the network from your XML feed.
Push parsing in PHP can also be done with xml_parse() (libexpat with a libxml compatibility layer). You can see a code example xml_parse PHP manual page.
I have a php application in which we allow every user to have a "public page" which shows their linked video. We are having an input textbox where they can specify the embed video's html code. The problem we're running into is that if we take that input and directly display it on the page as it is, all sorts of scripts can be inserted here leading into a very insecure system.
We want to allow embed code from all sites, but since they differ in how they're structured, it becomes difficult to keep tabs on how each one is structured.
What are the approaches folks have taken to tackle this scenario? Are there third-party scripts that do this for you?
Consider using some sort of pseudo-template which takes advantage of oEmbed. oEmbed is a safe way to link to a video (as the content authority, you're not allowing direct embed, but rather references to embeddable content).
For example, you might write a parser that searches for something like:
[embed]http://oembed.link/goes/here[/embed]
You could then use one of the many PHP oEmbed libraries to request the resource from the provided link and replace the pseudo-embed code with the real embed code.
Hope this helps.
I would have the users input the URL to the video. From there you can insert the proper code yourself. It's easier for them and safer for you.
If you encounter an unknown URL, just log it, and add the code needed to support it.
The best approach would be to have a white list tag that are allowed and remove everything else. It would also be necessary to filter all the attribute of those tag to remove the "onsomething" attribute.
In order to do a proper parsing, you need to use a XML parser. XMLReader and XMLWriter would works nicely to do that. You read the data from XMLReader, if the tag is in the white list, you write it in the XMLWriter. At the end of the process, you have your parsed data in the XMLWritter.
A code example of this would be this script. It has in the white list the tag test and video. If you give it the following input :
<z><test attr="test"></test><img />random text<video onclick="evilJavascript"><test></test></video></z>
It will output this :
<div><test attr="test"></test>random text<video><test></test></video></div>
I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.
I have been making HTML/PHP/MySQL database apps for quite a while now. I have avoided using XML/XSLT in any application since I just pull the data out and format it within my PHP script, and display it.
Assuming I am not wanting my data to be portable to other people's applications (via XML), is there any reason to implement an XML/XSLT based web app or is it a matter of preference?
Thanks.
I use XML/XSLT as a template engine.
Througout my script, I gather my data as nodes and put them in an XML object. When I need to display data, I feed this XML object to an XSLT and display the result.
It is a matter of preference.
XML/XSLT are useful when transforming XML to multiple other XML formats (rss, xhtml etc...), so if you don't need this kind of functionality, don't go with it.
They also add a cost in complexity and processing power. Again, if you don't need it, don't use them.