I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.
I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.
Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)
Related
I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?
HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)
I want to separate out header/footer/sidebar/carousel of homepage of any of the website.
Example if I enter google.com or alibaba.com or flipkart.com
I can retrieve that homepage via PHP CURL. (some of them are encoded which we can't)
But question is how to identify that? Each platform is using different programming language.
Is they any API available free/paid in market? Is it possible?
Here what I have tried
$url = "https://www.google.com";
$homepage = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($homepage);
echo "<pre>";
print_r($doc);
exit;
This is example of PHP language. I am looking to see solution in any of the language (Java/.NET).
Main question. Is it possible or not?
So there will be REST API like this & give response in JSON.
POST api/getWebsiteData
Params : <Website URL>
Sample Response
{
"header" : <html goes here>,
"menu" : <html goes here>,
"footer" : <html goes here>,
.....
....
}
I agree, we will not get 100% solution for this, because some of the website view source is encrypted.
The short answer is no, this is not possible.
The longer answer is that you could build something that might satisfy your needs, but I can pretty much guarantee that it won't work on the bulk of the web without lots, and lots, and lots of tweaking. And I mean lots. Like so much work that you become Google.
A web page is really composed of two things, HTML and DOM. HTML is what you are going to get from functions like file_get_contents, and when the browser interprets them they get converted into DOM. Further, once JavaScript gets involved, it can also modify the DOM as it pleases. Some web pages have a pretty 1-to-1 mapping for initial load with HTML to DOM but others have very minimal HTML and rely on JS to create and manipulate the DOM.
Next, there's CSS and CSSOM, the later of which is what JS has access to, similar to HTML's DOM. In CSS you can say "put the header at the bottom and the footer at the top". How many people do that? Probably zero, that's just a far-fetched example, but there are very many smaller nuanced examples. Some people believe that there should only be one header on a site, while others say that headers hold headings. For instance, you can (and I have seen) headers inside of a footer. (I'm not saying whether I agree or disagree with that, either.) Also, the web is full of HTML with CSS classes like:
<div class="a">...</div>
<div class="b">...</div>
Which one of those is the header and which is the footer? Or, which is the sidebar? Is one possibly a menu? Even better, go to the ReactJS official site and inspect their DOM and you'll see code like this:
<div class="css-1vcfx3l"><h3 class="css-1xm4gxl"></h3><div>
Do those classes make sense to you?
So if you are going down this path, you are going to have to figure out where you are going to start. Do you just want to parse HTML and ignore JS/CSS/DOM/CSSOM? If so, that's generally called screen scraping (or at least it was when I did it a decade ago).
If you want to get more complex, most browsers can be run in "headless mode" and then interacted with. For instance, there's Chromium in headless mode, but I'd really recommend using an abstraction over that such as Symfony's Panther if you are in PHP or Puppeteer if you are in server-side JS. (I know there are dozens of alternatives, anyone reading this, feel free to throw them in the comments.)
Regardless of simple or complex, you're going to want to write your own rules. A semi-modern site written in the past couple of years has a good chance of having a root or near-root <HEADER>, <MAIN> and <FOOTER> tags. If you find those, your general rules are probably going to be simpler. You stand a good chance of also finding <ASIDE> and other semantic HTML5 tags in there, too.
If you don't find those, you might still be able to look at near-root tags for <div class="header"> and similar. You are going to possibly need to handle alternative versions of header, especially across languages (human, not computer, so English, Spanish, etc.).
Using these rules, I think you could generally build something that would parse a good number of sites on the web.
I big word of caution, however, is that home pages tend to be weird and one-offs, because they tend to holds a subset of everything else on the site but no actual content of their own. In that regard, you'll usually still find a header and footer, but inside almost everything feels like a sidebar or similar.
As to carousels? That one is honestly really tough. Carousels are built with JS so if you only look at HTML you might only find a <UL> with a bunch of images. Actually, as I write this, I think I would target an <UL> with images and assume that they are a carousel. There will definitely be false-positives but it is a pretty common pattern. Not everyone is a <UL> fan, however, so they might be just regular <DIV>.
I say all this because I have built these in the past, but for very specific sites and for very specific reasons. Writing a general parser that could work everywhere is, as I said at the beginning, a lot of work.
This is a tough one and, unless you're google, I doubt it will be possible to make a solution that works on more than a couple websites.
First off lets start out by taking a couple of websites and looking at what they send to the client.
The HTML of a Wikipedia article looks something like this
<h2><span class="mw-headline" id="History">History</span></h2>
<h3><span class="mw-headline" id="Development">Development</span></h3>
<div class="thumb tright"><div class="thumbinner" style="width:172px;"><img alt="Photograph of Tim Berners-Lee in April 2009" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/170px-Tim_Berners-Lee_April_2009.jpg" decoding="async" width="170" height="234" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/255px-Tim_Berners-Lee_April_2009.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/340px-Tim_Berners-Lee_April_2009.jpg 2x" data-file-width="1195" data-file-height="1648" /> <div class="thumbcaption"><div class="magnify"></div>Tim Berners-Lee in April 2009</div></div></div>
<p>In 1980, physicist Tim Berners-Lee, a contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents. In 1989, Berners-Lee wrote a memo proposing an Internet-based hypertext system.<sup id="cite_ref-3" class="reference">[3]</sup> Berners-Lee specified HTML and wrote the browser and server software in late 1990. That year, Berners-Lee and CERN data systems engineer Robert Cailliau collaborated on a joint request for funding, but the project was not formally adopted by CERN. In his personal notes<sup id="cite_ref-4" class="reference">[4]</sup> from 1990 he listed<sup id="cite_ref-5" class="reference">[5]</sup> "some of the many areas in which hypertext is used" and put an encyclopedia first.
and this would be easy enough to parse with a php / python / Java program and separate into different parts.
Now lets look at the google support pages. The source is basically 2000 lines of javascript and that's it.
parsing this would be possible, but much harder since you need to actually render the page and execute the javascript before the <header>, <div> and <p> tags appear in the DOM.
I believe it would be possible to create an api thay scans websites like wikipedia or stackoverflow since they generate the HTML on the server side and only require the client to render it and apply css styles to it.
If a website is based upon a technology like react.js you'll see that the entirety of the page is simply javascript and nothing can be processed until that has been executed and rendered.
Would it be possible to reder it and parse it afterwards? Probably yes, but an API that can do this for any given website is so much work that you're probably better of training an AI to read webpages and let it point them out for you.
I have a huge list of URL's from a client which I need to run through so i can get content from the pages. This content is in different tags within the page.
I am looking to create an automated service to do this which I can leave running to complete.
I want the automated process to load each page and get the content from particular html tags, then process some this content to ensure the html is correct.
If possible I want to generate one XML or JSON file, but I can settle for an XML or JSON file per page.
What is the best way to do this, preferably something I can run off a mac or a linux server.
The list of URL's are to an external site.
Is there something I can already use or an example somewhere which will help me.
Thanks
This is a perfect application of BeautifulSoup, IMHO. Here is a tutorial on a similar process. It is certainly a headstart.
Scrapy is an excellent framework for spidering and scraping.
I think you'll find it will involve a little more learning overhead based on the Requests + Beautiful Soup or LXML tutorial mentioned by tim-cook in his answer. However if you're writing a lot of scraping / parsing logic it should guide toward a pretty well-factored (readiable, maintainable) codebase.
So, if it's a one-off run I'd go with Beautiful Soup + Requests. If it'll be re-used, extended and maintained over time then Scrapy would be my pick.
Guys/Gals I have made a website but now I want to encode the script so that no one can copy.
I'm using PHP, JavaScript and HTML in each page of my website. So how do I encrypt each and every page?
Thank You.
PHP
No need to encrypt - noone will ever see it (unless your site has security problems).
JavaScript
You can pack it. Can be reversed.
HTML
You can remove all whitespace. This is problematic with pre and white-space: pre.
It is also very ease to export the formatted DOM structure that is the end result of your serialised mess.
The Most Important Part
Obfuscate to make pages load faster - not to stop people from stealing your code/markup. If your code is really worth stealing (which I doubt it, no offense), then people will get it.
Neither html nor javascript can be encrypted, else the browsers would not be able to interprete it and your visitors would not be able to view your site. Dot. End. Compression tools may boost performance a little but will not really help against copyright infringement.
Your php-programs generate html, your visitors will always be able to see your html, but if your server is configured properly no one should ever see your php.
Just get comfortable with the idea that putting something on the web is to open it to the world.
Cost in attempting to stop duplication of the stuff you've already decided to make publicly available: $your hourly rate x hours == ??
Cost to stop worrying about something that doesn't actually cost you anything: zero. winner.
(And to head off another question you're inevitably going to ask at some point in future - Don't attempt to disable right-clicks. It just annoys everyone and doesn't achieve anything.)
Try using Javascript Obfuscator for your javascripts.
It will not hide you script but it protects JavaScript code from stealing and shrinks size.
if you do a google on "html encryption" you'll get a lot of hits.
http://www.developingwebs.net/tools/htmlencrypter.php
The question I have is why you would want to do this? You're going to have a performance hit for what gain?
You can also do the same for javascript but unless your html or javascript has organisational sensitive data then... And if they do then perhaps that's not the best place for it.
Actually one way to do it is to use XML + XSLT, it's extremely difficult for a lay-person to figure out what is going on, even more difficult for them to get your sauce code.
search google for ioncube
http://www.ioncube.com/html_encoder.php
This converts the html into gibberish. Stealing your html becomes difficult.
Nobody's html code is worth stealing anyways. This is only for self satisfaction.
The most I have ever been able to do to protect my code is to disable the right click with this line of code:
<body oncontextmenu="return false">
but it doesn't mean they can't right click on another page open inspect element and go back to your page and look at the code it can only stop them from viewing the source code for the most part.
Little late, by 10 years, but I've found a website that encrypts HTML. However, it doesn't work with PHP, it does work with JS. Evrsoft is what I've used for my website. It's the only HTML encryption I've found so far. If you've got PHP in your code, only encrypt the HTML in the page and leave the PHP raw. Nobody can see PHP anyway. It's a free service.
I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.
You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.
you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki
Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent
You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.
this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.