scraping a page

scraping a page - php

What would be best practice in scraping a horrible mess of a distributor's inventory page (using js to document.write a <td>, then using plaintext html to close it)? No divs/tds/anything is labelled with any id or classes, etc.
Should I just straight up preg_match(?_all) the thing or is there some xpath magic I can do?
There is no api, no feeds, no xml, nothing clean at all.
edit:
-
What i'm basically thinking of atm is something like http://pastebin.com/raw.php?i=EuMfRVD5 - is that my best bet or is there any other way?

Your example is not enough of an example. But since you seemingly don't need the highlighting meta info anyway, the JS-obfuscation could be undone with a bit of:
$html = preg_replace('# <script .*? (?: document.write\("(.*?)"\) )? .*? </script> #six', "$1", $html);
Maybe that's already good enough to pipe it through one of the DOM libraries afterwards.

In general you should always use http://www.php.net/DOM to parse a page. Regex is horrible and usually downright impossible to use for parsing html, because that's not what it was built for.
However...if the page uses a lot of javascript to output stuff, you are kind of SoL regardless. The best you can really do to get a complete picture is to grab it and run it through a browser and parse what is rendered. It is possible to automate it, though it's kind of a pita to setup.
But...given the issue w/ js outputting a lot of it...maybe regex really would be best route. But I guess first and foremost it kind of depends on what the actual content is and what it is you are trying to get from the page.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

Fetching images faster from a web page

I'm looking for a plugin or a simple code that fetches images from a link FASTER. I have been using http://simplehtmldom.sourceforge.net/ to extract first 3 images from a given link.
simplehtmldom is quite slow and many users on my site are reporting it as an issue.
Correct me if I'm wrong , I believe this plugin is taking lot of time to fetch complete html code from the url I pass and then it searches for img tags.
Someone please suggest me a technique to improvise the speed of fetching html code or an alternate plugin that i can try ?
What I'm thinking is something like fetching html code until it finds first three img tags and then kill the code fetching process ? So that things get faster.
I'm not sure if it's possible with php although, I'm trying hard to design that using jquery.
Thanks for your help !

Cross-site scripting rules will prevent you from doing something like this in jQuery/JS (unless you control all the domains that you'll be grabbing content from). What you're doing is not going to be super fast in any case, but try writing your own using file_get_content() paired with DOMDocument... the DOMDocument getElementsByTagName method may be faster than simplehtmldom's find() method.
You could also try a regex approach. It won't be as fool-proof as a true DOM parser, but it will probably be faster... Something like:
$html = file_get_contents($url);
preg_match_all("/<img[^']*?src=\"([^']*?)\"[^']*?>/", $html, $arr, PREG_PATTERN_ORDER);
If you want to avoid reading whole large files, you can also skip the file_get_contents() call and sub in a fopen(); while(feof()) loop and just check for images after each line is read from the remote server. If you take this approach, however, make sure you're regexing the WHOLE buffered string, not just the most recent line, as you could easily have the code for an image broke across several lines.
Keep in mind that real-life variability in HTML will make regex an imperfect solution at best, but if speed is a major concern it might be your best option.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...

You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.

I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.

Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Remove tag and content in between using REGEX/PHP

I've seen this question asked a few times on stackoverflow, with no resoundingly wonderful answer.
The answer always seems to be "don't use regex," without any examples of a better alternative.
For my purposes this will not be done for validation, but after the fact stripping.
I need to strip out all script tags including any content that may be between them.
Any suggestions on the best REGEX way to do this?
EDIT: PREEMPTIVE RESPONSE: I can't use HTML Purifier nor the DOMXPath feature of PHP.

The reason REGEX for HTML is considered evil, is because it can (usually) easily be broken, forcing you to repeatedly rethink your pattern. If for instance you're matching
<script>.+</script>
It could be broken easily with
<script type="text/javascript">
If you use
<script.+/script>
It can also be easily broken with
< script>...
There's no end for this. If you can't use any of the methods you've stated, you could try strip_tags, but it takes a whitelist as a parameter, not a blacklist, meaning you'll need to manually allow every single tag you want to allow.
If all else fail, you could resort to RegEx, what I came up with is this
<\s*script.*/script>
But I bet someone around here could probably come and break that too.

Regex (or better suggestion) on html with correct nesting

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.
I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.
I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).
The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!
UPDATE:
Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.
FURTHER UPDATE:
I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.

As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.
Relevant questions
Robust and Mature HTML Parser for PHP
How do you parse and process HTML/XML in PHP?

Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.