I have a huge list of URL's from a client which I need to run through so i can get content from the pages. This content is in different tags within the page.
I am looking to create an automated service to do this which I can leave running to complete.
I want the automated process to load each page and get the content from particular html tags, then process some this content to ensure the html is correct.
If possible I want to generate one XML or JSON file, but I can settle for an XML or JSON file per page.
What is the best way to do this, preferably something I can run off a mac or a linux server.
The list of URL's are to an external site.
Is there something I can already use or an example somewhere which will help me.
Thanks
This is a perfect application of BeautifulSoup, IMHO. Here is a tutorial on a similar process. It is certainly a headstart.
Scrapy is an excellent framework for spidering and scraping.
I think you'll find it will involve a little more learning overhead based on the Requests + Beautiful Soup or LXML tutorial mentioned by tim-cook in his answer. However if you're writing a lot of scraping / parsing logic it should guide toward a pretty well-factored (readiable, maintainable) codebase.
So, if it's a one-off run I'd go with Beautiful Soup + Requests. If it'll be re-used, extended and maintained over time then Scrapy would be my pick.
Related
I previously had someone build a website for me. It was 90% finished but then ill health got in the way.
I have all the files and I am now asking people to "put the website back together for me". The general consensus is that it's very messy and not clear what was done and some of the protocols are now out of date etc. And it would just be better to start from scratch. I have heard this from multiple people.
So now when I am asking a new guy to build it from scratch, he is asking me for the HTML files. I couldn't see any, so I contacted the previous developer and he said:
There are no HTML files, it all runs through the index.php file and
extracts pages, data etc. from the database.
I told this to the new developer, but he is saying:
But website is not possible without HTML. Ask him provide index HTML.
Pure HTML without php code.
I'm confused, because I saw the website up and running, so it seems it is possible without HTML?
I'm trying to figure out where the misunderstanding is happening.
Thanks.
What your previous developer is saying is that your site was dynamic and all requests were flowing through your index.php file, which in turn does some backend logic to produce HTML data for the browser to interpret. If you ask your previous developer to zip up the root of your old site, your new developer should be able to take it from there.
Can a website exist without HTML?
Without a .html file? Yes. Using only .php, .css and .js is possible.
Without using Hyper Text Mark-up Language? No. There ar no other mark-up language for browsers, afaik. So we're stuck with this.
Old dev used PHP for efficiency. Contents are in your database and fetched using php to show up in browser.
New dev probably only knows HTML and has no clue about php. Or, probably doesn't want to bother reading through the php codes to reverse engineer how your site works.
Suggestion: Get a different dev. A smarter one. You probably have to pay more, but it's more expensive to hire a less smarter dev.
I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.
I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.
I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.
Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)
I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it
ex: Parse a div that appears when a javascript its executed.
I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?
There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.
The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.
This may be way overkill for what you need though. I'll leave it up to you to decide.
You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.
Look at :
HTMLUnit
Golf
You can try using something like Selenium, which allows you to automate browser tasks.
On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.
You should look at PhantomJS and CasperJS (headless browsers).
In the ruby world the gem for running Phantomjs would be poltergeist
There is another article about some of the options you have in ruby here too (however they are not all js capable)
A friend has asked me for help with her website design. Although I know a fair amount about the basics behind HTML, XML, Php, ASP.Net, javascript, etc., I'm not really comfortable sitting down and coding from scratch. All of the work I do is in Java, C++, and so on.
My friend would like to add a vertically scrolling marquee to her site - no problem, there is code for that all over the internet. Here is the tricky part - she would like the text to be dynamically pulled from another website. This isn't like a simple text file, either - it's a list of names from a specific blog post, so there would be a lot of text processing involved to wade through all of the other markup, and extract the relevant info.
The way I see it, here are her options -
1) Write some kind of a perl script or somesuch that is set to run daily. This script will visit the blog and extract the necessary info. It will then update the HTML file's marquee text with its new info.
2) Some sort of active page written in ASP or PHP that will dynamically build the marquee (and the rest of the site) each time the site is visited, basically doing the work of the perl script each time. This seems like it has the potential to be somewhat slow.
Per my understanding, those are her only options. Am I correct? There is no simply way to do this in javascript that I am just missing? I know you can reference an image to be dynamically pulled with the marquee, but this isn't that simple...
Thanks.
EDIT: I guess where I was going with my question was this: Unless I implement this statically, this is going to be fairly involved, right? I believe it is over my head. This is why I would like to simply copy/paste the text list into the html document. It would need to be updated every time the blog does, but that only appears to happen every few months, so that's not a large chore. I realize this is a lazy solution, but this is from someone very inexperienced in web development.
For reference, this is the SPECIFIC blog post which the text will come from, and my friend would ONLY like to display that list of names that begins when you scroll several paragraphs down.
http://truthnottasers.blogspot.com/2008/04/what-follows-are-names-where-known.html
It depends what the list of names looks like, i.e. how much intelligence is needed to parse it. But this could be something that could be fairly easily be pulled, parsed and displayed using Ajax, for example in the jquery flavour.
All the blogs I have ever seen have an RSS feed. Why not just grab the feed?... Google provides javascript that does only this.
Google Ajax Feed API
The RSS suggestion sounds good. If you can't get it in the RSS you could screen scrape the content.
If you could do it with Javascript I think it would suffer the same resource issues as your once a day Perl script and every load asp/php methods since it would still have to fetch the web content by making a call to the web site.
Another option is to use asp.net and enable caching so that when other visitors come to the site instead of getting the page all over again it serves up the cached page. You can set this to cache for 24 hours or so. I'm sure other server languages have similar features. Basically this would be the same as your once a day Perl method but keep it within a web framework.
Another hacky solution would be to use an iframe and frame the content with javascript so that it only shows the content you want to show. Of course you'll have no control over the formatting (background, fonts) of the iframe and if the content gets bigger or changes position you'll have problems.