I have got a list of many web page URLs.. and all of them contain videos. These videos are embed via simple HTML and tags. I can extract these tags by some RegEx techniques.
Now the problem is that majority of them use Javascript to embed these elements! And as they are from different websites.. They dont have any specific pattern.
The only thing i can do now is to make my "PHP execute the Javascript". And i'm stuck in this task..
I want this extraction to be done via PHP script. Ive tried jParser and jTolkenizer but i cant get it to work in this case.
Any help would be appreciated.
Thanks.
The only thing i can do now is to make my "PHP execute the Javascript". And i'm stuck in this task..
Oh goodness, don't do that. The effort you'll spend (and security problems you'll open) is going to be way, way more of a time sink than coding specific identifier code for each site that I'm blindly assuming that you're scraping.
Related
I have a huge list of URL's from a client which I need to run through so i can get content from the pages. This content is in different tags within the page.
I am looking to create an automated service to do this which I can leave running to complete.
I want the automated process to load each page and get the content from particular html tags, then process some this content to ensure the html is correct.
If possible I want to generate one XML or JSON file, but I can settle for an XML or JSON file per page.
What is the best way to do this, preferably something I can run off a mac or a linux server.
The list of URL's are to an external site.
Is there something I can already use or an example somewhere which will help me.
Thanks
This is a perfect application of BeautifulSoup, IMHO. Here is a tutorial on a similar process. It is certainly a headstart.
Scrapy is an excellent framework for spidering and scraping.
I think you'll find it will involve a little more learning overhead based on the Requests + Beautiful Soup or LXML tutorial mentioned by tim-cook in his answer. However if you're writing a lot of scraping / parsing logic it should guide toward a pretty well-factored (readiable, maintainable) codebase.
So, if it's a one-off run I'd go with Beautiful Soup + Requests. If it'll be re-used, extended and maintained over time then Scrapy would be my pick.
I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.
I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.
Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)
I'm making a search engine that (in theory) analyzes online encyclopedias to get answers to a user's question from a form. However, I want to know if I'm wasting my time with the PHP. If I am, what language would be best suited to this task? If I'm not, what function in PHP would allow me to do this? Thanks!
PHP works as well as anything else. If you want to read data off of another webpage, you'll probably want to use cURL, which is built in to PHP.
All of the requisite pieces are there: PHP does fine with processing text and HTML. If you already know PHP, it's best to stick with what you know.
This is easy enough to do with PHP. If the sites you are getting the data from are valid xhtml it will be extremely easy to process the page and extract the data using the simplexml extension.
A friend has asked me for help with her website design. Although I know a fair amount about the basics behind HTML, XML, Php, ASP.Net, javascript, etc., I'm not really comfortable sitting down and coding from scratch. All of the work I do is in Java, C++, and so on.
My friend would like to add a vertically scrolling marquee to her site - no problem, there is code for that all over the internet. Here is the tricky part - she would like the text to be dynamically pulled from another website. This isn't like a simple text file, either - it's a list of names from a specific blog post, so there would be a lot of text processing involved to wade through all of the other markup, and extract the relevant info.
The way I see it, here are her options -
1) Write some kind of a perl script or somesuch that is set to run daily. This script will visit the blog and extract the necessary info. It will then update the HTML file's marquee text with its new info.
2) Some sort of active page written in ASP or PHP that will dynamically build the marquee (and the rest of the site) each time the site is visited, basically doing the work of the perl script each time. This seems like it has the potential to be somewhat slow.
Per my understanding, those are her only options. Am I correct? There is no simply way to do this in javascript that I am just missing? I know you can reference an image to be dynamically pulled with the marquee, but this isn't that simple...
Thanks.
EDIT: I guess where I was going with my question was this: Unless I implement this statically, this is going to be fairly involved, right? I believe it is over my head. This is why I would like to simply copy/paste the text list into the html document. It would need to be updated every time the blog does, but that only appears to happen every few months, so that's not a large chore. I realize this is a lazy solution, but this is from someone very inexperienced in web development.
For reference, this is the SPECIFIC blog post which the text will come from, and my friend would ONLY like to display that list of names that begins when you scroll several paragraphs down.
http://truthnottasers.blogspot.com/2008/04/what-follows-are-names-where-known.html
It depends what the list of names looks like, i.e. how much intelligence is needed to parse it. But this could be something that could be fairly easily be pulled, parsed and displayed using Ajax, for example in the jquery flavour.
All the blogs I have ever seen have an RSS feed. Why not just grab the feed?... Google provides javascript that does only this.
Google Ajax Feed API
The RSS suggestion sounds good. If you can't get it in the RSS you could screen scrape the content.
If you could do it with Javascript I think it would suffer the same resource issues as your once a day Perl script and every load asp/php methods since it would still have to fetch the web content by making a call to the web site.
Another option is to use asp.net and enable caching so that when other visitors come to the site instead of getting the page all over again it serves up the cached page. You can set this to cache for 24 hours or so. I'm sure other server languages have similar features. Basically this would be the same as your once a day Perl method but keep it within a web framework.
Another hacky solution would be to use an iframe and frame the content with javascript so that it only shows the content you want to show. Of course you'll have no control over the formatting (background, fonts) of the iframe and if the content gets bigger or changes position you'll have problems.
Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.