Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
There are many websites and blog which provide RSS feeds, but on the other hand there are also many which do not. I want to turn that type of web page into RSS feeds.
I found some solutions using through Google like Feed43, Page2rss, Dapper etc, but I want an Open Source project which can perform this task or any tutorial explaining about it.
Please give me suggestions and if you can explain, you are most welcome.
My preferable language is PHP.
There's nothing magic about RSS. I suggest you read this tutorial to understand how to build an RSS feed from scratch:
http://www.xul.fr/en-xml-rss.html
Then use your PHP skills to build one from your content. A generic HTML-to-RSS scraper can be found online by searching for "html to rss converter" or whatever, but most of these will be hosted solutions and the RSS feeds they produce aren't that great. A good RSS feed requires understanding the content that you're syndicating, not just the raw HTML. IMHO.
In general there is not going to be any "one size fites all" solution to something like this. You'll have to examine the HTML structure of the blog you want to build an RSS feed from, then parse out the content you are interested in, and stick it into an RSS feed.
Here's some PHP things to help get you started:
Parsing HTML:
DOMDocument (swiss-army-knife of HTML/XML parsing)
SimpleXML (easy to use, but requires valid XML)
Tidy (can be used to clean up bad HTML)
Understanding RSS Feeds:
http://en.wikipedia.org/wiki/RSS
To construct them with PHP, you can once again use DOMDocument or SimpleXML. Another option is, depending on the format of the HTML you want to convert into RSS, you may be able to create an XSLT stylesheet to transform it.
There is no simple or concrete answer to this question, but I will get you started.
First, you need to build a crawler of sorts. Typically, you are going to want this to be multi-threaded and run in the background on your server. This might be as simple as forking PHP processes on the server, but you might find a more efficient way, depending on how much traffic you expect.
Now probably the best way to start would be to read the DOM. See http://php.net/manual/en/class.domdocument.php Look for headings and try to associate them with the paragraphs below them. Beware though that probably less than half the sites out there (and likely far fewer from the ones that don't already have a feed) don't structure their site in an organized way. But, it is a place to start.
There are plenty of element attributes too you can use, such as alt text. Also, in time you may find a lot of sites using a particular template that you can write code to handle directly.
You should also have something to read existing feeds. If a site has a feed, no sense in generating one for it, right? Use SimplePie to get started, but there are alternatives you don't like it. http://simplepie.org/
Once you have parsed the page, you'll want a database backend to track it and changes and what not.
From there, you need something to generate the feed. There are plenty of OOP classes for doing this. Often times, I just write my own, but that is up to you.
If you build sites with the simple symphony cms then yes, its very easy. See this snippet of a tutorial. Learn here
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I need to crawl a website and detect how many ads are on a page. I've crawled using PHPCrawl and saved the content to DB. How can I detect if a web page has ads above the fold?
Well simply put: You can't really. At least not in a simple way. There is many things to consider here and all of them are highly subjective to the web page you are crawling, the device used, etc. I'll try to explain some of the main issues you would need to work around.
Dynamic Content
The first big problem is, that you just have the HTML structure, which on itself gives no direct visual information. It would be, if we were in like 1990, but modern websites use CSS and JS to enhance their pages core structure. What you see in your browser is not just HTML rendered as is. It's subject to CSS styling and even JS induced code fragments, which can alter the page significantly. For example: Any page that has a so called AJAX loader, will output as a very simple HTML code block, that you would see in the crawler. But the real page is loaded AFTER this is rendered (from JS).
Viewport
What you described as "above the fold" is an arbitary term, that can't be defined globaly. A smartphone has a very different viewport than a desktop PC. Also most of modern websites use a very different structure for mobile, tablet and desktop devices. But let's say you just want to do it for desktop devices. You could define an average viewport over most used screen resolutions (which you may find on the internet). We will define it as 1366x786 for now (based on a quick google search). However you still only have a PHP script and an HTML string. Which brings us the next problem.
Rendering
What you see in your browser is actually the result of a complex system, that incooperates HTML and all of the linked ressouces to render a visual representation of the code you have crawled. Besides the core structure of the HTML string you got, any resource linked can (and will) chanfge how the content looks. They can even add more content based on a variety of conditions. So what you would need to get the actual visual information is a so called "headless browser". Only this can give you valid informations about what is actually visible inside the desired viewport. If you want to dig into that topic, a good starting point would be a distribution like "PhantomJS". However don't assume that this is an easy task. You still only have bits of data, no context whatsoever.
Context, or "What is an ad?"
Let's assume you have tackled all these problems and made a script that can actually interpret all the things you got from your crawler. You still need to know "What is an ad?". And thats a huge problem. Of course for you as a human it's easy to distinquish between what is part of the website and what is an ad. But translating that into code is more of an AI task than just a basic script. For example: The ads could (and are most of the time) loaded into a predfined container, after the actual page load. These in turn may only have a cryptic ID set that distinguishes them from the rest of the (actually valid) page content. If you are lucky, it has a class with a string like "advertisment", but you can't just define that as given. Ads are subject to all sorts of ad blockers, so they have a long history of trying to disquise themselves as good as possible. You will have a REALLY hard time figuring out what is an ad, and what is valid page content.
So, while I only tackled some of the problems you are going to run into, I want to say that it's not impossible. But you have to understand that you are at the most basic entry point and if you want to make a system that is actually working, you'll have to spend a LOT of time on finetuning and maybe even research on the AI field.
And to come back to your question: There is no simple answer for "How to detect if a page has ads". Because it is way more complex than you might think. Hope this helps.
I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.
I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.
Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I heard xml is used as database, can anybody give me a simple tip or link to tutorial how to store some information in database ? what is the best use of xml on php realted to data things?
I'm gonna throw my hat in, simply because I am working on a personal project that does in fact use XML as it's storage mechanism. Notice, I didn't call it a database. It's not a database, at least in the way most would define it. As expressed in an article I read recently, XML is not data oriented, it's document oriented.
In my case, I'm building a simple OO php/XML resume site for my girlfriend. I am using an XML file to store the content. I chose this mainly because it's small, lightweight, interchangeable, and easy to read. Initially, I thought I could just provide the XML to her, and she could fill in the blanks. XML is straightforward enough to allow a laymen to do that.
As I continued, I realized that it wasn't very difficult to throw in an admin type interface where she could simply enter values in a form and update the resume that way.
Since the site is not really a web site, but a web document, XML works well here and nicely separates content.
Of course, I could have used JSON as well, and I may in fact adapt things to handle either JSON or XML, but I decided to use XML initially simply out of familiarity, and (this is arguable) that I assumed it would be easier for a laymen to parse when entering content.
XML is not supposed to be used as a database but as a way to transport data in an application agnostic way. For example, say you have many RSS feeds in Google Reader and you want to add them into Thunderbird. You will export them from Google Reader in the XML format, and then import that XML file into Thunderbird. Both applications will know how to read and write from the XML and how to use the information (the RSS feeds) in it.
If you want to store information in a useful way that, for example, lets you organize and search through it, you will need a full fledged database. Some good ones are Mysql and Postgresql. Both of those work well with PHP and have extensive tutorials to begin with, all easily accessible via any search engine.
You can answer this question yourself after reading this very entertaining article by one of Stackoverflow founders:
Back to Basics by Joel Spolsky
Check out some of the responses I got to my question "Is there a simple, flat, XML-based query-able data storage solution?" on the Programmers.StackExchange Site.
It's a mixed bag. SimpleXML is great with PHP, but there is a lot of FUD when it comes to XML query languages and implementations..
To add to what Fanis said, if you want something lightweight then I strongly recommend MongoDB or SQLite
I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.
You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.
you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki
Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent
You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.
this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Hi I know about several PDF Generators for php (fpdf, dompdf, etc.)
What I want to know is about a parser.
For reasons beyond my control, certain information I need is only in a table inside a pdf
and I need to extract that table and convert it to an array.
Any suggestions?
I've written one before (for similar needs), and I can say this: Have fun. It's quite a complex task. The PDF specification is large and unwieldy. There are several methods of storing text inside of it. And the kicker is that each PDF generator is different in how it works. So while something like TFPDF or DOMPDF creates REALLY easy to read PDFs (from a machine standpoint), Acrobat makes some really hellish documents.
The reason is how it writes the text. Most DOM based renderers --that I've used-- write the entire line as one string, and position it once (which is really easy to read). Acrobat tries to be more efficient (and it is) by writing only one or maybe a few characters at a time, and positioning them independently. While this REALLY simplifies rendering, it makes reading MUCH more difficult.
The up side here, is that the PDF format in itself is really simple. You have "objects" that follow a regular syntax. Then you can link them together to generate the content. The specification does a good job at describing the file format. But real world reading is going to take a bit of brain power...
Some helpful pieces of advice that I had to learn the hard way if you're going to write it yourself:
Adobe likes to re-map fonts. So character 65 will likely not be A... You need to find a map object and deduce what it's doing based upon what characters are in there. And it is efficient since if a character doesn't appear in the document for that font, it doesn't include it (which makes life difficult if you try to programmatically edit a PDF)...
Write it as abstract as possible. Write classes for each object type, and each native type (strings, numbers, etc). Let those classes parse for you. There will be a fair bit of repetition in there, but you'll save yourself in the end when you realize that you need to tweak something for only one specific type)...
Write for a specific version or two of the PDF spec, and enforce it. Check the version number, and if it's higher than you expect, bail... And don't try to "make it work". If you want to support newer versions, break out the specification and upgrade the parser from there. Don't try to trial and error your way up (it's not fun)...
Good luck with compressed streams. I've found that typically you can't trust the length arguments to verify what you are uncompressing. Sometimes (for some generators) it works well... Others it's off by one or more bytes. I just attempt to deflate it if the filter matches, and then force the length...
When testing lengths, don't use strlen. Use mb_strlen($string, '8bit') since it will compensate for different character sets (and allow potentially invalid characters in other charsets).
Otherwise, best of luck...
I use PDFBox for that (http://pdfbox.apache.org/). This software is javabased and platform independend. It works fast and reliable. You can use it via exec or shell execute or via a PHP/Java-Bridge (http://php-java-bridge.sourceforge.net/)
Have you already looked at xPDF ? There is a program in there called pdftotext that will do the conversion. You can call it from PHP and then read in the text version of the PDF. You will need to have the ability to run exec() or system() from php, so this may not work on all hosted solutions though.
Also, there are some examples on the PHP site that will convert PDF to text, although its pretty rough. You may want to try some of those examples as well. On that PHP page, search for luc at phpt dot org.
Zend_Pdf is part of the Zend Framework. Their manual states:
The Zend_Pdf component is a PDF
(Portable Document Format)
manipulation engine. It can load,
create, modify and save documents.
Thus it can help any PHP application
dynamically create PDF documents by
modifying existing documents or
generating new ones from scratch.
Have a look at GhostScript or ITextSharp, there are various cross-platform version of both.
It may not actually be a table inside the PDF as the PDF loses that sort of information...
This is PHP PDF parser, which exists in two flavours:
Free version can parse PDFs up to format PDF 1.5
Commercial add-on can parse any PDF format (up to current 1.9)