I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.
One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?
In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.
If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).
If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.
I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.
Related
Which is best for performance usage/usability, "file_get_content" or "DOM LoadHtmlFile" ?
As they both get the html content, what is the optimal situation to use one or the other?
This Also go with the "file_put_content" and "DOM SaveHtmlFile"
I was a newbie back when I posted this question. Here is my conclusion at the time I am writing this,
they both do different things completely. "DOM LoadHtmlFile" is a part
of a DOMDocument class which is very powerful to manipulate html
formatted content, and file_get_content is used to get file/http
content as a string and manipulate as your wish. On a case scenario if
the target content are in html format, depending on needs, if requires
heavily change then I would personally use DomDocument class. If
otherwise, I would just manipulate as a string.
There are a bunch of HTML text extraction tools out there. Mostly for Java or Python. The one I come across most often is boilerpipe. There are a few APIs here and there, and some seem to work pretty well. Does anyone know of anything in PHP that does this?
You could try phpQuery:
http://code.google.com/p/phpquery/
DomDocument is a class available in PHP if you have libxml support that can parse HTML documents and let you iterate over them or issue XPath queries to find specific nodes in the DOM tree. This is the ideal method.
Or, if the text is simple enough and uniform, you can use preg_match() to extract text from the data using Regular Expressions.
I want to implement a commenting system for my website. I looked around and found CKEditor to be the best WYSIWYG editor I found. I tried its bbcode output and it works perfectly. However if I use bbcode output, when I want to show the comments to the users, I should use a reliable parser to parse the bbcode to HTML. If I use HTML output, I may need to use something to prevent XSS in the comments. Which way you suggest for a simple commenting system. I already integrated CKEditor to my system and prefer a very lightweight and simple approach without so much bloat (like PEAR). Also, StackOverflow seems pretty awesome. Is it possible to use something similar for my php?
I should use a reliable parser to parse the bbcode to HTML.
PHP has a pecl BBCode extension.
Also, StackOverflow seems pretty awesome. Is it possible to use something similar for my php?
SO uses Markdown. Markdown parser in PHP is also available
I'd like to automatically pretty-print (indentation, mostly) the HTML output that my PHP scripts generate. I've been messing with Tidy, but have found that in its efforts to validate and clean my code, Tidy is changing way too much. I know Tidy's intentions are good but I'm really just looking for an HTML beautifier. Is there a simpler library out there that can run in PHP and just do the pretty-printing? Or, is there a way to configure Tidy to skip all the validation stuff and just beautify?
The behaviour that you've observed when using Tidy is a result of the underlying use of DOM API. Instead of manipulating the provided source code, DOM API will reconstruct the whole source, thus making fixes along the way.
I've written Dindent, which is a library that uses Regex. It does not do anything beyond adding the indentation and removing whitespaces. However, I advise against using this implementation beyond development purposes.
I've never used Tidy but it seems pretty customizable.
Here's the quick reference of configuration options: http://tidy.sourceforge.net/docs/quickref.html
But really, with tools like Firebug, I've never seen the need to Tidy HTML output.
Since you do not want to have it validate for whatever reason, I will not suggest htmlpurifier ; ). Why not just use an IDE to get everything indented nicely, like Alt-Shift-F in Netbeans.
Facing the same problem i currently use a combination of two commands:
cat template-home.php | js-beautify --type html | prettier --parser php
js-beautify formats the html bits and prettier formats the php code
I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...