Extract all text from a HTML page without losing context

Extract all text from a HTML page without losing context - php

For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links.
For example:
<div>Overflow <span>Texts <b>go</b> here</span></div>
Should give me 2 results to translate:
Overflow
Texts <b>go</b> here
Any suggestions or commercial packages available for this problem?

I'm not exactly sure what you're asking, but look at simplehtmldom. Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh). With that you can extract the text of a website without all those pesky tags.

Related

Find a match, then grab some html before and after it

In php, I am scraping some html from one of my other external sites. I'm performing the scrape and getting all of the page html in a php string. I need to find the first .png file type in this string. I then need to just grab the html from this point to find the beginning http before it AND grab the html after it just before the following characters begin "\u002522". Any ideas?
So:
<html><head><title>Hello</title></head><body><p>Here's a nice image</p><img src="http://www.exampleurl.com/image.png?id=35435646&v=5647\\u002522"/></body></html>
Would turn into:
http://www.exampleurl.com/image.png?id=35435646&v=5647
I've looked everywhere for combining all these things at the same time, but to no luck :(

I have used this before and it worked great for me. How to extract img src, title and alt from html using php?
Then just clean up the URL and split on //.
Let me know if I need to be more specific.

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">

</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.

If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.

I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

Is there an embedded visual (PHP) parser for Microsoft Word?

I am writing about code that I have written for documentation, mainly in PHP. I also have other languages that I will write about but I am wondering what the easiest way to display code within a word document. I could just import a print screen of a Notepad++ document but I would like an easy way to include code into Microsoft Word without having to print screen it every time I want to make a change. I am looking for something that will allow me to edit the code within word, but obviously not be functional. I would like there to be some sort of visual parsing so that similarly to Notepad++ it is more readable.

You don't need to print screen in Notepad++. You can export/copy the text as RTF and preserve syntax highlighting and formatting.
I'm not on my PC at the moment, but the option is either under the TextFX menu, or the Plugins menu.
Works very nicely.
Edit:
This menu, press 'Copy RTF to Clipboard', and you can paste into Word.

Instead of using word to document your code, you could instead check out a document markup language called Latex.
It allows for easy documentation of code(and math) and is therefor a really good tool for creating scientific reports.
http://www.latex-project.org/
Here is a basic tutorial on how it works:
http://www.youtube.com/watch?v=SoDv0qhyysQ
(This youtube video explains the basics)

Fetch pages translated by Google? (PHP)

I have a bunch of big txt files (game walkthroughs) that I need translating from English to French. My first instinct was to host them on a server and use a PHP script to automate the translation process by doing a file_get_contents() and some URL manipulation to get the translated text. Something like:
http://translate.google.com/translate?hl=fr&sl=en&u=http://mysite.com/faq.txt
I found it poses two problems: 1) there are frames 2) the frame src values are relative (ie src="/translate_c?....") so nothing loads.
Is there any way to fetch pages translated via Google in PHP (without using their AJAX API as it's really not suitable here)?

Use cRL to get the resulting page and then parse it.

Instead of using the regular translate URL which has frames, use the src of the frame:
http://translate.googleusercontent.com/translate_c?hl=<TARGET LANGUAGE>&sl=<SOURCE LANGUAGE>&tl=af&u=http://<URL TO TRANSALTE>&rurl=translate.google.com&twu=1&usg=ALkJrhhxPIf2COh7LOgXGl4jZdEBNutZAg
For example to translate the page http://chaimchaikin.za.net/ from English to Afrikaans:
http://translate.googleusercontent.com/translate_c?hl=en&sl=en&tl=af&u=http://chaimchaikin.za.net/&rurl=translate.google.com&twu=1&usg=ALkJrhhxPIf2COh7LOgXGl4jZdEBNutZAg
This will open up only a "frameless" page of the translation.
You may want to examine and test around to find the codes for the required language.
Also bear in mind that Google may add scripts to the translation (for example to show original text on hover).
EDIT: It appears, on examing the code, that there is a lot of javascript in between the translation. You may need to find a way to get rid of it.
EDIT: Further examination shows that the end bit "usg=ALkJr..." seems to change every time. Maybe first run a request on the regular Google translate page (e.g. http://translate.google.com/translate?hl=fr&sl=en&u=http://mysite.com/faq.txt) than find and parse the "usg=.." part and use it for your next request on the "frameless" page (http://translate.googleusercontent.com/translate_c?...).

What's the best way to limit tags in a WYSIWYG editor?

limitation of tags in a wysiwyg editor by excluding buttons is completly not enough, because i still can copy a page content and past it in the editor, and... it's will accept it !
i want some solution to limit the allowed tags, by example i love the way stackoverflow limit it by the traditional way of *** and quotes, but i don't want this style , i still like the wisiwyg's Real-time editing.
Note : If it's impossible, how i can do that in the php script which receive the html ?
Thanks

You could use regular expressions to processing the tags you would like to remove. The fckEditor does something very similar to your requirement when it pastes the clipboard contents as unformatted text. As it is open source, I think you can get some inspiration from it or even use it if it fits your requirements.

User HTML Purifier to filter the submitted content. I wouldn't recommend allowing submitted content without that tool.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.