PHP - extract text from HTML, translate and put it back

PHP - extract text from HTML, translate and put it back - php

I'm using an API to translate my blog but it sometimes messes up with my html in a way that it gives me more work to fix everything.
What I'm now trying to do is to extract the content from the html, translate it and put it back where it was.
I have first tried to do this with preg_replace where I would replace every tag by something like ##a_number## and then revert back to the original tag once the text has been translated. Unfortunately it's very difficult to manage because I need to replace every tag by a unique value.
I have then tried it with "simple html dom" which can be found here:
http://simplehtmldom.sourceforge.net/manual.htm
$html = str_get_html($content);
$str = $html;
$ret = $html->find('div');
foreach ($ret as $key=>$value)
{
echo $value;
}
This way I get all texts but there is still some html in the value (div inside div) and I don't know how I can put back translated text into the original object. The structure of this object is so complex that when displaying it, it crashes my browser.
I'm running a bit out of options and there are probably more straightforward ways of doing this. What I'd like to find is a way to get an object or array containing all the html on one side and all the text on the other side. I would loop through the text to get it translated and the merge back everything to avoid breaking the html.
Do you see better options to achieve this?
thanks
Laurent

For example, I have the following HTML, where all the words are lowercase:
<div>
<h2>page not found!</h2>
<p>go to home page or use the search.</p>
</div>
My task is to convert text to capitalized words. To solve it, I fetch all text nodes and convert them using the ucwords function (of course, you should use your translation function instead of it).
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//text()') as $text) {
if (trim($text->nodeValue)) {
$text->nodeValue = ucwords($text->nodeValue);
}
}
echo $dom->saveHTML();
The above outputs the following:
<div>
<h2>Page Not Found!</h2>
<p>Go To Home Page Or Use The Search.</p>
</div>

<div>
<p>
This text is for translating<br>
Next line of text
</p>
</div>
What if you explode the html string into an array splitting on "<" This will result in this array:
Array
(
[0] =>
[1] => div>
[2] => p>
This text is for translating
[3] => br>
Next line of text
[4] => /p>
[5] => /div>
)
Then split every array item on ">". The first item in this array is the tag. Every other item if there is one will be content for translating.
When the tranlating is done, you reverse it by gluing the array items back again.

Related

Remove all except inside tag

How to remove all from page except text inside <p> tag?
Page:
This is text.
<div class="text">This is text in 'div' tag</div>
<p>This is text in 'p' tag</p>
Expected result:
This is text in 'p' tag
Greetings.

Basically, you'll have to parse the markup. PHP comes with a good parser in the form of the DOMDocument class, so that's really quite easy:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
Next, get all p tags:
$paragraphs = $dom->getElementsByTagName('p');
This method returns a DOMNodeList object, which implements the Traversable interface, so you can use it as an array of DOMNode instances (DOMElement in this case):
$first = $paragraphs->item(0);//or $paragraphs[0] even
foreach ($paragraphs as $p) {
echo $p->textContent;//echo the inner text
}
If you only want the paragraph elements that do not contain child elements, then you can easily check that:
foreach ($paragraphs as $p) {
if (!$p->hasChildNodes()) {
echo $p->textContent; // or $p->nodeValue
}
}
A closely related answer with some more links/info: How to split an HTML string into chunks in PHP?

You can easily do this with the native php strip_tags function like so:
strip_tags("<p>This is text in 'p' tag</p>");
Which will return as you expected, "This is text in 'p' tag". NOTE: this is only useful when you have an outer-container div, and you use a little bit of dirty RegExp in order to strip not only the P, but the whole tags the user expected (ex. the div tag). This function has one argument, and a second optional argument. The first one is the string that you are stripping the tags from, and the second one specifies allowable tags that won't be stripped as a string. These tags will not be removed in the process. For more information on the strip_tags function click here. I hope you got the idea :)

simple_html_dom find all elements that ONLY contain certain text

I have:
<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>
and I just want to get the elements that ONLY contain "blarg" and no other text, so:
<b>blarg</b>
<span>blarg</span>
<em>blarg</em>
The important issue here is that I'm trying to check if blarg exists within one element alone on the page or not. I've had some general luck with regex but I'd rather do it with simple_html_dom so that I can look at child and sibling elements as well.
Does anyone know what is the simplest way to do this with simple_html_dom?

A way to do it, is to parse every tag, and test if it contains 'blarg'...
Here's a working example:
$text = '<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
// Find all elements
$tags = $html->find('*');
foreach ($tags as $key => $tag) {
// If text in tag contains 'blarg'
if (strcmp(trim($tag->plaintext),'blarg') == 0) {
echo "<div> 'blarg' found in \$tags[$key]: <xmp>".$tag->outertext."</xmp></div>";
}
}
I don't know what you want to do with, but this may be a start :)

Extract Image SRC from string using preg_match_all

I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?

Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.

You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.

Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)

PHP text to array and with key

I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)

I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.

i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}

dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference

HTML DOM: How to get elements without losing children?

I'm trying to perform a preg_replace on the text in an HTML string. I want to avoid replacing the text within tags, so I'm loading the string as a DOM element and grabbing the text within each node. For example, I have this list:
<ul>
<li>Boxes 1-3: 1925 - 1928 <em>(A-Ma)</em></li>
<li>Boxes 4-6: 1928 <em>(Mb-Z)</em> - 1930 <em>(A-Wi)</em></li>
<li>Boxes 7-9: 1930 <em>(Wo-Z)</em>- 1932 <em>(A-Fl)</em></li>
</ul>
I want to be able to highlight the character "1", or the letter "i", without disturbing the links or list item tag. So I grab each list item and get its value to perform the replace on:
$invfile = [string of the unordered list above]
$invcontents = new DOMDocument;
$invcontents->loadHTML($invfile);
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$f->nodeValue = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $f->nodeValue);
}
echo html_entity_decode($invcontents->saveHTML());
The problem is, when I grab the node values, the child nodes inside the list item are lost. If I print out the original string as-is, the < a >, < em >, etc. tags are all there. But when I run the script, it prints out without the links or any formatting tags. For example, if my $to_replace is the string "Boxes", the list becomes:
<ul>
<li><span class="highlight">Boxes</span> 1-3: 1925 - 1928 (A-Ma)</li>
<li><span class="highlight">Boxes</span> 4-6: 1928 (Mb-Z) - 1930 (A-Wi)</li>
<li><span class="highlight">Boxes</span> 7-9: 1930 (Wo-Z)- 1932 (A-Fl)</li>
</ul>
How can I get the text without losing the tags inside?

The problem here is that you're operating on the entire element. Boxes is part of the nodeValue of an anchor tag.
If the structure above is always the same you can do something like
$new_html = preg_replace("##", "", $f->item(0)->nodeValue);
In reality, the best way to go about it is to unset the anchor's node value and create an entirely new element and append it.
(Consider this psuedo code)
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$span = $invcontents->createElement("span");
$span->setAttribute("class", "highlight");
$span->nodeValue = $f->item(0)->nodeValue;
$f->appendChild($span);
}
echo $invcontents->saveHTML();
You'll have to do some matching in there, as well as unsetting the nodeValue of $f but hopefully this makes it a little more clear.
Also, don't set HTML in nodeValue directly, because it will run htmlentities() against all of the html you set. That is why I create a new element above. If you absolutely have to set HTML in nodeValue then you should create a DocumentFragment Object

YOu're better of operating only on the textnodes:
$x = new DOMXPath(invcontents);
foreach($x->query('//li/text()' as $textnode){
//replace text node with list of plain text nodes & your highlighting span.
}

I always use xpath for this kind of actions. It'll give you more flexibility.
This example handles
<mainlevel>
<toplevel>
<detaillevel key=...>
<xmlvalue1></xmlvalue1>
<xmlvalue1></xmlvalue2>
<sublevel key=...>
<xmlvalue1></xmlsubvalue1>
<xmlvalue1></xmlsubvalue2>
</sublevel>
</detaillevel>
</toplevel>
</mainlevel>
To parse this:
$xpath = new DOMXPath($xmlDoc);
$mainNodes = $xpath->query("/mainlevel/toplevel/detaillevel");
foreach( $mainNodes as $subNode ) {
$parameter1=$subNode->getAttribute('key');
$parameter2=$subNode->getElementsByTagName("xmlvalue1")->item(0)->nodeValue;
$parameter3=$subNode->getElementsByTagName("xmlvalue2")->item(0)->nodeValue;
foreach ($subNode->getElementsByTagName("sublevel") as $detailNode) {
$parameter1=$detailNode->getAttribute('key');
$parameter2=$detailNode->getAttribute('xmlsubvalue1');
$parameter2=$detailNode->getAttribute('xmlsubvalue2');
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - extract text from HTML, translate and put it back - php

Related

Remove all except inside tag

simple_html_dom find all elements that ONLY contain certain text

Extract Image SRC from string using preg_match_all

PHP text to array and with key

HTML DOM: How to get elements without losing children?

Categories

Resources