localizing an html document (hind sight) - php

I am bulding a web application in PHP, which I have decided (far along the process) to have available in different languages.
My question is this:
I do not want to wade through all the HTMl code in the template files to look for the "words" that I need to replace with dynamically generated lang variables.
Is there a tool that can highlight the "words" used in the HTML to make my task easier.
so that when I scroll down the HTML doc, I can easily see where the language "words" are.
Normally when I create an app, I add comments as i code, like below
<label><!--lang-->Full Name</lable>
<input type="submit" value="<!--lang-->Save Changes" name="submit">
so that when I am done, I can run through and easily identify the bits I need to add dynamic variables to....unfortunately I am almost through with the app (lost of HTML template files) and I had not done so.
I use a template engine (tinybutstrong) so my HTML is pretty clean (i.e. with no PHP in it)

You can do this, relatively easily even, using DOMDocument to parse the markup, DOMXPath to query for all the comment nodes, and then access each node's parent, extract the nodeValue and list those values as "strings to translate":
$dom = new DOMDocument;
$dom->load($file);//or loadHTML in case you're working with HTML strings
$xpath = new DOMXPath($dom);//get XPath
$comments = $xpath->query('//comment()');//get all comment nodes
//this array will contain all to-translate texts
$toTranslate = array();
foreach ($comments as $comment)
{
if (trim($comment->nodeValue) == 'lang')
{//trim, avoid spaces, use stristr !== false if you need case-insensitive matching
$parent = $comment->parentNode;//get parent node
$toTranslate[] = $parent->textContent;//get parent node's text content
}
}
var_dump($toTranslate);
Note that this can't handle comments used in tag attributes. Using this simple script, you will be able to extract those strings that need to be translated in the "regular" markup. After that, you can write a script that looks for <!--lang--> in tag attributes... I'll have a look if there isn't a way to do this using XPath, too. For now, this should help you to get started, though.
If you have not comments, other than <!--lang--> in your markup, then you could simply use an xpath expression that selects the parents of those comment nodes directly:
$commentsAndInput = $xpath->query('(//input|//option)[#value]|//comment()/..');
foreach ($commentsAndInput as $node)
{
if ($node->tagName !== 'input' && $node->tagName !== 'option')
{//get the textContent of the node
$toTranslate[] = $node->textContent;
}
else
{//get value attribute's value:
$toTranslate[] = $node->getAttributeNode('value')->value;
}
}
The xpath expression explained:
//: tells xpath to search for nodes that match the rest of the criteria anywhere in the DOM
input: literal tag name: //input looks for input tags anywhere in the DOM tree
[#value]: the mentioned tag only matches if it has a #value attribute
|: OR. //a|//input[#type="button"] matches links OR buttons
//option[#value]: same as above: options with value attributes are matched
(//input|//option): groups both expressions, the [#value] applies to all matches in this selection
//comment(): selects comments anywhere in the dom
/..: selects the parent of the current node, so //comment()/.. matches the parent, containing the selected comment node.
Keep working at the XPath expression to get all of the content you need to translate
Proof of concept

Related

PHP regex, tag which is not in another tag

I need a regex that is matching the content of the <cherry> tag which is not part of another tag. Unsatisfied I can't use the PHP DOM Parser because the content of the tag includes sometimes very special chars.
This is an example of the incoming input:
<cherry>test</cherry>
<banana>
<cherry>test</cherry>
some text
</banana>
This is my current regex but it will also match to the <cherry> tag inside the <banana> tag
(<cherry>)(.*?)(<\/cherry>)
How can I exclude the occurrence in other tags?
I have already tried a lot...
Why don't you make use of the DOMDocument class rather than a regex. Simply load your DOM and then use getElementsByTagName to get your tags. This way you can exclude any other tags which you don't want and only get those that you do.
Example
<?php
$xml = <<< XML
<?xml version="1.0" encoding="utf-8"?>
<books>
<book>Patterns of Enterprise Application Architecture</book>
<book>Design Patterns: Elements of Reusable Software Design</book>
<book>Clean Code</book>
</books>
XML;
$dom = new DOMDocument;
$dom->loadXML($xml);
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
echo $book->nodeValue, PHP_EOL;
}
?>
Reading Material
DOMDocument
Under the assumption, that you just need the contents of math tags at top level without anything else and you so far can't do it, because math tags contain invalid xml and therefore any xml-parser gives up ... (as mentioned in question and comments)
The clean approach would probably be, to use some fault-tolerant xml-parser (or fault-tolerant mode) or Tidy up the input before. However, these approaches all might "corrupt" the content.
The hacky and possibly dirty approach would be the following, which might very well have other issues, especially if the remaining xml is also invalid or your math tags are nested (this will lead to the xml-parser failing in step 2):
replace any <math>.*</math> (ungreedy) by a placeholder (preferably something unique uniqid might help, but a simple counter is probably enough) via preg_replace_callback or something
parse the document with a common xml-parser (wrapping it in some root tag as necessary)
fetch all child nodes of root node / all root nodes, see which ones were generated in step 1.
for example:
<math>some invalid xml</math>
<sometag>
<math>more invalid xml</math>
some text
</sometag>
replace with
$replacements = [];
$newcontent = preg_replace_callback(
'/'.preg_quote('<math>','/').'(.*)'.preg_quote('</math>','/').'/siU',
function($hit) use ($replacements) {
$id = uniqid();
$replacements[$id] = $hit[1];
return '<math id="'.$id.'" />';
},
$originalcontent);
which will turn your content into:
<math id="1stuniqid" />
<sometag>
<math id="2nduniqid" />
some text
</sometag>
now use the xml parser of your choice and select all root level/base level elements and look for /math/#id (my XPath is possibly just wrong, adjust as needed). result should contain all uniqids, which you can look up in your replacement array
edit: some preg_quote problems fixed and used more standard delimiters.

Regular Expression - get tables from html string in PHP

I try to wrap all tables inside my content with a special div container, to make them usable for mobile.
I can't wrap the tables, before they are saved within the database of the custom CSS. I managed to get to the content, before it's printed on the page and I need to preg_replace all the tables there.
I do this, to get all tables:
preg_match_all('/(<table[^>]*>(?:.|\n)*<\/table>)/', $aFile['sContent'], $aMatches);
The problem is to get the inner part (?:.|\n)* to match everything that is inside the tags, without matching the ending tag. Right now the expression matches everything, even the ending tag of the table...
Is there a way to exclude the match for the ending tag?
You need to perform a non greedy match: /(<table[^>]*>(?:.|\n)*?<\/table>)/. Note the question mark: ?.
However, I would use a DOM parser for that:
$doc = new DOMDocument();
$doc->loadHTML($html);
$tables = $doc->getElementsByTagName('table');
foreach($tables as $table) {
$content = $doc->saveHTML($table);
}
While it is already more convenient to use a DOM parser for extracting data from HTML documents, it is definitely the better solution if you are attempting to modify the HTML (as you told).
You could use lookahead if you don't want to match the end tag,
preg_match_all('/(<table[^>]*>(?:.|\n)*(?=<\/table>))/', $aFile['sContent'], $aMatches);

DOM xpath to find #text nodes and wrap in paragraph tag

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();
OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]
Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* #var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "<p>" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.
These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).
you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}
I know it is not xpath but check this out:
PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.

How to get text between div tags that contain class, style etc attributes before id attribute. I need to use regular expression

Hi I'm using this regular expression for getting the text inside test
<div id = "test">text</div>
$regex = "#\<div id=\"test\"\>(.+?)\<\/div\>#s";
But if the scenario change for e.g.
<div class="testing" style="color:red" .... more attributes and id="test">text</div>
or
<div class="testing" ...some attributes... id="test".... some attributes....>text</div>
or
<div id="test" .........any number of attributes>text</div>
then the above regex will not be able to extract the text between div tag. In 1st case if more attributes are placed in front of id attribute of div tag i.e id attribute being the last attribute the above regex don work. In second case id attribute is between some attributes and in 3rd case it is the 1st attribute of div tag.
Can I have a regex that can match the above 3 conditions so as to extract the text between div tags by specifying ID ONLY. Have to use regex only :( .
Please Help
Thank you....
I would strongly recommend an HTML parser to save yourself from the never-ending grief of trying to write a regular expression to parse HTML/XML.
I suggest you obtain that DOM element via xpath, the xpath expression for that element is:
//div[#class="testing"]
All this can be done with the PHP DOMDocument extension or alternatively with the SimpleXML extension. Both ship in 99,9% with PHP, same as with the regular expression extension, some rough example code (demo):
echo simplexml_import_dom(#DOMDocument::loadHTML($html))
->xpath('//div[#class="testing"]')[0];
Xpath is a specialized language for querying elements and data from XML documents, where as regular expression is a language for more simple strings.
Edit: Same for ID: http://codepad.viper-7.com/h1FlO0
//div[#id="test"]
I guess you understand quite quickly how these simple xpath expressions work.
Here's the answer with DOM (kind of crudish but works)
$aPieceOfHTML = '<div class="testing" id="test" style="color:red">This is my text blabla<div>';
$doc = new DOMDocument();
$doc->loadHTML($aPieceOfHTML);
$div = $doc->getElementsByTagName("div");
$mytext = $div->item(0)->nodeValue;
echo $mytext;
Here's the Cthulhu way:
$regex = '/(?<=id\=\"test\"\>).*(?=\<\/div\>)/';
DISCLAIMER
By no means I guarantee this will work in every case (far from it). In fact, this will fail if:
id="test" is not the last tag attribute
if there is a space (or anything) between id="test" and the closing >.
If the div tag is not properly closed </div>
If the tags are written in uppercase
If tag attributes are written in uppercase
I don't know... this will probably fail in more cases
I could try to write a more complex regex but I don't think I could come up with something much better than this. Besides, it kind of seems a waste of time when you have other tools built in PHP that can parse HTML so much better.
I don't know if you still need this, but the RegEx below works for all of the give scenarios in your question.
(!?(<.*?>)|[^<]+)\s*
https://regex101.com/r/DAObw0/1
The matching group can be accessed with:
const [_, group1, group2] = myRegex.Exec(input)

parse content for appearance of a keyword inside h1, h2 and h3 heading tags

Given a block of content, I'm looking to create a function in PHP to check for the existence of a keyword or keyword phrase inside an h1-h3 header tags...
For example, if the keyword was "Blue Violin" and the block of text was...
You don't see many blue violins. Most violins have a natural finish.
<h1>If you see a blue violin, its really a rarity</h1>
I'd like my function to return:
The keyword phrase does appear in an h1 tag
The keyword phrase does not appear in an h2 tag
The keyword phrase does not appear in an h2 tag
You can use DOM and the following XPath for this:
/html/body//h1[contains(.,'Blue Violin')]
This would match all h1 element inside the body element containing the phrase "Blue Violin" either directly or in a subnode. If it should only occur in the direct TextNode, change the . to text(). The results are returned in a DOMNodeList.
Since you only want to know if the phrase appears, you can use the following code:
$dom = new DOMDocument;
$dom->load('NewFile.xml');
$xPath = new DOMXPath($dom);
echo $xPath->evaluate('count(/html/body//h1[contains(.,"Blue Violin")])');
which will return the number of nodes matching this XPath. If your markup is not valid XHTML, you will not be able to use loadXML. Use loadHTML or loadHTMLFile instead. In addition, the XPath will execute faster if you give it a direct path to the nodes. If you only have one h1, h2 and h3 anyway, substitute the //h1 with a direct path.
Note that contains is case-sensitive, so the above will not match anything due to the Mixed Case used in the search phrase. Unfortunately, DOM (or better the underlying libxml) does only support XPath 1.0. I am not sure if there is an XPath function to do a case-insensitive search, but as of PHP 5.3, you can also use PHP inside an XPath, e.g.
$dom = new DOMDocument;
$dom->load('NewFile.xml');
$xpath = new DOMXPath($dom);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();
echo $xpath->evaluate('count(/html/body//h1[contains(php:functionString("strtolower", .),"blue violin")])');
so in case you need to match Mixed Case phrases or words, you can lowercase all text in the searched nodes before checking it with contains or use any other PHP function you may find useful here.
Instead of including PHP functions into the class, you can, as well, simply convert the Xpath PHP object into a regular PHP array and then search directly using regular string searching functions from PHP: http://fsockopen.com/php-programming/your-final-stop-for-php-xpath-case-insensitive

Categories