php Parsing HTML getting PRE text and saving it to file - php

I'm parsing an html file, and getting contents of a pre tag then saving it to a text file.
however when i open the text file in sublime, or other text editors the formmating is gone,
My question: how can i save the text in its original state inside the txt file.
the contents of the pre are below this:
x4 x4
|---------------------|-|-------------------|--------------------|
|---------------------|-|-------------------|--------------------|
|----------2-0-0------|-|-------------------|--------------------|
|----------------1-0-0|-|-------------------|--------------------|
|3-0-1-3-0------------|0|1-3-1-3-1-3-1-0----|1-3-1-3-1-3-1-0---0-|
x4 x4
|------------------------|-------------|-------------------|
|------------------------|-------------|-------------------|
|------------------------|-------------|-------------------|
|------------------------|-------------|0--0033------------|
|1-3-1-3-1-3-1-0--0000--0|1-3-1-3-1-3-1|--------333~-335-0-|
x4 x4
|------------------------|---------------------|-|-------------|
|------------------------|---------------------|-|-------------|
|------------------------|----------2-0-0------|-|-------------|
|------------------------|----------------1-0-0|-|-------------|
|0--0000--0-1-3-1-3-1-3-1|3-0-1-3-0------------|0|1-3-1-3-1-3-1|
my code:
<?php
// example of how to use basic selector to retrieve HTML contents
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://metaltabs.com/tab/10464/index.html');
foreach($html->find('title') as $e)
echo $e->innertext . '<br>';
$my_file = fopen("textfile.txt", "w") or die("Unable to open file!");
foreach($html->find('pre') as $e)
echo nl2br($e->innertext) . '<br>';
$txt = $e->innertext;
fwrite($my_file, $txt);
fclose($my_file);
?>

The problems with your parsing results are:
Line breaks are not preserved;
HTML entities are preserved.
To resolve line break issue you have to use ->load() instead of file_get_html:
$html = new simple_html_dom();
$data = file_get_contents( 'http://metaltabs.com/tab/10464/index.html' );
$html->load( $data , True, False );
/* └─┬┘ └─┬─┘
Optional parameter Optional parameter
lowercase Strip \r\n
*/
To resolve entities issue you can use php function ``:
$txt = html_entity_decode( $e->innertext );
The result is something like this:
Tuning E A D G B E
|------------------------------------------------------------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|-------<7-8>----------<10-11>---------<7-8>---7--10--8--11--|x9
|-0000-----------0000------------0000----------0-------------|

I tried this code and opening with sublime text, the text file preserve the same formatting as in your website:
$html = file_get_contents("http://metaltabs.com/tab/4086/index.html");
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object
$dom->loadHTML($html);
//preserve white space
$dom->preserveWhiteSpace = true;
$pre= $dom->getElementsByTagName('pre');
$file = fopen('text.txt', 'w');
fwrite($file, $pre->item(0)->nodeValue);
fclose($file);
This is assuming that you are sure that there is only one pre tag in your page, otherwise you have to loop through the $pre variable

Related

save() in php dom saves numbers at the end of the file

The below code successfully saves the child div but also saves some numbers in the file at the end. I think its the bytes of data present, how do i get rid of the numbers it saves?
$file = '../userfolders/'.$email.'/'.$ongrassdb.'/'.$pagenameselected.'.php';
$doc = new DOMDocument();
$doc->load($file);
$ele = $doc->createElement('div', $textcon);
$ele ->setAttribute('id', $divname);
$ele ->setAttribute('style', 'background: '.$divbgcolor.'; color :'.$divfontcolor.' ;display : table-cell;');
$element = $doc->getElementsByTagName('div')->item(0);
$element->appendChild($ele);
$doc->appendChild($element);
$myfile = fopen($file, "a+") or die('Unable to open file!');
$html = $doc->save($file);
fwrite($myfile,$html);
fclose($myfile);
I don't want to use saveHTML nor saveHTMLFile because it creates multiple instances of the divs and adds html tags to it.
$doc->load($file);
...
$myfile = fopen($file, "a+") or die('Unable to open file!');
$html = $doc->save($file);
fwrite($myfile,$html);
fclose($myfile);
The $doc->save() method saves the DOM tree to the file, and returns the number of bytes it wrote to the file. This number is stored in $html and is then append to the same file by fwrite().
Just remove the fopen(), fwrite() and fclose() calls.
I removed the last two lines and it solved the issue
fwrite($myfile,$html);
fclose($myfile);

how to find line number for DOM elements in php?

I want to check whether a <img> tag has alt="" text or not and also need to find what line number in DOM that img tag is. At the moment I have the following codes written but stuck with finding the line number.
for example:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.google.com');
$htmlElement = $doc->getElementsByTagName('html');
$tags = $doc->getElementsByTagName('img');
echo $tags->item(0)->getLineNo();
foreach ($tags as $image) {
// Get sizes of elements via width and height attributes
$alt = $image->getAttribute('alt');
if($alt == ""){
$src = $image->getAttribute('src');
echo "No alt text ";
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
else{
$src = $image->getAttribute('src');
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
}
from the above code at the moment I am getting images and text saying that "no alt text" beside the image, but I want to get what line number that img tag appears.
for example here the line number is 57,
56. <div class="work_item">
57. <p class="pich"><img src="images/works/1.jpg" alt=""></p>
58. </div>
Use DOMNode::getLineNo(), e.g.$line = $image->getLineNo().
HTML has no real concept of line numbers, since they are just whitespace.
With that in mind, you might be able to count how many newlines there are in all the text nodes preceding the target node. You might be able to do this with DOMXPath:
$xpath = new DOMXPath($doc);
$node = /* your target node */;
$textnodes = $xpath->query("./preceding::*[contains(text(),'\n')]",$node);
$line = 1;
foreach($textnodes as $textnode) $line += substr_count($textnode->textContent,"\n");
// $line is now the line number of the node.
Please note that I have not tested this, nor have I ever used axes in xpath.
I think i have figured out what i was trying to achieve but not sure is that the right way. It is doing the job. Please leave comments or any other idea how can i improve it.
If you go to the following site and type any URL. It will produce a report with accessibility issues in a webpage. It is an accessibility checker tool.
http://valet.webthing.com/page/
All i am trying to do is achieve that kind of layout. The code below will produce the DOM of supplied URL and find any image tag that does not have alternative text.
<html>
<body>
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$yourURLAddress');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
$lines = preg_split('/\r\n|\r|\n/', $new); //split the string on new lines
echo "<pre>";
//find 'alt=""' and print the line number and html tag
foreach ($lines as $lineNumber => $line) {
if (strpos($line, htmlspecialchars('alt=""')) !== false) {
echo "\r\n" . $lineNumber . ". " . $line;
}
}
echo "\n\n\nBelow is the whole DOM\n\n\n";
//print out the whole DOM including line numbers
foreach ($lines as $lineNumber => $line) {
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
?>
</body>
</html>
I like to thank everyone who helped specially "chwagssd" and Mike Johnson.

Write to a file using PHP

Bassicly what I want to do is using PHP open a xml file and edit it using php now this I can do using fopen() function.
Yet my issue it that i want to append text to the middle of the document. So lets say the xml file has 10 lines and I want to append something before the last line (10) so now it will be 11 lines. Is this possible. Thanks
Depending on how large that file is, you might do:
$lines = array();
$fp = fopen('file.xml','r');
while (!feof($fp))
$lines[] = trim(fgets($fp));
fclose($fp);
array_splice($lines, 9, 0, array('newline1','newline2',...));
$new_content = implode("\n", $lines);
Still, you'll need to revalidate XML-syntax afterwards...
If you want to be able to modify a file from the middle, use the c+ open mode:
$fp = fopen('test.txt', 'c+');
for ($i=0;$i<5;$i++) {
fgets($fp);
}
fwrite($fp, "foo\n");
fclose($fp);
The above will write "foo" on the fifth line, without having to read the file entirely.
However, if you are modifying a XML document, it's probably better to use a DOM parser:
$dom = new DOMDocument;
$dom->load('myfile.xml');
$linenum = 5;
$newNode = $dom->createElement('hello', 'world');
$element = $dom->firstChild->firstChild; // skips the root node
while ($element) {
if ($element->getLineNo() == $linenum) {
$element->parentNode->insertBefore($newNode, $element);
break;
}
$element = $element->nextSibling;
}
echo $dom->saveXML();
Of course, the above code depends on the actual XML document structure. But, the $element->getLineNo() is the key here.

How can I remove empty paragraphs from an HTML file using simple_html_dom.php?

I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".
This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:
<?php
/* Using the DOMDocument class */
/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");
/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");
/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");
echo "The initial number of paragraphs is " . $pars->length . ".<br />";
/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent) == ""){
$pars->item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}
echo "The final number of paragraphs is " . $pars->length . ".<br />";
// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>
This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:
<?php
/* Using simple_html_dom.php */
include("simple_html_dom.php");
$html = file_get_html("HTML File With Empty Paragraphs.html");
$pars = $html->find("p");
for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext) == "") {
unset($pars[$i]);
$i--;
}
}
$html->save("HTML File without Empty Paragraphs.html");
?>
It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext) == "") {".
Does anyone know how I can fix this?
Thank you.
I also asked on php devnetwork.
Looking at the documentation for Simple HTML DOM Parser, I think this should do the trick:
include('simple_html_dom.php');
$html = file_get_html('HTML File With Empty Paragraphs.html');
$pars = $html->find('p');
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
// Remove an element, set it's outertext as an empty string
$par->outertext = '';
}
}
$html->save('HTML File without Empty Paragraphs.html');
I did a quick test and this works for me:
include('simple_html_dom.php');
$html = str_get_html('<html><body><h1>Test</h1><p></p><p>Test</p></body></html>');
$pars = $html->find("p");
foreach($pars as $par)
{
if(trim($par->plaintext) == '')
{
$par->outertext = '';
}
}
echo $html;
// Output: <html><body><h1>Test</h1><p>Test</p></body></html>
Empty paragraphs looks like <p [attributes]> [spaces or newlines] </p> (case-insensitive). You can use preg_replace (or str_replace) for removing empty paragraphs.
The following will only work if an empty paragraph is <p></p>:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = str_replace('<p></p>', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);
This will also work on paragraphs with attributes, like <p class="msoNormal"> </p>:
$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = preg_replace('#<p[^>]*>\s*</p>#i', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);

extract text from tag

Hi I've got these lines here, I am trying to extract the first paragraph found in the file, but this fails to return any results, if not it returns results that are not even in <p> tags which is odd?
$file = $_SERVER['DOCUMENT_ROOT'].$_SERVER['REQUEST_URI'];
$hd = fopen($file,'r');
$cn = fread($hd, filesize($file));
fclose($hd);
$cnc = preg_replace('/<p>(.+?)<\/p>/','$1',$cn);
Try this:
$html = file_get_contents("http://localhost/foo.php");
preg_match('/<p>(.*)<\/p>/', $html, $match);
echo($match[1]);
I would use DOM parsing for that:
// SimpleHtmlDom example
// Create DOM from URL or file
$html = file_get_html('http://localhost/blah.php');
// Find all paragraphs
foreach($html->find('p') as $element)
echo $element->innerText() . '<br>';
It would allow you to more reliably replace some of the markup:
$html->find('p', 0)->innertext() = 'foo';

Categories