Using Simple Dom Parser create more than one files - php

I am using php simple dom parser to get the table elements of and html page and then create file for each element.
This is my code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('test.html');
foreach($html->find('table[id=backgroundTable]') as $element);
$element = $html->save();
$html->save('result.html');
The problem I have at them moment is that it stores all the tables in this result.html file.
What I need is the export results to be result1.html , result2.html . How can I achieve this?
Thank you very much in advance

You may try something like this, so in each loop the $i will be increased:
$html = file_get_html('test.html');
$i = 1;
foreach($html->find('table#backgroundTable') as $element) {
str_get_html($element)->save('result' . $i . '.html');
$i++;
}
So the results will be saved in result1.html, result2.html and so on.

Related

PHP DOMDocument messing up encoding

I'm using class to load html on one page, get HTML table content. The page is in windows-1250, so I'm using iconv to convert it to utf-8.
All this is done in one class, that I'm calling like this: $tableHtml = suplUpdater::getTableHtml(someParams,...);. When I echo that variable directly, everything looks nice. However, I want to parse the table rows with PHP DOMDocument to save them to database. Code looks like this:
$tableData = suplUpdater::getTableHtml(1400450400);
//echo($tableData);
$document = new DOMDocument();
$document->loadHTML($tableData);
$rows = $document->getElementsByTagName('tr');
$rows->item(0)->parentNode->removeChild($rows->item(0));//first row is just a header
$output = array();
foreach ($rows as $row) {
$currentOutput = array();
foreach ($row->childNodes as $cell) {
if ($cell->nodeType == 1) {
$currentOutput[] = $cell->nodeValue;
}
}
$output[] = $currentOutput;
}
When I do var_dump($output);, I get array, but it has messed up encoding. Where could be the problem? If needed, I can provide source table data.
EDIT:
When I copy table html to txt file, encoded in utf-8 and I do file_get_contents('tableHtml.txt'), I get the same result.
EDIT:
I have uploaded sample data here:http://anagmate.moxo.cz/data.txt
EDIT:
Screenshot of echo and var_dump is here:http://anagmate.moxo.cz/supl.png

Using php to get parent element of link with URL

I'm trying to implement a "find and replace" system for broken links. The problem is, for some links there are no replacements. So, I need to comment out certain li elements. You can see my code below to do this. (I'm starting with an HTML form).
<?php
$brokenlink = $_POST['brokenlink'];
$newlink = $_POST['newlink'];
$brokenlink = '"' . $brokenlink . '"';
$newlink = '"' . $newlink . '"';
$di = new RecursiveDirectoryIterator('hugedirectory');
foreach (new RecursiveIteratorIterator($di) as $filename => $file) {
// echo $filename . ' - ' . $file->getSize() . ' bytes <br/>';
$filetoedit = file_get_contents($file);
if(strpos($filetoedit, $brokenlink)) {
echo $brokenlink . "found in " . $filename . "<br/>";
$filetoedit = str_replace($brokenlink, $newlink, $filetoedit);
file_put_contents($filename, $filetoedit);
}
}
?>
What I want to accomplish is this: If I have a URL, I want to be able to find its li parent. For instance, I want PHP to be able to comment out the code below if the user inputs http://www.espn.com in an HTML form, I want php to find this element on my server:
<li>Sports</li>
And replace it with this:
<!-- <li>Sports</li> -->
Is this possible? Thanks.
I would try using this to parse the DOM.
http://simplehtmldom.sourceforge.net/
You can set a class to all the ones you want comment out. Then use this tool to find those classes and comment them all out at once.
Why not use a regexp to find and replace links, it would also take care of the perhaps expensive looping over links.
Here's a regex for matching urls
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
then preg_replace the broken with the new, or the broken with the commented out version of the broken link
Alternatively you can just run grep on the directory via shell_exec, that way you don't have to open / read and parse files yourself.
Also take a look at this match url pattern in php using regular expression
I suggest you construct DOMDocument with the file content and use XPath to search for the broken link node.
$dom = new DOMDocument();
#$dom->loadHTML($filetoedit);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//li/a[#href="' . $brokenlink . '"]');
for ($i = 0; $i < $nodes->length; $i++) {
$node = $nodes->item($i);
// Do whatever you want here
}

Get anchor tags from mutiple HTML Files

I am not sure if this is even possible but I am trying to extract all the anchor tag links in a few HTML files on my website. I have currently written a php script that scans a few directories and sub directories that builds an array of HTML file links. Here is that code:
$di = new RecursiveDirectoryIterator('Migration');
$migrate = array();
foreach (new RecursiveIteratorIterator($di) as $filename => $file) {
if (eregi("\.html",$file) || eregi("\.htm",$file) ) {
$migrate[] .= $filename;
}
}
This method successfully produces the HTML File links that I need. Ex:
Migration/administration/billing/Billing.htm
Migration/administration/billing/_notes/Billing.htm.mno
Migration/administration/new business/_notes/New Business.htm.mno
Migration/administration/new business/New Business.htm
Migration/account/nycds/_notes/NYCDS Index.htm.mno
Migration/account/nycds/NYCDS Index.htm
There's more links but this gives you an idea. The next part is where I am stuck. I was thinking that I would need a for loop to loop through each array element, open the file, extract the links, then store those links somewhere. I am just not sure how I would go about this process. I tried to google this question but I never seemed to get results that matched what I was looking to do. Here is the simplified for loop that I have.
var obj = <?php echo json_encode($migrate); ?>;
for(var i=0;i< obj.length;i++){
// alert(obj[i]);
}
The above code is in javascript. From what I am reading, It seems that I shouldn't be using javascript but should maybe continue using PHP. I am confused on what my next steps should be. If someone can point me in the right direction I would really appreciate it. Thank you so much for your time.
Use DOMDocument::getElementsByTagName to retrieve all <a> tags
http://www.php.net/manual/en/domdocument.getelementsbytagname.php
Example,
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
$anchors = $doc->getElementsByTagName('a'); //retrieve all anchor tags
foreach ($anchors as $a) { //loop anchors
echo $a->nodeValue;
}

In PHP, how can I get an XML attribute based on a variable?

I'm retrieving files like so (from the Internet Archive):
<files>
<file name="Checkmate-theHumanTouch.gif" source="derivative">
<format>Animated GIF</format>
<original>Checkmate-theHumanTouch.mp4</original>
<md5>72ec7fcf240969921e58eabfb3b9d9df</md5>
<mtime>1274063536</mtime>
<size>377534</size>
<crc32>b2df3fc1</crc32>
<sha1>211a61068db844c44e79a9f71aa9f9d13ff68f1f</sha1>
</file>
<file name="CheckmateTheHumanTouch1961.thumbs/Checkmate-theHumanTouch_000001.jpg" source="derivative">
<format>Thumbnail</format>
<original>Checkmate-theHumanTouch.mp4</original>
<md5>6f6b3f8a779ff09f24ee4cd15d4bacd6</md5>
<mtime>1274063133</mtime>
<size>1169</size>
<crc32>657dc153</crc32>
<sha1>2242516f2dd9fe15c24b86d67f734e5236b05901</sha1>
</file>
</files>
They can have any number of <file>s, and I'm solely looking for the ones that are thumbnails. When I find them, I want to increase a counter. When I've gone through the whole file, I want to find the middle Thumbnail and return the name attribute.
Here's what I've got so far:
//pop previously retrieved XML file into a variable
$elem = new SimpleXMLElement($xml_file);
//establish variable
$i = 0;
// Look through each parent element in the file
foreach ($elem as $file) {
if ($file->format == "Thumbnail"){$i++;}
}
//find the middle thumbnail.
$chosenThumb = ceil(($i/2)-1);
//Gloriously announce the name of the chosen thumbnail.
echo($elem->file[$chosenThumb]['name']);`
The final echo doesn't work because it doesn't like have a variable choosing the XML element. It works fine when I hardcode it in. Can you guess that I'm new to handling XML files?
Edit:
Francis Avila's answer from below sorted me right out!:
$sxe = simplexml_load_file($url);
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"]');
$n_thumbs = count($thumbs);
$middlethumb = $thumbs[(int) ($n_thumbs/2)];
$happy_string = (string)$middlethumb[name];
echo $happy_string;
Use XPath.
$sxe = simplexml_load_file($url);
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"]');
$n_thumbs = count($thumbs);
$middlethumb = $thumbs[(int) ($n_thumbs/2)];
$middlethumbname = (string) $middlethumb['name'];
You can also accomplish this with a single XPath expression if you don't need the total count:
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"][position() = floor(count(*) div 2)]/#name');
$middlethumbname = (count($thumbs)) ? $thumbs[0]['name'] : '';
A limitation of SimpleXML's xpath method is that it can only return nodes and not simple types. This is why you need to use $thumbs[0]['name']. If you use DOMXPath::evaluate(), you can do this instead:
$doc = new DOMDocument();
$doc->loadXMLFile($url);
$xp = new DOMXPath($doc);
$middlethumbname = $xp->evaluate('string(/files/file[format="Thumbnail"][position() = floor(count(*) div 2)]/#name)');
$elem->file[$chosenThumb] will give the $chosenThumb'th element from the main file[] not the filtered(for Thumbnail) file[], right?
foreach ($elem as $file) {
if ($file->format == "Thumbnail"){
$i++;
//add this item to a new array($filteredFiles)
}
}
$chosenThumb = ceil(($i/2)-1);
//echo($elem->file[$chosenThumb]['name']);
echo($filteredFiles[$chosenThumb]['name']);
Some problems:
Middle thumbnail is incorrectly calculated. You'll have to keep a separate array for those thumbs and get the middle one using count.
file might need to be {'file'}, I'm not sure how PHP sees this.
you don't have a default thumbnail
Code you should use is this one:
$files = new SimpleXMLElement($xml_file);
$thumbs = array();
foreach($files as $file)
if($file->format == "Thumbnail")
$thumbs[] = $file;
$chosenThumb = ceil((count($thumbs)/2)-1);
echo (count($thumbs)===0) ? 'default-thumbnail.png' : $thumbs[$chosenThumb]['name'];
/edit: but I recommend that guy's solution, to use XPath. Way easier.

Can't get the dom node value extracted

I have a code that links to another site, grabs that data, and returns the string to a variable.. i'm wondering why this isn't working however?
<?php
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
$doc = new DOMDocument();
#$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('h1');
for ($i=1; $i<=7; $i++)
{
echo trim($elements->item($i)->nodeValue);
}
?>
there are seven "h1" tags that i would like to grab but they won't return to echo out? an example of the string would be "Here is the test string i am trying to pull out"
This will not work because the path dont exists. It points to a file on your server.
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
I'n not sure if loadHTMLFile() can handle URLs at all. You may need to get the document with file() and load it with DOMDocument::loadHTML.

Categories