Concatenate HTML tables with PHP DOMDocument - php

I have a whole bunch of large HTML documents with tables of data inside and I'm looking to write a script which can process an HTML file, isolate the tags and their contents, then concatenate all the rows within those tables into one large data table.
Then loop through the rows and columns of the new large table.
After some research I've started trying out PHP's DOMDocument class to parse the HTML but I just wanted to know, is that the best way to do something like this?
This is what I've got so far...
$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
#$dom->loadHTMLFile('exrate.html');
$tables = $dom->getElementsByTagName('table');
How do I chop out everything other than the tables and their contents?
Then I'd actually like to remove the first table since it's a table of contents. Then loop through all the table rows and build them into one large table.
Anyone got any hints on how to do this?
I've been digging through the docs for DOMDocument on php.net but I'm finding the syntax pretty baffling!
Cheers, B
EDIT: Here is a sample of an HTML file with the data tables I'd like to join http://thenetzone.co.uk/exrates/exrate.html

Ok got it sorted with phpQuery and lots of trial and error.
So it takes a whole bunch of tables and moves the contents into the first one, removes the empty tables.
Then loops through each table row and extracts the text from specific columns, in this case the 2nd and 3rd td of each row.
require('phpQuery/phpQuery.php');
$doc = phpQuery::newDocumentFileHTML('exrates_code.html');
pq('table:first')->remove();// REMOVE FIRST TABLE, JUST A CONTENTS TABLE SO NOT INTERESTED
pq('tr:has(th)')->remove();// REMOVE TABLE ROWS THAT ARE HEADERS
pq('table:not(:first) tr')->appendTo('table:first');// MOVE CONTENTS OF OTHER TABLES TO FIRST
pq('table:empty')->remove();// REMOVE EMPTY TABLES
pq('br')->remove();
$rows = pq('table tr');
foreach ($rows as $row) {
$currency = pq($row)->find('td:eq(1)')->text();
$value = pq($row)->find('td:eq(2)')->text();
}
Hope this helps someone out!

Related

How to count number of lines from HTML out in PHP

How can I count the number of tags in HTML with PHP? I have a website which shows all of the bans on my game server, I want to count the number of bans that have been done, I see the only way to do this being counting the number of tags, since the HTML output is all on one line. At the moment I have:
$content = file_get_contents('**WEBSITE OF BAN LIST HERE**')
The HTML output looks like this but much much longer:
hillel123 banned on 13/March/2014 with reason : None<br>xmrbrhoom banned on 13/March/2014 with reason : None by [name of banner]<br>InfinityJoris banned on 13/March/2014 with reason : None by [Name of banner]<br>
Thanks
You could figure out how many bans there are by counting the occurrences of <br>? As long as the html is in that form..
echo substr_count($html, '<br>'); // How many new lines there are?
You can count tags using this code:
$dom = new DOMDocument;
$dom->loadHTML($HTML);
$allElements = $dom->getElementsByTagName('*');
echo $allElements->length;
Instead of (*) you can put any tags you want and you'll get number of those tags.
The easyest way i can think of is that you make an array and then count it.
Since you have all that in one string you can easely do that :)
<?php
$content = file_get_contents('**WEBSITE OF BAN LIST HERE**');
$banns = explode('<br>', $content);
echo count($banns);
?>
I think you want to use file instead of file_get_contents. Then use the count() function to get the resulting array's length.

how to randomly insert html inside of html string without causing invalid html

I've written a content generator tool for a project im working to assist me batch importing fake content into text fields of a database. It just assists making the site look populated.
I'm using an external class called lorem-php-sum to actually generate the strings that I am inserting. Its incredibly simple really, it just inserts paragraphs of text wrapped in <p> tags (and a random number of them each time) and I then insert these strings into my chosen table within a big loop.
Now the thing is, I want to slightly advance what content is being randomly generated and to add some html list tags, horizontal line tags and other stuff. I want my new html elements to be placed randomly within the paragraphs that I get returned from this paragraph generator class.
The problem is that whilst I can easily insert list tags into my big paragraph string at some random point, I fear sometimes it may insert my new html tags within the existing markup in a way that will break the html.
Does anyone have a trick for inserting html with some rules into another string? I imagine that maybe the php domDocument class can assist with this but not sure now?
You'd need to incorporate some kind of state machine in your generator.
You can think of something like this:
Step1: Choose which element to render: a textnode, a paragraph, a list node.
When you pick a textnode you randomly generate some text and return to Step 1.
When you pick a paragraph you emit <p> and generate some text, emit </p> and return to Step 1.
In the case of a list node you can only make list elements <li>, so pick a random number of elements and fill them with same rules from Step 1.
--
You can also allow nesting. In <li> you can add <strong> and <em>, similar for <p>.
You can make it as crazy as you want I guess :)
Tweak a bit with the coefficients to get good results. Try to make a generator that produces random, but predictable output, total length might be a good thing to control on.
You could hierarchically loop through multidimensional arrays. No cell without a row, no row without a table, as such no li without a ul.
$tags = array("<table>%s</table>\n" ,
array (" <tr>%s</tr>\n" ,
array(" <td>%s</td>\n)),
"<ul>%s</ul>\n",
arrray (" <li>%s</li>\n") //continue with more tags
);
$tags_simple = array("%s", "<strong>%s</strong>",
"<i>%s</i>", "<p>%s</p>\n", "%s</ br>\n"
); //etc, "%s" for a none tag, add more if you like
Pick a ramdom from $tags, multiloop them, sprintf the random sentences and add random simple tags to them. It's a standalone possibility.
So I managed to work this out with other code samples and using domDocument.
I ended up making a function that explodes the string via paragraph tags and returns it as an array containing each paragraph as a separate item.
function splitTextByPara($string,$split_on="p"){
// Add alternative tags to split on with syntax: |//ul|//br
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//".$split_on);
$result = array();
foreach ($entries as $entry) {
$result[] = $entry->ownerDocument->saveHTML( $entry );
}
// re-encode to utf8
$result = array_map("utf8_decode", $result);
return $result;
}

PHP> Extracting html data from an html file?

What I've been trying to do recently is to extract listing information from a given html file,
For example, I have an html page that has a list of many companys, with their phone number, address, etc'
Each company is in it's own table, every table started like that: <table border="0">
I tried to use PHP to get all of the information, and use it later, like put it in a txt file, or just import into a database.
I assume that the way to achieve my goal is by using regex, which is one of the things that I really have problems with in php,
I would appreciate if you guys could help me here.
(I only need to know what to look for, or atleast something that could help me a little, not a complete code or anything like that)
Thanks in advance!!
I recommend taking a look at the PHP DOMDocument and parsing the file using an actual HTML parser, not regex.
There are some very straight-forward ways of getting tables, such as the GetElementsByTagName method.
<?php
$htmlCode = /* html code here */
// create a new HTML parser
// http://php.net/manual/en/class.domdocument.php
$dom = new DOMDocument();
// Load the HTML in to the parser
// http://www.php.net/manual/en/domdocument.loadhtml.php
$dom->LoadHTML($htmlCode);
// Locate all the tables within the document
// http://www.php.net/manual/en/domdocument.getelementsbytagname.php
$tables = $dom->GetElementsByTagName('table');
// iterate over all the tables
$t = 0;
while ($table = $tables->item($t++))
{
// you can now work with $table and find children within, check for
// specific classes applied--look for anything that would flag this
// as the type of table you'd like to parse and work with--then begin
// grabbing information from within it and treating it as a DOMElement
// http://www.php.net/manual/en/class.domelement.php
}
If You're familiar with jQuery (and even if You're not as it's command are simple enough) I recommend this PHP counterpart: http://code.google.com/p/phpquery/
If your HTML is valid XML, as in XHTML, then you could parse it using SimpleXML

Why does my XML data get all mixed up?

i have a problem with getting content from a XML into a mysql database.
This is the code:
$objDOM = new DOMDocument('1.0', 'UTF-8');
$objDOM->load("something.xml"); $IAutnr = $objDOM->getElementsByTagName("Data");
Now, in a for loop:
for($i=$t;$i<=$max;$i++) {
$some= $objDOM->getElementsByTagName("some");
$something = $some->item($i)->nodeValue;
$some2 = $objDOM->getElementsByTagName("some2");
$something2 = $some2->item($i)->nodeValue;
Now put $something and $something2 into the database
}
Now, what happens is, that everything works perfectly fine until one of the Elements (some,some2...) does not exist within the tag "Data". So what he does, is taking the element from the next "Data"-tag and this mixes all my data, so I have data in my database, that actually doesnt belong there. And so I have an all mixed up database.
I allready tried for several hours to change the XML manually by putting the missing tags inside, but with thousands of data records, it is not possible.
So I need to add something into my code, that will have the effect, that if the tag doesnt exist, just leave it and dont take the tag from the next "Data"-Tag.
I actually dont even understand why he is doing that, why is he just jumping into the next "Data"-tag?
Thank you very much for your help!
I'm only guessing here about the content of your XML structure, but I imagine it looks something like
...
<Data>
<some>a</some>
<some2>b</some2>
</Data>
<Data>
<some>c</some>
<some2>d</some2>
</Data>
...
If this is the case, you should be looping over the collection of Data elements in $IAutnr, eg
for($i = 0, $limit = min($IAutnr->length, $max); $i < $limit; $i++) {
$data = $IAutnr->item($i);
$some = $data->getElementsByTagName('some');
$something = $some->item(0)->nodeValue;
$some2 = $data->getElementsByTagName('some2');
$something2 = $some2->item(0)->nodeValue;
// insert
}
Unless you need some of the more advanced features of the DOM library, I'd recommend using SimpleXML.
It does that because you're asking it to extract elements with tag name "some" and "some2" from the entire XML structure, so that's what it does -- it doesn't only look into the branch you intend it to, because you never tell it to do that. One way to fix it is to look at $some->items($i)->parentNode (and maybe to that node's parent, and so on) in order to properly identify the parent $something and $something2 belong to. Of course, there's no guarantee that $something and $something2 belong to the same parent, unless your XML is somehow guaranteed to present either none or both within the same branch. I know the explanation's a bit hairy, but that's the best way I could put it into words.

PHP: Formatting irregular CSV files into HTML tables

My client receives a set of CSV text files periodically, where the elements in each row follow a consistent order and format, but the commas that separate them are inconsistent. Sometimes one comma will separate two elements and other times it will be two or four commas, etc ...
The PHP application I am writing attempts to do the following things:
PSEUDO-CODE:
1. Upload csv.txt file from client's local directory.
2. Create new HTML table.
3. Insert the first three fields FROM csv.txt into HTML table row.
4. Iterate STEP 2 while the FIRST field equals the First field below it.
5. If they do not equal, CLOSE HTML table.
6. Check to see if FIRST field is NOT NULL, IF TRUE, GOTO step 2, Else close HTML table.
I have no trouble with steps 1 and 2. Step 3 is where it gets tricky since the fields in the csv.txt files are not always separated by the same number of commas. They are, however, always in the same relative order and format. I am also having issues with step 4. I don't know how to check if the beginning field in a row matches the beginning field in the row below it. Steps 5 should be relatively simple. For step 6, I need to find an eqivalent of a "GOTO" function in PHP.
Please let me know if any part of the question is unclear. I appreciate your help.
Thank you in advance!
If you want to group the rows by their first element you can try something like:
read the next row via fgetcsv()
filter empty elements (a,,b,c -> a,b,c)
if the row contains fields <-> is not empty append the row to "its" group
That's not exactly what you've described but it may be what you want ;-)
<?php
$fp = fopen('test.csv', 'rb') or die('!fopen');
$groups = array();
while(!feof($fp)) {
$row = array_filter(fgetcsv($fp));
if ( !empty($row) ) {
// # because I don't care whether the array exists or not
#$groups[$row[0]][] = $row;
}
}
foreach( $groups as $g ) {
echo '
<table>';
foreach( $g as $row ) {
echo '
<tr>
<td>', join('</td><td>', array_map('htmlentities', $row)), '</td>
</tr>
';
}
echo '</table>';
}
why not simply start by going through any replacing any multiples of commas with a single comma. eg:
abc,def,,ghi,,,,jkl
becomes:
abc,def,ghi,jkl
and then just continue normally.
If you mean that there are different numbers of commas on each line, then as far as I can see it is actually impossible to do what you want to do by looking at the commas alone. For example:
ab,c,d,ef // could group columns a-f in that way, but
a,bc,de,f // could also group columns a-f
... and you would have no way of knowing which was the proper arrangement, unless you're given some other instructions or the type of data is identifiable by regular expression as someone else said.
If on the other hand you just mean that sometimes there are blanks, but there are still the same number of columns, like this:
a,b,,d,e,f
a,,c,d,e,f
... then you can still form the table correctly. I would recommend using explode(',' $line) in that case and then doing your processing on the elements of the exploded array without worrying about what is inside them.

Categories