How can I count the number of tags in HTML with PHP? I have a website which shows all of the bans on my game server, I want to count the number of bans that have been done, I see the only way to do this being counting the number of tags, since the HTML output is all on one line. At the moment I have:
$content = file_get_contents('**WEBSITE OF BAN LIST HERE**')
The HTML output looks like this but much much longer:
hillel123 banned on 13/March/2014 with reason : None<br>xmrbrhoom banned on 13/March/2014 with reason : None by [name of banner]<br>InfinityJoris banned on 13/March/2014 with reason : None by [Name of banner]<br>
Thanks
You could figure out how many bans there are by counting the occurrences of <br>? As long as the html is in that form..
echo substr_count($html, '<br>'); // How many new lines there are?
You can count tags using this code:
$dom = new DOMDocument;
$dom->loadHTML($HTML);
$allElements = $dom->getElementsByTagName('*');
echo $allElements->length;
Instead of (*) you can put any tags you want and you'll get number of those tags.
The easyest way i can think of is that you make an array and then count it.
Since you have all that in one string you can easely do that :)
<?php
$content = file_get_contents('**WEBSITE OF BAN LIST HERE**');
$banns = explode('<br>', $content);
echo count($banns);
?>
I think you want to use file instead of file_get_contents. Then use the count() function to get the resulting array's length.
Related
I am trying to retreive one specific element from HTML-code using QueryPath. It occurs twice, I only want the first one though.
Searching for the object DOES work, but it returns me two elements.
I was trying to add a pseudo-class-selector to my search, but that didn't work.
This is the HTML-element that occurs twice in the code:
<span class="aui-suffix"> of 5 </span>
And this is how I am searching for it:
$arrURL = "URL..."
$html = htmlqp( $arrURL );
$pageAsString = $html->find('span.aui-suffix');
echo $pageAsString->text();
The output is "of 5 of 5 ", which is both elements printed right after each other.
How can I modify my search to get me only "of 5 "?
try
$pageAsString = $html->find('span.aui-suffix:eq(0)');
I've written a content generator tool for a project im working to assist me batch importing fake content into text fields of a database. It just assists making the site look populated.
I'm using an external class called lorem-php-sum to actually generate the strings that I am inserting. Its incredibly simple really, it just inserts paragraphs of text wrapped in <p> tags (and a random number of them each time) and I then insert these strings into my chosen table within a big loop.
Now the thing is, I want to slightly advance what content is being randomly generated and to add some html list tags, horizontal line tags and other stuff. I want my new html elements to be placed randomly within the paragraphs that I get returned from this paragraph generator class.
The problem is that whilst I can easily insert list tags into my big paragraph string at some random point, I fear sometimes it may insert my new html tags within the existing markup in a way that will break the html.
Does anyone have a trick for inserting html with some rules into another string? I imagine that maybe the php domDocument class can assist with this but not sure now?
You'd need to incorporate some kind of state machine in your generator.
You can think of something like this:
Step1: Choose which element to render: a textnode, a paragraph, a list node.
When you pick a textnode you randomly generate some text and return to Step 1.
When you pick a paragraph you emit <p> and generate some text, emit </p> and return to Step 1.
In the case of a list node you can only make list elements <li>, so pick a random number of elements and fill them with same rules from Step 1.
--
You can also allow nesting. In <li> you can add <strong> and <em>, similar for <p>.
You can make it as crazy as you want I guess :)
Tweak a bit with the coefficients to get good results. Try to make a generator that produces random, but predictable output, total length might be a good thing to control on.
You could hierarchically loop through multidimensional arrays. No cell without a row, no row without a table, as such no li without a ul.
$tags = array("<table>%s</table>\n" ,
array (" <tr>%s</tr>\n" ,
array(" <td>%s</td>\n)),
"<ul>%s</ul>\n",
arrray (" <li>%s</li>\n") //continue with more tags
);
$tags_simple = array("%s", "<strong>%s</strong>",
"<i>%s</i>", "<p>%s</p>\n", "%s</ br>\n"
); //etc, "%s" for a none tag, add more if you like
Pick a ramdom from $tags, multiloop them, sprintf the random sentences and add random simple tags to them. It's a standalone possibility.
So I managed to work this out with other code samples and using domDocument.
I ended up making a function that explodes the string via paragraph tags and returns it as an array containing each paragraph as a separate item.
function splitTextByPara($string,$split_on="p"){
// Add alternative tags to split on with syntax: |//ul|//br
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//".$split_on);
$result = array();
foreach ($entries as $entry) {
$result[] = $entry->ownerDocument->saveHTML( $entry );
}
// re-encode to utf8
$result = array_map("utf8_decode", $result);
return $result;
}
I am trying to extract the html content from inside a website. I want only the content inside the tags.
//$validLink is a link with .htm extension, source code is rather large
//contains 24,000 lines of html code
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);
What else can I do? $thehtml is empty.... I am trying to insert this into a wordpress post... but $thehtml is empty.... for some odd reason. Is there a possible timeout issue or something???
There can't be a timeout issue..... due to the fact that I noticed that if I output just file_get_contents($validlink); for some reason BODY is not found.....
Another possible solution would be just to get the content between the first div and the last div found in the document....
get the string position using 'strpos()' of both tag starting and ending then use sub string method i.e, substr() with this positions
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];
Here is the correct code:
$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];
But I suggest you to use DOM parser instead.
I have a series of select boxes that I'd like to get data from, essentially turning them into an array. What's the most efficient way to do this? Right now I'm thinking....
$html = file_get_contents('http://www.domain.com');
preg_match_all("/name\=\'subscription\[division_id\]\' style\=\'width: 170px;\'>(.+?)<\/select>/is", $html, $matches);
Then I was thinking of running other code to take the option tags into an array, but this seems it might be a little unnecessarily intensive
If you are scraping for whatever reason, you could probably parse the page's html with php's DOMXPath commands. I can't write out all the code, but you can get started with:
$xpath = new DOMXPath($dom);
$select_values = $xpath->evaluate("/html/body//option");
Then you run everything through a loop getting the contents of the options. Anyway, with something like this you can avoid all the nonsense with regex.
I have a whole bunch of large HTML documents with tables of data inside and I'm looking to write a script which can process an HTML file, isolate the tags and their contents, then concatenate all the rows within those tables into one large data table.
Then loop through the rows and columns of the new large table.
After some research I've started trying out PHP's DOMDocument class to parse the HTML but I just wanted to know, is that the best way to do something like this?
This is what I've got so far...
$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
#$dom->loadHTMLFile('exrate.html');
$tables = $dom->getElementsByTagName('table');
How do I chop out everything other than the tables and their contents?
Then I'd actually like to remove the first table since it's a table of contents. Then loop through all the table rows and build them into one large table.
Anyone got any hints on how to do this?
I've been digging through the docs for DOMDocument on php.net but I'm finding the syntax pretty baffling!
Cheers, B
EDIT: Here is a sample of an HTML file with the data tables I'd like to join http://thenetzone.co.uk/exrates/exrate.html
Ok got it sorted with phpQuery and lots of trial and error.
So it takes a whole bunch of tables and moves the contents into the first one, removes the empty tables.
Then loops through each table row and extracts the text from specific columns, in this case the 2nd and 3rd td of each row.
require('phpQuery/phpQuery.php');
$doc = phpQuery::newDocumentFileHTML('exrates_code.html');
pq('table:first')->remove();// REMOVE FIRST TABLE, JUST A CONTENTS TABLE SO NOT INTERESTED
pq('tr:has(th)')->remove();// REMOVE TABLE ROWS THAT ARE HEADERS
pq('table:not(:first) tr')->appendTo('table:first');// MOVE CONTENTS OF OTHER TABLES TO FIRST
pq('table:empty')->remove();// REMOVE EMPTY TABLES
pq('br')->remove();
$rows = pq('table tr');
foreach ($rows as $row) {
$currency = pq($row)->find('td:eq(1)')->text();
$value = pq($row)->find('td:eq(2)')->text();
}
Hope this helps someone out!