Using Simple HTML DOM to Scrape? - php

Simple HTML DOM is basically a php you add to your pages which lets you have simple web scraping. It's good for the most part but I can't figure out the manual as I'm not much of a coder. Are there any sites/guides out there that have any easier help for this? (the one at php.net is a bit too complicated for me at the moment) Is there a better place to ask this kind of question?
The site for it is at: http://simplehtmldom.sourceforge.net/manual.htm
I can scrape stuff that has specific classes like <tr class="group">, but not for stuff that's in between. For example.. This is what I currently use...
$url = 'http://www.test.com';
$html = file_get_html($url);
foreach($html->find('tr[class=group]') as $result)
{
$first = $result->find('td[class=category1]',0);
$second = $result->find('td[class=category2]',0);
echo $first.$second;
}
}
But here is the kind of code I'm trying to scrape.
<table>
<tr class="Group">
<td>
<dl class="Summary">
<dt>Heading 1</dt>
<dd>Cat</dd>
<dd>Bacon</dd>
<dt>Heading 2</dt>
<dd>Narwhal</dd>
<dd>Ice Soap</dd>
</dl>
</td>
</tr>
</table>
I'm trying to extract the content of each <dt> and put it to a variable. Then I'm trying to extract the content of each <dd> and put it to a variable, but nothing I tried works. Here's the best I could find, but it gives me back only the first heading repeatedly rather than going to the second.
foreach($html->find('tr[class=Summary]') as $result2)
{
echo $result2->find('dt',0)->innertext;
}
Thanks to anyone who can help. Sorry if this is not clear or that it's so long. Ideally I'd like to be able to understand these DOM commands more as I'd like to figure this out myself rather than someone here just do it (but I'd appreciate either).
TL;DR: I am trying to understand how to use the commands listed in the manual (url above). The 'manual' isn't easy enough. How do you go about learning this stuff?

I think $result2->find('dt',0) gives you back element 0, which is the first. If you omit that, you should be able to get an array (or nodelist) instead. Something like this:
foreach($html->find('tr[class=Summary]') as $result2)
{
foreach ($result2->find('dt') as $node)
{
echo $node->innertext;
}
}
You don't strictly need the outer for loop, since there's only 1 tr in your document. You could even leave it altogether to find each dt in the document, but for tools like this, I think it's a good thing to be both flexible and strict, so you are prepared for multiple rows, but don't accidentally parse dts from anywhere in the document.

Related

Adding a class to all English text in HTML?

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.
You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

Alternative to concatenating $return string to build HTML table [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a program in which a function returns a rather large HTML table using a series of dot equal sign concatenation assignments like this:
function callme() {
$return = '';
$return .= '<table>';
foreach ($foo as $bar => $bar){
$return .= '<tr><td style="css">bla bla bla'.$bar.'</td>';
$return .= '<td>more stuff, modal buttons</td></tr>';
$return .= 'etc etc etc...';
}
$return .= '</table>';
return $return
}
I'd like to find a less error prone, WET way of handling this and not sure if it should be using an array that I add each item to and implode/join, maybe using an object would make sense.
I've googled around a bit without a whole lot of luck and would love some input from the SO community.
EDIT:
Not sure if there's any way of making this post On-Topic as I can see now that there's really no way to select a correct solution, so since it's too late to delete it (and maybe post to the php mailing list as would have been more appropriate), I'll at least try and make it more useful.
Spent 30 minutes debugging a script in which a table that was being generate by concatenating the HTML, text and variable components using the "dot equals" method, and where a bug was causing every text element to be rendered as an anchor (<a>some text</a>). After removing all the javascript, which I expected was somehow generating the anchor tags in the DOM elements, tracked down the bug in the following line:
$return .= '<br/><div id="visitMBO" class="btn visitMBO" style="display:none"><a href="'.$eventLinkURL.'" target="_blank">Manage on MindBody Site<a/></div>';
Error is at the end of the line: <a/>. Whoops.
It's a fairly long table including lots of CSS classes and ids used by javascript, as well as many variables so there's tons of single quotes, double quotes and dots all over the place. Now I want to be able to render different variations on it based on user input and having either multiple version of it separated by test clauses or having tests throughout it will probably be even worse.
Asking about an HTML table generator class would have been more appropriate, but also off-topic. Options there include the fairly light-weight: htmlgen, the more robust Cellbrush and DynaWeb's Table Class which is also fairly light-weight as far as I can tell.
As one answer below states, .= is a reasonable approach if the table's not too too complex, and if I had merely kept each line to a shorter length the problem would have been more visible. As it was, the anchor's start tag was practically falling off the edge of the page.
Noted in answers below (with links) other options for separating logic, markup and data would be using a template engine or using php's DOM class (tutorial). Heredoc would also clean up some of the redundancy and multiple quotation marks a bit.
If you want a classful way to do this, check out DomDocument http://www.php.net/manual/en/class.domdocument.php
You can use it to construct HTML in a very structured way. However, given this example I don't think their is necessarily a right way to solve the issue. You should find a style that you are comfortable with and makes sense to you and your team; that's the best way to help avoid buggy code.
For me, I usually just write code using the .= exactly how you are so long as the code is fairly small. It's a very easy to read method. If I can glance at a piece of code and understand it without thinking generally its ok code.
One option would be to use a templating engine rather than putting your HTML code directly in your PHP function. An example is mustache.
This will allow you to construct your HTML in a separate file and pass data to it.
Your template file might look something like this:
<table>
{{# foo }}
<tr>
<td>{{ bar }}</td>
</tr>
{{/ foo }}
</table>
And you would then get your HTML string from it like this:
$m = new Mustache_Engine(array(
'loader' => new Mustache_Loader_FilesystemLoader(dirname(__FILE__) . '/templates'),
));
$m->render('table_template', array('foo' => $foo));
There are many ways to do things with strings in PHP. It's worth noting that if you have a large contiguous block of HTML that you want to want to concatenate with a string, one helpful way to do so without quotes is with Heredoc. For example
function callme(&$rows) {
$return = '<table>' . PHP_EOL;
foreach ($rows as $row){
// Start a heredoc-delimited string
$return .= <<<EOD
<tr>
<td>Movie name: "$row[0]"</td>
<td>Review: $row[1]</td>
<td>IMDB rating: $row[2]</td>
<td>Summary: "$row[0]" was $row[1] with a rating of $row[2]</td>
</tr>
EOD;
// End a heredoc-delimited string (must be on its own line)
}
return $return . PHP_EOL . '</table>';
}
$arr = array(array("Fast 7", "awesome", "7.6"), array("Glitter", "horrible", "2.1"));
echo callme($arr);
Result:
<table>
<tr>
<td>Movie name: "Fast 7"</td>
<td>Review: awesome</td>
<td>IMDB rating: 7.6</td>
<td>Summary: "Fast 7" was awesome with a rating of 7.6</td>
</tr> <tr>
<td>Movie name: "Glitter"</td>
<td>Review: horrible</td>
<td>IMDB rating: 2.1</td>
<td>Summary: "Glitter" was horrible with a rating of 2.1</td>
</tr>
</table>

PHP Function vs Echo

I find myself lately working with a ton of tabular data, I am more than comfortable writing the raw html but i'm wondering if there's an easier way (or library worth implementing) that helps reduce time writing html tags such as
<tr><td></td></tr>
I have created my own custom function, but I think ultimately it's not necessarily helping and potentially could be slowing down my script, now my project is small so maybe it could cope with that, examples:
echo '<tr class="test_class">
<td>' . $content . '</td>
<td>' . $second_content . '</td>
<tr/>';
here is an example with my current function:
tr("test_class");
td(); echo $content; escape(td);
td(); echo $second_content; escape(td);
escape(tr);
Looking forward to hearing peoples thoughts.
There are multiple ways of doing this...
write your own html helper library, that will contain classes, that can generate html elements based on their data source. For instance you could call them like:
<?php
HtmlHelper::Table("someArrayOfValues", "idOfTable", "styleOfTable");
?>
This is a good reusable solution, if you implement this idea properly. I was playing with this myself few days ago, really it's simple.
if you find 1. difficult, you can split the idea down... But not so deep like you've shown, but generate whole rows instead.
<?php
foreach ($myArray as $key => $value)
{
echo HtmlHelper::Row(...);
}
?>
Find some library, that provides this functionality. Can't help you on this one I'm afraid. I like to have control over the generated markup.
Hope you get the idea.
If you have short_open_tags turned on (and assuming you can turn it on, if necessary), you can use the templating syntax, like this:
<table>
<?php foreach($myList as $key => $value): ?>
<tr>
<td><?= $value["key1"] ?></td>
<td><?= $value["key2"] ?></td>
...
</tr>
<?php endforeach; ?>
</table>
That might make your job easier, in terms of writing tabular data.
The biggest strength of PHP for web development is how much its made to do with few calls, and in particular for this case the echoing of content without the need to work through language constraints. So in general unless the case is really warranted, directly writing the html with the echoes will be the simplest solution that takes the most advantage of PHP, and simplicity is always a good thing.
That being said, if you have a lot of complex table generation, then the code would be more readable if use a library like: http://pear.php.net/package/HTML_Table/. Additionally if you were looking to do something like serialize an object into a table display, then creating a serializer that is made for that would be the solution most in-line with the functionality.
In the code above I'd suggest a transparent utility function certainly wouldn't hurt. But rather than the direction you're going if you consistently have the same number of columns then you could use an array which is joined with the table cell separation markup (a function that produces a row at a time).
it is more comfort to use some template engines. try twig or smarty

How to make a small php link "spider" and extract data?

I want to spider a simple white website that has lot's of html links that represent
a phone number' name and address. From each page i want to extract the exact 3 fields
that are between the 3 TD's such as:
<div id="idTabResults2" align="center">
<TABLE border='1'>
<tr><th>Name</th><th>Adress</th><th>Phone number</th></tr>
<TR>
<TD>Joe</TD><TD>New York</TD><TD>555999</TD></TR>
</TABLE>
</div>
So in the example above i would get "Joe", "New York" & 555999.
I'm using php and mysql later to insert every result to my DB.
Can someone point me to the right direction on how to go about this?
Maybe a faster (and simpler) way than PeeHaa's solution:
Retrieve the page using file_get_contents()
Parse it with Simple DOM Parser
For instance:
<?php
require("simple_html_dom.php");
$data = file_get_contents(YOUR_PAGE_HERE);
$html = str_get_html($data);
$tds = $html->find('td');
foreach ($tds as $td) {
// Do something
}
?>
You can retrieve the page content using cURL.
Once you have the content you can parse it with PHP's DOM.
Do not attempt to try and parse it using regex. God will kill a kitten just for that.

Using regex in php to add a cell in a row

As usual I have trouble writing a good regex.
I am trying to make a plugin for Joomla to add a button to the optional print, email and PDF buttons produced by the core on the right of article titles. If I succeed I will distribute it under the GPL. None of the examples I found seem to work and I would like to create a php-only solution.
The idea is to use the unique pattern of the Joomla output for article titles and buttons for one or more regex. One regex would find the right table by looking for a table with class "contentpaneopen" (of which there are several in a page) and containing a cell with class "contentheading". A second regex could check if in that table there is a cell with class "buttonheading". The number of these cells could be from zero to three but I could use this check if the first regex returns more than one match. With this, I would like to replace the table by the same table but with an extra cell holding the button I want to add. I could do that by taking off the last row and table closing tags and inserting my button cell before adding those closing tags again.
The normal Joomla output looks like this:
<table class="contentpaneopen">
<tbody>
<tr>
<td width="100%" class="contentheading">
<a class="contentpagetitle" href="url">Title Here</a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="PDF" href="url"><img alt="PDF" src="/templates/neutral/images/pdf_button.png"/></a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="Print" href="url"><img alt="Print" src="/templates/neutral/images/printButton.png" ></a>
</td>
</tr>
</tbody>
</table>
The code would very roughly be something like this:
$subject = $article;
$pattern1 = '[regex1]'; //<table class="contentpaneopen">etc</table>
preg_match($pattern, $subject, $match);
$pattern2 = '[regex2]'; //</tr></tbody></table>
$replacement = [mybutton];
echo preg_replace($pattern2, $replacement, $match);
Without a good regex there is little point doing the rest of the code, so I hope someone can help with that!
This is a common question on SO and the answer is always the same: regular expressions are a poor choice for parsing or processing HTML or XML. There are many ways they can break down. PHP comes with at least three built-in HTML parsers that will be far more robust.
Take a look at Parse HTML With PHP And DOM and use something like:
$html = new DomDocument;
$html->loadHTML($source);
$html->preserveWhiteSpace = false;
$tables = $html->getElementsByTagName('table');
foreach ($tables as $table) {
if ($table->getAttribute('class') == 'contentpaneopen') {
// replace it with something else
}
}
Is there a reason that you need to use regex for this? DOM parsing would be much more straightforward.
Since a plugin in the scenario you provided is called everytime you load a page, a regex approach is faster than a dom call, that's why a lot of people use this approach. In Joomla's documentation, you can see too why a regex in the provided scenario is better than trying to use a dom approach.
The problem with your solution is that it's tied with Joomla's default template. I don't remember if it uses the same class="contentheading" structure in all templates. If you plan to GPL such an extension, you should be careful about that.
What you're trying to do seems to me as a template override, explained in more details here. Is a much more simpler solution. For example, the php that creates your article title's:
<div class="componentheading<?php echo $this->params->get('pageclass_sfx')?>">
<h2><?php echo $this->escape($this->params->get('page_title')); ?></h2>
</div>
You just need to override the com_content article template, and echo the html for the pdf buttons after the >get('page_title') call. If you don't want to echo the html, you can create a module or a component, import it in the template and after the >get('page_title') you call the methods in your component that show the html.
This component could have various checkboxes "show pdf (yes/no)" and other interesting actions.

Categories