Using regex in php to add a cell in a row - php

As usual I have trouble writing a good regex.
I am trying to make a plugin for Joomla to add a button to the optional print, email and PDF buttons produced by the core on the right of article titles. If I succeed I will distribute it under the GPL. None of the examples I found seem to work and I would like to create a php-only solution.
The idea is to use the unique pattern of the Joomla output for article titles and buttons for one or more regex. One regex would find the right table by looking for a table with class "contentpaneopen" (of which there are several in a page) and containing a cell with class "contentheading". A second regex could check if in that table there is a cell with class "buttonheading". The number of these cells could be from zero to three but I could use this check if the first regex returns more than one match. With this, I would like to replace the table by the same table but with an extra cell holding the button I want to add. I could do that by taking off the last row and table closing tags and inserting my button cell before adding those closing tags again.
The normal Joomla output looks like this:
<table class="contentpaneopen">
<tbody>
<tr>
<td width="100%" class="contentheading">
<a class="contentpagetitle" href="url">Title Here</a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="PDF" href="url"><img alt="PDF" src="/templates/neutral/images/pdf_button.png"/></a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="Print" href="url"><img alt="Print" src="/templates/neutral/images/printButton.png" ></a>
</td>
</tr>
</tbody>
</table>
The code would very roughly be something like this:
$subject = $article;
$pattern1 = '[regex1]'; //<table class="contentpaneopen">etc</table>
preg_match($pattern, $subject, $match);
$pattern2 = '[regex2]'; //</tr></tbody></table>
$replacement = [mybutton];
echo preg_replace($pattern2, $replacement, $match);
Without a good regex there is little point doing the rest of the code, so I hope someone can help with that!

This is a common question on SO and the answer is always the same: regular expressions are a poor choice for parsing or processing HTML or XML. There are many ways they can break down. PHP comes with at least three built-in HTML parsers that will be far more robust.
Take a look at Parse HTML With PHP And DOM and use something like:
$html = new DomDocument;
$html->loadHTML($source);
$html->preserveWhiteSpace = false;
$tables = $html->getElementsByTagName('table');
foreach ($tables as $table) {
if ($table->getAttribute('class') == 'contentpaneopen') {
// replace it with something else
}
}

Is there a reason that you need to use regex for this? DOM parsing would be much more straightforward.

Since a plugin in the scenario you provided is called everytime you load a page, a regex approach is faster than a dom call, that's why a lot of people use this approach. In Joomla's documentation, you can see too why a regex in the provided scenario is better than trying to use a dom approach.
The problem with your solution is that it's tied with Joomla's default template. I don't remember if it uses the same class="contentheading" structure in all templates. If you plan to GPL such an extension, you should be careful about that.
What you're trying to do seems to me as a template override, explained in more details here. Is a much more simpler solution. For example, the php that creates your article title's:
<div class="componentheading<?php echo $this->params->get('pageclass_sfx')?>">
<h2><?php echo $this->escape($this->params->get('page_title')); ?></h2>
</div>
You just need to override the com_content article template, and echo the html for the pdf buttons after the >get('page_title') call. If you don't want to echo the html, you can create a module or a component, import it in the template and after the >get('page_title') you call the methods in your component that show the html.
This component could have various checkboxes "show pdf (yes/no)" and other interesting actions.

Related

Adding a class to all English text in HTML?

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.
You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

Php_simple_html_dom on a table

I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!
So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.

Using Simple HTML DOM to Scrape?

Simple HTML DOM is basically a php you add to your pages which lets you have simple web scraping. It's good for the most part but I can't figure out the manual as I'm not much of a coder. Are there any sites/guides out there that have any easier help for this? (the one at php.net is a bit too complicated for me at the moment) Is there a better place to ask this kind of question?
The site for it is at: http://simplehtmldom.sourceforge.net/manual.htm
I can scrape stuff that has specific classes like <tr class="group">, but not for stuff that's in between. For example.. This is what I currently use...
$url = 'http://www.test.com';
$html = file_get_html($url);
foreach($html->find('tr[class=group]') as $result)
{
$first = $result->find('td[class=category1]',0);
$second = $result->find('td[class=category2]',0);
echo $first.$second;
}
}
But here is the kind of code I'm trying to scrape.
<table>
<tr class="Group">
<td>
<dl class="Summary">
<dt>Heading 1</dt>
<dd>Cat</dd>
<dd>Bacon</dd>
<dt>Heading 2</dt>
<dd>Narwhal</dd>
<dd>Ice Soap</dd>
</dl>
</td>
</tr>
</table>
I'm trying to extract the content of each <dt> and put it to a variable. Then I'm trying to extract the content of each <dd> and put it to a variable, but nothing I tried works. Here's the best I could find, but it gives me back only the first heading repeatedly rather than going to the second.
foreach($html->find('tr[class=Summary]') as $result2)
{
echo $result2->find('dt',0)->innertext;
}
Thanks to anyone who can help. Sorry if this is not clear or that it's so long. Ideally I'd like to be able to understand these DOM commands more as I'd like to figure this out myself rather than someone here just do it (but I'd appreciate either).
TL;DR: I am trying to understand how to use the commands listed in the manual (url above). The 'manual' isn't easy enough. How do you go about learning this stuff?
I think $result2->find('dt',0) gives you back element 0, which is the first. If you omit that, you should be able to get an array (or nodelist) instead. Something like this:
foreach($html->find('tr[class=Summary]') as $result2)
{
foreach ($result2->find('dt') as $node)
{
echo $node->innertext;
}
}
You don't strictly need the outer for loop, since there's only 1 tr in your document. You could even leave it altogether to find each dt in the document, but for tools like this, I think it's a good thing to be both flexible and strict, so you are prepared for multiple rows, but don't accidentally parse dts from anywhere in the document.

How to make a small php link "spider" and extract data?

I want to spider a simple white website that has lot's of html links that represent
a phone number' name and address. From each page i want to extract the exact 3 fields
that are between the 3 TD's such as:
<div id="idTabResults2" align="center">
<TABLE border='1'>
<tr><th>Name</th><th>Adress</th><th>Phone number</th></tr>
<TR>
<TD>Joe</TD><TD>New York</TD><TD>555999</TD></TR>
</TABLE>
</div>
So in the example above i would get "Joe", "New York" & 555999.
I'm using php and mysql later to insert every result to my DB.
Can someone point me to the right direction on how to go about this?
Maybe a faster (and simpler) way than PeeHaa's solution:
Retrieve the page using file_get_contents()
Parse it with Simple DOM Parser
For instance:
<?php
require("simple_html_dom.php");
$data = file_get_contents(YOUR_PAGE_HERE);
$html = str_get_html($data);
$tds = $html->find('td');
foreach ($tds as $td) {
// Do something
}
?>
You can retrieve the page content using cURL.
Once you have the content you can parse it with PHP's DOM.
Do not attempt to try and parse it using regex. God will kill a kitten just for that.

Use PHP to extract simple numeric data from website and display as HTML

I have no clue at all.
How do I extract the numeric % data on the right from the link below and display them on my website without updating daily myself? Can a simple PHP + HTML solve my problem?
http://www.mrrebates.com/merchants/all_merchants.asp
Meanwhile, how do I automatically hyperlink the extracted numeric % and display it as a link for that retailer? for example,
1 Stop Florists------------------------- 8% (this 8% should be displayed as hyperlink for that retailer, unfortunately I am too new to have more than 1 hyperlink)
at the same time integrating my referral id (shown below) on to that 8% hyperlink
mrrebates.com?refid=420149
You can use curl to download the page, then use regular expressions to parse it up and print it out in whatever form you want. Here's some PHP code to do it:
<?php
system("curl -v http://www.mrrebates.com/merchants/all_merchants.asp > /tmp/x.txt");
$data = file_get_contents("/tmp/x.txt");
preg_match_all('/<td><a href="([^"]*)".*?<b>([^<]*)<\/b>.*?<td class="r">([^<]*)<\/td>/',
$data, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$site_name = $match[2];
$url = "http://www.mrrebates.com/{$match[1]}";
$percent = $match[3];
print "<a href='$url'>$site_name</a> ";
print "<a href='$url'>$percent</a> <br/>";
}
That'll print out a list of links every time you refresh the page. I have no idea how referral codes work on that site, but I imagine it'll be pretty easy to tack it onto the $url variable.
One caveat here is that every time you refresh your page, it's going to have to load the other site first and parse it so it'll be slow. You could separate out the system("curl...") call into a separate file and only do that once an hour or so if you want to make it go faster. Good luck.
Parsing XHTML is best left to a DOM parser. However, this type of scrape operation is messy business anyway. I will propose another solution and let you piece it together.
View the source of your HTML and find out the beginning and end of your table. Looks like you want this:
<table border="0" width="95%" cellpadding="3" cellspacing="0" style="border: 1px dotted #808080;">
<tr>
<td bgcolor="#FFCC00"><b>Store Name</b></td>
<td width="75" align="center" bgcolor="#FFCC00"><b>Coupons</b></td>
<td width="75" align="right" bgcolor="#FFCC00"><b>Rebate</b></td>
</tr>
And then look for the next occurrence of </table>.
Now, your content is in rows... look for <tr and </tr>.
I'll let you figure it out how to break it down from there.
Now, do actually all of this work... there are lots of functions that can help you. Start with strpos.
This is probably better done with javascript (or at least I have usually tackled problems like this on the client-side), particularly jQuery library.
You want to load the data on that page with something like
$.get("www.mrrebates.com/merchants/allmerchants.asp");
and parse the remaining data to get the info you need (this should be simple enough jQuery will do, tho there are fuller DOM parsers). I'm not sure what you're familiar with so far but it would probably be a lot to describe here. I see the % info is in td with class "r"
Do you have just one referral ID or one for each vender? that will obviously matter

Categories