I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!
So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
How would I go about getting a certain string from a webpage that has been scraped?
I am using SimpleBrowser in PHP to download a webpage into a variable.
The resultant webpage at a certain part has the following:
<tr>
<td class="label" width="350">POD Receiver Name: </td>
<td class="field" align="left">
<b>KRISTY</b>
</td>
</tr>
I want to get the value KRISTY into a variable, but not really sure how.
I have no real experience with regex so I wouldnt know where to start.
Any help appreciated!
To pull one specific part out from a known location, I'd use xpath. Try a tutorial such as http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/
I am not sure why you are storing a page in a variable. But if you have a page stored as a string in a variable you can use Regular expression to extract string out of it. For this particular example you can use something like this.
$v = '<tr>
<td class="label" width="350">POD Receiver Name: </td>
<td class="field" align="left">
<b>KRISTY</b>
</td>
</tr>';
preg_match('/\<b\>(.*?)\<\/b\>/', $v, $matches);
$result = $matches[1];
This particular regular expression gets everything between the bold tags.
If the structure can be depended on, give SimpleXML a shot:
$xml = simplexml_load_string(html_entity_decode($v));
$name = strval($xml->td[1]->b);//KRISTY
http://php.net/manual/en/function.simplexml-load-string.php
http://www.php.net/manual/en/class.simplexmlelement.php
I'm trying to match and select a bunch of cols of a table but don't get it working. Here's a simplified table:
<table>
<tr>
<td>Foo</td>
<td>Bar</td>
<td>Arb</td>
<td>...</td>
</tr>
<tr>
<td>Foo</td>
<td>Rab</td>
</tr>
</table>
So I want to get the TDs wich contain Bar and Arb and others but not Foo and nothing from the 2nd TR Block. Someone knows if this is possible with a XPath expression?
Note: There's nothing static in there. The only way to get the correct cols is to match the first TDs content.
Don't know if my answer could help you out, but an adaptable XPath could look like:
//table/tr[1]/td[x] | //table/tr[1]/td[y]
where x and y depends which column you'd like to target. The | computes two node-sets.
I'd also like to suggest you to install the XPath Checker Firefox AddOn ( https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/ ) where you're able to play around with the xPath syntax, in order to trigger a particular DOM Element of a website.
I want to spider a simple white website that has lot's of html links that represent
a phone number' name and address. From each page i want to extract the exact 3 fields
that are between the 3 TD's such as:
<div id="idTabResults2" align="center">
<TABLE border='1'>
<tr><th>Name</th><th>Adress</th><th>Phone number</th></tr>
<TR>
<TD>Joe</TD><TD>New York</TD><TD>555999</TD></TR>
</TABLE>
</div>
So in the example above i would get "Joe", "New York" & 555999.
I'm using php and mysql later to insert every result to my DB.
Can someone point me to the right direction on how to go about this?
Maybe a faster (and simpler) way than PeeHaa's solution:
Retrieve the page using file_get_contents()
Parse it with Simple DOM Parser
For instance:
<?php
require("simple_html_dom.php");
$data = file_get_contents(YOUR_PAGE_HERE);
$html = str_get_html($data);
$tds = $html->find('td');
foreach ($tds as $td) {
// Do something
}
?>
You can retrieve the page content using cURL.
Once you have the content you can parse it with PHP's DOM.
Do not attempt to try and parse it using regex. God will kill a kitten just for that.
I have no clue at all.
How do I extract the numeric % data on the right from the link below and display them on my website without updating daily myself? Can a simple PHP + HTML solve my problem?
http://www.mrrebates.com/merchants/all_merchants.asp
Meanwhile, how do I automatically hyperlink the extracted numeric % and display it as a link for that retailer? for example,
1 Stop Florists------------------------- 8% (this 8% should be displayed as hyperlink for that retailer, unfortunately I am too new to have more than 1 hyperlink)
at the same time integrating my referral id (shown below) on to that 8% hyperlink
mrrebates.com?refid=420149
You can use curl to download the page, then use regular expressions to parse it up and print it out in whatever form you want. Here's some PHP code to do it:
<?php
system("curl -v http://www.mrrebates.com/merchants/all_merchants.asp > /tmp/x.txt");
$data = file_get_contents("/tmp/x.txt");
preg_match_all('/<td><a href="([^"]*)".*?<b>([^<]*)<\/b>.*?<td class="r">([^<]*)<\/td>/',
$data, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$site_name = $match[2];
$url = "http://www.mrrebates.com/{$match[1]}";
$percent = $match[3];
print "<a href='$url'>$site_name</a> ";
print "<a href='$url'>$percent</a> <br/>";
}
That'll print out a list of links every time you refresh the page. I have no idea how referral codes work on that site, but I imagine it'll be pretty easy to tack it onto the $url variable.
One caveat here is that every time you refresh your page, it's going to have to load the other site first and parse it so it'll be slow. You could separate out the system("curl...") call into a separate file and only do that once an hour or so if you want to make it go faster. Good luck.
Parsing XHTML is best left to a DOM parser. However, this type of scrape operation is messy business anyway. I will propose another solution and let you piece it together.
View the source of your HTML and find out the beginning and end of your table. Looks like you want this:
<table border="0" width="95%" cellpadding="3" cellspacing="0" style="border: 1px dotted #808080;">
<tr>
<td bgcolor="#FFCC00"><b>Store Name</b></td>
<td width="75" align="center" bgcolor="#FFCC00"><b>Coupons</b></td>
<td width="75" align="right" bgcolor="#FFCC00"><b>Rebate</b></td>
</tr>
And then look for the next occurrence of </table>.
Now, your content is in rows... look for <tr and </tr>.
I'll let you figure it out how to break it down from there.
Now, do actually all of this work... there are lots of functions that can help you. Start with strpos.
This is probably better done with javascript (or at least I have usually tackled problems like this on the client-side), particularly jQuery library.
You want to load the data on that page with something like
$.get("www.mrrebates.com/merchants/allmerchants.asp");
and parse the remaining data to get the info you need (this should be simple enough jQuery will do, tho there are fuller DOM parsers). I'm not sure what you're familiar with so far but it would probably be a lot to describe here. I see the % info is in td with class "r"
Do you have just one referral ID or one for each vender? that will obviously matter
As usual I have trouble writing a good regex.
I am trying to make a plugin for Joomla to add a button to the optional print, email and PDF buttons produced by the core on the right of article titles. If I succeed I will distribute it under the GPL. None of the examples I found seem to work and I would like to create a php-only solution.
The idea is to use the unique pattern of the Joomla output for article titles and buttons for one or more regex. One regex would find the right table by looking for a table with class "contentpaneopen" (of which there are several in a page) and containing a cell with class "contentheading". A second regex could check if in that table there is a cell with class "buttonheading". The number of these cells could be from zero to three but I could use this check if the first regex returns more than one match. With this, I would like to replace the table by the same table but with an extra cell holding the button I want to add. I could do that by taking off the last row and table closing tags and inserting my button cell before adding those closing tags again.
The normal Joomla output looks like this:
<table class="contentpaneopen">
<tbody>
<tr>
<td width="100%" class="contentheading">
<a class="contentpagetitle" href="url">Title Here</a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="PDF" href="url"><img alt="PDF" src="/templates/neutral/images/pdf_button.png"/></a>
</td>
<td width="100%" align="right" class="buttonheading">
<a rel="nofollow" onclick="etc" title="Print" href="url"><img alt="Print" src="/templates/neutral/images/printButton.png" ></a>
</td>
</tr>
</tbody>
</table>
The code would very roughly be something like this:
$subject = $article;
$pattern1 = '[regex1]'; //<table class="contentpaneopen">etc</table>
preg_match($pattern, $subject, $match);
$pattern2 = '[regex2]'; //</tr></tbody></table>
$replacement = [mybutton];
echo preg_replace($pattern2, $replacement, $match);
Without a good regex there is little point doing the rest of the code, so I hope someone can help with that!
This is a common question on SO and the answer is always the same: regular expressions are a poor choice for parsing or processing HTML or XML. There are many ways they can break down. PHP comes with at least three built-in HTML parsers that will be far more robust.
Take a look at Parse HTML With PHP And DOM and use something like:
$html = new DomDocument;
$html->loadHTML($source);
$html->preserveWhiteSpace = false;
$tables = $html->getElementsByTagName('table');
foreach ($tables as $table) {
if ($table->getAttribute('class') == 'contentpaneopen') {
// replace it with something else
}
}
Is there a reason that you need to use regex for this? DOM parsing would be much more straightforward.
Since a plugin in the scenario you provided is called everytime you load a page, a regex approach is faster than a dom call, that's why a lot of people use this approach. In Joomla's documentation, you can see too why a regex in the provided scenario is better than trying to use a dom approach.
The problem with your solution is that it's tied with Joomla's default template. I don't remember if it uses the same class="contentheading" structure in all templates. If you plan to GPL such an extension, you should be careful about that.
What you're trying to do seems to me as a template override, explained in more details here. Is a much more simpler solution. For example, the php that creates your article title's:
<div class="componentheading<?php echo $this->params->get('pageclass_sfx')?>">
<h2><?php echo $this->escape($this->params->get('page_title')); ?></h2>
</div>
You just need to override the com_content article template, and echo the html for the pdf buttons after the >get('page_title') call. If you don't want to echo the html, you can create a module or a component, import it in the template and after the >get('page_title') you call the methods in your component that show the html.
This component could have various checkboxes "show pdf (yes/no)" and other interesting actions.