Get a string from a large html variable [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
How would I go about getting a certain string from a webpage that has been scraped?
I am using SimpleBrowser in PHP to download a webpage into a variable.
The resultant webpage at a certain part has the following:
<tr>
<td class="label" width="350">POD Receiver Name: </td>
<td class="field" align="left">
<b>KRISTY</b>
</td>
</tr>
I want to get the value KRISTY into a variable, but not really sure how.
I have no real experience with regex so I wouldnt know where to start.
Any help appreciated!

To pull one specific part out from a known location, I'd use xpath. Try a tutorial such as http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

I am not sure why you are storing a page in a variable. But if you have a page stored as a string in a variable you can use Regular expression to extract string out of it. For this particular example you can use something like this.
$v = '<tr>
<td class="label" width="350">POD Receiver Name: </td>
<td class="field" align="left">
<b>KRISTY</b>
</td>
</tr>';
preg_match('/\<b\>(.*?)\<\/b\>/', $v, $matches);
$result = $matches[1];
This particular regular expression gets everything between the bold tags.

If the structure can be depended on, give SimpleXML a shot:
$xml = simplexml_load_string(html_entity_decode($v));
$name = strval($xml->td[1]->b);//KRISTY
http://php.net/manual/en/function.simplexml-load-string.php
http://www.php.net/manual/en/class.simplexmlelement.php

Related

Splitting a Value from a String and Appending it to the end of that Value in PHP

I don't know if this is possible as I'm unable to find how I can do this and I'm very new to PHP but here's an overview of my issue:
I have a script which reads a CSV file. One of the columns contains cells which contain HTML tables. At varying positions within all of the tables there exists a table row which contains <td>Retail</td> and then the price such as <td>$300</td> for example. An example is below which I have formatted so that it's easier for you to read, but this is returned as a continuous string from the CSV file normally:
<table>
<tr>
<td>Designer</td>
<td>Hermes</td>
</tr>
<tr>
<td>Size inch</td>
<td>5.9 x 4.3 x 2.4</td>
</tr>
<tr>
<td>Material</td>
<td>Cotton</td>
</tr>
<tr>
<td>Retail</td>
<td>$300.00</td>
</tr>
<tr>
<td>Made in</td>
<td>France</td>
</tr>
</table>
These tables are then required to have the CAD [Canadian Dollars] retail price added to them. Example below of the desired end result:
<table>
<tr>
<td>Designer</td>
<td>Hermes</td>
</tr>
<tr>
<td>Size inch</td>
<td>5.9 x 4.3 x 2.4</td>
</tr>
<tr>
<td>Material</td>
<td>Cotton</td>
</tr>
<tr>
<td>Retail USD</td>
<td>$300.00</td>
</tr>
<tr>
<td>Retail CAD</td>
<td>$410.00</td>
</tr>
<tr>
<td>Made in</td>
<td>France</td>
</tr>
</table>
I have looked at using substr() but it looks as though you need to specify the length of characters that will be ignored from the start of the string which isn't possible for me here as the data varies.
So therefore my question is whether it's at all possible to specifically split the price out from the string and then append it back in after the </tr> so that the result is as above. If you could point me in the right direction of the functions that I would need to use to achieve this then I would really appreciate it. Please bear in mind I am already using str_replace() to rename Retail to Retail USD and I already have a variable created ready to convert USD price to a CAD price which uses a finance API.
Thank you in advance for any insight you can offer me here.
I have looked at using substr() but it looks as though you need to specify the length of characters that will be ignored from the start of the string which isn't possible for me here as the data varies.
So use stripos to find the start of the string you want to replace.
However the more I dig into this, it because a mess very quickly. It would be better to edit the CSV generator rather than trying to mutate your CSV. It would also in an ideal world be better your CSV contained only data and not HTML.
Apologies the following became a large and probably unwieldy answer:
However to do it, you need to isolate this CSV column, into a variable $csvData. Then work with it directly:
$csvData = "<table data from your question>";
$csvData = str_replace("</td>","*!*</td>",$csvData);
//remove all the HTML junk
$csvDataClean = strip_tags($csvData);
// Form an array.
$csvDataArray = explode("*!*",$csvDataClean);
// trim contents of the array.
$csvDataArray = array_map('trim', $csvDataArray);
// remove empty array values.
$csvDataArray = array_filter($csvDataArray);
// build new contents array.
foreach($csvDataArray as $key=>$value){
if($key%2 == 0){
//odd number. Is a content header.
$value = str_replace(" ","_",$value);
$lastHeader = preg_replace("/[^a-z0-9-_]/i","",$value);
}
else {
//even number, it's a value
$csvArray[$lastHeader] = $value;
}
}
//tidy up.
unset($key,$value,$lastHeader,$csvDataArray,$csvDataClean);
print_r($csvArray);
This will now output for you an array of headers and values from your HTML table. You can then easily reference values from this array and then recompile them into an HTML table as nessecary.
Using phpsandbox I can output:
Array
(
[Designer] => Hermes
[Size_inch] => 5.9 x 4.3 x 2.4
[Material] => Cotton
[Retail] => $300.00
[Made_in] => France
)
So you can then take $csvArray['Retail'] and process this value to get the other currency values, and add them to this array. Then you can run this array through another process to rebuild a table, to save into the CSV (although this doesn't come recommended, it's better to save the arraty as a CSV itself, but I don't know your requirements).
So:
//whatever system you currently use to get conversion.
$csvArray['Retail_CAD'] = convert_currency($csvArray['Retail']);
$csvArray['Retail_USD'] = convert_currency($csvArray['Retail']);
And now rebuild the HTML form:
foreach($csvArray as $key=>$value){
$csvOutput .= "<tr><td>".str_replace("_"," ",$key)."</td><td>".$value."</td></tr>\n";
}
unset($key,$value);
$csvOutput = "<table>".$csvOutput."</table>";
print_r($csvOutput);
You can also manually delete and readd the Made_in array key if you want to maintain this as the final array value:
//whatever system you currently use to get conversion.
$csvArray['Retail_CAD'] = convert_currency($csvArray['Retail']);
$csvArray['Retail_USD'] = convert_currency($csvArray['Retail']);
....
$value = $csvArray['Made_in'];
unset($csvArray['Made_in']);
$csvArray['Made_in'] = $value;
This is a hacky but quick way of keeping the "made in" column after the new Retail columns added above.
What you pasted here is a html table, not csv.
Anyway, there are several ways to manipulate strings. str_replace() is one of the most basic ones, so you got that already. In your case, you're probably best off using regular expressions. It's like str_replace but much more powerful. There are plenty of tutorials out there.
If you want to do a lot and more complex manipulation of html or xml data, you may want to have a look at XSLT.
I had to deal with a similar scenario once, what I would do is:
1.-Form first your desired output block in a variable $output_block i.e :
<td>Retail USD</td><td>$300.00</td></tr><tr><td>Retail CAD</td><td>$410.00</td>
note: you dont need the firs opening tr tag neither the last closing one cause you already have those on your original output.
2.-find the position of <td>Retail</td>
(use strpos)
3.-Save the substring you have before in a a variable i.e: $first_part
4.-find the position of <td>Made in</td>
5.-Save the substring you have after this in a variable : $last_part
6.- Your final output: $final_output = $firstpart . $output_block . $last_part;
easy cake... ;)

Php_simple_html_dom on a table

I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!
So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.

Sending special characters in HTML form

I have an input field (which is filled automatically) with the format name <myemail#host.com>. I gave the form enctype="application/x-www-form-urlencoded", but when I retrieve it in PHP, it shows only the name. Please help me retrieving the email too.
My HTML form:
<form action="{$path_site}{$index_file}" method="POST" enctype="application/x-www-form-urlencoded">
<table>
<tr>
<td>Your Name</td>
<td><input type="text" name="sender_name" size="37" /></td>
</tr>
<tr>
<td>To</td>
<td><input type="text" name="reciever_name" size="37" id="inputString" onkeyup="lookup(this.value)" onblur="fill()" /></td>
</tr>
</table>
</form>
And PHP code:
echo $msg_sender_name = $info[reciever_name];
Extracting the information from comments, where you say:
if the text is like "myName<myEmail#email.com>", info['reciever_name'] displays only "myName"
I would say that your problem is related to the displaying the results, and is not related to the form.
You probably display the received string as HTML, where the characters "<" and ">" are special.
Instead of
echo $info['reciever_name'];
you should use the htmlspecialchars function:
echo htmlspecialchars($info['reciever_name'], ENT_QUOTES);
This is the most common bug in PHP (and in many other languages).
You should escape all the text you are displaying, especially when it comes from untrusted sources - and every value provided by the user is untrusted.
Failing to escape the output you risk the security of your users - you may want to read about Cross-site-scripting on Wikipedia.
The following PHP code
echo $msg_sender_name = $info[reciever_name];
seems to be missing a couple of quotes. Try this instead:
echo $msg_sender_name = $info['reciever_name'];

Parsing information between known variables

I dont mean to be a bother and I know this has been asked a thousand times before but i'm just not understanding the concept. I was wondering if somebody could walk me through it, Here is what i'm trying to do:
I have a set of information inside an html file. The file is uploaded to the server and i need to parse information out of the file inside of set parameters (demo code to follow). I have been reading on parsing for over a week and understand some of it but just not grasping the concept, i guess i just need somebody to do one on this demo for me to understand and if you could, break down the search variables please. Here's the demo:
<hr>
<a id="Operating_System"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Operating System</FONT></CAPTION>
<tr><td>Top</td></tr>
<TR ALIGN="LEFT" BGCOLOR="#00FF00">
<TH>Property</TH>
<TH>Value</TH>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Name</TD>
<TD>Windows 7 Professional x64 Service Pack 1</TD>
</TR>
<TR>
<TD>Features</TD>
<TD>Terminal Services in Remote Admin Mode, 64 Bit Edition, Media Center Edition, Multiprocessor Free</TD>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Up Time</TD>
<TD>5 Days 22 Hours 4 Minutes 26 seconds</TD>
</TR>
<!-- Operating System Duration: 1.853 seconds -->
</table>
<hr>
<a id="Installed_Updates"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Installed Updates</FONT></CAPTION>
and here is what i'm trying to accomplish. On this demo, i would need the information parsed but only certain information to come back. there is a lot more information here but only need about 30 things total on each document. first i need to search from Operating_System to Installed_Updates, this will give me the first set area i need to gather information (there is other groups too so i'll make one for each group of information). The i need to make the search more specific such as from <TR> to </TR> which will give me the actual information set i need. After that just grap the first 'name' and 'value' to store in a database.
Again, i know it's out there but i'm just not getting the whole concept of simple expressions. After i do it a few times on an actual document, i'll get the hang of it i think.
Thank you all so much for the help, i really appreciate it.
This only works for fixed HTML with little variations. But if you just want a simple example, here is one:
preg_match('#<TD>Up Time</TD>.*?<TD>([\w ]+)</TD>#is', $html, $match);
print $match[1]; # ^^^^^^
See also https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for some tools. And http://regular-expressions.info/ to learn the syntax.
But as said, if you want to extract a lot of values, there are easier options.

Use PHP to extract simple numeric data from website and display as HTML

I have no clue at all.
How do I extract the numeric % data on the right from the link below and display them on my website without updating daily myself? Can a simple PHP + HTML solve my problem?
http://www.mrrebates.com/merchants/all_merchants.asp
Meanwhile, how do I automatically hyperlink the extracted numeric % and display it as a link for that retailer? for example,
1 Stop Florists------------------------- 8% (this 8% should be displayed as hyperlink for that retailer, unfortunately I am too new to have more than 1 hyperlink)
at the same time integrating my referral id (shown below) on to that 8% hyperlink
mrrebates.com?refid=420149
You can use curl to download the page, then use regular expressions to parse it up and print it out in whatever form you want. Here's some PHP code to do it:
<?php
system("curl -v http://www.mrrebates.com/merchants/all_merchants.asp > /tmp/x.txt");
$data = file_get_contents("/tmp/x.txt");
preg_match_all('/<td><a href="([^"]*)".*?<b>([^<]*)<\/b>.*?<td class="r">([^<]*)<\/td>/',
$data, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$site_name = $match[2];
$url = "http://www.mrrebates.com/{$match[1]}";
$percent = $match[3];
print "<a href='$url'>$site_name</a> ";
print "<a href='$url'>$percent</a> <br/>";
}
That'll print out a list of links every time you refresh the page. I have no idea how referral codes work on that site, but I imagine it'll be pretty easy to tack it onto the $url variable.
One caveat here is that every time you refresh your page, it's going to have to load the other site first and parse it so it'll be slow. You could separate out the system("curl...") call into a separate file and only do that once an hour or so if you want to make it go faster. Good luck.
Parsing XHTML is best left to a DOM parser. However, this type of scrape operation is messy business anyway. I will propose another solution and let you piece it together.
View the source of your HTML and find out the beginning and end of your table. Looks like you want this:
<table border="0" width="95%" cellpadding="3" cellspacing="0" style="border: 1px dotted #808080;">
<tr>
<td bgcolor="#FFCC00"><b>Store Name</b></td>
<td width="75" align="center" bgcolor="#FFCC00"><b>Coupons</b></td>
<td width="75" align="right" bgcolor="#FFCC00"><b>Rebate</b></td>
</tr>
And then look for the next occurrence of </table>.
Now, your content is in rows... look for <tr and </tr>.
I'll let you figure it out how to break it down from there.
Now, do actually all of this work... there are lots of functions that can help you. Start with strpos.
This is probably better done with javascript (or at least I have usually tackled problems like this on the client-side), particularly jQuery library.
You want to load the data on that page with something like
$.get("www.mrrebates.com/merchants/allmerchants.asp");
and parse the remaining data to get the info you need (this should be simple enough jQuery will do, tho there are fuller DOM parsers). I'm not sure what you're familiar with so far but it would probably be a lot to describe here. I see the % info is in td with class "r"
Do you have just one referral ID or one for each vender? that will obviously matter

Categories