PHP: Scrape all numbers from Brackets "(123)" [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I have a HTML-Code. The structure is always the same. But i don't know, how i can extract all numbers from the brackets.
Example-Code:
<table align="left" border="0" cellpadding="0" cellspacing="1">
<tbody><tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
5 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="73%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:73%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (96)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
4 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="11%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:11%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (15)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
3 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="7%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:7%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (10)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
2 Sterne:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="3%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:3%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (4)</td>
</tr>
<tr>
<td style="padding-right:0.5em;padding-bottom:1px;white-space:nowrap;font-size:10px;" align="left">
1 Stern<span style="color:#FFFFFF">e</span>:
</td>
<td style="min-width:60; background-color: #eeeecc" class="tiny" title="4%" align="left" width="60"><div style="background-color:#FFCC66; height:13px; width:4%;"></div></td>
<td style="font-family:Verdana,Arial,Helvetica,Sans-serif;;font-size:10px;" align="right"> (6)</td>
</tr>
<tr><td> </td><td><div style="width:60px;"> </div></td><td> </td></tr>
</tbody></table>
In this case i need this numbers: 96, 15, 10, 4 and 6.
Please give me a tip, which function is good for it.

You can use a DOM parser such as DOMDocument class to parse the HTML document. Since the structure is always the same, you can simply traverse the DOM using an XPath expression and grab the text from the third <td> node. Once you have the node value, you can use a simple preg_replace() to get the number:
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//table/tbody/tr/td[3]/text()') as $node) {
$number = preg_replace('~\D~', '', $node->nodeValue);
echo $number . '<br/>';
}
Demo.

preg_match_all('~\((\d+)\)~',$content,$numbers);
print_r($numbers); // example to print results
output:
Array
(
[0] => Array
(
[0] => (96)
[1] => (15)
[2] => (10)
[3] => (4)
[4] => (6)
)
[1] => Array
(
[0] => 96
[1] => 15
[2] => 10
[3] => 4
[4] => 6
)
)

Related

How to get href attributefrom html page

I have this html code:
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
I'd like to get inside an array all the href attributes.
I'm trying to use this php code:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = trim($tag->innertext);
}
Indeed if I print olympiad array I get something like:
Array
(
[0] => 1
[1] => <img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan
[2] => 1936
[3] => 2016
[4] => 103
[5] => 7
[6] =>
[7] =>
[8] => 2
[9] => 2
[10] =>
Why this behaviour? I'd like to get also the text inside href attribute (in this case Afghanistan), possibly in another array.
I'm not an php code expert so I ask help to you.
You can load the html file like this,this is an exemple you can adapt it:
<?php
include_once ('/share/Multimedia/simple_html_dom.php');
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left"';
$olympiad = array();
$html = file_get_html($url,true);
$doc = new DOMDocument();
$doc->loadHTML( $html);
// example 1:
$elements = $doc->getElementsByTagName('*');
// example 2:
$elements = $doc->getElementsByTagName('html');
// example 3:
$elements = $doc->getElementsByTagName('body');
// example 4:
$elements = $doc->getElementsByTagName('table');
// example 5:
$elements = $doc->getElementsByTagName('div');
I hope it help.
If you want to find all href attributes, I think you can add an a to $tagname_tr = 'td align="left"';
Then you can loop the result, and get the href and the innertext.
As an example, the values are stored in 2 arrays and the html is loaded as a string:
include_once ('/share/Multimedia/simple_html_dom.php');
$source = <<<SOURCE
<tbody>
<tr class="">
<td align="right" csk="1">1</td>
<td align="left" ><img src="http://static.spref.com/olympics/images/flags/AFG.png" alt="AFG" title="Afghanistan" height=15 width=22> Afghanistan</td>
<td align="right" >1936</td>
<td align="right" >2016</td>
<td align="right" >103</td>
<td align="right" >7</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" >2</td>
<td align="right" >2</td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
<td align="right" ></td>
</tr>
SOURCE;
$url = 'https://www.sports-reference.com/olympics/countries/';
$tagname_tbody = 'tbody';
$tagname_tr = 'td align="left" a';
$olympiad = array();
$elementText = array();
//$html = file_get_html($url,true);
$html = str_get_html($source);
foreach($html->find($tagname_tr) as $tag) {
$olympiad[] = $tag->href;
$elementText[] = $tag->innertext;
}
echo "<pre>";
print_r($olympiad);
print_r($elementText);
Will result in:
Array
(
[0] => /olympics/countries/AFG/
)
Array
(
[0] => Afghanistan
)

Splitting HTML without cutting the tags

I've been trying to split a PHP string in an arbitrary number of characters per split. However, I'm looking for a way to do so without breaking HTML tags. Here is an example:
$string = 'Section 1:
<table width = "528" border="0" cellpadding="0" cellspacing="0">
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 1 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 2 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 3 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top">• </td> <td valign="top"> Element 4 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 5 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 6 </td></tr>
</table>
Section 2:
<table width = "528" border="0" cellpadding="0" cellspacing="0">
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 7 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 8 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 9 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 10 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 11 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 12 </td></tr>
<tr> <td width="20"> </td> <td width="15" valign="top"> • </td> <td valign="top"> Element 13 </td></tr>
</table>';
$charAmount = 450;
$textSplit = array();
while ($string){
array_push($textSplit, substr($string, 0, $charAmount));
$string = substr($string, $charAmount);
}
var_dump($textSplit);
In this case, two tags are broken. I'd like whatever tag that is cut up at the end of a split to just skip to the next split, but I have no idea how to do this.
I'm not php guys, But logicwise I can help, just before split check which of dese two character is present nearest backwards from the split index < or >
if < is encountered u r splitting in wrong place so skip
if > is encountered go ahead with split
I have done it in jQuery successfully sometimes back
About Splitting html string, I have no ideas now but cutting string with limit character you could refer the solution at the link: https://github.com/dhngoc/php-cut-html-string.
This resource may help you to get more ideas.

Convert HTML table to CSV via PHP

I am trying to pull each td element from the html table below and import each element into its own cell in a CSV file.
Here are the two html tables:
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow1Font">
<td width="7%">WAITLIST</td>
<td width="5%">91630</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">10</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 251</td>
<td width="13%">TR<br>09:35AM - 10:50AM</td>
<td width="14%">
01/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="1" cellpadding="0" bgcolor="#006699">
<tr align="center" class="tableRow2Font">
<td width="7%">WAITLIST</td>
<td width="5%">90003</td>
<td width="11%">
ACCY 2001
</td>
<td width="5%">11</td>
<td width="16%">Intro Financial Accounting</td>
<td width="6%">3.00</td>
<td width="8%"> Zou, Y</td>
<td width="8%"><A HREF="http://www.gwu.edu/~map/building.cfm?BLDG=DUQUES" target="_blank"
>DUQUES</a> 254</td>
<td width="13%">TR<br>11:10AM - 12:25PM</td>
<td width="14%">
1/13/14 - 04/28/14
</td>
<td width="7%">
</td>
</tr>
</table>
I have written code that goes through the tables and pulls the td elements:
foreach($html->find('tr[align=center] td') as $e)
$str .= strip_tags($e->innertext) . ', ';
echo $str;
So how can I extract these elements into a CSV file? In Excel I want it to look like this with each td element in its own cell, starting a new row for each html table:
WAITLIST 91630 ACCY 2001 10 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR
WAITLIST 90003 ACCY 2001 11 Intro Financial Accounting 3.00 Zou, Y DUQUES 251 TR
There is a library exist for this. Goto http://phpexcel.codeplex.com/. Download the zip file and in example you would find 17html.php try this code. I hope this will help.
CSV means Comma Separated Values. Thus, as you echo out the data (after running it through your function to strip the <td> tags), put commas in between each piece of data (cell), and a new line where you want the next line to start.
So to use your example above, it should look like this:
WAITLIST,91630,ACCY,2001,10,Intro Financial Accounting,3.00,Zou,Y,DUQUES,251,TR
WAITLIST,90003,ACCY,2001,11,Intro Financial Accounting,3.00,Zou,Y,DUQUES,2,
Keep in mind that when you echo this, you shouldn't have any other html tags or anything.

Parsing info from a table without headers, using PHP, DOM and cUrl

I need to parse data from a table that i scrape from a different website using PHP.
The table looks like this:
<table id="IWGRD" border="1" cellpadding="0" cellspacing="0" width="409" bordercolor="#FFFFFF" bordercolorlight="#FFFFFF" bordercolordark="#FFFFFF" class="IWGRDCSS" style="width:409;height:10;z-index:100;font-style:normal;font-size:10pt;text-decoration:none;">
<tbody>
<tr>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Dag </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Datum </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lesuur </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Lokaal </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Docent(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Vak </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Groep(en) </b></font>
</td>
<td valign="middle" align="left" nowrap="" bgcolor="#A0A0A0">
<font style="font-size:10pt;"><b> Toelichting </b></font>
</td>
</tr>
<tr>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> Di </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 12-11-2013 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> 5 - 6 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> B2.33 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> LKH02 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SWSP14SLB1V13_SWSP15PRA1V13 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> MAV1SP10 </font>
</td>
<td valign="middle" align="left" nowrap="">
<font style="font-size:10pt;"> SLB major 1 / praktijkleren </font>
</td>
</tr>
This table is generated by javascript.
In this table the first tr holds all the td which holds the headers. While all the rest of the table rows hold the info that i need to parse.
Now I've been struggling with this for a while and i found an answer on this website which helped me out a little bit, but it reads the table by using the td and th id's while mine table doesn't have an id on it's table rows or td's.
I'm using cURL to get this table HTML from an other website and pass it through and load it into DOM like this:
<?php
include_once('/simple_dom/simple_html_dom.php');
//step1
$cSession = curl_init();
//step2
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($cSession, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($cSession, CURLOPT_COOKIEFILE, $tmpfname);
curl_setopt($cSession,CURLOPT_URL,"http://anonymusurlbecauseofprivacyreasons?somegetters");
curl_setopt($cSession,CURLOPT_RETURNTRANSFER,true);
curl_setopt($cSession, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cSession,CURLOPT_HEADER, false);
curl_setopt ($cSession, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($cSession, CURLOPT_CAINFO, dirname(__FILE__)."/cacert.pem");
curl_setopt($cSession,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$result=curl_exec($cSession);
if ($result === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
curl_close($cSession);
// create empty document
$dom = new DomDocument;
#$dom->loadHtml($result);
$xpath = new DomXPath($dom);
Okay so far, so good.
But now comes the part of code which i can't figure out how to get it working.
To read out the date I copied and edited the code from this thread: (How to parse this table and extract data from it?) but I can't get it working.
// collect data
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
}
print_r($rowData);
Which gives me the following output:
Array ( [0] => [1] => [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek )
Which is the correct output for the last row, but i need all the rows.
So the kind of output I would need is all of the rows (I only don't need the top rows)
So like
array[1] = ([0] => Mon [1] => 11-11-2013 [2] => 7 - 8 [3] => S0.20 [4] => SPHdeBruin [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => Bewegingsagogiek)
Array[2] = ([0] => Mon [1] => 11-11-2013 [2] => 8 - 9 [3] => S0.20 [4] => name [5] => SWSP17KBOOV13 [6] => MAV1SP09,MAV1SP10 [7] => randomresult)
So i can use the info and put it in variables to pass it on to an app.
Anyone knows how to do this? I've been working on this for hours because i have none experience using cUrl or DOM whatsoever.
Any help is much appreciated! :)
It seems like you're not collecting every row as you go along...
$tableData = array();
foreach ($xpath->query('//table[#id="IWGRD"]/tr') as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
$rowcleaned = str_replace("\xc2\xa0","", $cell->textContent);
$rowData[] = $rowcleaned;
}
$tableData[] = $rowData;
}
print_r($tableData);

Warning: DOMXPath::query(): Invalid expression

I'm an Xpath newbie. I want to loop through the result of a cURL query and print each element of the only table on the page.
I've used the Xpath plugin for Firefox to obtain my expression and my table is structured as follows:
<table>
<tr class="listItemOneBg">
<td valign="top">
SMITH
</td>
<td valign="top">
WILLIAM C C
</td>
<td valign="top">
Male
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
BLACKWOOD
</td>
<td valign="top">
61
</td>
<td valign="top">
1924
</td>
<td valign="top">
<a target="_blank" href='XXX'>
order</a>
</td>
</tr>
<tr class="listItemTwoBg">
<td valign="top">
SMITH
</td>
<td valign="top">
WILLIAM C PAGE-
</td>
<td valign="top">
Male
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
</td>
<td valign="top">
SWAN
</td>
<td valign="top">
9
</td>
<td valign="top">
1914
</td>
<td valign="top">
<a target="_blank" href='XXY'>
order</a>
</td>
</tr>
Here's the code I've tried so far. I get a message"Warning: Invalid argument supplied for foreach()". What am I doing wrong?
$page = curl_exec($ch);
curl_close($ch);
// Create new PHP DOM document
$dom = new DOMDocument;
// Load html from curl request into document model
#$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$tableRows = $xpath->query("id('divResults')/table/tbody/tr");
foreach ($tableRows as $row) {
// fetch all 'tds' inside this 'tr'
$td = $xpath->query('td', $row);
echo $td->item(1)->textContent;
}
Assuming the table you're after is actually in a <div id="divResults">...
$tableRows = $xpath->query('//div[#id="divResults"]/table/tbody/tr');
foreach ($tableRows as $row) {
$cells = $row->getElementsByTagName('td');
}
That's a non-standard XPath expression. It cannot work in DOMXPath.(Downvoters, the expression has been edited since the question was posted. Cheers!)
This is where you learn XPath:
Microsoft XPath Syntax
Microsoft XPath by Example
PS: It's where I learnt it.

Categories