Beginner PHP scraping help - getting img src? - php

I am currently trying to increase my knowledge of PHP and I have set myself the task of scraping a website and turning the data I retrieve into a JSON format.
Here is an example row of the data I am trying to parse:
<tr>
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
<td >
Copenhagen
</td>
<td>
Sas
</td>
<td>
SK537
</td>
<td>
02 Apr 10:20
</td>
<td class="last">
Delayed 11:30
</td>
</tr>
And here is my PHP code so far:
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table width="100%" cellspacing="0" cellpadding="0" border="0" summary="Departure times detail information"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$url_src = strip_tags($cells[0][0]);
$airport = strip_tags($cells[0][1]);
$airline = strip_tags($cells[0][2]);
$flightnum = strip_tags($cells[0][3]);
$schedule = strip_tags($cells[0][4]);
$status = strip_tags($cells[0][5]);
echo "{$url_src} - {$aiport} - {$airline} - {$flightnum} - {$schedule} - {$status}<br>\n";
}
}
I can currently get nearly all values correctly except I cannot seem to get anything for the cell that contains this:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
Can anyone help me out with what I need to get the img string, I would be happy just being able to get the entire string within the <td></td> like this:
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
But if its possible to parse out just the src string that would be very helpful.

Your <img> tag is not opening at all, that's why your regular expression won't parse it.
Try:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>

Related

Simple HTML DOM: accessing html elements within results

I'm trying to get a better understanding of PHP Simple HTML DOM and am kinda stuck on the following.
I am trying to retrieve information from one of my user pages by using the following code :
$dom = file_get_html('http://127.0.0.1/comments/top-commenters/');
foreach($dom->find('tr[id*=commenter]') as $result) {
print_r($result->innertext);
}
Which produces for each commenter profile ($result->innertext) the following :
<td class="Position"># 3 </td>
<td class="img" align="center">
<a href="/images/users/814ocnqlN6.jpg">
<img src="/images/users/814ocnqlN6.jpg" info="Image" border="0"/></a>
<a uid="814ocnqlN6"></td>
<td> <b>User 3.</b>
<div class="tiny">Most recent comments</div>
</td>
<td class="NumCredits"> 471 </td>
<td class="NumComments"> 5.439 </td>
<td class="PercUpVotes"> 93% </td>
Now if I would like to access within each result (same foreach loop) for example :
<td class="Position"># 3 </td>
And
<td class="NumComments"> 5.439 </td>
What would be the best way to accomplish this ?
Try:
$dom = file_get_html('http://127.0.0.1/comments/top-commenters/');
foreach($dom->find('tr[id*=commenter]') as $result) {
print_r($result->find('td.Position'));
print_r($result->find('td.NumComments'));
}
}

Select and copy a content from a website

Im trying to extract a part of my website to get some content information. The content of that I'm trying to put into a variable is like:
<table class="tabelaHistorico">
<tr>
<td bgcolor=#ccccc></td>
<td bgcolor=#ccccc>2014</td>
</tr>
<tr>
<td>
Jan
</td>
<td>
9719,46
</td>
</tr>
<tr>
<td>
Fev
</td>
<td>
9421,65
</td>
</tr>
</table>
I tried to do:
$content = file_get_contents("www.website.com");
$pos = strpos($content,"table" , 0);
echo $pos;
printf($pos);
$rest = substr($content, $pos, 5);
echo $rest;
You'll need a proper HTML parser. Luckily, PHP has one built-in.
http://www.php.net/manual/en/domdocument.loadhtml.php
preg_match('!<table class="tabelaHistorico">.+?</table>!s', $content, $match);
echo $match[0];

extracting a link in a html table php using simple_html_dom

I'm trying to extract a specific link from a table but is not displaying anything. It's the 3rd link in the td. I thought this would work but doesn't.
here the code:
<?php
$site = 'site';
$html = file_get_html($site);
foreach($html->find('td a', 3) as $element)
echo $element->href;
?>
Here is the HTML
<tr class="evenrow team-600-359">
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal
</td>
<td align="center">
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
</td>
<td>
Premier League
</td>
</tr>
You have invalid HTML. It can be the cause.
Check double closing of TD with 60,003 value.
Just use native DomDocument:
$str = <<<STR
<tr class="evenrow team-600-359">
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal
</td>
<td align="center">
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
</td>
<td>
Premier League
</td>
</tr>
STR;
$dom = new DOMDocument();
#$dom->loadHTML($str);
$elements = $dom->getElementsByTagName('td');
echo '<pre>' . print_r($dom->saveXML($elements->item(2)), true) . '</pre>';
OUTPUT
<td align="right">
Arsenal
</td>

How To Format This Scraped Content

I'm grabbing the content from all the td's in this table with the class="job" using this.
$table01 = $salary->find('table.table01');
$rows = $table01[0]->find('td.job');
Then I'm using this to output it which works, but obviously only outputs it as plaintext, I need to do some more with it...
foreach($table01[0]->find('td.job') as $element) {
$jobs .= $element->plaintext . '<br />';
}
Ultimately I would like it outputted to this format. Notice the a href is using the job name and replacing spaces and / with a -.
<tr>
<td class="small"> Graphic Artist / Designer
$23,755 – $55,335 </td>
</tr>
<tr>
<td class="small"> Sales Associate<br />
$15,577 – $56,290 </td>
</tr>
<tr>
<td class="small"> Film / Video Editor<br />
$24,184 – $94,493 </td>
</tr>
Heres the table im scraping
<table cellpadding="0" cellspacing="0" border="0" class="table01">
<tr>
<td class="head">Test</td>
<td class="job">
Graphic Artist / Designer<br/>
$23,755 – $55,335
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Sales Associate<br/>
$15,577 – $56,290
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Film / Video Editor<br/>
$24,184 – $94,493
</td>
</tr>
</table>
may be better to use regexps
<?php
$html=file_get_contents('1.html');
$jobs='';
if(preg_match_all("/<tr>.*?<td.*?>.*?<\/td>.*?<td\sclass=\"job\">.*?<a.+?href=\"(.+?)\".+?>(.*?)<\/a>(.*?)<\/td>.*?<\/tr>/ims", $html, $res))
{
foreach($res[1] as $i=>$uri)
{
$uri=strtolower(urldecode($uri));
$uri=preg_replace("/_\/_/",'-',$uri);
$uri=preg_replace("/_/",'-',$uri);
$jobs.='<tr><td class="small"> '.$res[2][$i].''.$res[3][$i].'</td></tr>'."\n";
}
}
echo $jobs;

PHP REGEX: Find a dom node based on innerHTML

As I am well aware that PHPDom can solve half of my problem, I'm in need of a way (not necessarily regex) to be able to find a certain DOM element based on a given innerHTML.
say for example i got this code:
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48697</td>
<td class="ranking_ign" style="vertical-align:middle;">kanineh</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/NKGEHGDLFNINKPMFLDCNNOHKHKBOHBKLGCBLABFLABHAGBPAEMDEFABJBLKJIHJAANGEKFJGELEPKMCNLKPCINEJDGAJFLKG.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Bootes.gif" onMouseover="ddrivetip('Bootes','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Warrior.gif" onMouseover="ddrivetip('Warrior','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">449</td>
</tr>
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48698</td>
<td class="ranking_ign" style="vertical-align:middle;">WannaLogic</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/DOMELFGEGCGDBFCOLADBDOJLHADCIBNKEGKGINPNBEKPDDKOEEGBLMDLBGBDHGCNPGLAECAMLGKEMDKJGPODIDKCOJCMNNKN.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Aquila.gif" onMouseover="ddrivetip('Aquila','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Magician.gif" onMouseover="ddrivetip('Magician','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">56</td>
</tr>
I need to be able to get a hold of the whole row node with the td that has WannaLogic in it. that way, when I have this table row already, I can now easily traverse the nodes using PHP DOM. I'm a sucker for regular expression so I'd really much appreciate it if you can shed me some light on this.
Using regex on a DOM tree is a no-no and bound to fail when faced with malformed XML/HTML. Try this:
$xpath = new DOMXPath($doc);
$query = "//*[.='WannaLogic']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
// do whatever
}

Categories