Scraping html tables with different number of rows - php

I'm trying to pull data from this site http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN using php. Can anyone please tell me why my code below isn't working. Ideally I want to pull the Name, Point of contact, Phone number, email, and Brief Description if one exists then convert that data into a csv file.
<?php
require_once "support/simple_html_dom.php";
$url = "http://www.citizencorps.fema.gov/cc/CertIndex.do?reportsForState&cert=&state=IN";
$html = file_get_html($url);
foreach($html->find('tr') as $row) {
$name = $row->find('td', 0)->plaintext;
$poc = $row->find('td', 1)->plaintext;
$phone = $row->find('td', 2)->plaintext;
$email = $row->find('td', 3)->plaintext;
if(count($row->find('td', 4)->plaintext) > 0) {
$desc = find('td', 4)->plaintext;
}
print_r($name.'<br/>'. $poc.'<br/>'.$phone.'<br/>'.$email.'<br/>'.$desc);
}
?>

Related

XML only stores 79 entries? - PHP webscraper

I have been playing around with a simple php webscraper I've built for a small project of mine. The scraper is running through jobposts on a website and storing all relevant information in an nested array, which I then store in an xml-file. However, the problem is that whenever i run the code it only store the first 79 jobposts and i can't seem to find the problem (I know there are more jobposts with the class I'm searching for).
If anyone can point me in the right direction or have tried something similar themselves, it whould be nice to get a solution :)
I'm running the server locally via. MAMP. Don't know if that could be the problem?
include('simple_html_dom.php');
$Pages = array();
$JobOffers = array();
$html = file_get_html("https://www.jobindex.dk/jobsoegning?q=studiejob");
$NumPage = $html->find('li.page-item');
foreach ($NumPage as $page){
$res = preg_replace("/[^0-9]/", "", $page->plaintext);
$PageNumber = $res.trim();
$PageNumToInt = (int)$PageNumber;
array_push($Pages, $PageNumToInt);
}
$HighestValue = max($Pages);
for($i = 8; $i <= $HighestValue; $i++){
$Newhtml = file_get_html("https://www.jobindex.dk/jobsoegning?page=".$i."&q=studiejob");
$items = $Newhtml->find('div.PaidJob');
foreach ($items as $job){
$RareTitle = $job->find("a", 0)->plaintext;
$CommonTitle = $job->find("a", 1)->plaintext;
$Virksomhed = $job->find("a", 2)->plaintext;
$LinkHref = $job->find("a", 1)->href;
$DisP1 = $job->find("p", 1)->plaintext;
$DisP2 = $job->find("p", 2)->plaintext;
$Dis = $DisP1 . " " . $DisP2;
$date = date("d/m/Y");
$prefix = "JoIn";
echo $RareTitle;
echo $CommonTitle;
echo $Virksomhed;
echo $LinkHref;
echo $Dis;
echo $date;
echo $prefix;
$SingleJob = array($CommonTitle, $RareTitle, $Virksomhed, $Dis, $LinkHref, $date, $prefix);
array_push($JobOffers,$SingleJob);
}}
This code is for saving the job offers in local xml file:
function SaveJobs($JobInfo){
if(file_exists("./xml/JobOffers.xml")){
$i = 1;
foreach ($JobInfo as $jobs){
$xml = new DOMDocument("1.0", "utf-8");
$xml->load("./xml/JobOffers.xml");
// Creating textnode with line break
$textNode = $xml->createTextNode("\n");
// root Element
$root = $xml->getElementsByTagName("job")->item(0);
$root->appendChild($textNode);
// Create Singlejob Element
$SingleJob = $xml->createElement("Jobitem");
//ID Attribute
$DomAtt1 = $xml->createAttribute('ID');
$DomAtt1->value = $i.$jobs[6];
$SingleJob->appendChild($DomAtt1);
//Date Attribute
$DomAtt2 = $xml->createAttribute('Date');
$DomAtt2->value = $jobs[5];
$SingleJob->appendChild($DomAtt2);
// Creating Elements
$TitleElement = $xml->createElement("Title", $jobs[0]);
$SecTitle = $xml->createElement("SecTitle", $jobs[1]);
$Firm = $xml->createElement("Firm", $jobs[2]);
$dis = $xml->createElement("Description", $jobs[3]);
$Linkhref = $xml->createElement("Linkhref", $jobs[4]);
// Append data to SingleJob Element
$SingleJob->appendChild($TitleElement);
$SingleJob->appendChild($SecTitle);
$SingleJob->appendChild($Firm);
$SingleJob->appendChild($dis);
$SingleJob->appendChild($Linkhref);
// Append Singlejob to root and save the changes
$root->appendChild($SingleJob);
$xml->save("./xml/JobOffers.xml");
$i++;
}
}}

HTML Pagination Parsing with PHP Simple HTML DOM Parser

I am trying to parse movie website with pagination. I want to parse all movie items on page 1 and when it will be done I want parser to continue on next page. I wrote a parser which works but it does not parses all movie items on page and do not continue on another page. I want to detect when parsing of one result is done and make it move on next item. Then detect when all movie items are parsed and make it move on next page. I expect that when I run parser, it should display movie title, year, etc. one by one and then continue on next page. Currently it only displays/parsing only one movie item on page 1 and do not continues work. Here's my code and example:
Parsing Example: http://minerbitco.in/parse/parse.php
<?php
include_once 'simple_html_dom.php';
$page = (!isset($_GET['page'])) ? 1 : $_GET['page'];
echo '<br> Parsing Page #'.$page.'<br><br>';
$html = file_get_html('https://srulad.com/movies/type/movie#page-'.$page);
$obj = $html->find('div.movie_item');
$datas = [];
if($obj){
foreach ($obj as $key => $data) {
$movie_url = 'https://srulad.com/'.$data->find('div.poster a', 0)->href;
$html2 = file_get_html($movie_url);
$item['url'] = $movie_url;
$item['year'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(0)->children(1)->plaintext;
$item['genre'] = $html2->find('#movie_content > div', 0)->children(1)->find('span', 0)->plaintext;
$item['description'] = $html2->find('#movie_content > div', 0)->children(1)->find('div.plot', 0)->plaintext;
$item['imdb_rating'] = $html2->find('#movie_content > div', 0)->children(2)->find('div', 0)->children(1)->children(1)->find('span', 0)->plaintext;
$item['englishtitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h2.newmt', 0)->plaintext;
$item['geotitle'] = $html2->find('#movie_content > div', 0)->children(1)->find('h3.newmt', 0)->plaintext;
$item['poster'] = $html2->find('#movie_content > div', 0)->children(0)->find('img', 0)->src;
$url = $item['url'];
$year = $item['year'];
$desc = $item['description'];
$rating = $item['imdb_rating'];
$poster = $item['poster'];
$engtitle = $item['englishtitle'];
$geotitle = $item['geotitle'];
$genre = $item['genre'];
}}
if ($data === end($obj)) {
echo '<META http-equiv="refresh" content="10;URL=#page-'.($page+1).'">';
}
else {
echo "dasrulebulia.";
}
echo 'URL: '.$url.'<br>';
echo 'პოსტერის URL: '.$poster.'<br>';
echo 'სათაური ინგლისურად: '.$engtitle.'<br>';
echo 'სათაური ქართულად: '.$geotitle.'<br>';
echo 'წელი:'.$year.'<br>';
echo 'ჟანრი:'.$genre.'<br>';
echo 'აღწერა:'.$desc.'<br>';
echo 'რეიტინგი:'.$rating.'<br>';
?>
you can give it a try to Parser i have written:
https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php

PHP-MySQL inserts "Array" into a table in a foreach loop with Simple HTML Dom library

I have a piece of code similar to below:
include 'simplehtmldom/simple_html_dom.php';
...
...
foreach ($files as $file){
$results= array();
if(substr($file->getAttribute('href'),0,strlen($lookfor))==$lookfor){
$URLs= $file->getAttribute('href');
echo $URLs ."<br>";
$html = file_get_html($URLs);
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$results[] = $item;
}
print_r($results) ."</br>";
...
...
...
$my_id ="1";
$photos = "1";
$insert_query = mysqli_query($db_connect, "INSERT INTO jackson.data (
my_id, photos, results) VALUES (
'$my_id', '$photos', '$results')");
The code echos the $results values in the browser perfectly fine; however, when I inserted the data into the database, results field only stores the "Array" as values. So, is there something I'm missing? and how can I insert the HTML format of the $results values which is echoing on my browser rather than the plain text?
You are using print_r which outputs the array with index and that's why the browser displays the result perfectly.I think you are using the variable $results in your insert query and that's why it fails as it contains an array.Try something like this:
Change your table structure to
jackson.data (my_id, photos, title,date,location,post)
and put the insert statement into the foreach loop and insert the values accordingly.
Example
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
}
For html formatting:
Do something like this:
echo "<html><body>";
foreach($html->find('div.postDisplay') as $post) {
$item['date'] = $post->find('p.id.post-date', 0)->plaintext;
$item['location'] = $post->find('p.id.post-location', 0)->plaintext;
$title = $item['title'] = $post->find('h1.id.post-title', 0)->plaintext;
$item['post'] = $post->find('div.post', 0)->plaintext;
$query=mysqli_query($db_connect,"INSERT INTO jackson.data (
my_id, photos, title,date,location,post) VALUES (
'$my_id', '$photos', '$item['title'],$item['date'],.....)");
echo "<div class=\"my_post\"><h1>".$item['title']."</h1>"."<br />Published:". $item['date']."<br />".$item['location']."<br /><br />".$item['post']."</div>";
}
echo "</body></html>";
In your css you can have something like this:
.my_post
{
margin:0 auto;//centers the contents
font-weight:bold;
font:fontname;
font-size:16px;
color:brown;
padding-top:15px;//Adjusts the gap between two posts;
}
you can use
"<pre>".print_r($result,true)."</pre>"
to store in db to display html output similar to browser

Collect web data using Simple HTML Dom from multiple pages

I used the below code and successfully collected the data from a specific page as follows:
include 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://test.com/file/1209i0329/');
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
print_r($files);
The code works well for http://test.com/file/1209i0329.php, but my goal is to collect data from all pages starting with http://test.com/file/ on this domain (For example, http://test.com/file/1209i0329/, http://test.com/file/120dnkj329/, and etc). Is there a solution to overcome this problem using simle_html_dom?
I dont know where you would search your files (same domain, or outside), you may need to loop an array containing the urls of what you want to search.
Consider this example:
include 'simplehtmldom/simple_html_dom.php';
// most likely this process will take some time
$files = array();
$urls = array(
'http://test.com/file/1209i0329/',
'http://test.com/file/120dnkj329/',
'http://en.wikipedia.org/wiki/',
);
foreach($urls as $url) {
$html = file_get_html($url);
// Find all article blocks
foreach($html->find('div.Content') as $file) {
$item['date'] = $file->find('id.article-date', 0)->plaintext;
$item['location'] = $file->find('id.article-location', 0)->plaintext;
$item['price'] = $file->find('div.article', 0)->plaintext;
$files[] = $item;
}
}
print_r($files);

XML/php foreach loop not looping all data

It seems my foreach is looping the correct amount of times. However it's only populating the variables with content from the first loop.
I tried it 2 ways. but firstly this is the url to the XML feed
http://wowfeeds.wipeitau.com/GuildActivity.php?location=EU&rn=shadowsong&gn=antheas&output=XML&callback=? so you can see the structure.
the php codee issss
function GetAchievements(){
$achurl = "http://wowfeeds.wipeitau.com/GuildActivity.php?location=EU&rn=shadowsong&gn=antheas&limit=100&output=XML&callback=?";
$achxml = new SimpleXMLElement($achurl);
// Achievements
foreach ($achxml->ACTIVITYLIST->ACTIVITYITEM as $ach) {
$name = $ach['NAME'];
echo $name;
//$Achievments = "<p><img src='$achimg' /> <span class='red'>$achname</span> $achtext <span class='red'>$achobj</span></p>";
//echo $Achievments;
}
}
This seems to just return a blank.
However if I alter the code # $name = $ach['NAME'] to = $name =
function GetAchievements(){
$achurl = "http://wowfeeds.wipeitau.com/GuildActivity.php?location=EU&rn=shadowsong&gn=antheas&limit=100&output=XML&callback=?";
$achxml = new SimpleXMLElement($achurl);
// Achievements
foreach ($achxml->ACTIVITYLIST->ACTIVITYITEM as $ach) {
$name = $achxml->ACTIVITYLIST->ACTIVITYITEM->NAME;
echo $name;
//$Achievments = "<p><img src='$achimg' /> <span class='red'>$achname</span> $achtext <span class='red'>$achobj</span></p>";
//echo $Achievments;
}
}
Then it simply repeats the first entry the same number of times as entries.
EG.
Name.
Name.
Name.
This been driving me mad for 2 hours now. Please help :(
simplexml object is not an array, you might need to consider like this
$url = 'http://wowfeeds.wipeitau.com/GuildActivity.php?'.
'location=EU&rn=shadowsong&gn=antheas&output=XML&callback=?';
$achxml = simplexml_load_file($url);
foreach ($achxml->ACTIVITYLIST->ACTIVITYITEM as $ach)
{
$name = (string) $ach->NAME;
echo $name, "\n";
}
output :
Ichex
Azraelka
Brechnor
Rougwar
Bromious
Ziini
Ryoden
Ashlynne
Snappidagg
Flökræ
Flökræ
Sevenfold
Ashlynne
Bonewing
Goldstroke
Flökræ
Worgin
Bromious
Renevatio
Ziini
Flökræ
Flökræ
Strollomiona
Thorban
Ichex

Categories