I have been playing around with a simple php webscraper I've built for a small project of mine. The scraper is running through jobposts on a website and storing all relevant information in an nested array, which I then store in an xml-file. However, the problem is that whenever i run the code it only store the first 79 jobposts and i can't seem to find the problem (I know there are more jobposts with the class I'm searching for).
If anyone can point me in the right direction or have tried something similar themselves, it whould be nice to get a solution :)
I'm running the server locally via. MAMP. Don't know if that could be the problem?
include('simple_html_dom.php');
$Pages = array();
$JobOffers = array();
$html = file_get_html("https://www.jobindex.dk/jobsoegning?q=studiejob");
$NumPage = $html->find('li.page-item');
foreach ($NumPage as $page){
$res = preg_replace("/[^0-9]/", "", $page->plaintext);
$PageNumber = $res.trim();
$PageNumToInt = (int)$PageNumber;
array_push($Pages, $PageNumToInt);
}
$HighestValue = max($Pages);
for($i = 8; $i <= $HighestValue; $i++){
$Newhtml = file_get_html("https://www.jobindex.dk/jobsoegning?page=".$i."&q=studiejob");
$items = $Newhtml->find('div.PaidJob');
foreach ($items as $job){
$RareTitle = $job->find("a", 0)->plaintext;
$CommonTitle = $job->find("a", 1)->plaintext;
$Virksomhed = $job->find("a", 2)->plaintext;
$LinkHref = $job->find("a", 1)->href;
$DisP1 = $job->find("p", 1)->plaintext;
$DisP2 = $job->find("p", 2)->plaintext;
$Dis = $DisP1 . " " . $DisP2;
$date = date("d/m/Y");
$prefix = "JoIn";
echo $RareTitle;
echo $CommonTitle;
echo $Virksomhed;
echo $LinkHref;
echo $Dis;
echo $date;
echo $prefix;
$SingleJob = array($CommonTitle, $RareTitle, $Virksomhed, $Dis, $LinkHref, $date, $prefix);
array_push($JobOffers,$SingleJob);
}}
This code is for saving the job offers in local xml file:
function SaveJobs($JobInfo){
if(file_exists("./xml/JobOffers.xml")){
$i = 1;
foreach ($JobInfo as $jobs){
$xml = new DOMDocument("1.0", "utf-8");
$xml->load("./xml/JobOffers.xml");
// Creating textnode with line break
$textNode = $xml->createTextNode("\n");
// root Element
$root = $xml->getElementsByTagName("job")->item(0);
$root->appendChild($textNode);
// Create Singlejob Element
$SingleJob = $xml->createElement("Jobitem");
//ID Attribute
$DomAtt1 = $xml->createAttribute('ID');
$DomAtt1->value = $i.$jobs[6];
$SingleJob->appendChild($DomAtt1);
//Date Attribute
$DomAtt2 = $xml->createAttribute('Date');
$DomAtt2->value = $jobs[5];
$SingleJob->appendChild($DomAtt2);
// Creating Elements
$TitleElement = $xml->createElement("Title", $jobs[0]);
$SecTitle = $xml->createElement("SecTitle", $jobs[1]);
$Firm = $xml->createElement("Firm", $jobs[2]);
$dis = $xml->createElement("Description", $jobs[3]);
$Linkhref = $xml->createElement("Linkhref", $jobs[4]);
// Append data to SingleJob Element
$SingleJob->appendChild($TitleElement);
$SingleJob->appendChild($SecTitle);
$SingleJob->appendChild($Firm);
$SingleJob->appendChild($dis);
$SingleJob->appendChild($Linkhref);
// Append Singlejob to root and save the changes
$root->appendChild($SingleJob);
$xml->save("./xml/JobOffers.xml");
$i++;
}
}}
Related
currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}
You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector
Hello I want to parse a HTML table and assign those values to php variables so that i can insert them in to mysql database
I have tried some html parsing methods like dom_html_parsing but as a beginer i am getting much confused, i would be gald if some provides me some hint on this so that i can code
Parsing code i used is
include('simple_html_dom.php');
$dom = str_get_html($result);
$table = array();
$html = str_get_html($result);
foreach($html->find('tr') as $row) {
$time = $row->find('td',-1)->plaintext;
$title = $row->find('td',0)->plaintext;
$title0 = $row->find('td',1)->plaintext;
$title1 = $row->find('td',2)->plaintext;
$title2 = $row->find('td',3)->plaintext;
$title3 = $row->find('td',4)->plaintext;
$title4 = $row->find('td',5)->plaintext;
$title5 = $row->find('td',6)->plaintext;
$table[$title][$title0][$title1][$title2][$title3][$title4][$title3] = true;
}
echo '<pre>';
print_r($table);
echo '</pre>';
the arrays are printing but i do not know how to insert those particular values in to mysql database, i want to assign those values to variables first so that i can insert in to the database and the format i need is shown above, name and fathername & htno are printed only once in html table but i need it to be repeated with each row of table
Please help me
$table_data = array();
$dom = new DOMDocument();
$dom->loadHTML($html_string);
$rows = $dom->getElementsByTagName('tr');
for ($i = 0; $i < $rows->length; $i++) {
$cells = $rows->item($i)->getElementsByTagName('td');
for ($j = 0; $j < $cells->length; $j++) {
$table_data[$i][$j] = $cells->item($j)->textContent;
}
}
//print_r($table_data);
You can use phpquery. It's similar to jQuery, but for PHP.
I'm still working on this catalogue for a client, which loads images from a remote site via PHP and the Simple DOM Parser.
// Code excerpt from http://internetvolk.de/fileadmin/template/res/scrape.php, this is just one case of a select
$subcat = $_GET['subcat'];
$url = "http://pinesite.com/meubelen/index.php?".$subcat."&lang=de";
$html = file_get_html(html_entity_decode($url));
$iframe = $html->find('iframe',0);
$url2 = $iframe->src;
$html->clear();
unset($html);
$fullurl = "http://pinesite.com/meubelen/".$url2;
$html2 = file_get_html(html_entity_decode($fullurl));
$pagecount = 1;
$titles = $html2->find('.tekst');
$images = $html2->find('.plaatje');
$output='';
$i=0;
foreach ($images as $image) {
$item['title'] = $titles[$i]->find('p',0)->plaintext;
$imagePath = $image->find('img',0)->src;
$item['thumb'] = resize("http://pinesite.com".str_replace('thumb_','',$imagePath),array("w"=>225, "h"=>162));
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$imagePath);
$fullurl2 = "http://pinesite.com/meubelen/prog/showpic.php?src=".str_replace('thumb_','',$imagePath)."&taal=de";
$html3 = file_get_html($fullurl2);
$item['size'] = str_replace(' ','',$html3->find('td',1)->plaintext);
unset($html3);
$output[] = $item;
$i++;
}
if (count($html2->find('center')) > 1) {
// ok, multi-page here, let's find out how many there are
$pagecount = count($html2->find('center',0)->find('a'))-1;
for ($i=1;$i<$pagecount; $i++) {
$startID = $i*20;
$newurl = html_entity_decode($fullurl."&beginrec=".$startID);
$html3 = file_get_html($newurl);
$titles = $html3->find('.tekst');
$images = $html3->find('.plaatje');
$a=0;
foreach ($images as $image) {
$item['title'] = $titles[$a]->find('p',0)->plaintext;
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$image->find('img',0)->src);
$item['thumb'] = resize($item['image'],array("w"=>225, "h"=>150));
$output[] = $item;
$a++;
}
$html3->clear();
unset ($html3);
}
}
echo json_encode($output);
So what it should do (and does with some categories): Output the images, the titles and the the thumbnails from this page: http://pinesite.com
This works, for example, if you pass it a "?function=images&subcat=antiek", but not if you pass it a "?function=images&subcat=stoelen". I don't even think it's a problem with the remote page, so there has to be an error in my code.
Ehm..trying to state the obvious maybe but 'stoele'?
As it turns out, my code was completely fine, it was a missing space in the HTML of the remote site that got the Simple PHP DOM Parser to not recognize the iframe I was looking for. I fixed it on my end by running a str_replace on the code first to replace the faulty code.
I know it's a dirty solution, but it works :)
I have an rss feed, created by Yahoo Pipes and I need to get random post from it. How is it possible to realize this on php?
Read the feed using XML Parser and put it in an array. then, use array_rand to pick a random item from the array.
<?
function load_xml_feed($feed)
{
global $RanVal;
$i= 1;
$FeedXml = simplexml_load_file($feed);
foreach ($FeedXml->channel->item as $topic) {
$title[$i] = (string)$topic->title;
$link[$i] = (string)$topic->link;
$description[$i] = (string)$topic->description;
$i++;
}
$randtopic = rand(2, $i);
$link = trim($link[$randtopic]);
$title = trim($title[$randtopic]);
$description = trim($description[$randtopic]);
$RanVal = array($title,$link,$description);
return $RanVal;
}
$rss = "http://www.sabaharabi.com/rss/rss.xml";
load_xml_feed($rss);
$link = $RanVal[1];
$title = $RanVal[0];
$description = $RanVal[2];
echo "<h1>".$title."</h1><h2>".$link."</h2><p>".$description."</p>";
I have asked in two earlier questions to place multiple markers from a XML file created from Lightroom, which had to be tranformed in degrees instead of Degrees,Minutes,Seconds.
This part i managed but then...
The answers in the previous question were very informative but it's my poor skill of programming (first project) that i just cannot manage to solve it.
The problem is i want to show multiple markers.
the complete code:
<?php
require('GoogleMapAPI.class.php');
$objDOM = new DOMDocument("1.0", 'utf-8');
$objDOM->preserveWhiteSpace = false;
$objDOM->load("googlepoints.xml"); //make sure path is correct
$photo = $objDOM->getElementsByTagName("photo");
foreach ($photo as $value) {
$album = $value->getElementsByTagName("album");
$albu = $album->item(0)->nodeValue;
$description = $value->getElementsByTagName("description");
$descriptio = $description->item(0)->nodeValue;
$title = $value->getElementsByTagName("title");
$titl = $title->item(0)->nodeValue;
$link = $value->getElementsByTagName("link");
$lin = $link->item(0)->nodeValue;
$guid = $value->getElementsByTagName("guid");
$gui = $guid->item(0)->nodeValue;
$gps = $value->getElementsByTagName("gps");
$gp = $gps->item(0)->nodeValue;
$Deglon = str_replace("'", "/", $gp);
$Deglon = str_replace("°", "/", $Deglon);
$Deglon = str_replace("", "/", $Deglon);
$str = $Deglon;
$arr1 = str_split($str, 11);
$date = $arr1[0]; // Delimiters may be slash, dot, or hyphen
list ($latdeg, $latmin, $latsec, $latrichting) = split ('[°/".-]', $date);
$Lat = $latdeg + (($latmin + ($latsec/60))/60);
$latdir = $latrichting.$Lat;
If (preg_match("/N /", $latdir)) {$Latcoorl = str_replace(" N ", "+",$latdir);}
else {$Latcoorl = str_replace ("S ", "-",$latdir);}
//$Latcoord=$Latcoorl.",";
$date1 = $arr1[1]; // Delimiters may be slash, dot, or hyphen
list ($londeg, $lonmin, $lonsec, $lonrichting) = split ('[°/".-]', $date1);
$Lon = $londeg + (($lonmin + ($lonsec/60))/60);
$londir = $lonrichting.$Lon;
If (preg_match("/W /", $londir)) {$Loncoorl = str_replace("W ", "+",$londir);}
else {$Loncoorl = str_replace ("E", "-",$londir);}
$Lonarr = array($Loncoorl);
foreach ($Lonarr as &$LonArray);
$Latarr = array($Latcoorl);
foreach ($Latarr as &$LatArray);
$titarr = array($titl);
foreach ($titarr as &$titArray);
$guarr = array($gui);
foreach ($guarr as &$guaArray);
$albuarr = array($albu);
foreach ($albuarr as &$albuArray);
print_r ($LonArray);
print_r ($LatArray);
print_r ($guaArray);
print_r ($albuArray);
$map = new GoogleMapAPI('map');
// setup database for geocode caching
// $map->setDSN('mysql://USER:PASS#localhost/GEOCODES');
// enter YOUR Google Map Key
$map->setAPIKey('ABQIAAAAiA4e9c1IW0MDrtoPQRaLgRQmsvD_kVovrOh_CkQEnehxpBb-yhQq1LkA4BJtjWw7lWmjfYU8twZvPA');
$map->addMarkerByCoords($LonArray,$LatArray,$albuArray,$guaArray);
}
?>
The problem is that the "$map->addMarkerByCoords($LonArray,$LatArray,$albuArray,$guaArray);" only shows the last value's from the 4 arrays.
And there fore there is only one marker created.
The output (print_r) of for example the $guaArray is IMG_3308IMG_3309IMG_3310IMG_3311IMG_3312 (5 name's of filename's from photographs).
The function addMarkersByCoords from the 'GoogleMapAPI.class.php' is like this:
function addMarkerByCoords($lon,$lat,$title = '',$html = '',$tooltip = '') {
$_marker['lon'] = $lon;
$_marker['lat'] = $lat;
$_marker['html'] = (is_array($html) || strlen($html) > 0) ? $html : $title;
$_marker['title'] = $title;
$_marker['tooltip'] = $tooltip;
$this->_markers[] = $_marker;
$this->adjustCenterCoords($_marker['lon'],$_marker['lat']);
// return index of marker
return count($this->_markers) - 1;
}
I hope that someone can help me ?
You must create the new instance of the google map above the foreach
like this
$map = new GoogleMapAPI('map');
// setup database for geocode caching
// $map->setDSN('mysql://USER:PASS#localhost/GEOCODES');
// enter YOUR Google Map Key
$map->setAPIKey('ABQIAAAAiA4e9c1IW0MDrtoPQRaLgRQmsvD_kVovrOh_CkQEnehxpBb-yhQq1LkA4BJtjWw7lWmjfYU8twZvPA');
foreach ()
{
}
now you are creating every loop a new map with the last coord
Your foreach loops aren't accomplishing annything useful:
$Lonarr = array($Loncoorl);
foreach ($Lonarr as &$LonArray);
$LonArray is just one element from the $Lonarr array. I think the foreach loop is adding each element of the array onto one big string ($LonArray).