I'm still working on this catalogue for a client, which loads images from a remote site via PHP and the Simple DOM Parser.
// Code excerpt from http://internetvolk.de/fileadmin/template/res/scrape.php, this is just one case of a select
$subcat = $_GET['subcat'];
$url = "http://pinesite.com/meubelen/index.php?".$subcat."&lang=de";
$html = file_get_html(html_entity_decode($url));
$iframe = $html->find('iframe',0);
$url2 = $iframe->src;
$html->clear();
unset($html);
$fullurl = "http://pinesite.com/meubelen/".$url2;
$html2 = file_get_html(html_entity_decode($fullurl));
$pagecount = 1;
$titles = $html2->find('.tekst');
$images = $html2->find('.plaatje');
$output='';
$i=0;
foreach ($images as $image) {
$item['title'] = $titles[$i]->find('p',0)->plaintext;
$imagePath = $image->find('img',0)->src;
$item['thumb'] = resize("http://pinesite.com".str_replace('thumb_','',$imagePath),array("w"=>225, "h"=>162));
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$imagePath);
$fullurl2 = "http://pinesite.com/meubelen/prog/showpic.php?src=".str_replace('thumb_','',$imagePath)."&taal=de";
$html3 = file_get_html($fullurl2);
$item['size'] = str_replace(' ','',$html3->find('td',1)->plaintext);
unset($html3);
$output[] = $item;
$i++;
}
if (count($html2->find('center')) > 1) {
// ok, multi-page here, let's find out how many there are
$pagecount = count($html2->find('center',0)->find('a'))-1;
for ($i=1;$i<$pagecount; $i++) {
$startID = $i*20;
$newurl = html_entity_decode($fullurl."&beginrec=".$startID);
$html3 = file_get_html($newurl);
$titles = $html3->find('.tekst');
$images = $html3->find('.plaatje');
$a=0;
foreach ($images as $image) {
$item['title'] = $titles[$a]->find('p',0)->plaintext;
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$image->find('img',0)->src);
$item['thumb'] = resize($item['image'],array("w"=>225, "h"=>150));
$output[] = $item;
$a++;
}
$html3->clear();
unset ($html3);
}
}
echo json_encode($output);
So what it should do (and does with some categories): Output the images, the titles and the the thumbnails from this page: http://pinesite.com
This works, for example, if you pass it a "?function=images&subcat=antiek", but not if you pass it a "?function=images&subcat=stoelen". I don't even think it's a problem with the remote page, so there has to be an error in my code.
Ehm..trying to state the obvious maybe but 'stoele'?
As it turns out, my code was completely fine, it was a missing space in the HTML of the remote site that got the Simple PHP DOM Parser to not recognize the iframe I was looking for. I fixed it on my end by running a str_replace on the code first to replace the faulty code.
I know it's a dirty solution, but it works :)
Related
I have been playing around with a simple php webscraper I've built for a small project of mine. The scraper is running through jobposts on a website and storing all relevant information in an nested array, which I then store in an xml-file. However, the problem is that whenever i run the code it only store the first 79 jobposts and i can't seem to find the problem (I know there are more jobposts with the class I'm searching for).
If anyone can point me in the right direction or have tried something similar themselves, it whould be nice to get a solution :)
I'm running the server locally via. MAMP. Don't know if that could be the problem?
include('simple_html_dom.php');
$Pages = array();
$JobOffers = array();
$html = file_get_html("https://www.jobindex.dk/jobsoegning?q=studiejob");
$NumPage = $html->find('li.page-item');
foreach ($NumPage as $page){
$res = preg_replace("/[^0-9]/", "", $page->plaintext);
$PageNumber = $res.trim();
$PageNumToInt = (int)$PageNumber;
array_push($Pages, $PageNumToInt);
}
$HighestValue = max($Pages);
for($i = 8; $i <= $HighestValue; $i++){
$Newhtml = file_get_html("https://www.jobindex.dk/jobsoegning?page=".$i."&q=studiejob");
$items = $Newhtml->find('div.PaidJob');
foreach ($items as $job){
$RareTitle = $job->find("a", 0)->plaintext;
$CommonTitle = $job->find("a", 1)->plaintext;
$Virksomhed = $job->find("a", 2)->plaintext;
$LinkHref = $job->find("a", 1)->href;
$DisP1 = $job->find("p", 1)->plaintext;
$DisP2 = $job->find("p", 2)->plaintext;
$Dis = $DisP1 . " " . $DisP2;
$date = date("d/m/Y");
$prefix = "JoIn";
echo $RareTitle;
echo $CommonTitle;
echo $Virksomhed;
echo $LinkHref;
echo $Dis;
echo $date;
echo $prefix;
$SingleJob = array($CommonTitle, $RareTitle, $Virksomhed, $Dis, $LinkHref, $date, $prefix);
array_push($JobOffers,$SingleJob);
}}
This code is for saving the job offers in local xml file:
function SaveJobs($JobInfo){
if(file_exists("./xml/JobOffers.xml")){
$i = 1;
foreach ($JobInfo as $jobs){
$xml = new DOMDocument("1.0", "utf-8");
$xml->load("./xml/JobOffers.xml");
// Creating textnode with line break
$textNode = $xml->createTextNode("\n");
// root Element
$root = $xml->getElementsByTagName("job")->item(0);
$root->appendChild($textNode);
// Create Singlejob Element
$SingleJob = $xml->createElement("Jobitem");
//ID Attribute
$DomAtt1 = $xml->createAttribute('ID');
$DomAtt1->value = $i.$jobs[6];
$SingleJob->appendChild($DomAtt1);
//Date Attribute
$DomAtt2 = $xml->createAttribute('Date');
$DomAtt2->value = $jobs[5];
$SingleJob->appendChild($DomAtt2);
// Creating Elements
$TitleElement = $xml->createElement("Title", $jobs[0]);
$SecTitle = $xml->createElement("SecTitle", $jobs[1]);
$Firm = $xml->createElement("Firm", $jobs[2]);
$dis = $xml->createElement("Description", $jobs[3]);
$Linkhref = $xml->createElement("Linkhref", $jobs[4]);
// Append data to SingleJob Element
$SingleJob->appendChild($TitleElement);
$SingleJob->appendChild($SecTitle);
$SingleJob->appendChild($Firm);
$SingleJob->appendChild($dis);
$SingleJob->appendChild($Linkhref);
// Append Singlejob to root and save the changes
$root->appendChild($SingleJob);
$xml->save("./xml/JobOffers.xml");
$i++;
}
}}
I'm trying to write a script that list all the image URL from an specific URL. I used foreach in order to scan several pages, but I think os not working well.
This is my code:
<?php
include('simple_html_dom.php');
$array = array($page, $page2);
$page = "https://www.dllo.dev";
$page2 = "https://www.dllo2.dev";
$html = new simple_html_dom();
$html->load_file($array);
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
reset($images);
echo "URL $array:<br /><br />";
foreach ($images as $out) {
$url = "$base$out";
echo "$url, ";
}
It's partially working, but only with the first URL ($page)... Any idea?
:D
I'm trying to scrape some information from a web page with the Simple HTML Dom Parser. Some issues are causing elements with a tag to cause an off set in my counters.
The tag looks like:
// <div id="result-title-2" class="offerList-item-description-title">
<script type="text/javascript">
document.write(getContents('##UD9Jj\>2?E:4 9:;23'));
</script>Romantic hijab
</div>
I either need to be able to get the contents or make my programme skip it.
This is how I am currently grabbing all of my Elements:
foreach($html->find('.offerList-item') as $element)
{
$count++;
foreach($element->find('.offerList-item-image img') as $image)
{
//$images[] = '<img src="'.$image->src.'">'.'<br>';//$img->src;
$images[] = $image->src;//$img->src;
}
foreach($element->find('.offerList-item-description-title') as $title)
{
$titles[] = $title->innertext;
}
//foreach($element->find('.priceRange-from') as $price) {
foreach($element->find('.priceRange-from')as $price){
$pound = $price->find('text',1);
$number = $price->find('text',2);
$prices[] = $pound.' '.$number;
}
foreach($element->find('.offerList-itemWrapper') as $compare) //Get store links
{
$links[] = $idealo.$compare->getAttribute('href');
}
}
Download html page to local disk after delete script tags
$url2 = "https://www.idealo.co.uk/mscat.html?q=Dashboard+Cleaner";
//Code to get the file...
$data2 = file_get_contents($url2);
//save as?
$filename = "test.html";
//save the file...
$fh = fopen($filename,"w");
fwrite($fh,$data2);
fclose($fh);
After this codes.
Try scrape and finally delete with this
$target = array('<script type="text/javascript">', '</script>');
$convert = array('<!--<script type="text/javascript">', '</script>-->');
$result = str_replace($target, $convert, $title);
I need to find links in a part of some html code and replace all the links with two different absolute or base domains followed by the link on the page...
I have found a lot of ideas and tried a lot different solutions.. Luck aint on my side on this one.. Please help me out!!
Thank you!!
This is my code:
<?php
$url = "http://www.oxfordreference.com/views/SEARCH_RESULTS.html?&q=android";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table class="short_results_summary_table">');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
echo "{$table}";
$dom = new DOMDocument();
$dom->loadHTML($table);
$dom->strictErrorChecking = FALSE;
// Get all the links
$links = $dom->getElementsByTagName("a");
foreach($links as $link) {
$href = $link->getAttribute("href");
echo "{$href}";
if (strpos("http://oxfordreference.com", $href) == -1) {
if (strpos("/views/", $href) == -1) {
$ref = "http://oxfordreference.com/views/"+$href;
}
else
$ref = "http://oxfordreference.com"+$href;
$link->setAttribute("href", $ref);
echo "{$link->getAttribute("href")}";
}
}
$table12 = $dom->saveHTML;
preg_match_all("|<tr(.*)</tr>|U",$table12,$rows);
echo "{$rows[0]}";
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
echo "{$cells}";
}
}
?>
When i run this code i get htmlParseEntityRef: expecting ';' warning for the line where i load the html
var links = document.getElementsByTagName("a"); will get you all the links.
And this will loop through them:
for(var i = 0; i < links.length; i++)
{
links[i].href = "newURLHERE";
}
You should use jQuery - it is excellent for link replacement. Rather than explaining it here. Please look at this answer.
How to change the href for a hyperlink using jQuery
I recommend scrappedcola's answer, but if you dont want to do it on client side you can use regex to replace:
ob_start();
//your HTML
//end of the page
$body=ob_get_clean();
preg_replace("/<a[^>]*href=(\"[^\"]*\")/", "NewURL", $body);
echo $body;
You can use referencing (\$1) or callback version to modify output as you like.
I'm building a PHP program that basically grabs only image links from my twitter feed and displays them on a page, I have 3 components that I have set up that all work fine on their own.
The first component is the twitter oauth component which grabs the tweet text and creates an array, this works fine by itself.
The second is a function that processes the tweets and only returns tweets that contain image links, this as well works fine.
The program breaks down during the third section when the links are processed and an image is displayed, I had no issues running this on its own and from my attempts to trouble shoot it appears that it breaks down at the $images(); array, as that array is empty.
I'm sure I've made a silly mistake but I've been trying to find this for over a day now and can't seem to fix it. Any help would be great! Thanks guys!
code:
<?php
if ($result['socialorigin']== "twitter"){
$twitterObj = new EpiTwitter($consumer_key, $consumer_secret);
$token = $twitterObj->getAccessToken();
$twitterObj->setToken($result['oauthtoken'], $result['oauthsecret']);
$tweets = $twitterObj->get('/statuses/home_timeline.json',array('count'=>'200'));
$all_tweets = array();
$hosts = "lockerz|yfrog|twitpic|tumblr|mypict|ow.ly|instagr";
foreach($tweets as $tweet) {
$twtext = $tweet->text;
if(preg_match("~http://($hosts)~", $twtext)){
preg_match_all("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", $twtext, $matches, PREG_PATTERN_ORDER);
foreach($matches[0] as $key2 => $link){
array_push($all_tweets,"$link");
}
}
}
function height_compare($a1, $b1)
{
if ($a1 == $b1) {
return 0;
}
return ($a1 > $b1) ? -1 : 1;
}
foreach($all_tweets as $alltweet => $tlink){
$doc = new DOMDocument();
// Okay this is HTML is kind of screwy
// So we're going to supress errors
#$doc->loadHTMLFile($tlink);
// Get all images
$images_list = $doc->getElementsByTagName('img');
$images = array();
foreach($images_list as $image) {
// Get the src attribute
$image_source = $image->getAttribute('src');
if (substr($image_source,0,7)=="http://"){
$image_size_info = getimagesize($image_source);
$images[$image_source] = $image_size_info[1];
}
}
// Do a numeric sort on the height
uasort($images, "height_compare");
$tallest_image = array_slice($images, 0,1);
$mainimg = key($tallest_image);
echo "<img src='$mainimg' />";
}
print_r($all_tweets);
print_r($images);
}
Change the for loop where you fetch the actual images to move the images array OUTSIDE the for loop. This will prevent the loop from clearing it each time through.
$images = array();
foreach($all_tweets as $alltweet => $tlink){
$doc = new DOMDocument();
// Okay this is HTML is kind of screwy
// So we're going to supress errors
#$doc->loadHTMLFile($tlink);
// Get all images
$images_list = $doc->getElementsByTagName('img');
foreach($images_list as $image) {
// Get the src attribute
$image_source = $image->getAttribute('src');
if (substr($image_source,0,7)=="http://"){
$image_size_info = getimagesize($image_source);
$images[$image_source] = $image_size_info[1];
}
}
// Do a numeric sort on the height
uasort($images, "height_compare");
$tallest_image = array_slice($images, 0,1);
$mainimg = key($tallest_image);
echo "<img src='$mainimg' />";
}