Removing certain things from being scraped

Removing certain things from being scraped - php

Currently I am scraping this website with the code displayed below but it displays sometimes pages with Mixtape in the title and I am wondering how I can make it skip over these and only crawl the pages that display normally. (demo)
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print '<br>';
$title = $html2->find('.single-title',0);
print $title = $title->plaintext;
print '<br>';
$str = explode ("/", $url);
$date = $html2->find('.single-content a',2);
print $date = $date->href;
}
Screenshot:
Top result is good, bottom result is bad.

Very simple, check if the title contains 'mixtape' and go to the next item in the loop:
if(stripos($title->plaintext, 'mixtape') !== false) {
continue;
}
Put that code just before you assign $title to $title->plaintext, or just use $title as the haystack argument.
Some people need it spelled out..
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
$html2 = file_get_html($url);
$title = $html2->find('.single-title',0);
if(stripos($title, 'mixtape') !== false) continue;
$title = $title->plaintext;
print '<br><br>';
echo $url = ''.$element->href;
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print $title.'<br>';
$str = explode ("/", $url);
$date = $html2->find('.single-content a',2);
print $date = $date->href;
}

First
print $image = $image->content;
looks superflous.
It both sets $image = $image->content and prints it.
But instead of grabbing and printing each line one after another, grab the title, then decide if you want to fetch the other lines and print the record.
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
$url = ''.$element->href;
$html2 = file_get_html($url);
$title = $html2->find('.single-title',0);
if (strpos($title->plaintext,"MIXTAPE")===FALSE) {
$image = $html2->find('meta[property=og:image]',0);
$date = $html2->find('.single-content a',2);
print '<br><br>';
echo $url;
print '<br>';
print $image->content;
print '<br>';
print $title->plaintext;
print '<br>';
print $date->href;
}
}

Related

Loop through list of URLs and save content to text file

I am trying to loop through a list of URLs and save content from a div tag to a text file.
<?php
$file = 'content.txt';
$i = 406;
for($i; $i <= 1410; $i++) {
$url = 'http://example.com/chapter/chapter-'.$i;
$content = file_get_contents($url);
$start_tag = explode( '<div class="textdiv">' , $content );
$end_tag = explode("</div>" , $start_tag[1] );
$result_text = $second_step[0];
echo $result_text;
$result = file_put_contents($file, $result_text);
}
?>
The first problem is that there are multiple occurrences of the div tag with that class and I want to get every div with that class and the current code just outputs first occurrence.
[EDIT]
Thanks to The Alpha's help for pointing me to right direction, This worked for me:
<?php
include_once('simple_html_dom.php');
$i = 399;
$file = 'content.txt';
for($i; $i < 1400; $i++){
$url = 'http://example.com/chapter/chapter-'.$i;
$html = file_get_html($url);
foreach ($html->find('div.textdiv') as $div) {
echo $div . '<br />';
$result = file_put_contents($file, $div );
}
echo '<hr><br /><h1>Chapter '. $i .'</h1><br /><hr>';
}
?>
One Issue was it takes very very long time for the script to run.

PHP variables to excel spreadsheet

I have the following code which scrapes the text from multiple pages and displays them.
My question for you is how can I take each of those variables and place them into an excel spreadsheet located on the server. For each link, on separate rows.
Like this :
<?php
include_once 'simple_html_dom.php';
$urls = array(
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
);
function scraping($url) {
// DOM
$html = file_get_html($url);
// articol
if ($html && is_object($html) && isset($html->nodes)) {
foreach ($html->find('/html/body/table') as $article) {
//titlu
$item['titlu'] = trim($article->find('/tbody/tr[1]/td/div', 0)->plaintext);
// tabel
$item['tr2'] = trim($article->find('/tbody/tr[2]/td[2]', 0)->plaintext);
$item['tr3'] = trim($article->find('/tbody/tr[3]/td[2]', 0)->plaintext);
$item['tr4'] = trim($article->find('/tbody/tr[4]/td[2]', 0)->plaintext);
$item['tr5'] = trim($article->find('/tbody/tr[5]/td[2]', 0)->plaintext);
$item['tr6'] = trim($article->find('/tbody/tr[6]/td[2]', 0)->plaintext);
$item['tr7'] = trim($article->find('/tbody/tr[7]/td[2]', 0)->plaintext);
$item['tr8'] = trim($article->find('/tbody/tr[8]/td[2]', 0)->plaintext);
$item['tr9'] = trim($article->find('/tbody/tr[9]/td[2]', 0)->plaintext);
$item['tr10'] = trim($article->find('/tbody/tr[10]/td[2]', 0)->plaintext);
$item['tr11'] = trim($article->find('/tbody/tr[11]/td[2]', 0)->plaintext);
$item['tr12'] = trim($article->find('/tbody/tr[12]/td/div/]', 0)->plaintext);
$ret[] = $item;
}
// memorie
$html->clear();
unset($html);
return $ret;}
}
echo '<pre>';
foreach ($urls as $url) {
$ret = scraping($url);
foreach ($ret as $v) {
echo $v['titlu'] . '<br>';
echo $v['tr2'] . '<br>';
echo $v['tr3'] . '<br>';
echo $v['tr4'] . '<br>';
echo $v['tr5'] . '<br>';
echo $v['tr6'] . '<br>';
echo $v['tr7'] . '<br>';
echo $v['tr8'] . '<br>';
echo $v['tr9'] . '<br>';
echo $v['tr10'] . '<br>';
echo $v['tr11'] . '<br>';
echo $v['tr12'] . '<br>';
echo '<br>';
echo '<br>';
}
}
?>

Array does not remove all elements

$channels = array('imaqtpies','imsoff','zzero71tv', 'kaptenen', 'onlySinged', 'nightblue3') ;
$nr = 0;
$callAPI = implode(",",$channels);
$online = 'online.png';
$offline = 'offline.png';
$json = file_get_contents('https://api.twitch.tv/kraken/streams?channel=' . $callAPI);
$dataArray = json_decode($json, true);
foreach($dataArray['streams'] as $mydata){
echo $mydata['channel']['name'] . ' is online';
echo '<br /><hr />';
unset($channels[$nr]);
$nr++;
}
$newChannels = array_values($channels);;
foreach($newChannels as $channel) {
echo $channel . ' is offline';
echo '<br /><hr />';
}
Not all the names are echoed in the "offline" part and some names are being echoed twice (both in online and offline).

$mydata['channel']['name'] and $nr are not aligned. You're unsetting the first x channels but I don't see why twitch should return them in the order you've defined your channels.
You will want something like:
$online_channels = array();
foreach($dataArray['streams'] as $stream){
$online_channels[] = $stream["channel"]["name"];
}
$offline_channels = array_diff($channels, $online_channels);
Then print $online_channels and $offline_channels.

Limiting the number of feed items displayed

Below is roughly what I am using to display items from a feed. It works fine but the feed has many items and I want to be able to just display the first 5 items in the feed. How can this e done?
<?php
$theurl = 'http://www.theurl.com/feed.xml';
$xml = simplexml_load_file($theurl);
$result = $xml->xpath("/items/item");
foreach ($result as $item) {
$date = $item->date;
$title = $item->title;
echo 'The title is '. $title.' and the date is '. $date .'';
} ?>

foreach ($result as $i => $item) {
if ($i == 5) {
break;
}
echo 'The title is '.$item->title.' and the date is '. $item->date;
}

A for loop may be more suitable for this than a foreach loop:
for ($i=0; $i<=4; $i++) {
echo 'The title is '.$result[$i]->title.' and the date is '. $result[$i]->date;
}
This loop has a much higher performance when not modifying anything in the array, so if speed matters I'd recommend it.

Just do it as part of the XPath query:
<?php
$theurl = 'http://www.theurl.com/feed.xml';
$xml = simplexml_load_file($theurl);
$result = $xml->xpath('/items/item[position() <= 5]');
foreach ($result as $item) {
$date = $item->date;
$title = $item->title;
echo 'The title is '. $title.' and the date is '. $date . '';
}
?>
Here's a demo!

How to iterate this in a foreach construct

This one only covers the first record in the array -- $form[items][0][description]. How could I iterate this to be able to echo succeeding ones i.e
$form[items][1][description];
$form[items][2][description];
$form[items][3][description];
and so on and so forth?
$array = $form[items][0][description];
function get_line($array, $line) {
preg_match('/' . preg_quote($line) . ': ([^\n]+)/', $array['#value'], $match);
return $match[1];
}
$anchortext = get_line($array, 'Anchor Text');
$url = get_line($array, 'URL');
echo '' . $anchortext . '';
?>

This should do the trick
foreach ($form['items'] as $item) {
echo $item['description'] . "<br>";
}
I could help you more if I saw the body of your get_line function, but here's the gist of it
foreach ($form['items'] as $item) {
$anchor_text = get_line($item['description'], 'Anchor Text');
$url = get_line($item['description'], 'URL');
echo "{$anchor_text}";
}

You can use a for loop to iterate over this array.
for($i=0; $i< count($form['items']); $i++)
{
$anchortext = get_line($form['items'][$i]['description'], 'Anchor Text');
$url = get_line($form['items'][$i]['description'], 'URL');
echo '' . $anchortext . '';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Removing certain things from being scraped - php

Related

Loop through list of URLs and save content to text file

PHP variables to excel spreadsheet

Array does not remove all elements

Limiting the number of feed items displayed

How to iterate this in a foreach construct

Categories

Resources