Currently I am scraping this website with the code displayed below but it displays sometimes pages with Mixtape in the title and I am wondering how I can make it skip over these and only crawl the pages that display normally. (demo)
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print '<br>';
$title = $html2->find('.single-title',0);
print $title = $title->plaintext;
print '<br>';
$str = explode ("/", $url);
$date = $html2->find('.single-content a',2);
print $date = $date->href;
}
Screenshot:
Top result is good, bottom result is bad.
Very simple, check if the title contains 'mixtape' and go to the next item in the loop:
if(stripos($title->plaintext, 'mixtape') !== false) {
continue;
}
Put that code just before you assign $title to $title->plaintext, or just use $title as the haystack argument.
Some people need it spelled out..
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
$html2 = file_get_html($url);
$title = $html2->find('.single-title',0);
if(stripos($title, 'mixtape') !== false) continue;
$title = $title->plaintext;
print '<br><br>';
echo $url = ''.$element->href;
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print $title.'<br>';
$str = explode ("/", $url);
$date = $html2->find('.single-content a',2);
print $date = $date->href;
}
First
print $image = $image->content;
looks superflous.
It both sets $image = $image->content and prints it.
But instead of grabbing and printing each line one after another, grab the title, then decide if you want to fetch the other lines and print the record.
$html = file_get_html('http://beatshype.com/mp3download/');
foreach($html->find('.entry-title a') as $element)
{
$url = ''.$element->href;
$html2 = file_get_html($url);
$title = $html2->find('.single-title',0);
if (strpos($title->plaintext,"MIXTAPE")===FALSE) {
$image = $html2->find('meta[property=og:image]',0);
$date = $html2->find('.single-content a',2);
print '<br><br>';
echo $url;
print '<br>';
print $image->content;
print '<br>';
print $title->plaintext;
print '<br>';
print $date->href;
}
}
Related
I am trying to loop through a list of URLs and save content from a div tag to a text file.
<?php
$file = 'content.txt';
$i = 406;
for($i; $i <= 1410; $i++) {
$url = 'http://example.com/chapter/chapter-'.$i;
$content = file_get_contents($url);
$start_tag = explode( '<div class="textdiv">' , $content );
$end_tag = explode("</div>" , $start_tag[1] );
$result_text = $second_step[0];
echo $result_text;
$result = file_put_contents($file, $result_text);
}
?>
The first problem is that there are multiple occurrences of the div tag with that class and I want to get every div with that class and the current code just outputs first occurrence.
[EDIT]
Thanks to The Alpha's help for pointing me to right direction, This worked for me:
<?php
include_once('simple_html_dom.php');
$i = 399;
$file = 'content.txt';
for($i; $i < 1400; $i++){
$url = 'http://example.com/chapter/chapter-'.$i;
$html = file_get_html($url);
foreach ($html->find('div.textdiv') as $div) {
echo $div . '<br />';
$result = file_put_contents($file, $div );
}
echo '<hr><br /><h1>Chapter '. $i .'</h1><br /><hr>';
}
?>
One Issue was it takes very very long time for the script to run.
I have the following code which scrapes the text from multiple pages and displays them.
My question for you is how can I take each of those variables and place them into an excel spreadsheet located on the server. For each link, on separate rows.
Like this :
<?php
include_once 'simple_html_dom.php';
$urls = array(
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/01/1150001435/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/1&judet=50',
'http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=50/05/1140001657/2&judet=50',
);
function scraping($url) {
// DOM
$html = file_get_html($url);
// articol
if ($html && is_object($html) && isset($html->nodes)) {
foreach ($html->find('/html/body/table') as $article) {
//titlu
$item['titlu'] = trim($article->find('/tbody/tr[1]/td/div', 0)->plaintext);
// tabel
$item['tr2'] = trim($article->find('/tbody/tr[2]/td[2]', 0)->plaintext);
$item['tr3'] = trim($article->find('/tbody/tr[3]/td[2]', 0)->plaintext);
$item['tr4'] = trim($article->find('/tbody/tr[4]/td[2]', 0)->plaintext);
$item['tr5'] = trim($article->find('/tbody/tr[5]/td[2]', 0)->plaintext);
$item['tr6'] = trim($article->find('/tbody/tr[6]/td[2]', 0)->plaintext);
$item['tr7'] = trim($article->find('/tbody/tr[7]/td[2]', 0)->plaintext);
$item['tr8'] = trim($article->find('/tbody/tr[8]/td[2]', 0)->plaintext);
$item['tr9'] = trim($article->find('/tbody/tr[9]/td[2]', 0)->plaintext);
$item['tr10'] = trim($article->find('/tbody/tr[10]/td[2]', 0)->plaintext);
$item['tr11'] = trim($article->find('/tbody/tr[11]/td[2]', 0)->plaintext);
$item['tr12'] = trim($article->find('/tbody/tr[12]/td/div/]', 0)->plaintext);
$ret[] = $item;
}
// memorie
$html->clear();
unset($html);
return $ret;}
}
echo '<pre>';
foreach ($urls as $url) {
$ret = scraping($url);
foreach ($ret as $v) {
echo $v['titlu'] . '<br>';
echo $v['tr2'] . '<br>';
echo $v['tr3'] . '<br>';
echo $v['tr4'] . '<br>';
echo $v['tr5'] . '<br>';
echo $v['tr6'] . '<br>';
echo $v['tr7'] . '<br>';
echo $v['tr8'] . '<br>';
echo $v['tr9'] . '<br>';
echo $v['tr10'] . '<br>';
echo $v['tr11'] . '<br>';
echo $v['tr12'] . '<br>';
echo '<br>';
echo '<br>';
}
}
?>
$channels = array('imaqtpies','imsoff','zzero71tv', 'kaptenen', 'onlySinged', 'nightblue3') ;
$nr = 0;
$callAPI = implode(",",$channels);
$online = 'online.png';
$offline = 'offline.png';
$json = file_get_contents('https://api.twitch.tv/kraken/streams?channel=' . $callAPI);
$dataArray = json_decode($json, true);
foreach($dataArray['streams'] as $mydata){
echo $mydata['channel']['name'] . ' is online';
echo '<br /><hr />';
unset($channels[$nr]);
$nr++;
}
$newChannels = array_values($channels);;
foreach($newChannels as $channel) {
echo $channel . ' is offline';
echo '<br /><hr />';
}
Not all the names are echoed in the "offline" part and some names are being echoed twice (both in online and offline).
$mydata['channel']['name'] and $nr are not aligned. You're unsetting the first x channels but I don't see why twitch should return them in the order you've defined your channels.
You will want something like:
$online_channels = array();
foreach($dataArray['streams'] as $stream){
$online_channels[] = $stream["channel"]["name"];
}
$offline_channels = array_diff($channels, $online_channels);
Then print $online_channels and $offline_channels.
Below is roughly what I am using to display items from a feed. It works fine but the feed has many items and I want to be able to just display the first 5 items in the feed. How can this e done?
<?php
$theurl = 'http://www.theurl.com/feed.xml';
$xml = simplexml_load_file($theurl);
$result = $xml->xpath("/items/item");
foreach ($result as $item) {
$date = $item->date;
$title = $item->title;
echo 'The title is '. $title.' and the date is '. $date .'';
} ?>
foreach ($result as $i => $item) {
if ($i == 5) {
break;
}
echo 'The title is '.$item->title.' and the date is '. $item->date;
}
A for loop may be more suitable for this than a foreach loop:
for ($i=0; $i<=4; $i++) {
echo 'The title is '.$result[$i]->title.' and the date is '. $result[$i]->date;
}
This loop has a much higher performance when not modifying anything in the array, so if speed matters I'd recommend it.
Just do it as part of the XPath query:
<?php
$theurl = 'http://www.theurl.com/feed.xml';
$xml = simplexml_load_file($theurl);
$result = $xml->xpath('/items/item[position() <= 5]');
foreach ($result as $item) {
$date = $item->date;
$title = $item->title;
echo 'The title is '. $title.' and the date is '. $date . '';
}
?>
Here's a demo!
This one only covers the first record in the array -- $form[items][0][description]. How could I iterate this to be able to echo succeeding ones i.e
$form[items][1][description];
$form[items][2][description];
$form[items][3][description];
and so on and so forth?
$array = $form[items][0][description];
function get_line($array, $line) {
preg_match('/' . preg_quote($line) . ': ([^\n]+)/', $array['#value'], $match);
return $match[1];
}
$anchortext = get_line($array, 'Anchor Text');
$url = get_line($array, 'URL');
echo '' . $anchortext . '';
?>
This should do the trick
foreach ($form['items'] as $item) {
echo $item['description'] . "<br>";
}
I could help you more if I saw the body of your get_line function, but here's the gist of it
foreach ($form['items'] as $item) {
$anchor_text = get_line($item['description'], 'Anchor Text');
$url = get_line($item['description'], 'URL');
echo "{$anchor_text}";
}
You can use a for loop to iterate over this array.
for($i=0; $i< count($form['items']); $i++)
{
$anchortext = get_line($form['items'][$i]['description'], 'Anchor Text');
$url = get_line($form['items'][$i]['description'], 'URL');
echo '' . $anchortext . '';
}