php get bing rss feed titles into one var - php

I am trying with the code below to get a bing rss news feed, grab all the titles from this data to an array and then implode them all together so I have a variable with all the words together in a string so I can then create a word cloud out of this with another peice of code. So far it grabs the rss feed and print_r($doc); if you uncomment it displays the simple xml. Howver my foreach looping to grab the titles in the array doesn't seem to be working and I can't see where the error is? Thanks in advance.
$ch = curl_init("http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXMLElement($data);
//print_r($doc);
$vals = array();
foreach ($doc->entry as $entry) {
$vals[] = (string) $entry->title;
}
//join content nodes together for the word cloud
$vals = implode(' ', $vals);
echo($vals);

The titles are at rss/channel/item/title and not where your code looks for them, at rss/entry/title. Other than that, your way of getting the values is fine.
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$titles = array();
foreach ($rss->channel->item as $item) {
$titles[] = (string) $item->title;
}
//join content nodes together for the word cloud
$words = implode(' ', $titles);
echo $words;
A quicky alternative, using XPath to get the titles, is:
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$words = implode(' ', $rss->xpath('channel/item/title'));
echo $words;

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

Get contents from 2 urls by file_get_contents

How can I get contents from 2 urls by file_get_contents(); at the same time?
$url1 ="https://site1.com";
$url2 ="https://site2.com";
$urls = file_get_contents($url1 + $url2);
echo $urls;
You can't, but you can get the first and then the second and append the contents to the first:
$urls = file_get_contents($url1) . file_get_contents($url2);
Or:
$urls = file_get_contents($url1);
$urls .= file_get_contents($url2);
If you have many URLs then create an array and loop them:
$urls = ["https://site1.com", "https://site2.com"];
$result = '';
foreach($urls as $url) {
$result .= file_get_contents($url);
}

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

creating multidimensional array with two arrays

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
//links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
$links[] = url_to_absolute($URL, $theelement->href);
}
print_r($links);
//titles
$titles = Array();
$str = file_get_contents($URL);
$titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );
print_r($title[1]);
You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.
$newArray = array();
foreach ($links as $key=>$val)
{
$newArray[$key]['link'] = $val;
$newArray[$key]['title'] = $titles[$key];
}
It is not clear what you want.
Anyway, here is how I would rewrite your code in a more organized way:
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
$info = array();
$urls = array(
'http://www.youtube.com',
'http://www.google.com.br'
);
foreach ($urls as $url)
{
$str = file_get_contents($url);
$html = str_get_html($str);
$title = strval($html->find('title')->plaintext);
$links = array();
foreach($html->find(a) as $anchor)
{
$links[] = url_to_absolute($url, strval($anchor->href));
}
$links = array_unique($links);
$info[$url] = array(
'title' => $title,
'links' => $links
);
}
print_r($info);

Parsing XML in PHP DOM via cURL - can't get nodeValue if it is url address or date

I have this strange problem parsing XML document in PHP loaded via cURL. I cannot get nodeValue containing URL address (I'm trying to implement simple RSS reader into my CMS). Strange thing is that it works for every node except that containing url addresses and date ( and ).
Here is the code (I know it is a stupid solution, but I'm kinda newbie in working with DOM and parsing XML documents).
function file_get_contents_curl($url) {
$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // return into a variable
curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
$result = curl_exec($ch); // run the whole process
return $result;
}
function vypis($adresa) {
$html = file_get_contents_curl($adresa);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$desc = $doc->getElementsByTagName('description');
$ctg = $doc->getElementsByTagName('category');
$pd = $doc->getElementsByTagName('pubDate');
$ab = $doc->getElementsByTagName('link');
$aut = $doc->getElementsByTagName('author');
for ($i = 1; $i < $desc->length; $i++) {
$dsc = $desc->item($i);
$titles = $nodes->item($i);
$categorys = $ctg->item($i);
$pubDates = $pd->item($i);
$links = $ab->item($i);
$autors = $aut->item($i);
$description = $dsc->nodeValue;
$title = $titles->nodeValue;
$category = $categorys->nodeValue;
$pubDate = $pubDates->nodeValue;
$link = $links->nodeValue;
$autor = $autors->nodeValue;
echo 'Title:' . $title . '<br/>';
echo 'Description:' . $description . '<br/>';
echo 'Category:' . $category . '<br/>';
echo 'Datum ' . gmdate("D, d M Y H:i:s",
strtotime($pubDate)) . " GMT" . '<br/>';
echo "Autor: $autor" . '<br/>';
echo 'Link: ' . $link . '<br/><br/>';
}
}
Can you please help me with this?
To read RSS you shouldn't use loadHTML, but loadXML. One reason why your links don't show is because the <link> tag in HTML ignores its contents. See also here: http://www.w3.org/TR/html401/struct/links.html#h-12.3
Also, I find it easier to just iterate over the <item> tags and then iterate over their children nodes. Like so:
$d = new DOMDocument;
// don't show xml warnings
libxml_use_internal_errors(true);
$d->loadXML($xml_contents);
// clear xml warnings buffer
libxml_clear_errors();
$items = array();
// iterate all item tags
foreach ($d->getElementsByTagName('item') as $item) {
$item_attributes = array();
// iterate over children
foreach ($item->childNodes as $child) {
$item_attributes[$child->nodeName] = $child->nodeValue;
}
$items[] = $item_attributes;
}
var_dump($items);

Categories