How do you capture certain data from description field in RSS feed?

How do you capture certain data from description field in RSS feed? - php

I have an rss feed that I am reading into. I need to retrieve certain data from the field in this feed.
This is the example feed data :
<content:encoded><![CDATA[
<b>When:</b><br />
Weekly Event - Every Thursday: 1:30 PM to 3:30 PM (CT)<br /><br />
<b>Where:</b><br />
100 West Street<BR>2nd floor<BR>Gainesville<BR>
<br>.....
How do I pull out the data for When: and Where: respectively? I attempted to use regex but I am unsure if I am not accessing the data correctly or if my regex expression is wrong. I'm not set on using regex.
This is my code:
foreach ($x->channel->item as $event) {
$eventCounter++;
$rowColor = ($eventCounter % 2 == 0) ? '#FFFFFF' : '#F1F1F1';
$content = $event->children('http://purl.org/rss/1.0/modules/content/');
$contents = $content->encoded;
echo '<tr style="background-color:' . $rowColor . '">';
echo '<td>';
//echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>" . $event->title . "</a>";
echo "" . $event->title . "";
echo '</td>';
echo '<td>';
$re = '%when\:\s*</b>\s*(.|\s)<br \/><br \/>$/i';
if (preg_match($re, $contents, $matches)) {
$date = $matches;
}
echo $date;
echo '</td>';
echo '<td>';
$re = '/^When\:<\/b>()$/';
if (preg_match($re, $contents, $matches)) {
$location = $matches;
}
echo $location;
echo '</td>';
echo '<td>';
echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>Click Here To Register</a>";
echo '</td>';
echo '</tr>';
}
The two $res are just my attempt to get the data out using different regex expressions. Let me know where I am going wrong. Thanks

The following should sort of get you there. (I wrote this from the top of my head and it does not exactly following your XML syntax. But you get the idea.)
<?php
$str = "<root><b>When:</b> whenwhen <b>Where:</b> wherewhere</root>";
$doc = new DOMDocument();
$doc->loadXML($str);
$when = $where = "";
$target = null;
foreach ($doc->documentElement->childNodes as $node) {
if ($node->tagName == "b") {
if (++$i == 1) {
$target = &$when;
} else {
$target = &$where;
}
}
if ($target !== null && $node->nodeType === XML_TEXT_NODE) {
$target .= $node->nodeValue;
}
}
var_dump($when, $where);

I had a problem like this and I ended up using YQL. Take a good look at the page-scraping code given there, especially the select command. Then go the the console and put in your own select statement, specifying the feed url and the xpath to the nodes you're wanting. Select JSON format. Then go down to the bottom of the page, get the REST query url, and use it in a jquery jsonp request. MAGIC!

please, don't extract data from XML-documents via regex.
The long answer is e.g. here: https://stackoverflow.com/a/335446/313145
The short answer is: it is not easier to use regex and will break often.

Related

Scrape only 1 result from google search via PHP

Recently i've been having issues with the code of not parsing only 1 result from the google search url. Instead it parses 10 results which, I am not sure if it is changable.
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.google.com/search?q=raspberry&oq=raspberry&aqs=chrome.0.69i59j0l5.1144j0j7&sourceid=chrome&ie=UTF-8');
$linkObjs = $html->find('div[class=jfp3ef] a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
//if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title:' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
?>
All, I need is 1 result being scraped off a google search. Thats all. The code i've included is the web scraper written in PHP.

Scraping with Simple HTML DOM Parser but it stops suddenly

I'm trying to scrape the following page: http://mangafox.me/manga/
I wanted the script to click on each of those links and scrape the details of each manga and for the most part my code does exactly that. It works, but for some reason the page just stops loading midway (it doesn't even go through the # list).
There is no error message so I don't know what I'm looking for. I would appreciate some advice on what I'm doing wrong.
Code:
<?php
include('simple_html_dom.php');
set_time_limit(0);
//ini_set('max_execution_time', 300);
//Creates an instance of the simple_html_dom class
$html = new simple_html_dom();
//Loads the page from the URL entered
$html->load_file('http://mangafox.me/manga');
//Finds an element and if there is more than 1 instance the variable becomes an array
$manga_urls = $html->find('.manga_list a');
//Function which retrieves information needed to populate the DB from indiviual manga pages.
function getmanga($value, $url){
$pagehtml = new simple_html_dom();
$pagehtml->load_file($url);
if ($value == 'desc') {
$description = $pagehtml->find('p.summary');
foreach($description as $d){
//return $d->plaintext;
return $desc = $d->plaintext;
}
unset($description);
} else if ($value == 'status') {
$status = $pagehtml->find('div[class=data] span');
foreach ($status as $s) {
$status = explode(",", $s->plaintext);
return $status[0];
}
unset($status);
} else if ($value == 'genre') {
$genre = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[4]');
foreach ($genre as $g) {
return $g->plaintext;
}
unset($genre);
} else if ($value == 'author') {
$author = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[2]');
foreach ($author as $a) {
return $a->plaintext;
}
unset($author);
} else if ($value == 'release') {
$release = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[1]');
foreach ($release as $r) {
return $r->plaintext;
}
unset($release);
} else if ($value == 'image') {
$image = $pagehtml->find('.cover img');
foreach ($image as $i) {
return $i->src;
}
unset($image);
}
$pagehtml->clear();
unset($pagehtml);
}
foreach($manga_urls as $url) {
$href = $url->href;
if (strpos($href, 'http') !== false){
echo 'Title: ' . $url->plaintext . '<br />';
echo 'Link: ' . $href . '<br />';
echo 'Description: ' . getmanga('desc', $href) . '<br />';
echo 'Status: ' . getmanga('status',$href) . '<br />';
echo 'Genre: ' . getmanga('genre', $href) . '<br />';
echo 'Author: ' . getmanga('author', $href) . '<br />';
echo 'Release: ' . getmanga('release', $href) . '<br />';
echo 'Image Link: ' . getmanga('image', $href) . '<br />';
echo '<br /><br />';
}
}
$html->clear();
unset($html);
?>

So, it was not a 'just do this' fix, but I did it ;)
Beside the fact is was importing the sub pages way too much, it also had a huge simple_html_dom to iterate through. It has like 13307 items, and simple_html_dom is not made for speed or efficiency. It allocated much space for things you didn't need in this case. That is why I replaced the main simple_html_dom with a regular expression.
I think it still takes ages to load fully, and you are better of using a other language, but this is a working result :-)
https://gist.github.com/dralletje/ee996ffe4c957cdccd01

I have faced the same issue, when the loop with 20k iterations stopped without any error message. So posting the solution so it might help someone.
The issue seems to be of performance as stated before. So I decided to use curl instead of simple html dom. The function bellow returns content of website:
function getContent($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
if($result){
return $result;
}else{
return "";
}
}
Now to traverse the DOM, I am still using simple html dom, but the code is changed as:
$content = getContent($url);
if($content){
// Create a DOM object
$doc = new simple_html_dom();
// Load HTML from a string
$doc->load($content);
}else{
continue;
}
And at the end of each loop close and unset variable as:
$doc->clear();
unset($doc);

Echo selective data (rows) from multidimensional array in PHP

I have an array coming from a .csv file. These are coming from a real estate program. In the second column I have the words For Sale and in the third column the words For Rent that are indicated only on the rows that are concerned. Otherwise the cell is empty. I want to display a list rows only For Sale for example. Of course then if the user clicks on a link in one of the rows, the appropriate page will be displayed.
I can't seem to target the text in the column, and I can't permit that the words For Sale be used throughout the entire array because they could appear in another column (description for example).
I have tried this, but to no avail.
/* The array is $arrCSV */
foreach($arrCSV as $book) {
if($book[1] === For Sale) {
echo '<div>';
}
echo '<div>';
echo $book[0]. '<br>';
echo $book[1]. '<br>';
echo $book[2]. '<br>';
echo $book[3]. '<br>';
echo $book[6]. '<br><br><br>';
echo '</div>';
}
I also tried this:
foreach($arrCSV as $key => $book) {
if($book['1'] == 'For Sale') {
echo '<div>';
}
echo '<div>';
echo $book[0]. '<br>';
echo $book[1]. '<br>';
echo $book[2]. '<br>';
echo $book[3]. '<br>';
echo $book[6]. '<br><br><br>';
echo '</div>';
}

Do you mean:
foreach($arrCSV as $key => $book) {
if($book['1'] == 'For Sale') {
echo '<div></div>';
}else{
echo '<div>';
echo $book[0]. '<br>';
echo $book[1]. '<br>';
echo $book[2]. '<br>';
echo $book[3]. '<br>';
echo $book[6]. '<br><br><br>';
echo '</div>';
}
}

I found a solution here:
PHP: Taking Array (CSV) And Intelligently Returning Information
$searchCity = 'For Sale'; //or whatever you are looking for
$file = file_get_contents('annonces.csv');
$results = array();
$lines = explode("\n",$file);
//use any line delim as the 1st param,
//im deciding on \n but idk how your file is encoded
foreach($lines as $line){
//split the line
$col = explode(";",$line);
//and you know city is the 3rd element
if(trim($col[1]) == $searchCity){
$results[] = $col;
}
}
And then:
foreach($results as $Rockband)
{
echo "<tr>";
foreach($Rockband as $item)
{
echo "<td>$item</td>";
}
echo "</tr>";
}
As you can see I'm learning. It turns out that by formulating a question, a series of similar posts are displayed, which is much quicker and easier than using google. Thanks.

Regular Expression (preg_match_all)

I have a private website where I share videos (and some other stuff).
What I have achieved is that with preg_match_all() it automatically finds the link and it paste the video with the HTML code to my website.
Here an example:
<?php
$matchwith = "http://videosite.com/id1 http://videosite.com/id2 http://videosite.com/id3";
preg_match_all('/videosite\.com\/(\w+)/i', $matchwith, $matches);
foreach($matches[1] as $value)
{
print 'Hyperlink';
}
?>
This works. I know this could may could be done easier, but it has to be this way.
But I do not know how this with a two part movie. Here an example:
$matchWith = "http://videosite.com/id1_movie1 http://videosite.com/id2_movie1"
"http://videosite.com/id3_movie2 http://videosite.com/id4_movie2";
Everything after http://videosite.com/(...) is unique.
What I want is if you write Part 1 and Part 2 (or whatever) before the link, that it automatically detects it as Part 1 and Part 2 of this video.
$matchwith could contain different movies.

So I believe this is what you need:
<?php
$matchWith = "Movie 1 http://videosite.com/id1" . PHP_EOL .
"Movie 1 http://videosite.com/id2" . PHP_EOL .
"Movie 2 http://videosite.com/id3";
$arrLinks = array();
preg_match_all('%(.*)\shttp://videosite\.com/(\w+)\r{0,1}%', $matchWith, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
$arrLinks[$result[$matchi][1]][] = $result[$matchi][2];
}
foreach ($arrLinks as $movieName => $arrMovieIds) {
print '<div>' . $movieName . '</div>';
foreach ($arrMovieIds as $movieId) {
print 'Hyperlink<br/>';
}
}
?>

$matchwith = "Part 1 http://videosite.com/id1-1 Part2 http://videosite.com/id1-2";
preg_match_all('/videosite\.com\/(\w+-\d+)/i', $matchwith, $matches);
foreach($matches[1] as $value)
{
print 'Hyperlink';
}

Need some help with XML parsing

The XML feed is located at: http://xml.betclick.com/odds_fr.xml
I need a php loop to echo the name of the match, the hour, and the bets options and the odds links.
The function will select and display ONLY the matchs of the day with streaming="1" and the bets type "Ftb_Mr3".
I'm new to xpath and simplexml.
Thanks in advance.
So far I have:
<?php
$xml_str = file_get_contents("http://xml.betclick.com/odds_fr.xml");
$xml = simplexml_load_string($xml_str);
// need xpath magic
$xml->xpath();
// display
?>

Xpath is pretty simple once you get the hang of it
you basically want to get every match tag with a certain attribute
//match[#streaming=1]
will work pefectly, it gets every match tag from underneath the parent tag with the attribute streaming equal to 1
And i just realised you also want matches with a bets type of "Ftb_Mr3"
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]
This will return the bet node though, we want the match, which we know is the grandparent
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..
the two dots work like they do in file paths, and gets the match.
now to work this into your sample just change the final bit to
// need xpath magic
$nodes = $xml->xpath('//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..');
foreach($nodes as $node) {
echo $node['name'].'<br/>';
}
to print all the match names.

I don't know how to work xpath really, but if you want to 'loop it', this should get you started:
<?php
$xml = simplexml_load_file("odds_fr.xml");
foreach ($xml->children() as $child)
{
foreach ($child->children() as $child2)
{
foreach ($child2->children() as $child3)
{
foreach($child3->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
}
}
}
?>
That gets you to the 'match' tag which has the 'streaming' attribute. I don't really know what 'matches of the day' are, either, but...
It's basically right out of the w3c reference:
http://www.w3schools.com/PHP/php_ref_simplexml.asp

I am using this on a project. Scraping Beclic odds with:
<?php
$match_csv = fopen('matches.csv', 'w');
$bet_csv = fopen('bets.csv', 'w');
$xml = simplexml_load_file('http://xml.cdn.betclic.com/odds_en.xml');
$bookmaker = 'Betclick';
foreach ($xml as $sport) {
$sport_name = $sport->attributes()->name;
foreach ($sport as $event) {
$event_name = $event->attributes()->name;
foreach ($event as $match) {
$match_name = $match->attributes()->name;
$match_id = $match->attributes()->id;
$match_start_date_str = str_replace('T', ' ', $match->attributes()->start_date);
$match_start_date = strtotime($match_start_date_str);
if (!empty($match->attributes()->live_id)) {
$match_is_live = 1;
} else {
$match_is_live = 0;
}
if ($match->attributes()->streaming == 1) {
$match_is_running = 1;
} else {
$match_is_running = 0;
}
$match_row = $match_id . ',' . $bookmaker . ',' . $sport_name . ',' . $event_name . ',' . $match_name . ',' . $match_start_date . ',' . $match_is_live . ',' . $match_is_running;
fputcsv($match_csv, explode(',', $match_row));
foreach ($match as $bets) {
foreach ($bets as $bet) {
$bet_name = $bet->attributes()->name;
foreach ($bet as $choice) {
// team numbers are surrounded by %, we strip them
$choice_name = str_replace('%', '', $choice->attributes()->name);
// get the float value of odss
$odd = (float)$choice->attributes()->odd;
// concat the row to be put to csv file
$bet_row = $match_id . ',' . $bet_name . ',' . $choice_name . ',' . $odd;
fputcsv($bet_csv, explode(',', $bet_row));
}
}
}
}
}
}
fclose($match_csv);
fclose($bet_csv);
?>
Then loading the csv files into mysql. Running it once a minute, works great so far.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do you capture certain data from description field in RSS feed? - php

please, don't extract data from XML-documents via regex. The long answer is e.g. here: https://stackoverflow.com/a/335446/313145 The short answer is: it is not easier to use regex and will break often.

Related

Scrape only 1 result from google search via PHP

Scraping with Simple HTML DOM Parser but it stops suddenly

Echo selective data (rows) from multidimensional array in PHP

Regular Expression (preg_match_all)

Need some help with XML parsing

Categories

Resources