Parse all items with PHP Simple HTML DOM Parser

Parse all items with PHP Simple HTML DOM Parser - php

I'm trying to parse DOM with Simple HTML DOM Parser in PHP. Parsed content are movies, so I want to get every genre there is, but when i run my code I only get the last genre, not all. My code looks like this:
if ($obj) {
foreach($obj as $key => $data) {
$item['url'] = 'http://geo.saitebi.ge/movie/' . $page;
$item['poster'] = 'http://geo.saitebi.ge/web/ka/img/movies/' . $page . '/240x340/' . $page . '.jpg';
$item['geotitle'] = $data->find('div.movie-item-title', 0)->plaintext;
$item['englishtitle'] = $data->find('div.movie-item-title-en', 0)->plaintext;
$item['year'] = $data->find('div.movie-item-year', 0)->plaintext;
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] = $genre->plaintext . ', ';
}
$item['description'] = $data->find('div.movie-desctiption-more', 0)->plaintext;
$item['imdb_rating'] = $data->find('a.imdb_vote', 0)->plaintext;
$item['imdb_id'] = trim(substr($data->find('a.imdb_vote',0)->href, strrpos($data->find('a.imdb_vote',0)->href, '/') + 1));
}
}
As you see I'm getting content as array. Then inside it I run another foreach loop to get all genre items but it only gets last genre item. What is wrong in my code?

You are just overwriting the last set of data each time. You need to set it to blank and then append it using .= each time, like...
$item['genres'] = '';
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] .= $genre->plaintext . ', ';
}

The code are saving the same '$genre->plaintext' in $item with key 'genres'
so the same key replace the values every loop.
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] = $genre->plaintext . ', ';
}
May $item['genres] can be an associative array.. i mean:
foreach($data->find('a.movie-genre-item') as $genre) {
if ( !isset($item[$genre]) ) {
$item[$genre] = array();
}
array_push($item[$genre],$genre->plaintext);
}

Here is genre code which you have to update
// Here is you genre code
$movie_genre='';
foreach($data->find('a.movie-genre-item') as $genre) {
$movie_genre .= $genre->plaintext . ',';
}
// Here you can use rtrim for removing last comma from genre
$item['genres'] = rtrim($movie_genre,',');

Related

User variables of a function, in another function

I have a Laravel app, which the following PHP code:
public function handle()
{
$post_item->category_id = $source->category_id;
$post_item->featured = 0;
$post_item->type = Posts::TYPE_SOURCE;
$post_item->render_type = $item['render_type'];
$post_item->source_id = $source->id;
$post_item->description = is_array($item['description'])?'':$item['description'];
$post_item->featured_image = $item['featured_image'];
$post_item->video_embed_code = $item['video_embed_code'];
$post_item->dont_show_author_publisher = $source->dont_show_author_publisher;
$post_item->show_post_source = $source->show_post_source;
$post_item->show_author_box = $source->dont_show_author_publisher == 1 ? 0 : 1;
$post_item->show_author_socials = $source->dont_show_author_publisher == 1 ? 0 : 1;
$post_item->rating_box = 0;
$post_item->created_at = $item['pubDate'];
$post_item->views = 1;
$post_item->save();
$this->createTags($item['categories'], $post_item->id);
// This is where I want to add my echo
}
public function createTags($tags, $post_id)
{
$post_tags = PostTags::where('post_id', $post_id)->get();
foreach ($post_tags as $post_tag) {
Tags::where('id', $post_tag->tag_id)->delete();
}
PostTags::where('post_id', $post_id)->delete();
foreach ($tags as $tag) {
$old_tag=Tags::where('title',$tag)->first();
if(isset($old_tag))
{
$pt = new PostTags();
$pt->post_id = $post_id;
$pt->tag_id = $old_tag->id;
$pt->save();
}
else {
$new_tag = new Tags();
$new_tag->title = $tag;
$new_tag->slug = Str::slug($tag);
$new_tag->save();
$pt = new PostTags();
$pt->post_id = $post_id;
$pt->tag_id = $new_tag->id;
$pt->save();
}
}
}
Im trying to echo the the title along with the tags, right after the commented place, but it fails to provide the correct output. I was wondering if Im using the correct way or my workaround is completely wrong.
Im using this code in the commented part:
$tweet = $post_item->title . ' tags: ' . $post_item->tags;
After doing some tests, I realized that if I use
var_dump($tag);
right after
foreach ($tags as $tag)
at my createTags function, it seems that all tags are output correctly.
Im wondering if I can store all $tags inside the createTag function's foreach, under a globally accessed variable that would be used in the initial handle function echoed.

Guessing the post item model has a relation "tags" you could try this:
$tweet = $post_item->title . ' tags: ' . implode(', ', $post_item->tags->toArray());
Also if you would just like to echo the tags on the commented place, try this:
echo implode(',', $item['categories']);

try if $post_item->tags - is array then
$tweet = $post_item->title . ' tags: ' . implode(', ', $post_item->tags);
if - collection then
$tweet = $post_item->title . ' tags: ' . $post_item->tags->join(', ');

find a element in html and explode it for stock

I want to retrieve an HTML element in a page.
<h2 id="resultCount" class="resultCount">
<span>
Showing 1 - 12 of 40,923 Results
</span>
</h2>
I have to get the total number of results for the test in my php.
For now, I get all that is between the h2 tags and I explode the first time with space.
Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.
define("MAX_RESULT_ALL_PAGES", 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$htmlResultCountPage = file_get_html($queryUrl);
$htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
$resultCountArray = explode(" ", $htmlResultCount[0]);
$explodeCount = explode(',', $resultCountArray[5]);
$europeFormatCount = '';
foreach ($explodeCount as $val) {
$europeFormatCount .= $val;
}
if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
At the moment the total number of results is not well recovered and the condition does not happen even when it should.
Someone would have a solution to this problem or any other way?

I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:
define('MAX_RESULT_ALL_PAGES', 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);
if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
$totalResults = (int) str_replace(',', '', $matches[1]);
} else {
throw new \RuntimeException('Total number of results not found');
}
if ($totalResults > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
// ...
}

A regex would do it:
...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...

Please try this code.
define("MAX_RESULT_ALL_PAGES", 1200);
// new dom object
$dom = new DOMDocument();
// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');
// Store total result count
$totalCount = 0;
// loop over the all h2 tags and print result
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
$inner_html = str_replace(',', '', trim($node->nodeValue));
$inner_html_array = explode(' ', $inner_html);
// Print result to the terminal
$totalCount += $inner_html_array[5];
}
}
}
}
// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}

Give this a try:
$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);
The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.

How to introduce white space in following?

The following snippet of PHP code creates $desc alright, but I like it to introduce two (2) blank spaces between every dpItemFeatureList found as it goes through its iteration.
I can't seem to garner exactly what or where to add a snippet to do this?
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$desc .= $tot->nodeValue;
}
}
}
return $desc;

Looking at the code you have shared here and consequently having a look at the data that you are processing (a sample of which I have pasted here) you actually want to collect the text within the <li> child elements of the <ul class="dpItemFeaturesList"> node.
In your original code snippet your XPath is as follows:
'//ul[#class="dpItemFeaturesList"]'
This will only select the <ul> element and not the child elements. Consequently when you try to do a $tot->nodeValue it will concatenate all the text within all it's child nodes without spaces (ah ha, the real reason why you want spaces in the first place).
To fix this we should do two things:
Select the <li> nodes within the appropriate node. Change the XPath to //ul[#class="dpItemFeaturesList"]/li.
In the foreach loop concatenate 2 non-breakable spaces (because this is HTML) to the $desc variable.
Here $c is the array index.
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]/li');
foreach ($k as $c => $tot) {
if ($c > 0) {
$desc .= " ";
}
$desc .= $tot->nodeValue;
}
}
}
return $desc;
}
We check for $c > 0 so that you will not get extra spaces after the last node in the loop.
P.S.: Unrelated to your original question. The code for which you shared a link has an undefined variable $timestamp in $date = date("format", $timestamp); on line 116.

Since you're appending everything to desc, try something like
$desc .= $tot->nodeValue;
$desc .= "<br />"

try that:
$desc .= $tot->nodeValue.' ';
and trim($desc) after the loop to avoid two spaces at the end.
or, alternatively create an array:
$desc = array();
//....
$desc[] = $tot->nodeValue;
and return implode(' ', $desc)

If you need that between each one, you need to add in front on each iteration but the first:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$c && $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This is an expression which saves you an if but it works similar. Maybe a bit of taste so sure, an if can do it as well:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
if($c) $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This works because every integer number but zero is true in PHP.
See the demo.

Need help with foreach and XML

I have the following output (via link) which displays the var_dump of some XML im generating:
http://bit.ly/aoA3qY
At the very bottom of the page you will see some output, generated by this code:
foreach ($xml->feed as $entry) {
$title = $entry->title;
$title2 = $entry->entry->title;
}
echo $title;
echo $title2;
For some reason $title2 only outputs once, where there are multiple entries?
Im using $xml = simplexml_load_string($data); to create the xml.

You re-assign a value to $title and $tile2 in each iteration of the foreach loop. After the loop is finished only the last assigned value is accessible.
Possible alternatives:
// print/use the values within the loop-body
foreach ($xml->feed as $entry) {
$title = $entry->title;
$title2 = $entry->entry->title;
echo $title, ' ', $title2, "\n";
}
// append the values in each iteration to a string
$title = $title2 = '';
foreach ($xml->feed as $entry) {
$title .= $entry->title . ' ';
$title2 .= $entry->entry->title . ' ';
}
echo $title, ' ', $title2, "\n";
// append the values in each iteration to an array
$title = $title2 = array();
foreach ($xml->feed as $entry) {
$title[] = $entry->title;
$title2[] = $entry->entry->title;
}
var_dump($title, $title2);

Need some help with XML parsing

The XML feed is located at: http://xml.betclick.com/odds_fr.xml
I need a php loop to echo the name of the match, the hour, and the bets options and the odds links.
The function will select and display ONLY the matchs of the day with streaming="1" and the bets type "Ftb_Mr3".
I'm new to xpath and simplexml.
Thanks in advance.
So far I have:
<?php
$xml_str = file_get_contents("http://xml.betclick.com/odds_fr.xml");
$xml = simplexml_load_string($xml_str);
// need xpath magic
$xml->xpath();
// display
?>

Xpath is pretty simple once you get the hang of it
you basically want to get every match tag with a certain attribute
//match[#streaming=1]
will work pefectly, it gets every match tag from underneath the parent tag with the attribute streaming equal to 1
And i just realised you also want matches with a bets type of "Ftb_Mr3"
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]
This will return the bet node though, we want the match, which we know is the grandparent
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..
the two dots work like they do in file paths, and gets the match.
now to work this into your sample just change the final bit to
// need xpath magic
$nodes = $xml->xpath('//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..');
foreach($nodes as $node) {
echo $node['name'].'<br/>';
}
to print all the match names.

I don't know how to work xpath really, but if you want to 'loop it', this should get you started:
<?php
$xml = simplexml_load_file("odds_fr.xml");
foreach ($xml->children() as $child)
{
foreach ($child->children() as $child2)
{
foreach ($child2->children() as $child3)
{
foreach($child3->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
}
}
}
?>
That gets you to the 'match' tag which has the 'streaming' attribute. I don't really know what 'matches of the day' are, either, but...
It's basically right out of the w3c reference:
http://www.w3schools.com/PHP/php_ref_simplexml.asp

I am using this on a project. Scraping Beclic odds with:
<?php
$match_csv = fopen('matches.csv', 'w');
$bet_csv = fopen('bets.csv', 'w');
$xml = simplexml_load_file('http://xml.cdn.betclic.com/odds_en.xml');
$bookmaker = 'Betclick';
foreach ($xml as $sport) {
$sport_name = $sport->attributes()->name;
foreach ($sport as $event) {
$event_name = $event->attributes()->name;
foreach ($event as $match) {
$match_name = $match->attributes()->name;
$match_id = $match->attributes()->id;
$match_start_date_str = str_replace('T', ' ', $match->attributes()->start_date);
$match_start_date = strtotime($match_start_date_str);
if (!empty($match->attributes()->live_id)) {
$match_is_live = 1;
} else {
$match_is_live = 0;
}
if ($match->attributes()->streaming == 1) {
$match_is_running = 1;
} else {
$match_is_running = 0;
}
$match_row = $match_id . ',' . $bookmaker . ',' . $sport_name . ',' . $event_name . ',' . $match_name . ',' . $match_start_date . ',' . $match_is_live . ',' . $match_is_running;
fputcsv($match_csv, explode(',', $match_row));
foreach ($match as $bets) {
foreach ($bets as $bet) {
$bet_name = $bet->attributes()->name;
foreach ($bet as $choice) {
// team numbers are surrounded by %, we strip them
$choice_name = str_replace('%', '', $choice->attributes()->name);
// get the float value of odss
$odd = (float)$choice->attributes()->odd;
// concat the row to be put to csv file
$bet_row = $match_id . ',' . $bet_name . ',' . $choice_name . ',' . $odd;
fputcsv($bet_csv, explode(',', $bet_row));
}
}
}
}
}
}
fclose($match_csv);
fclose($bet_csv);
?>
Then loading the csv files into mysql. Running it once a minute, works great so far.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse all items with PHP Simple HTML DOM Parser - php

You are just overwriting the last set of data each time. You need to set it to blank and then append it using .= each time, like... $item['genres'] = ''; foreach($data->find('a.movie-genre-item') as $genre) { $item['genres'] .= $genre->plaintext . ', '; }

Here is genre code which you have to update // Here is you genre code $movie_genre=''; foreach($data->find('a.movie-genre-item') as $genre) { $movie_genre .= $genre->plaintext . ','; } // Here you can use rtrim for removing last comma from genre $item['genres'] = rtrim($movie_genre,',');

Related

User variables of a function, in another function

find a element in html and explode it for stock

How to introduce white space in following?

Need help with foreach and XML

Need some help with XML parsing

Categories

Resources