find a element in html and explode it for stock - php

I want to retrieve an HTML element in a page.
<h2 id="resultCount" class="resultCount">
<span>
Showing 1 - 12 of 40,923 Results
</span>
</h2>
I have to get the total number of results for the test in my php.
For now, I get all that is between the h2 tags and I explode the first time with space.
Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.
define("MAX_RESULT_ALL_PAGES", 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$htmlResultCountPage = file_get_html($queryUrl);
$htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
$resultCountArray = explode(" ", $htmlResultCount[0]);
$explodeCount = explode(',', $resultCountArray[5]);
$europeFormatCount = '';
foreach ($explodeCount as $val) {
$europeFormatCount .= $val;
}
if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
At the moment the total number of results is not well recovered and the condition does not happen even when it should.
Someone would have a solution to this problem or any other way?

I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:
define('MAX_RESULT_ALL_PAGES', 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);
if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
$totalResults = (int) str_replace(',', '', $matches[1]);
} else {
throw new \RuntimeException('Total number of results not found');
}
if ($totalResults > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
// ...
}

A regex would do it:
...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...

Please try this code.
define("MAX_RESULT_ALL_PAGES", 1200);
// new dom object
$dom = new DOMDocument();
// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');
// Store total result count
$totalCount = 0;
// loop over the all h2 tags and print result
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
$inner_html = str_replace(',', '', trim($node->nodeValue));
$inner_html_array = explode(' ', $inner_html);
// Print result to the terminal
$totalCount += $inner_html_array[5];
}
}
}
}
// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}

Give this a try:
$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);
The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.

Related

Check every URL in string to remove links of certain sites

I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>
One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);

Parse all items with PHP Simple HTML DOM Parser

I'm trying to parse DOM with Simple HTML DOM Parser in PHP. Parsed content are movies, so I want to get every genre there is, but when i run my code I only get the last genre, not all. My code looks like this:
if ($obj) {
foreach($obj as $key => $data) {
$item['url'] = 'http://geo.saitebi.ge/movie/' . $page;
$item['poster'] = 'http://geo.saitebi.ge/web/ka/img/movies/' . $page . '/240x340/' . $page . '.jpg';
$item['geotitle'] = $data->find('div.movie-item-title', 0)->plaintext;
$item['englishtitle'] = $data->find('div.movie-item-title-en', 0)->plaintext;
$item['year'] = $data->find('div.movie-item-year', 0)->plaintext;
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] = $genre->plaintext . ', ';
}
$item['description'] = $data->find('div.movie-desctiption-more', 0)->plaintext;
$item['imdb_rating'] = $data->find('a.imdb_vote', 0)->plaintext;
$item['imdb_id'] = trim(substr($data->find('a.imdb_vote',0)->href, strrpos($data->find('a.imdb_vote',0)->href, '/') + 1));
}
}
As you see I'm getting content as array. Then inside it I run another foreach loop to get all genre items but it only gets last genre item. What is wrong in my code?
You are just overwriting the last set of data each time. You need to set it to blank and then append it using .= each time, like...
$item['genres'] = '';
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] .= $genre->plaintext . ', ';
}
The code are saving the same '$genre->plaintext' in $item with key 'genres'
so the same key replace the values every loop.
foreach($data->find('a.movie-genre-item') as $genre) {
$item['genres'] = $genre->plaintext . ', ';
}
May $item['genres] can be an associative array.. i mean:
foreach($data->find('a.movie-genre-item') as $genre) {
if ( !isset($item[$genre]) ) {
$item[$genre] = array();
}
array_push($item[$genre],$genre->plaintext);
}
Here is genre code which you have to update
// Here is you genre code
$movie_genre='';
foreach($data->find('a.movie-genre-item') as $genre) {
$movie_genre .= $genre->plaintext . ',';
}
// Here you can use rtrim for removing last comma from genre
$item['genres'] = rtrim($movie_genre,',');

How to introduce white space in following?

The following snippet of PHP code creates $desc alright, but I like it to introduce two (2) blank spaces between every dpItemFeatureList found as it goes through its iteration.
I can't seem to garner exactly what or where to add a snippet to do this?
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$desc .= $tot->nodeValue;
}
}
}
return $desc;
Looking at the code you have shared here and consequently having a look at the data that you are processing (a sample of which I have pasted here) you actually want to collect the text within the <li> child elements of the <ul class="dpItemFeaturesList"> node.
In your original code snippet your XPath is as follows:
'//ul[#class="dpItemFeaturesList"]'
This will only select the <ul> element and not the child elements. Consequently when you try to do a $tot->nodeValue it will concatenate all the text within all it's child nodes without spaces (ah ha, the real reason why you want spaces in the first place).
To fix this we should do two things:
Select the <li> nodes within the appropriate node. Change the XPath to //ul[#class="dpItemFeaturesList"]/li.
In the foreach loop concatenate 2 non-breakable spaces (because this is HTML) to the $desc variable.
Here $c is the array index.
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]/li');
foreach ($k as $c => $tot) {
if ($c > 0) {
$desc .= " ";
}
$desc .= $tot->nodeValue;
}
}
}
return $desc;
}
We check for $c > 0 so that you will not get extra spaces after the last node in the loop.
P.S.: Unrelated to your original question. The code for which you shared a link has an undefined variable $timestamp in $date = date("format", $timestamp); on line 116.
Since you're appending everything to desc, try something like
$desc .= $tot->nodeValue;
$desc .= "<br />"
try that:
$desc .= $tot->nodeValue.' ';
and trim($desc) after the loop to avoid two spaces at the end.
or, alternatively create an array:
$desc = array();
//....
$desc[] = $tot->nodeValue;
and return implode(' ', $desc)
If you need that between each one, you need to add in front on each iteration but the first:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$c && $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This is an expression which saves you an if but it works similar. Maybe a bit of taste so sure, an if can do it as well:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
if($c) $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This works because every integer number but zero is true in PHP.
See the demo.

Strip tag with class in PHP

So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>
This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>
A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.
Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}

Need some help with XML parsing

The XML feed is located at: http://xml.betclick.com/odds_fr.xml
I need a php loop to echo the name of the match, the hour, and the bets options and the odds links.
The function will select and display ONLY the matchs of the day with streaming="1" and the bets type "Ftb_Mr3".
I'm new to xpath and simplexml.
Thanks in advance.
So far I have:
<?php
$xml_str = file_get_contents("http://xml.betclick.com/odds_fr.xml");
$xml = simplexml_load_string($xml_str);
// need xpath magic
$xml->xpath();
// display
?>
Xpath is pretty simple once you get the hang of it
you basically want to get every match tag with a certain attribute
//match[#streaming=1]
will work pefectly, it gets every match tag from underneath the parent tag with the attribute streaming equal to 1
And i just realised you also want matches with a bets type of "Ftb_Mr3"
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]
This will return the bet node though, we want the match, which we know is the grandparent
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..
the two dots work like they do in file paths, and gets the match.
now to work this into your sample just change the final bit to
// need xpath magic
$nodes = $xml->xpath('//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..');
foreach($nodes as $node) {
echo $node['name'].'<br/>';
}
to print all the match names.
I don't know how to work xpath really, but if you want to 'loop it', this should get you started:
<?php
$xml = simplexml_load_file("odds_fr.xml");
foreach ($xml->children() as $child)
{
foreach ($child->children() as $child2)
{
foreach ($child2->children() as $child3)
{
foreach($child3->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
}
}
}
?>
That gets you to the 'match' tag which has the 'streaming' attribute. I don't really know what 'matches of the day' are, either, but...
It's basically right out of the w3c reference:
http://www.w3schools.com/PHP/php_ref_simplexml.asp
I am using this on a project. Scraping Beclic odds with:
<?php
$match_csv = fopen('matches.csv', 'w');
$bet_csv = fopen('bets.csv', 'w');
$xml = simplexml_load_file('http://xml.cdn.betclic.com/odds_en.xml');
$bookmaker = 'Betclick';
foreach ($xml as $sport) {
$sport_name = $sport->attributes()->name;
foreach ($sport as $event) {
$event_name = $event->attributes()->name;
foreach ($event as $match) {
$match_name = $match->attributes()->name;
$match_id = $match->attributes()->id;
$match_start_date_str = str_replace('T', ' ', $match->attributes()->start_date);
$match_start_date = strtotime($match_start_date_str);
if (!empty($match->attributes()->live_id)) {
$match_is_live = 1;
} else {
$match_is_live = 0;
}
if ($match->attributes()->streaming == 1) {
$match_is_running = 1;
} else {
$match_is_running = 0;
}
$match_row = $match_id . ',' . $bookmaker . ',' . $sport_name . ',' . $event_name . ',' . $match_name . ',' . $match_start_date . ',' . $match_is_live . ',' . $match_is_running;
fputcsv($match_csv, explode(',', $match_row));
foreach ($match as $bets) {
foreach ($bets as $bet) {
$bet_name = $bet->attributes()->name;
foreach ($bet as $choice) {
// team numbers are surrounded by %, we strip them
$choice_name = str_replace('%', '', $choice->attributes()->name);
// get the float value of odss
$odd = (float)$choice->attributes()->odd;
// concat the row to be put to csv file
$bet_row = $match_id . ',' . $bet_name . ',' . $choice_name . ',' . $odd;
fputcsv($bet_csv, explode(',', $bet_row));
}
}
}
}
}
}
fclose($match_csv);
fclose($bet_csv);
?>
Then loading the csv files into mysql. Running it once a minute, works great so far.

Categories