PHP scraping to extract the price of items

PHP scraping to extract the price of items - php

I have tried to create a PHP code to extract the price of items from an eCommerce website. I created a variable where I need to type in the URL of the item and the code will fetch the price of the item and then will display it.
Unfortunately I have tried it for more than 20 times but still I am not getting the result. I went to my professor and he said, he is really busy and will try to find the solution after 3 days. I don't want to wait for 3 days.
Can anyone please help me?
I have been trying the fetch the price of this item

You must try something before coming to Stack Overflow. I hope you won't do this mistake again ;)
Well.. enough of my advice. Here i wrote this code using cURL on PHP. Gets you the amount 40490.
<?php
$ch = curl_init('http://www.flipkart.com/lg-g2-16-gb/p/itmdzuhncfhj9zwt?pid=MOBDZUHGWZ3HMCMF&ref=c35ae3ed-99d5-49d8-ae45-b0d4de3afe41');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$strx=strip_tags(curl_exec($ch));
$str_key="Rs. ";
$end_key=" Inclusive";
$strt=strpos($strx,$str_key);
$end=strpos($strx,$end_key);
echo intval(substr($strx,$strt+strlen($str_key),9));//outputs 40490 (price of the prod)

public function scrapeProductPrice($remote_page_content,$log){
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($remote_page_content);
$xpath = new DOMXPath($dom);
$my_xpath_query = "//table//tr";
$result_rows = $xpath->query($my_xpath_query);
foreach($result_rows as $key => $value) {
$lookUp = strstr($value->nodeValue, PRODUCT_NAME) ? str_split($value->nodeValue, strlen(PRODUCT_NAME)) : 0;
if($lookUp){
return $lookUp[1];
}
}
}
Note:Change $remote_page_content with the page url

Related

(Resolved)PHP DOMDocument to get whole content under tbody

I am working on to retrieve a table content(everything under <tbody>) from an URL to my page.
It can be everything under <table> but remove <thread>...</thread>
I have search many references in this forum but not able to get the result I want.
The HTML structure as per the image(actual code too lengthy to paste here):
[1]: https://i.stack.imgur.com/SgwM1.png
Appreciate if you can show me the light
Orz
My sample code"
$url = 'https://xxxxxx.com/tracking/SUA000085003';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
$cl = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($cl);
$dom->validate();
$rows = $dom->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
print $cell->nodeValue; // print cells' content as 124578
echo "<BR>";
}
}
The result I got is：
https://xxxxxx.com/tracking/SUA000085003
15 May 202101:35:33
the goods left the warehouse in guangzhou
15 May 202101:35:33
arrived at sorting facility
14 May 202123:35:33
express operation is complete
The URL from the result is under <Table><thread>...</thread>
I would like to remove this text entirely or only show the text after the last /, SUA000085003 is the example for this case.

Need help decoding JSON from Riot API with PHP

As a part of an assignment I am trying to pull some statistics from the Riot API (JSON data for League of Legends). So far I have managed to find summoner id (user id) based on summoner name, and I have filtered out the id's of said summoner's previous (20) games. However now I can't figure out how to get the right values from the JSON data. So this is when I'll show you my code I guess:
$matchIDs is an array of 20 integers (game IDs)
for ($i = 1; $i <= 1; $i++)
{
$this_match_data = get_match($matchIDs[$i], $server, $api);
$processed_data = json_decode($this_match_data, true);
var_dump($processed_data);
}
As shown above my for loop is set to one, as I'm just focusing on figuring out one before continuing with all 20. The above example is how I got the match IDs and the summoner IDs. I'll add those codes here for comparison:
for ($i = 0; $i <= 19; $i++)
{
$temp = $data['matches'][$i]['matchId'];
$matchIDs[$i] = json_decode($temp, true);
}
$data is the variable I get when I pull all the info from the JSON page, it's the same method I use to get $this_match_data in the first code block.
function match_list($summoner_id, $server, $api)
{
$summoner_enc = rawurlencode($summoner);
$summoner_lower = strtolower($summoner_enc);
$curl =curl_init('https://'.$server.'.api.pvp.net/api/lol/'.$server.'/v2.2/matchlist/by-summoner/'.$summoner_id.'?api_key='.$api.'');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
Now to the root of the problem, This is where I put the data I get from the site, so you can see what I am working with. Now by using the following code I can get the first value in that file, the match ID.
echo $processed_data['matchId'];
But I can't seem to lock down any other information from this .json file. I've tried typing stuff like ['region'] instead of ['matchId'] with no luck as well as inserting index numbers like $processed_data[0], but nothing happens. This is just how I get the right info from the first examples and I am really lost here.

Ok, so I think I've figured it out myself. By adding this to the code I can print out the json file in a way more human-friendly way, and that should make it much easier to handle the data.
echo ("<pre>");
var_dump($processed_data);
echo ("</pre>");

Crawling google search results with PHP Curl , was working but seems to have stopped

Hi Im attempting to crawl google search results, just for my own learning, but also to see can I speed up getting access to direct URLS (Im aware of their API but I just thought Id try this for now).
It was working fine but it seems to have stopped, its simply returning nothing now, Im unsure if its something I did, but I can say that I had this in a for loop to allow the start parameter to increase and Im wondering may that have caused problems.
Is it possible Google can block an IP from crawling?
Thanks..
$url = "https://www.google.ie/search?q=adrian+de+cleir&start=1&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb&gws_rd=cr&ei=D730U7KgGfDT7AbNpoBY#channel=fflb&q=adrian+de+cleir&rls=org.mozilla:en-US:official";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('h3') as $link) {
$actual_link = $link->getElementsbyTagName('a');
foreach ($actual_link as $single_link) {
# Show the <a href>
echo '<pre>';
print_r($single_link->getAttribute('href'));
echo '</pre>';
}
}

Given below is the program I have written in python. But it is not completed fully. Right now it only gets the first page and prints all the href links found on the result.
We can use sets and remove the redundant links from the result set.
import requests<br>
from bs4 import BeautifulSoup
def search_spider(max_pages, search_string):
page = 0
search_string = search_string.replace(' ','+')
while page <= max_pages:
url = 'https://www.google.com/search?num=10000&q=' + search_string + '#q=' + search_string + '&start=' + str(page)
print("URL to search - " + url)
source_code = requests.get(url)
count = 1
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll("a", {"class" : ""}):
href = link.get('href')
input_string = slice_string(href)
print(input_string)
count += 1
page += 10
def slice_string(input_string):
input_string = input_string.lstrip("/url?q=")
index_c = input_string.find('&')
input_string = input_string[:index_c]
return input_string
search_spider(1,"bangalore cabs")
This program will search for bangalore cabs in google.
Thanks,
Karan

You can check if Google blocked you by the following simple curl script command:
curl -sSLA Mozilla "http://www.google.com/search?q=linux" | html2text -width 80
You may install html2text in order to convert html into plain text.
Normally you should use Custom Search API provided by Google to avoid any limitations, so you could retrieve search results in easier way by having access to different formats (such as XML or JSON).

Web Crawler - Fetching data from over 2000 web pages (TED web site example)

I am writing a php crone job script that will run once a week
the main purpose of this script is to get details from all TED talks that are available on the TED
we site (for example to make this question more understandable)
this script will take around 70min to run and it goes over 2000 web pages
my questions are :
1) is there a better / faster way to get the web page each time, im using the function :
file_get_contents_curl($url)
2) is it a good practice to hold all the talks in a array (that can get pretty big)
3) is there a better way in general to get for example all ted talks details from a web site ? what is the best way to "crawl" on TED website to get all the talks
**Ive checked the option to use rss feeds but its missing some details i need.
Thanks
<?php
define("START_ID", 1);
define("STOP_TED_QUERY",20);
define ("VALID_PAGE","TED | Talks");
/**
* this script will run as a cron job and will go over all pages
* on TED http://www.ted.com/talks/view/id/
* from id 1 till there are no more pages
*/
/**
* function get a file using curl (fast)
* #param $url - url which we want to get its content
* #return the data of the file
* #author XXXXX
*/
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
//will hold all talks in array
$tedTalks = array();
//id to start the query from
$id=START_ID;
//will indicate when needed to stop the query beacuse reached the end id's on TED website
$endOFQuery=0;
//get the time
$time_start = microtime(true);
//start the query on TED website
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more
while ($endOFQuery < STOP_TED_QUERY){
//get the page of the talk
$html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id");
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
//check if this a valid page
if (! strcmp ($title , VALID_PAGE ))
//this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop)
$endOFQuery++;
else {
//this is a valid TED talk get its details
//reset the flag for end of query
$endOFQuery = 0;
//get meta tags
$metas = $doc->getElementsByTagName('meta');
//get the tag we need (keywords)
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
//create new talk object and populate it
$talk = new Talk();
//set its ted id from ted web site
$talk->setID($id);
//parse the name (name has un-needed char's in the end)
$talk->setName( substr($title, 0, strpos( $title, '|')) );
//parse the String of tags to array
$keywords = explode(",", $keywords);
//remove un-needed items from it
$keywords=array_diff($keywords, array("TED","Talks"));
//add the filters tags to the talk
$talk->setTags($keywords);
//add to the total talks array
$tedTalks[]=$talk;
}
//move to the next ted talk ID to query
$id++;
} //end of the while
$time_end = microtime(true);
$execution_time = ($time_end - $time_start);
echo "this took (sec) : ".$execution_time;
?>

got a web crawler php example on github.com
if some 1 is looking for how it works
https://github.com/Nimrod007/TED-talks-details-from-TED.com-and-youtube
Ive published a freemium api on Mashape implementing this script https://market.mashape.com/bestapi/ted
enjoy!

Count tweets using a hashtag

I'm developing a Twitter App and have a problem I cannot resolve. Could you help me please?
The app is for a promotion for a brand. We need to count every tweet using a hashtag and give the author of tweet #50000 a price. How can we take that data from Twitter API and identify tweet #50000? Thanks for your help!
We use PHP and MySQL.

I would start by looking into phirehose which will allow you to obtain the tweets. You can also use the Ruby Twitter Gem which is fairly well documented and seems to be easy to use if you are comfortable with ruby.

this php source code for get count hashtag(#) twitter
<?php
global $total, $hashtag;
//$hashtag = '#supportvisitbogor2011';
$hashtag = '#australialovesjustin';
$total = 0;
function getTweets($hash_tag, $page) {
global $total, $hashtag;
$url = 'http://search.twitter.com/search.json?q='.urlencode($hash_tag).'&';
$url .= 'page='.$page;
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, TRUE);
$json = curl_exec ($ch);
curl_close ($ch);
//echo "<pre>";
//$json_decode = json_decode($json);
//print_r($json_decode->results);
$json_decode = json_decode($json);
$total += count($json_decode->results);
if($json_decode->next_page){
$temp = explode("&",$json_decode->next_page);
$p = explode("=",$temp[0]);
getTweets($hashtag,$p[1]);
}
}
getTweets($hashtag,1);
echo $total;
?>
Thanks..

I was looking this up last night. You can request the URL ie. http://search.twitter.com/search.json?q=%23hashtag
(Here's the docs page http://dev.twitter.com/doc/get/search)
And on say a 5 minute cron script, keep a record of the last tweet ID you got, passing that to the search URL since_id parameter, while also keeping a count of how many tweets you have counted, optionally storing each tweet in a table for reference... that's my 2 cents

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP scraping to extract the price of items - php

Related

(Resolved)PHP DOMDocument to get whole content under tbody

Need help decoding JSON from Riot API with PHP

Crawling google search results with PHP Curl , was working but seems to have stopped

Web Crawler - Fetching data from over 2000 web pages (TED web site example)

Count tweets using a hashtag

Categories

Resources