I am working on to retrieve a table content(everything under <tbody>) from an URL to my page.
It can be everything under <table> but remove <thread>...</thread>
I have search many references in this forum but not able to get the result I want.
The HTML structure as per the image(actual code too lengthy to paste here):
[1]: https://i.stack.imgur.com/SgwM1.png
Appreciate if you can show me the light
Orz
My sample code"
$url = 'https://xxxxxx.com/tracking/SUA000085003';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
$cl = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($cl);
$dom->validate();
$rows = $dom->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
print $cell->nodeValue; // print cells' content as 124578
echo "<BR>";
}
}
The result I got is:
https://xxxxxx.com/tracking/SUA000085003
15 May 202101:35:33
the goods left the warehouse in guangzhou
15 May 202101:35:33
arrived at sorting facility
14 May 202123:35:33
express operation is complete
The URL from the result is under <Table><thread>...</thread>
I would like to remove this text entirely or only show the text after the last /, SUA000085003 is the example for this case.
Related
There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm
Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]
I have tried to create a PHP code to extract the price of items from an eCommerce website. I created a variable where I need to type in the URL of the item and the code will fetch the price of the item and then will display it.
Unfortunately I have tried it for more than 20 times but still I am not getting the result. I went to my professor and he said, he is really busy and will try to find the solution after 3 days. I don't want to wait for 3 days.
Can anyone please help me?
I have been trying the fetch the price of this item
You must try something before coming to Stack Overflow. I hope you won't do this mistake again ;)
Well.. enough of my advice. Here i wrote this code using cURL on PHP. Gets you the amount 40490.
<?php
$ch = curl_init('http://www.flipkart.com/lg-g2-16-gb/p/itmdzuhncfhj9zwt?pid=MOBDZUHGWZ3HMCMF&ref=c35ae3ed-99d5-49d8-ae45-b0d4de3afe41');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$strx=strip_tags(curl_exec($ch));
$str_key="Rs. ";
$end_key=" Inclusive";
$strt=strpos($strx,$str_key);
$end=strpos($strx,$end_key);
echo intval(substr($strx,$strt+strlen($str_key),9));//outputs 40490 (price of the prod)
public function scrapeProductPrice($remote_page_content,$log){
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($remote_page_content);
$xpath = new DOMXPath($dom);
$my_xpath_query = "//table//tr";
$result_rows = $xpath->query($my_xpath_query);
foreach($result_rows as $key => $value) {
$lookUp = strstr($value->nodeValue, PRODUCT_NAME) ? str_split($value->nodeValue, strlen(PRODUCT_NAME)) : 0;
if($lookUp){
return $lookUp[1];
}
}
}
Note:Change $remote_page_content with the page url
I am writing a php crone job script that will run once a week
the main purpose of this script is to get details from all TED talks that are available on the TED
we site (for example to make this question more understandable)
this script will take around 70min to run and it goes over 2000 web pages
my questions are :
1) is there a better / faster way to get the web page each time, im using the function :
file_get_contents_curl($url)
2) is it a good practice to hold all the talks in a array (that can get pretty big)
3) is there a better way in general to get for example all ted talks details from a web site ? what is the best way to "crawl" on TED website to get all the talks
**Ive checked the option to use rss feeds but its missing some details i need.
Thanks
<?php
define("START_ID", 1);
define("STOP_TED_QUERY",20);
define ("VALID_PAGE","TED | Talks");
/**
* this script will run as a cron job and will go over all pages
* on TED http://www.ted.com/talks/view/id/
* from id 1 till there are no more pages
*/
/**
* function get a file using curl (fast)
* #param $url - url which we want to get its content
* #return the data of the file
* #author XXXXX
*/
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
//will hold all talks in array
$tedTalks = array();
//id to start the query from
$id=START_ID;
//will indicate when needed to stop the query beacuse reached the end id's on TED website
$endOFQuery=0;
//get the time
$time_start = microtime(true);
//start the query on TED website
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more
while ($endOFQuery < STOP_TED_QUERY){
//get the page of the talk
$html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id");
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
//check if this a valid page
if (! strcmp ($title , VALID_PAGE ))
//this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop)
$endOFQuery++;
else {
//this is a valid TED talk get its details
//reset the flag for end of query
$endOFQuery = 0;
//get meta tags
$metas = $doc->getElementsByTagName('meta');
//get the tag we need (keywords)
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
//create new talk object and populate it
$talk = new Talk();
//set its ted id from ted web site
$talk->setID($id);
//parse the name (name has un-needed char's in the end)
$talk->setName( substr($title, 0, strpos( $title, '|')) );
//parse the String of tags to array
$keywords = explode(",", $keywords);
//remove un-needed items from it
$keywords=array_diff($keywords, array("TED","Talks"));
//add the filters tags to the talk
$talk->setTags($keywords);
//add to the total talks array
$tedTalks[]=$talk;
}
//move to the next ted talk ID to query
$id++;
} //end of the while
$time_end = microtime(true);
$execution_time = ($time_end - $time_start);
echo "this took (sec) : ".$execution_time;
?>
got a web crawler php example on github.com
if some 1 is looking for how it works
https://github.com/Nimrod007/TED-talks-details-from-TED.com-and-youtube
Ive published a freemium api on Mashape implementing this script https://market.mashape.com/bestapi/ted
enjoy!
Wasn't sure what to call this, so I will quickly elaborate.
I have a screen scraper I am trying to build, using the YQL console. The query provides the user with a choice of XML or JSON. I am targeting the YQL>data>html aspect of the console, and chose XML as my output format.
My YQL Query:
SELECT * FROM html WHERE url="http://google.com"
This will provide you with a readout of the Google.com document tree in XML. Too much output to paste into this post, so just click the link.
My problem comes with traversing the XML Tree with PHP to properly display the output from this request. I dont know how to effectively create a foreach statement (or any other statement) to effectively scrape the XML output and collect the Document tree and re-display it for my own needs.
My PHP:
$searchUrl = "google.com";
if(isset($_REQUEST['searchUrl'])) {
$searchUrl = $_REQUEST['searchUrl'];
}
$query = "select * from html where url=\"http://".$searchUrl."\"";
$url = "http://query.yahooapis.com/v1/public/yql";
// Get Subcategory Article Data
$parameterData = "q=".urlencode($query);
$parameterData .= "&diagnostics=true";
// setup CURL
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $parameterData);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
// send
$response = trim(urldecode(curl_exec($ch)));
// parse response
$xmlObjects = #simplexml_load_string($response);
foreach ($xmlObjects->diagnostics as $diagnostics) {
echo "<a href=".$diagnostics->url." target='_blank'>".$diagnostics->url."</a>";
}
foreach ($xmlObjects->results as $result) {
// here is where I would go echo $result->body or something along those lines
}
I suppose I am a bit stumped at this point due to my lack of knowledge to know where to turn next to navigate an XML tree with this type of format. After query>results>body in the XML I am unsure where to turn to collect the remaining objects, and output it into my document in a pre tag or something of that nature.
I would like to provide an input field for users to enter their own domain, and my PHP will submit the query, iterate over the response, and return the Document tree to the user for HTML viewing and debugging.
I am familiar with PHP and XML in the context of iterating a large number of parent elements with the same internal structure like an RSS feed or something of that nature. In this case I am dealing with a dynamic XML tree, with one large response object, and a fluctuating internal structure.
The following code will display the result body as html page:
<?php
// ... the code you posted in the question
// !without the diagnostics output!
// read comments of the answer to know why
?>
<html>
<head>
</head>
<?php
foreach ($xmlObjects->results as $result) {
// asXml() will return the content of body as xml string
echo $result->body->asXml();
break;
}
?>
</html>
Note that as you won't get the <head> element of the page via YQL the output will in most cases look messy.
Is it bad practice or will it be slower if I use curl within a foreach loop?
I'm planning on having an autocomplete input field, and the query in the input would be sent to an API call.
I'm getting an id from a certain link (ie: http://api.linke1.com/names)
foreach($json as j){
$id = $j->id; //from http://api.linke1.com/names
$url = "https://api.site/{$id}/photos";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$output = curl_exec($ch);
curl_close($ch);
$jsonDecode = json_decode($output);
$results = $jsonDecode->results;
foreach($results as $result)
{
$photoURL= $result->photo->url; //from https://api.site/{$id}/photos
}
}
So every time I type in a name, it will go into the foreach searching for an id from http://api.linke1.com/names, and then it will look for the photo url from the other link. I wanted to output a list of an array, so eventually i'll have a list of data to output showing information such as name, photo, etc...
Will this slow down dramatically because each letter typed in the input field it will run through this foreach loop. Would there be an easier way?
Thanks!
Initialize the curl and the things that doesn't change before the loop and close it afterwards.
That will speed up the thing a little bit.
and you can use curl_multi_*, which can fetch several URLs in parallel.
http://se2.php.net/manual/en/ref.curl.php