Playing Simple PHP Dom Parser In Advance - php

I want to do something advanced with PHP Simple HTML DOM Parser but can't understand how to do so..
What I want to do: Suppose we are getting some url from here
foreach($result->find('a') as $element)
$element->href
and now i want to get its content as plain text but note that I am going to fetch full content from that pages.
Suppose this script has a variable called $sn = 'linkin park song'; - is there any way to fetch lines from those pages which have the keywords from $sn
If you don't understand what I want to do then here is a example.
Suppose $sn = 'Google 2012 Earning Report'; so PHP SIMPLE DOM PARSER will get some url from $element->href , such as:
http://www.dailytech.com/Google+Q2+2012+Earnings+Rise+11+Percent+Deals+with+Falling+Ad+PricesMotorola+Issues/article25221.htm
It will fetch plain text from this url, which match $sn:
Google's Q2 2012 earning are looking good despite certain issues ...
Google's latest Earning Report is result of its ability to better target...

Related

Getting Specific Data in PHP

What is the way to get specific data using PHP. In this case i want to get some text which is wrapped by <span class="s"> to the first <b> HTML tag.Assuming a HTML source code is:
Once there was a king <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video</span> but they have...
So, here i want to get those filtered data in a variable like: $fdata = "May 3 2009";Because, May 3 2009 is wrapped by <span class="s"> to the first <b> HTML tag.
I will use it in SIMPLE PHP HTML DOM PARSING. So, any idea or example to filter those text and get it in a variable? Any idea will be a great help. *If you found a duplicate question here, its not that its more specified.
Use Simple HTML DOM
http://simplehtmldom.sourceforge.net/
Or http://php.net/manual/en/domdocument.loadhtml.php
Or you can use any other library also.
If you're using simple html dom parser you'd grab the elements you're targeting like this:
$ret = $html->find('span class="s"');
This is just a basic sample, but it should get you going in the right direction.
if you need to find a very specific instance, you can use something such as:
$ret = $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;

How to pull specific content from HTML using PHP? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
How do I go about pulling specific content from a given live online HTML page?
For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967
I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.
Psuedo code
$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);
Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.
First of all, what u wanna do, is called WEBSCRAPING.
Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc.
Search after webscraping.
HERE is a basic tutorial
THIS book should be useful too.
something like this would be a good starting point if you wanted tabular output
$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
// array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]);
$var2= strip_tags($cells[0][1]);
etc etc
The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.
I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.

Get title from link, PHP Simple HTML DOM Parser

I'm trying to get the title from a link - which I already have identified, it is just the last bit with retrieving the title from the same link.
To find the right link I use this code:
$html->find('a[href=http://mylink.se']');
But I also want the title from this link. How would I do that?
Assuming you're using PHP Simple HTML DOM parser, which is not clear in your question, you could do
$link = $html->find('a[href=http://mylink.se]', 0); //As the OP pointed out in comments, you need to select the first element
$title = $link->title
http://simplehtmldom.sourceforge.net/manual.htm

How do I screen scrape a website and get data within div?

How can I screen scrape a website using cURL and show the data within a specific div?
Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.
After downloading with cURL use XPath to select the div and extract the content.
A possible alternative.
# We will store the web page in a string variable.
var string page
# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page
# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page
This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.
stex -r -c "^<div&ABC&</div\>^" $page
Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.
Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.
Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.
Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

Extract data from website via PHP

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

Categories