How does Facebook Link tear down a page?

How does Facebook Link tear down a page? - php

So when a user pastes a link into facebook status, it fires off a call to get the details of that page.
What I'm wondering is if anyone has any similar functions to tear apart a page?
Having thought about it, getting the is just matching some regular expression.
It then usually gets an array of images, also fairly easy todo with regular expression and maybe filtering images too small.
I'm alittle baffled how it figures out what bit of text is relevant, any ideas?

Perhaps looking at an article extractor like Goose might help?

It's worth mentioning that since the introduction of the Open Graph support, Facebook is saving so much time and server load when parsing (scraping) pages that uses the protocol.
Check out the PHP implementation for more info, and here's a small example using one of the libraries (OpenGraphNode in PHP):
include "OpenGraphNode.php";
# Fetch and parse a URL
#
$page = "http://www.rottentomatoes.com/m/oceans_eleven/";
$node = new OpenGraphNode($page);
# Retrieve the title
#
print $node->title . "\n"; # like this
print $node->title() . "\n"; # or with parentheses
# And obviously the above works for other Open Graph Protocol
# properties like "image", "description", etc. For properties
# that contain a hyphen, you'll need to use underscore instead:
#
print $node->street_address . "\n";
# OpenGraphNode uses PHP5's Iterator feature, so you can
# loop through it like an array.
#
foreach ($node as $key => $value) {
print "$key => $value\n";
}

Regular expressions are bad for parsing html because of its leveled structure. you will want to use the DOMDocument class.
http://www.php.net/manual/en/class.domdocument.php
This will turn the page source into an XML object. You should be able to figure out how to get the relevent details using XPath queries fairly easily.
you may also want to take a look at the php function get_meta_tags().
http://www.php.net/manual/en/function.get-meta-tags.php

Related

Using PHP to extract specific data from websites

I am new in PHP and I was looking to extract data like inventory quantity and sizes from different websites. Was kind of confused on how I would go about doing this. Would Domdocument be the way to go?
Not sure if that was the best method for this.
I was attempting from lines 164-174 on here.
Any help is greatly appreciated!
EDIT - this is my updated code. Dont really think its the most efficient way to do things though.
<html>
<?php
$url = 'https://kithnyc.com/collections/adidas/products/kith-x-adidas- consortium-response-trail-boost?variant=35276776455';
$html = file_get_contents($url);
//preg_match('~itemprop="image"\scontent="(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $image);
//$image = $image[1];
preg_match('~,"title":"(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $title);
$title = $title[1];
preg_match_all('~{"id":(\d+)~', $html, $id);
$id = $id[1];
preg_match_all('~","public_title":"(\d+..)~', $html, $size);
$size = $size[1];
preg_match_all('~inventory_quantity":(\d+)~', $html, $quantity);
$quantity = $quantity[1];
function plain_url_to_link($url) {
return preg_replace(
'%(https?|ftp)://([-A-Z0-9./_*?&;=#]+)%i',
'<a target="blank" rel="nofollow" href="$0" target="_blank">$0</a>', $url);
}
$i = 0;
$j = 2;
echo "$title<br />";
echo "<br />";
//echo $image;
echo plain_url_to_link($url);
echo "<br />";
echo "<br />";
for($i = 0; $i < 18; $i++) {
print "Size: $size[$i] --- Quantity: $quantity[$i] --- ID: $id[$j]";
$j++;
echo "<br />";
}
echo "<br />";
//print_r($quantity);
?>
</body>
</html>

As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
— https://stackoverflow.com/a/590789/65732
Use a DOM parser instead which is specifically designed for the purpose of parsing HTML/XML documents. Here's an example:
# Installing Symfony's dom parser using Composer
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = $crawler->filter('.product-header-title[itemprop="price"]')->text();
// UPDATE: Does not work! as the page updates the button text
// later with javascript. Read more for another solution.
$in_stock = $crawler->filter('#AddToCartText')->text();
if ($in_stock == 'Sold Out') {
$in_stock = 0; // or `false`, if you will
}
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: Buy Now
// We'll fix "Availability" later...
Using such parsers, you have the ability to extract elements using XPath as well.
But if you want to parse the javascript code included in that page, you'd better use a browser emulator like Selenium. Then you have programmatic access to all the globally available javascript vars/functions in that page.
Update
Getting the price
So you were getting this error running the above code:
PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.
That's because the target page uses an invalid class name for the price element (.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:
<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>
To workaround it, let's use the itemprop attribute instead. Here's the selector that can match it:
.product-header-title[itemprop="price"]
I updated the above code accordingly to reflect it. I tested it and it's working for the price part.
Getting the stock status
Now that I actually tested the code, I see that the stock status of products is set later using javascript. It's not there when you fetch the page using file_get_contents(). You can see it for yourself, refresh the page, the button appears as Buy Now, then a second later it changes to Sold Out.
But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.
So now the problem is parsing javascript code with PHP. There are a few general approaches to tackle the problem:
Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.
The most reliable and common approach is to scrape data from such sites (that heavily rely on javascript) is to use a browser emulator like Selenium which are able to execute javascript code. Have a look at Facebook's PHP WebDriver package which is the most sophisticated PHP binding for Selenium WebDriver available. It provides you with an API to remotely control web browsers and execute javascript against them.
Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.
See V8js, the PHP extension; which embeds V8 javascript engine into PHP. It allows you to evaluate javascript code right from your PHP script. But it's a little bit overkill to install a PHP extension if you're not heavily using the feature. But if you want to extract the relevant script using the DOM parser:
$script = $crawler->filterXPath('//head/following-sibling::script[2]')->text();
Use HtmlUnit to parse the page and then feed the final HTML to PHP. You gonna need a small Java wrapper. Right, overkill for your case.
Extract the javascript code and parse it using a JS parser/tokenizer library like hiltonjanfield/js4php5 or squizlabs/PHP_CodeSniffer which has a JS tokenizer.
In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to cart.js to retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.
You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.
This approach is about matching the inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = trim($crawler->filter('.product-header-title[itemprop="price"]')->text());
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
This regex needs a variant ID (35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string: ?variant=35276776455.
Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:
<?php
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
// You need to check if it's matched before assigning
// $price[1]. Anyway, this is just an example.
preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price);
$price = $price[1];
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
Conclusion
Even though that I still believe that it's a bad idea to parse HTML/XML documents with regex, I must admit that available DOM parsers are not able to parse embedded javascript code (and probably will never be), which is your case. We can partially utilize regular expressions to extract strings from HTML/XML; the parts which are not parsable using DOM parsers. So, all in all:
Use DOM parsers to parse/scrape the HTML code that initially exists in the page.
Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.
Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.
Partially use regex to extract what is not extractable using DOM parsers.
If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.

Perf. issue / Too much calls to string manipulation functions

This question is about optimizing a part of a program that I use to add in many projects as a common tool.
This 'templates parser' is designed to use a kind of text pattern containing html code or anything else with several specific tags, and to replace these by developer given values when rendered.
The few classes involved do a great job and work as expected, it allows when needed to isolate design elements and easily adapt / replace design blocks.
The patterns I use look like this (nothing exceptional I admit) :
<table class="{class}" id="{id}">
<block_row>
<tr>
<block_cell>
<td>{content}</td>
</block_cell>
</tr>
</block_row>
</table>
(Example code below are adapted extracts)
The parsing does things like that :
// Variables are sorted by position in pattern string
// Position is read once and stored in cache to avoid
// multiple calls to str_pos or str_replace
foreach ($this->aVars as $oVar) {
$sString = substr($sString, 0, $oVar->start) .
$oVar->value .
substr($sString, $oVar->end);
}
// Once pattern loaded, blocks look like --¤(<block_name>)¤--
foreach ($this->aBlocks as $sName=>$oBlock) {
$sBlockData = $oBlock->parse();
$sString = str_replace('--¤(' . $sName . ')¤--', $sBlockData, $sString);
}
By using the class instance I use methods like 'addBlock' or 'setVar' to fill my pattern with data.
This system has several disadvantages, among them the multiple objects in memory (one for each instance of block) and the fact that there are many calls to string manipulation functions during the parsing process (preg_replace in the past, now just a bunch of substr and pals).
The program on which I'm working is making a large use of these templates and they are just about to show their limits.
My question is the following (No need for code, just ideas or a lead to follow) :
Should I consider I've abused of this and should try to manage so that I don't need to make so many calls to these templates (for instance improving cache, using only simple view scripts...)
Do you know a technical solution to feed a structure with data that would not be that mad resource consumer I wrote ? While I'm writing I'm thinking about XSLT, would it be suitable, if yes could it improve performances ?
Thanks in advance for your advices

Use the XDebug extension to profile your code and find out exactly which parts of the code are taking the most time.

Alternative to php preg_match to pull data from an external website?

I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!

You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.

There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)

Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.

Mediawiki tag extension - chained tags do not get processed

I'm trying to develop a simple Tag Extension for Mediawiki. So far I'm basically outputing the input as it comes. The problem arises when there are chained tags. For instance, for this example:
function efSampleParserInit( Parser &$parser ) {
$parser->setHook( 'sample', 'efSampleRender' );
return true;
}
function efSampleRender( $input, array $args, Parser $parser, PPFrame $frame ) {
return "hello ->" . $input . "<- hello";
}
If I write this in an article:
This is the text <sample type="1">hello my <sample type="2">brother</sample> John</sample>
Only the first sample tag gets processed. The other one isn't. I guess I should work with the $parser object I receive so I return the parsed input, but I don't know how to do it.
Furthermore, Mediawiki's reference is pretty much non existant, it would be great to have something like a Doxygen reference or something.

Use $parser->recursiveTagParse(), as shown at Manual:Tag_extensions#How do I render wikitext in my extension?.
It is kind of a clunky interface, and not very well documented. The underlying reason why such a seemingly natural thing to do is so tricky to accomplish is that it sort of goes against the original design intent of tag extensions — they were originally conceived as low-level filters that take in raw unparsed text and spit out HTML, completely bypassing normal parsing. So, for example, if you wanted to include some content written in Markdown (such as a StackOverflow post) on a wiki page, the idea was that you could install a suitable extension and then write
<markdown>
**Look,** here's some Markdown text!
</markdown>
on the page, and the MediaWiki parser would leave everything between the <markdown> tags alone and just hand it over to the extension for parsing.
Of course, it turned out that most people who wrote MediaWiki tag extensions didn't really want to replace the parser, but just to apply some tweaks to it. But the way the tag extension interface was set up, the only way to do that was to call the parser recursively. I've sometimes thought it would be nice to add a new parser extension type to MediaWiki, something that looked like tag extensions but didn't interrupt normal parsing in such a drastic manner. Alas, my motivation and copious free time hasn't so far been sufficient to actually do something about it.

Extract data from website via PHP

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";

It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How does Facebook Link tear down a page? - php

Perhaps looking at an article extractor like Goose might help?

Related

Using PHP to extract specific data from websites

Perf. issue / Too much calls to string manipulation functions

Alternative to php preg_match to pull data from an external website?

Mediawiki tag extension - chained tags do not get processed

Extract data from website via PHP

Categories

Resources