$url = 'http://www.test.com/';
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
I am currently using the above script the capture links on a page, however what I found was there are always duplicate links. On the page, there is a picture which is linked, followed by a text link which goes to the same link. Is there an easy way to capture just the text link, not the image link?
As I was saying, I might take the approach of cleaning up the dupes in my result set. Not sure on what you are scraping but what if the link is only used with an image?
You could even count the occurrences.
$url = 'http://www.test.com/';
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$links = $dom->getElementsByTagName('a');
$distinctLinks = [];
foreach ($links as $link) {
$distinctLinks[$link] = (int) $distinctLinks[$link] + 1;
}
Related
Here is the code snipet being used:
$urlContent = file_get_contents('http://www.techeblog.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$domPath=new DOMXpath($dom);
$linkList = $domPath->evaluate("/html/body/a/img");
foreach ($linkList as $link)
{
echo $link->getAttribute("src")."<br />";
}
Need to extract all the links in which the child node is an image tag.
Your XPath expression will only return image tags that are inside links that are direct children of the body tag. If you want all link tags that contain images anywhere in the document, use the expression //a[img]
That being said, you may want to be more specific about which images you pull. This expression will limit the results to links containing images that are inside the blog entries //div[#class="entry"]//a[img].
Here is a great XPath cheat sheet.
<?php
$urlContent = file_get_contents('http://www.techeblog.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$domPath=new DOMXpath($dom);
$linkList = $domPath->evaluate('//div[#class="entry"]//a[img]');
foreach ($linkList as $link)
{
echo $link->getAttribute("href").PHP_EOL;
}
Also, your echo is looking for an attribute calles src, which will not be present in the links.
I have this keyword: yt-lookup-title.
I want the next 17 letters after this in a variable. So I would have:
"<a href="/watch?v=HnlC81tWoY8"
How can I archive that I get it from all lines with this Keyword?
Keywords
If you want to get the href content, you can rely on domdocument.
If I'm not mistaken, all the links (<a>) have this class yt-uix-tile-link. So you can do the following:
$dom = new DOMDocument;
// $html is a string containing the html of the page you're parsing
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
links = array ();
$nodes = $xpath->query('//a[#class="yt-uix-tile-link"]/#href');
foreach ($nodes as $node) {
$links [] = $node->nodeValue;
}
var_dump ($links);
Hope that helps
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}
Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all
Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}
I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links
I have a page scraped with curl and am looking to grab all of the links with a certain id. As far as I can tell the best way to do this is with dom and xpath. The bellow code grabs a large number of the urls, but cuts many of them off and grabs text that is not a url.
$curl_scraped_page is the page scraped with curl.
$dom = new DOMDocument();
#$dom->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Am I on the right track? Do I just need to mess with the "/html/body//a" xpath syntax or do I need to add more to capture the id element?
You can also do it this way and you'll have onyl a tags which have an id and href :
$doc = new DOMDocument();
$doc->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query('//a[#href][#id]');
$dom = new DOMDocument();
$dom->loadHTML($curl_scraped_page);
$links = $dom->getElementsByTagName('a');
$processed_links = array();
foreach ($links as $link)
{
if ($link->hasAttribute('id') && $link->hasAttribute('href'))
{
$processed_links[$link->getAttribute('id')] = $link->getAttribute('href');
}
}
This is the solution regarding your question.
http://simplehtmldom.sourceforge.net/
include('simple_html_dom.php');
$html = file_get_html('http://www.google.com/');
foreach($html->find('#www-core-css') as $e) echo $e->outertext . '<br>';
I think that the easiest way is combining 2 following classes to pull information from another website:
Pull info from any HTML tag, contents or tag attribute: http://simplehtmldom.sourceforge.net/
Easy to handle curl, supports POST requests: https://github.com/php-curl-class/php-curl-class
Example:
include('path/to/curl.php');
include('path/to/simple_html_dom.php');
$url = 'http://www.example.com';
$curl = new Curl;
$html = str_get_html($curl->get($url)); //full HTML of website
$linksWithSpecificID = $html->find('a[id=foo]'); //returns array of elements
Check Simple HTML DOM Parser Manual from the upper link for the manipulation with HTML data.