get title tag value using DOMDocument - php

i want to get the value of the <title> tag for all the pages of my website. i am trying to run the script only on my website domain, and get all the pages links on my website , and the titles of them.
This is my code:
$html = file_get_contents('http://xxxxxxxxx.com');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
What i get is: z2 i get z1.html and z2....
my z1.html have a title named z3. i want to get z1.html and z3, not z2. Can anyone help me?

adding a bit to hitesh's answer to check if the elements have attributes and the desired attribute exists. also if the getting the 'title' elements actually does return at least one item before trying to grab the first one ($a_html_title->item(0)).
and added an option for curl to follow location (needed it for my hardcoded test for google.com)
foreach ($links as $link) {
//Extract and show the "href" attribute.
if ($link->hasAttributes()){
if ($link->hasAttribute('href')){
$href = $link->getAttribute('href');
$href = 'http://google.com'; // hardcoding just for testing
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $href . "---TITLE--->";
$a_html = my_curl_function($href);
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
if ($a_html_title->length){
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
}
}
}
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, TRUE); // added this
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}

you need to make your own custom function and call it in appropriate places , if you need to get multiple tags from the pages which are in anchor tag, you just need to create new custom function.
Below code will help you get started
$html = my_curl_function('http://www.anchorartspace.org/');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$mytag = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $mytag->item(0)->nodeValue;
$links = $doc->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link) {
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $link->getAttribute('href') . "---TITLE--->";
$a_html = my_curl_function($link->getAttribute('href'));
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
echo "Title: $title" . '<br/><br/>';
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
let me know if you need any more help

Related

Regular expression to extract the content inside the script tag in php

I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi

Extract data from HTML tag

I have the following code and trying to extract the value of attribute content from an html page, But it's not giving any result that I expect, instead its give only blank page.
Any help where could be the issue ?
$url= "https://fr-ca.wordpress.org";
$html = file_get_contents($url);
# Create a DOM parser object
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('meta') as $key ) {
echo "<pre>";
$tab[] = $key->getAttribute('content');
}
$reg= '<meta name="generator" content="(.*?)"/>';
if (preg_match_all($reg, $html, $ar)) {
print_r($ar);
}
Page source has :
<meta name="generator" content="WP 4.5"/>
try this:
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all('/content="(.*)"/i', $html, $matches);
if (isset($matches[1])) {
print_r($matches[1]);
}
Here is a regex that will look for a meta tag and get the content attribute contents. It has some wild cards that will account for other variables such as different names, or extra spaces, etc.
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all( '#<meta.*?content=[\'"](.*?)[\'"]\s*/>#i', $tab, $results );
print_r( $results[1] ); // contains array of captures.
if( $results[1] ) {
// code here...
}
please use like this ...
$html = file_get_contents( $url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
// A name attribute on a <div>???
$nodes = $xpath->query( '//div[#name="changeable_text"]')->item( 0);
echo $nodes->Content;
OR
// Use Curl ...
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return #curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';
// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';

PHP Simple HTML Dom parser returns 0

I use PHP Simple HTML Dom parser to get some elements of a page. Unfortunately, I get as a result 0 or 1... I would like to get the innerHTML instead.
Here is a photo of the dom:
And here is my code:
include('simple_html_dom.php');
// We take the url we want to scrape
$URL = 'https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000033011065&dateTexte=20160821';
// Curl init
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
// We get the html
$html = new simple_html_dom();
$html->load($result);
// Find all article blocks
foreach($html->find('div.data') as $article) {
$item['title'] = $article->find('.titreSection', 0) ->plaintext;
$resultat[] = "<p>" + $item['title']."</p></br>";
}
include 'vue_scrap.php';
?>
Here is the code of my view:
foreach ($resultat as $result){
echo $result;
}
Thank you for your help.
In fact I just did a mistake with that line:
$resultat[] = "<p>" + $item['title']."</p></br>";
The correct version is:
$resultat[] = "<p>".$item['title']."</p></br>";

PHP DOM parser breaks the page and can't load page content

I have created a php parser that must extract the price in a span tag, but when I echo the $html so that I could see how the page loads, it shows me a broken page with no contents. Instead only header and footer loads, but not the content. The content seems to load by JavaScript externally and my question is how can I load the html page with Dom so that JavaScript also loads? I need to let the whole content load so that I can get the divs and spans. This is my code:
<?php
require_once('simple_html_dom.php');
$url = 'http://oldnavy.gap.com/browse/product.do?cid=99570&vid=1&pid=714649002';
$dom = new domDocument('1.0', 'UTF-8');
$html = file_get_html($url);
echo $html;
if(is_object($html)){
foreach ( $html->find('span#priceText') as $data){
$raw_price = $data->innertext;
echo $raw_price;
}
}
?>
Alt aproach
The link you are actually looking for (in his minimal expression) is this: http://oldnavy.gap.com/browse/productData.do?pid=714649
Now load that using curl, put a value to the unknownShopperId cookie, explode it into an array and get the price you need:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "http://oldnavy.gap.com/browse/productData.do?pid=714649");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: unknownShopperId=E853DA3B2607DDAA5F2FE13CE8D32ACF"));
$result = curl_exec($ch);
$explode = explode(',', $result);
echo 'Original price: ' . $explode[92] . '<br/>' .
'New price: ' . $explode[93] . '<br/>' .
'Both prices: ' . $explode[13];
The result will be: '$14.94'
From now on, if you need another price you must know the intem's pid

regex to print url from any webpage with specific word in url

i am using below code to extract url from a webpage and its working just fine but i want to filter it. it will display all urls in that page but i want only those url which consists of the word "super"
$regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";
}
so it should echo only uls where the word super is present.
for example it should ignore url
http://xyz.com/abc.html
but it should echo
http://abc.superpower.com/hddll.html
as it consists of the required word super in url
Make your regex un-greedy and it should work:
$regex = '|<a.*?href="(.*?super[^"]*)"|is';
However to parse and scrap HTML it is better to use php's DOM parser.
Update: Here is code using DOM parser:
$request_url ='1900girls.blogspot.in/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';
$nodelist = $xpath->query("//a[contains(#href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
echo $node->getAttribute('href') . "\n";
}

Categories