I have the following code and trying to extract the value of attribute content from an html page, But it's not giving any result that I expect, instead its give only blank page.
Any help where could be the issue ?
$url= "https://fr-ca.wordpress.org";
$html = file_get_contents($url);
# Create a DOM parser object
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('meta') as $key ) {
echo "<pre>";
$tab[] = $key->getAttribute('content');
}
$reg= '<meta name="generator" content="(.*?)"/>';
if (preg_match_all($reg, $html, $ar)) {
print_r($ar);
}
Page source has :
<meta name="generator" content="WP 4.5"/>
try this:
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all('/content="(.*)"/i', $html, $matches);
if (isset($matches[1])) {
print_r($matches[1]);
}
Here is a regex that will look for a meta tag and get the content attribute contents. It has some wild cards that will account for other variables such as different names, or extra spaces, etc.
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all( '#<meta.*?content=[\'"](.*?)[\'"]\s*/>#i', $tab, $results );
print_r( $results[1] ); // contains array of captures.
if( $results[1] ) {
// code here...
}
please use like this ...
$html = file_get_contents( $url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
// A name attribute on a <div>???
$nodes = $xpath->query( '//div[#name="changeable_text"]')->item( 0);
echo $nodes->Content;
OR
// Use Curl ...
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return #curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';
// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';
Related
I'm searching for a solution to this problem for a long time and I didn't get any solutions.
I managed to extract the mp4 URL, but the problem is that this link redirects to another URL that can be seen in response header: Location, I don't know how I can get this URL.
Response Header(img)
<?php
function tidy_html($input_string) {
$config = array('output-html' => true,'indent' => true,'wrap'=> 800);
// Detect if Tidy is in configured
if( function_exists('tidy_get_release') ) {
$tidy = new tidy;
$tidy->parseString($input_string, $config, 'raw');
$tidy->cleanRepair();
$cleaned_html = tidy_get_output($tidy);
}
else {
# Tidy not configured for this Server
$cleaned_html = $input_string;
}
return $cleaned_html;
}
function getFromPage($webAddress,$path){
$source = file_get_contents($webAddress); //download the page
$clean_source = tidy_html($source);
$doc = new DOMDocument;
// suppress errors
libxml_use_internal_errors(true);
// load the html source from a string
$doc->loadHTML($clean_source);
$xpath = new DOMXPath($doc);
$data="";
$nodelist = $xpath->query($path);
$node_counts = $nodelist->length; // count how many nodes returned
if ($node_counts) { // it will be true if the count is more than 0
foreach ($nodelist as $element) {
$data= $data.$element->nodeValue . "\n";
}
}
return $data;
}
$vidID = 4145616; //videoid : https://video.sibnet.ru/shell.php?videoid=4145616
$link1 = getFromPage("https://video.sibnet.ru/shell.php?videoid=".$vidID,"/html/body/script[21]/text()"); // Use XPath
$json = urldecode($link1);
$link2 = strstr($json, "player.src");
$url = substr($link2, 0, strpos($link2, ","));
$url =str_replace('"',"",$url);
$url = substr($url , 18);
//header('Location: https://video.sibnet.ru'.$url);
echo ('https://video.sibnet.ru'.$url)
?>
<?php
$url='https://video.sibnet.ru'.$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // This is what you need, it will return you the last effective URL
$realUrl = $url; //here you go
?>
SOURCE: https://stackoverflow.com/a/17473000/14885297
I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi
I want to get meta tags from a external URL. Unfortunately, my meta tags on that website are placed after tag.
I use get_meta_tags($url) but it didn't work.
Here is the external url source and my meta tag description is exist at the last.
<html><meta http-equiv="content-type" content="text/html; charset=UTF-8">
<head><title>Tools</title>
</head>
<body><h2>Sitemap Notification Received</h2>
<br>
Your Sitemap has been successfully added to our list of Sitemaps to crawl. If this is the first time you are notifying Google about this Sitemap, please add it via http://www.google.com/webmasters/tools/ so you can track its status. Please note that we do not add all submitted URLs to our index, and we cannot make any predictions or guarantees about when or if they will appear.</body></html>
<meta name='description' content='200'>
this function should help you:
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://example.com/");
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
or easier with this:
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
i want to get the value of the <title> tag for all the pages of my website. i am trying to run the script only on my website domain, and get all the pages links on my website , and the titles of them.
This is my code:
$html = file_get_contents('http://xxxxxxxxx.com');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
What i get is: z2 i get z1.html and z2....
my z1.html have a title named z3. i want to get z1.html and z3, not z2. Can anyone help me?
adding a bit to hitesh's answer to check if the elements have attributes and the desired attribute exists. also if the getting the 'title' elements actually does return at least one item before trying to grab the first one ($a_html_title->item(0)).
and added an option for curl to follow location (needed it for my hardcoded test for google.com)
foreach ($links as $link) {
//Extract and show the "href" attribute.
if ($link->hasAttributes()){
if ($link->hasAttribute('href')){
$href = $link->getAttribute('href');
$href = 'http://google.com'; // hardcoding just for testing
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $href . "---TITLE--->";
$a_html = my_curl_function($href);
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
if ($a_html_title->length){
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
}
}
}
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, TRUE); // added this
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
you need to make your own custom function and call it in appropriate places , if you need to get multiple tags from the pages which are in anchor tag, you just need to create new custom function.
Below code will help you get started
$html = my_curl_function('http://www.anchorartspace.org/');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$mytag = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $mytag->item(0)->nodeValue;
$links = $doc->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link) {
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $link->getAttribute('href') . "---TITLE--->";
$a_html = my_curl_function($link->getAttribute('href'));
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
echo "Title: $title" . '<br/><br/>';
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
let me know if you need any more help
Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);