what should i do for get all http links in cURL

what should i do for get all http links in cURL - php

I created a program in php using CURL, in which i can take data of any site and can display it in the browser. Another part of the program is that the data can be saved in the file using file handling and after saving this data, I can find all the http links within the body tag of the saved file. My code is showing all the sites in the browser which I took, but I can not find all http links
Kindly help me out this problem.
PHP Code:
<!DOCTYPE html>
<html>
<head>
<title>Display links using Curl</title>
</head>
<body>
<?php
$GetData = curl_init();
$url = "http://www.ucertify.com/";
curl_setopt($GetData, CURLOPT_URL, $url);
curl_setopt($GetData, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($GetData);
curl_close($GetData);
$file=fopen("content.txt","w");
fputs($file,$data);
fclose($file);
echo $data;
function links() {
$file_content = file_get_contents("http://www.ucertify.com/");
$dom_obj = new DOMDocument();
#$dom_obj->loadHTML($file_content);
$xpath = new DOMXPath($dom_obj);
$links_href = $xpath->evaluate("/html/body//a");
for ($i = 0; $i<$links_href->length; $i++) {
$href = $links_href->item($i);
$url = $href->getAttribute("href");
if(strstr($url,"#")||strstr($url,"javascript:void(0)")||$url=="javascript:;"||$url=="javascript:"){}
else {
echo "<div>".$url."<div/>";
}
}
}
echo links();
?>
</body>
</html>

You can use regex like this
preg_match("/<body[^>]*>(.*?)<\/body>/is", $file_data, $body_content);
preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$body_content[1],$matches);
foreach($matches[0] as $d) {
echo $d."<br>";
}

Related

How to get the HTML of from an URL in PHP?

I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.

You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...

You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.

Parsing only text content from url

I am trying to parse text content from url given. Here is the code:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
I want to get only the text written over page. No page source code. Any idea for this? I already googled but above method only present everywhere.

You can use DOMDocument and DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
Instead of using xpath, you can also do:
$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.

$content = file_get_contents(strip_tags($url));
This will remove the HTML tags coming form the page

To remove html tag use:
$text = strip_tags($text);

A simple cURL will solve the issue. [TESTED]
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>

code not parsing through a simple google.com test

<?php
$file = 'http://www.google.com';
$doc = new DOMDocument();
# $doc->loadHTML(file_get_contents($file));
echo $doc->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>
im trying to get the inner content of a span tag from google.com's home site. this code should output the first span tag, but it is not outputting any results?

The is not an error ... the first span in http://www.google.com is empty and am not sure what else you expect
<span class=gbtcb></span> <---------------- item(0)
<span class=gbtb2></span> <---------------- item(1)
<span class=gbts>Search</span> <----------- item(2)
Try
$element = $doc->getElementsByTagName('span')->item(2);
var_dump($element->nodeValue);
Output
Search

First, bear in mind that the HTML is not necessarily valid XML.
That aside, check that you're actually getting some contents to parse; you need to have allow_url_fopen enabled in order to use file_get_contents() with URLs.
In general, avoid using the error suppression operator (#) because it will almost certainly come back to bite you some time (and this time might well be that time); there is a discussion on this elsewhere on SO.
So, as a first step, switch to something like the following let me know if you're getting any contents at all.
// stop using # to suppress errors
$contents = file_get_contents($file);
// check that you're getting something to parse
echo $contents;

Try this and tell us what the output is
<?
echo ini_get('allow_url_fopen');
?>

Try using cURL to get the data and then load it into a DOMDocument:
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); //The # is necessary to suppress invalid markup
echo $dom->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>

unable to parse xml with javascript loadXMLString()

I am trying to parse the xml file from a webservice. I am using javascript loadXMLString function to parse the xml into html. with local file it was working fine if i insert the xml code in to a variable. but for getting xml from external link i have used php function here like this:
<?php
$request = "http://www.somewebsite.com/feeds/get-cities.php?vendor_key=xxx";
$response = file_get_contents($request);
$xmlstring = htmlspecialchars($response, ENT_QUOTES);
?>
<script language="javascript">
function loadXMLString(txt)
{
if (window.DOMParser)
{
parser=new DOMParser();
xmlDoc=parser.parseFromString(txt,"text/xml");
}
else // Internet Explorer
{
xmlDoc=new ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async=false;
xmlDoc.loadXML(txt);
}
return xmlDoc;
}//function loadXMLString ends
text = <?php $xmlstring;?>
xmlDoc=loadXMLString(text);
document.write("<table border='1'>");
var x=xmlDoc.getElementsByTagName("city");
for (i=0;i<x.length;i++)
{
document.write("<tr style='background:#dddddd;'><td>");
document.write(x[i].getElementsByTagName("name")[0].childNodes[0].nodeValue);
document.write("</td><td>");
document.write(x[i].getElementsByTagName("country")[0].childNodes[0].nodeValue);
document.write("</td></tr>");
}
document.write("</table>");
</script>
in the above code i am trying to insert the xml code from a php variable $xmlstring to javascript variable text. but it display nothing. but if i put the xml code inside the script like below it works perfectly:
text="<cities>"
text=text+"<city>";
text=text+"<name>bulga</name>";
text=text+"<country>Giada De Laurentiis</country>";
text=text+"<city_id>2005</city_id>";
text=text+"</city>";
text=text+"</cities>";
does any body know how can i parse it. or if somebody have a better solution please suggest me that also.

Try to change following line in your code
text = <?php echo $xmlstring;?>
it should echo your variable value.

With the help of GBD i have written the following code and its start displaying the city list. but when i try this with different xml code it does not work. may be someone have better solution for this
<?php
function curl_get_file_contents($URL)
{
$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_URL, $URL);
$contents = curl_exec($c);
curl_close($c);
if ($contents) return $contents;
else return FALSE;
}
$xmlString = curl_get_file_contents("http://www.somesite.com/feeds/get-cities.php?vendor_key=xxx");
?>
<script language="javascript">
function loadXMLString(text)
{
if (window.DOMParser)
{
parser=new DOMParser();
xmlDoc=parser.parseFromString(text,"text/xml");
}
else // Internet Explorer
{
xmlDoc=new ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async=false;
xmlDoc.loadXML(text);
}
return xmlDoc;
}
var text = "<?php echo substr_replace($xmlString,"",0,39);?>";
xmlDoc=loadXMLString(text);
document.write("<table border='1'>");
var x=xmlDoc.getElementsByTagName("city");
for (i=0;i<x.length;i++)
{
document.write("<tr style='background:#dddddd;'><td>");
document.write(x[i].getElementsByTagName("name")[0].childNodes[0].nodeValue);
document.write("</td><td>");
document.write(x[i].getElementsByTagName("country")[0].childNodes[0].nodeValue);
document.write("</td></tr>");
}
document.write("</table>");
</script>

PHP retrieve inner HTML as string from URL using DOMDocument [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 9 years ago.
I've been picking bits and pieces of code, you can see roughly what I'm trying to do, obviously this doesn't work and is utterly wrong:
<?php
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container");
$html = $data->saveHTML();
echo $html;
?>
Using a CURL call, I am able to retrieve the document URL source:
function curl_get_file_contents($URL)
{
$c = curl_init();
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_URL, $URL);
$contents = curl_exec($c);
curl_close($c);
if ($contents) return $contents;
else return FALSE;
}
$f = curl_get_file_contents('http://example.com/');
echo $f;
So how can I use this now to instantiate a DOMDocument object in PHP and extract a node using getElementById

This is the code you will need to avoid any malformed HTML errors:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("banner");
echo $data->nodeValue."\n"
To dump whole HTML source you can call:
echo $dom->saveHTML();

<?php
$f = curl_get_file_contents('http://example.com/')
$dom = new DOMDocument();
#$dom->loadHTML($f);
$data = $dom->getElementById("profile_section_container");
$html = $dom->saveHTML($data);
echo $html;
?>
It would help if you provided the example html.

i'm not sure but i remember once i wanted to use this i was unbale to load some external url as file because the php.ini directve allow-url-fopen was set to off ...
So check your pnp.ini or try to open url with fopen to see if you can read the url as a file
<?php
$f = file_get_contents(url);
var_dump($f); // just to see the content
?>
Regards;
mimiz

Try this:
$dom= new DOMDocument();
$dom->loadHTMLFile('http://example.com/');
$data = $dom->getElementById("profile_section_container")->item(0);
$html = $data->saveHTML();
echo $html;

i think that now you can use DOMDocument::loadHTML
Maybe you should try Doctype existence (with a regexp) and then add it if necessary, for being sure to have it declare ...
Regards
Mimiz

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

what should i do for get all http links in cURL - php

You can use regex like this preg_match("/<body[^>]>(.?)<\/body>/is", $file_data, $body_content); preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$body_content[1],$matches); foreach($matches[0] as $d) { echo $d."<br>"; }

Related

How to get the HTML of from an URL in PHP?

Parsing only text content from url

code not parsing through a simple google.com test

unable to parse xml with javascript loadXMLString()

PHP retrieve inner HTML as string from URL using DOMDocument [duplicate]

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

what should i do for get all http links in cURL - php

You can use regex like this preg_match("/<body[^>]*>(.*?)<\/body>/is", $file_data, $body_content); preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$body_content[1],$matches); foreach($matches[0] as $d) { echo $d."<br>"; }

Related

How to get the HTML of from an URL in PHP?

Parsing only text content from url

code not parsing through a simple google.com test

unable to parse xml with javascript loadXMLString()

PHP retrieve inner HTML as string from URL using DOMDocument [duplicate]

Categories

Resources

You can use regex like this preg_match("/<body[^>]>(.?)<\/body>/is", $file_data, $body_content); preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$body_content[1],$matches); foreach($matches[0] as $d) { echo $d."<br>"; }