Sorry for the long code, I'm really losing it.
This code is supposed to get a list of urls through POST, in a textarea with breaklines between each url. The script should download each url, go through the html and take some links, then go in those links, get some data and echo it out.
For some reason, visually it looks as if I'm running getDetails() only once, as I'm getting only one set of results.
I have checked multiple times if the foreach loop takes each url separately and that part is working
Can anyone spot the problem?
require_once('simple_html_dom.php');
function getDetails($html) {
$dom = new simple_html_dom;
$dom->load($html);
$title = $dom->find('h1', 0)->find('a', 0);
foreach($dom->find('span[style="color:#333333"]') as $element) {
$address = $element->innertext;
}
$address = str_replace("<br>"," ",$address);
$address = str_replace(","," ",$address);
$title->innertext = str_replace(","," ",$title->innertext);
if ($address == "") {
$exp = explode("<strong><strong>",$html);
$exp2 = explode("</strong>",$exp[1]);
$address = $exp2[0];
}
echo $title->innertext . "," . $address . "<br>";
}
function getHtml($Url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
function getdd($u) {
$html = getHtml($u);
$dom = new simple_html_dom;
$dom->load($html);
foreach($dom->find('a') as $element) {
if (strstr($element->href,"display_one.asp")) {
$durls[] = $element->href;
}
}
return $durls;
}
if (isset($_POST['url'])) {
$urls = explode("\n",$_POST['url']);
foreach ($urls as $u) {
$durls2 = getdd($u);
$durls2 = array_unique($durls2);
foreach ($durls2 as $durl) {
$d = getHtml("http://www.example.co.il/" . $durl);
getDetails($d);
}
}
}
You're only assigning the last element in the loop, it looks like. You'll need to concatenate. Something like $address .= $element->innertext; inside the loop (note the .= instead of =).
edit: unless I'm mistaking what it's supposed to be doing. I think I may've been focusing on the wrong part of the code.
When you use DOMDocument on html you load it with $dom->loadHTMLFile() or $dom->loadHTML() you should also call libxml_use_internal_errors(true) before hand so that it will not crash because of improperly formatted html.
Related
I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi
I would like to retrieve broken links of a given website.
I have this code but it doesn't work.
Can you help me ?
// function to check url
function check_url($url) {
//echo "Test broken liens";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch , CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
return $headers['http_code'];
}
if(check_url("https://www.amazon.com/")==200){
echo "<br> The link is validated <br>";
}else{
echo "<br>broken links<br>";
}
// this function check all the code of a website and retrieve the tag of a hyperlink
function getLinks(){
$html = file_get_contents('https://www.amazon.com/');
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('a');
foreach ($images as $image) {
$file= $image->getAttribute('href')."<br>";
$lien= "https://www.amazon.com/".$file;
echo $lien;
echo existenceLien($lien);
}
}
echo getLinks();
// The target is to search the broken links in a website and worn the existence of those links
//check if link exist and display the result for each
function linkexistence($url){
// get the url
$test = get_headers($url , 1);
$message="";
// use preg_match function
if (preg_match("#HTTP/1.1 200i#", $test[0])) {
$message="Valid";
}elseif (preg_match("#HTTP/1.1 404i#", $test[0])) {
$message="Non-existent page ! (404)";
}elseif (preg_match("#HTTP/1.1 301i#", $test[0])) {
$message="The page has been moved";
}elseif (preg_match("#HTTP/1.1 403i#", $test[0])) {
$message="Access to the page refused! (403)";
}else {
$message="Invalid links";
}
return $message;
}*****
The mask is wrong in preg_match function, currently your mask is
#HTTP/1.1 200i#
but I believe that you have to use the following mask
#HTTP/1.1 200#i
thus you have to move the "i" after "#" in all your preg_match functions.
the "i" means the case sensitivity will be ignored
I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
I am developing my custom theme and I noticed that some of the links result in showing me plain text HTML instead of normal webpage. I tracked down the issue and found out that this happens when I include one custom php file inside my functions.php. I found that code on one of the tutorial on how to create social share buttons. If I comment include out everything works like a charm. I tried to investigate the file but I couldn't find anything wrong with it, could you please have a look what might be wrong?
<?php
function get_likes($url) {
$json_string = file_get_contents('https://api.facebook.com/method/links.getStats?urls=' . $url . '&format=json');
$json = json_decode($json_string, true);
if(isset($json[0]['total_count'])){
return intval( $json[0]['total_count'] );
} else { return 0;}
}
function get_tweets($url) {
$json_string = file_get_contents('http://urls.api.twitter.com/1/urls/count.json?url=' . $url);
$json = json_decode($json_string, true);
if(isset($json['count'])){
return intval( $json['count'] );
} else {return 0;}
}
function get_plusones($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://clients6.google.com/rpc");
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS, '[{"method":"pos.plusones.get","id":"p","params":{"nolog":true,"id":"' . $url . '","source":"widget","userId":"#viewer","groupId":"#self"},"jsonrpc":"2.0","key":"p","apiVersion":"v1"}]');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Content-type: application/json'));
$curl_results = curl_exec ($curl);
curl_close ($curl);
$json = json_decode($curl_results, true);
if(isset($json[0]['result']['metadata']['globalCounts']['count'])){
return intval( $json[0]['result']['metadata']['globalCounts']['count'] );
} else {return 0;}
}
function get_stumble($url) {
$json_string = file_get_contents('http://www.stumbleupon.com/services/1.01/badge.getinfo?url='.$url);
$json = json_decode($json_string, true);
if (isset($json['result']['views'])) {
return intval($json['result']['views']);
} else {return 0;}
}
if(isset($_GET["thisurl"])){
$thisUrl=$_GET["thisurl"];
$firstpart = substr("$thisUrl", 0, 22);
// Change http://medialoot.com to your own domain!
if ($firstpart == 'http://mdbootstrap.com') {
$data = "{";
$data .= '"facebook": '. json_encode(get_likes($thisUrl)) . ", ";
$data .= '"twitter": ' . json_encode(get_tweets($thisUrl)) . ", ";
$data .= '"gplus": ' . json_encode(get_plusones($thisUrl)) . ", ";
$data .= '"stumble": ' . json_encode(get_stumble($thisUrl)) . "}";
} else {
//throw error
$data = 'ERROR - you are trying to use this script for something outside of the allowed domain';
}
} else {
$data = '';
}
header('Content-Type: application/json');
echo $data;
?>
You are echoing the contents of $data – I guess thats also what you are seeing if I understood that correctly.
If the code is included in your functions.php like this, it probably gets executed as soon as the functions.php file is loaded, which might be too late or too early.
To be able to control when the code executes, you should have a look into WordPress Hooks and hook your code into this mechanism.
If you can tell me more about what exactly you are trying to do, I might be able to give a more detailed answer.
Just as a sidenote: Take care not to cross over into Plugin territory with your theme. As soon as you are trying to do anything more than modifying the look of something, it doesn't belong into functions.php anymore but a separate plugin.
I wrote a program in PHP to find and print all links present on a web page. It also goes inside any links it found and does the same. My problem is that in some sites (like Youtube) it won't print the links, or follow inside them.
Here is my main code:
function echo_urls($site_address){
if(check_valid_url($site_address)){
$site = new site();
$site->address = $site_address;
$site->full_address = "$site_address";
$site->depth = 0;
$queue = new queue();
$queue->push($site);
array_push($queue->seen,$site->address);
$depth = 0;
while(($site = $queue->get_first())){
$depth++;
echo $site->depth." : ".$site->full_address."<br>";
$queue = push_links($site->address,$queue,$depth);
}
}
else;
}
function push_links($site_address,$queue,$depth){
if($depth<4){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$site_address);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
if( $result ){
preg_match_all( '/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU', $result, $list);
$list = $list[0];
foreach( $list as $item ) {
if(!(empty($item)))
if($result = get_all_string_between($item,"href=\"","\"")){
if((array_search($result[0],$queue->seen))==false){
$site = new site();
$site->address = $result[0];
$site->full_address = $item;
$site->depth = $depth;
$queue->push($site);
array_push($queue->seen,$site->address);
}
}
}
}
}
return $queue;
}
It's hard to tell by looking at a couple of functions, but my guess is:
YouTube is blocking you
This part if($depth<4){ is stopping push_links from executing because it might be returning FALSE
Also, don't use RegEx for this. Use something like The DOMDocument class
I usually use PHPQuery for crawling the sites. It's very simple
http://code.google.com/p/phpquery/