I am making a class to open a webpage and store the href values of all outbound links on the page. For some reason it works for the first 3 then goes wierd. Below is my code:
class Crawler {
var $url;
function construct($url) {
$this->url = 'http://'.$url;
$this->crawl();
}
function crawl() {
$str = file_get_contents($this->url);
$start = 0;
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
$end = strpos($str,'"',$beg);
$diff = $end - $beg;
$links[$i] = substr($str,$beg, $diff);
$start = $start + $beg;
}
print_r($links);
}
}
$crawler = new Crawler;
$crawler->construct('www.yahoo.com');
Ignore the for loop for the time being I know this will only return the first 10 and won't do the whole document. But if you run this code the first 3 work fine but then all the other values are UBLIC.
Can anyone help? Thanks
Instead of:
$start = $start + $beg;
try:
$start = $beg;
That's likely why you are only seeing the first three matches.
Also, you need to insert a check that $beg is not FALSE:
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
if ($beg === FALSE)
break;
//...
Note, however, that you really should be using DOMDocument to find all tags in a document with a given tag name (a here). In particular, because this is HTML that might not be valid XHTML, you should consider using the loadHTML method.
I think you have a problem in your logic:
you use $start to mark the place where to start looking for the href, but the resulting $beg will still be an index into the complete string. So when you update $start by adding $beg you get to high values. You should try $start = $beg + 1 instead of $start = $start + $beg
Related
I have a script to get certain data from a page based on the content of a span ID
However there are 200+ pages of results to trawl through and it only displays 127 results on each page
The script I have does get data for the 127 elements that are on the first page however it wont then open the new page and continuing to get data
It just stops after the initial 127
Any help would be great
$end = 200;
$start = 1;
$stop = $start + 10;
$html = file_get_contents('http://example.com/res/'.$start);
$doc = new DOMDocument();
#$doc->loadHTML($html);
echo $stop;
$i = 0;
foreach($doc->getElementsByTagName('span') as $element ) { //Loops through all available span elements
if (!empty($element->attributes->getNamedItem('id')->value)) { // Discards irrelevant span elements based on their `ID`. A similar sorting is achieved with `empty()` as the target `span` doesn't have any associated `ID`.
echo "Record : ".$i.' '. $element->attributes->getNamedItem('id')->value."\n";
$i++;
$end = $start;
}
}
if($i == 127) {
$i = 0;
do {
$next = $start++;
$page = $next;
$html = file_get_contents('http://example.com/res/'.$page);
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($doc->getElementsByTagName('span') as $element )
{
if (!empty($element->attributes->getNamedItem('id')->value))
{
echo "Record : ".$i.' '. $element->attributes->getNamedItem('id')->value."\n";
$i++;
$end = $start;
}
}
} while ($page != $stop);
//echo $i.' Records';
}
As stated in comments, since your last record displayed in the first loop is 127, the line after that echo increments $i from 127 to 128.
foreach($doc->getElementsByTagName('span') as $element ) { //Loops through all available span elements
if (!empty($element->attributes->getNamedItem('id')->value)) { // Discards irrelevant span elements based on their `ID`. A similar sorting is achieved with `empty()` as the target `span` doesn't have any associated `ID`.
echo "Record : ".$i.' '. $element->attributes->getNamedItem('id')->value."\n";
$i++; //At last iteration, $i = 128
$end = $start;
}
}
And then, if($i == 127) will be false.
I suggest you to change the condition to if($i >= 127)
I'm having problems trying to "print" a PHP function to convert an IP address range to CIDR format, here is the function posted by IP2Location.com :
https://www.ip2location.com/tutorials/how-to-convert-ip-address-range-into-cidr
function iprange2cidr($ipStart, $ipEnd){
if (is_string($ipStart) || is_string($ipEnd)){
$start = ip2long($ipStart);
$end = ip2long($ipEnd);
}
else{
$start = $ipStart;
$end = $ipEnd;
}
$result = array();
while($end >= $start){
$maxSize = 32;
while ($maxSize > 0){
$mask = hexdec(iMask($maxSize - 1));
$maskBase = $start & $mask;
if($maskBase != $start) break;
$maxSize--;
}
$x = log($end - $start + 1)/log(2);
$maxDiff = floor(32 - floor($x));
if($maxSize < $maxDiff){
$maxSize = $maxDiff;
}
$ip = long2ip($start);
array_push($result, "$ip/$maxSize");
$start += pow(2, (32-$maxSize));
}
return $result;
}
function iMask($s){
return base_convert((pow(2, 32) - pow(2, (32-$s))), 10, 16);
}
(note: corrected 'echo' to 'return' result)
I've tried all of the suggested ways of "feeding" the $ipStart and $ipEnd values to the function, and also to "echo" or "print" the resulting array, but all I get is the word "Array".
For example, after the function is defined, I try:
$ipStart = '8.8.8.8';
$ipEnd = '8.8.8.254';
echo iprange2cidr($ipStart, $ipEnd);
... I appologise for the novice question, I'm a PHP newbie. I'm just not sure how to use the function. Any guidance on what I'm doing wrong would be appreciated! My server uses PHP 7.1. Thank you.
Let's return $result instead.
function iprange2cidr($ipStart, $ipEnd){
....
return $result;
}
Then let's convert it to a string before we echo it:
$ipStart = '8.8.8.8';
$ipEnd = '8.8.8.254';
$range = iprange2cidr($ipStart, $ipEnd);
echo implode("\n",$range);
You can use print_r($result); to get human-readable output.
see doc for more info.
proper way of using function is to return value like
function iprange2cidr($ipStart, $ipEnd){
....
return $result;}
and then call the function like
$returnedVale = iprange2cidr($ipStart, $ipEnd);
$returnedVale = iprange2cidr($ipStart, $ipEnd);
echo"<pre>";print_r($returnedVale);echo"</pre>";
I'm trying to repeat the function only if the value is ",":
This is my code for trying to get the coordinates from an address but somtimes it gets only "," so I want it to try 10 times until it gets the full coordinates.
$coordinates1 = getCoordinates($placeadress);
$i == 0;
while (($coordinates1 == ',') && ($i <= 10)) {
$coordinates1 = getCoordinates($placeadress);
$i++;
}
The function code is this:
function getCoordinates($address) {
$address = str_replace(" ", "+", $address); // replace all the white space with "+" sign to match with google search pattern
$address = str_replace("-", "+", $address); // replace all the "-" with "+" sign to match with google search pattern
$url = "http://maps.google.com/maps/api/geocode/json?address=$address";
$response = file_get_contents($url);
$json = json_decode($response,TRUE); //generate array object from the response from the web
return ($json['results'][0]['geometry']['location']['lat'].",".$json['results'][0]['geometry']['location']['lng']);
}
May be try this
$coordinates1 = getCoordinates($placeadress);
$i = 0;
while ($i <= 10) {
$coordinates1 = getCoordinates($placeadress);
if($coordinates1==',')
$i++;
else
break;
}
It will break the loop as soon as co-ordinates value is not a comma and you are good to go. If it is a comma it will go for next iteration in while
You could just reduce all that code to a while with an empty block:
<?php
$i = 0;
while (
($coords = getLatLong($place))
&& $coords == ','
&& $i++ < 10
);
There's still perhaps no guarantee of the 'right' value being returned even after calling the function 10 times.
I'm using simpleHTMLDom parser, it works very well with url like : http://someWebSite.com/page/1 suppose that i want to parse from page 1 to page 20 (for website that contain pagination).
i've tried (naively) this :
for($page = 1; $page <= 20; $page++){
$getHTML = file_get_html('http://website.com/page/'.$page);
}
It doesn't work (it get the last page and it parses it)
Any help please ??
for($page = 1; $page <= 20; $page++){
$getHTML = file_get_html('http://website.com/page/'.$page);
// <-- Do your stuff here
}
or
$getHTML = array();
for($page = 1; $page <= 20; $page++){
$getHTML[] = file_get_html('http://website.com/page/'.$page);
}
foreach($getHTML as $html){
// Do stuff with $html
}
You need to to something with the HTML befor you get the next one or store it and then to somethin with it.
Is there a easier/better way to get every second hour than this
if(date("H")=='00'){$chart_updates = '|02|04|06|08|10|12|14|16|18|20|22|00';}
if(date("H")=='01'){$chart_updates = '|03|05|07|09|11|13|15|17|19|19|23|01';}
if(date("H")=='02'){$chart_updates = '|04|06|08|10|12|14|16|18|20|21|00|02';}
if(date("H")=='03'){$chart_updates = '|05|07|09|11|13|15|17|19|21|23|01|03';}
if(date("H")=='04'){$chart_updates = '|06|08|10|12|14|16|18|20|22|00|02|04';}
if(date("H")=='05'){$chart_updates = '|07|09|11|13|15|17|19|21|23|01|03|05';}
if(date("H")=='06'){$chart_updates = '|08|10|12|14|16|18|20|22|00|02|04|06';}
if(date("H")=='07'){$chart_updates = '|09|11|13|15|17|19|21|23|01|03|05|07';}
if(date("H")=='08'){$chart_updates = '|10|12|14|16|18|20|22|00|02|04|06|08';}
if(date("H")=='09'){$chart_updates = '|11|13|15|17|19|21|23|01|03|05|07|09';}
if(date("H")=='10'){$chart_updates = '|12|14|16|18|20|22|00|02|04|06|08|10';}
if(date("H")=='11'){$chart_updates = '|13|15|17|19|21|23|01|03|05|07|09|11';}
if(date("H")=='12'){$chart_updates = '|14|16|18|20|22|00|02|04|06|08|10|12';}
if(date("H")=='13'){$chart_updates = '|15|07|19|21|23|01|03|05|07|09|11|13';}
if(date("H")=='14'){$chart_updates = '|16|08|20|22|00|02|04|06|08|10|12|14';}
if(date("H")=='15'){$chart_updates = '|17|09|21|23|01|03|05|07|09|11|13|15';}
if(date("H")=='16'){$chart_updates = '|18|20|22|00|02|04|06|08|10|12|16|16';}
if(date("H")=='17'){$chart_updates = '|19|21|23|01|03|05|07|09|11|13|15|17';}
if(date("H")=='18'){$chart_updates = '|20|22|00|02|04|06|08|10|12|14|16|18';}
if(date("H")=='19'){$chart_updates = '|21|23|01|03|05|07|09|11|13|15|17|19';}
if(date("H")=='20'){$chart_updates = '|22|00|02|04|06|08|10|12|14|16|18|20';}
if(date("H")=='21'){$chart_updates = '|23|01|03|05|07|09|11|13|15|17|19|21';}
if(date("H")=='22'){$chart_updates = '|00|02|04|06|08|10|12|14|16|18|20|22';}
if(date("H")=='23'){$chart_updates = '|01|03|05|07|09|11|13|15|17|19|21|23';}
I need this for google charts and wanted to check if this way is stupid.
1) take the current hour
2) mod2 (there are only two different sets of numbers, odd and even)
3) build array of hours
4) sort array by value
5) split array where the original hour was, and recombine.
$h = date("H");
$line = '';
for($i=0; $i<=24; $i++)
{
if($i % 2 == $h % 2)
$line .= '|' . ($i < 10 ? '0'.$i : $i);
}
One way is to create an array with keys:
$theHour['00'] = '|02|04|06|08|10|12|14|16|18|20|22|00';
Then you can call it like this:
$chart_updates = $theHour[date("H")];
There is also probably a better way to generate this too, but since you already typed it out, its there.. It would just suck if you want to make a change.
Nice code :)
There's actually much easier way to do this in php:
$chars = array();
$start = date("H")+2;
for( $i = 0; $i < 12; $i++){
$chars[] = str_pad( ($start+2*$i)%24, 2, '0', STR_PAD_LEFT);
}
$chart_updates = '|' . implode( '|', $chars);
function helper_add($h,$plus=0){
if($h+$plus > 23){
return $h+$plus-24;
}
return $h+$plus;
}
function helper_10($in){
return $in < 10 ? '0'.$in : $in;
}
function getchartupdates(){
$now = date('G');
for($i=($now%2==0?0:1); $i<=24 ;$i+=2)
$res[] = helper_10(helper_add($now,$i));
return '|'.implode('|',$res);
}
used this to test it !