php scraping with simple Dom model

php scraping with simple Dom model - php

include('simple_html_dom.php');
function curl_set($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
return $result;
}
$curl_scraped_page = curl_set('http://www.belmontwine.com/site-map.html');
$html = new simple_html_dom();
$html->load($curl_scraped_page, true, false);
$i = 0;
$ab = array();
$files = array();
foreach($html->find('td[class=site-map]') as $td) {
foreach($td->find('li a') as $a) {
if($i<=2){
$ab = 'http://www.belmontwine.com'.$a->href;
$html = file_get_html($ab);
foreach($html->find('td[class=pageheader]') as $file) {
$files[] = $file->innertext;
}
}
else{
//exit();
}
$i++;
}
$html->clear();
}
print_r($files);
Above is my code i need help to scrap site with php.
$ab variable contain the urls that are scraped from the site.i want to scrap data from those URL. I don't know whats wrong with script.
The desired output be the url passed by $ab..
but it is not returning anything..just a continous loop...
Need help with it

You have a run away program because once you are inside the if($i<=2) section you never increment the i variable. Right now your i++ is in the wrong place. I don't know why you want to limit the finds to 3 or less but you need to remember to reset the i variable to 0 also, which you are not doing at all.
EDIT:
I don't use the class 'simple_html_dom.php' so I don't know it very well. And I don't know what you want to do with each link found. And I can't do the work for you. I came up with this sample php script that grabs all the links from your site-map page. It creates an array consisting of the link title and href path. The last foreach loop just prints the array for now but you could use that loop to process each path found.
include('simple_html_dom.php');
$files = array();
$html = file_get_html('http://www.belmontwine.com/site-map.html');
foreach($html->find('td[class=site-map]') as $td)
{
foreach($td->find('li a') as $a)
{
if($a->plaintext != '')
{
$files["$a->plaintext"] = "http://www.belmontwine.com/$a->href";
}
}
}
// To print $files array or to process each link found
foreach($files as $title => $path)
{
echo('Title: ' . $title . ' - Path: ' . $path . '<br>' . PHP_EOL);
}
Also, not every link found is an html file, at least 1 is a pdf so be sure to test for that in your code.

Related

how can I get the image names that are inside a text file with php?

Like always, I am having issues doing this. So, I have been able to extract images from folders using php, but since it was taking too long to do that, I decided to just extract the names from a text file and then do a while loop and echo the list of results. For example I have a file called, "cats.txt" and inside the data looks like this.
Kitten1.jpg
Kitten2.jpg
Kitten3.jpg
I could easily use an sql or json table to do this, but I am not allowed. So, I only have this.
Now, I have tried doing this, but I get an error.
--- PHP CODE ---
$data = file ("path/to/the/file/cats.txt");
for ($i = 0; $i < count ($data); i++){
$images = '<img src="/kittens/t/$i">';
}
$CatLife = '
Images of our kittens:<br>
'.$images.'
';
I would really appreciate the help. I am not sure of what the error is, since the page doesn't tell me anything. It just doesn't load.

You can try something like this:
$fp = fopen('path/to/the/file/cats.txt', 'r');
$images = '';
while(!feof($fp)) {
$row = fgets($fp);
$images .= '<img src="/kittens/t/'.$row.'">';
}
fclose($fp);
$CatLife = "Images of our kittens:<br>$images";

Maybe you should use the glob php function instead of parsing your txt file.
foreach (glob("path/to/the/file/Kitten*.jpg") as $filename) {
echo "$filename\n";
}

Get used to foreach(). Much easier. Also note that variables like $file in your img tag don't get interpreted in single quotes. I use an array and then implode it:
$data = file ("path/to/the/file/cats.txt");
foreach($data as $file) {
$images[] = '<img src="/kittens/t/'.$file.'">';
}
$CatLife = '
Images of our kittens:<br>
'.implode($images).'
';
You could also just use $images .= '<img src="/kittens/t/$file">'; to concatenate and not need to implode.

$data = file ("path/to/the/file/cats.txt");
$images = '';
for ($i = 0; $i < count ($data); i++){
$images .= '<img src="/kittens/t/$i">');
}
$CatLife = 'Images of our kittens:<br>'.$images.'';
echo $CatLife;
Try this, it stores each image tag into a string and echos it to the page.

php file_get_contents from different URL if first one not available

I have the following code to read an XML file which works well when the URL is available:
$url = 'http://www1.blahblah.com'."param1"."param2";
$xml = file_get_contents($url);
$obj = SimpleXML_Load_String($xml);
How can I change the above code to cycle through a number of different URL's if the first one is unavailable for any reason? I have a list of 4 URL's all containing the same file but I'm unsure how to go about it.

Replace your code with for example this
//instead of simple variable use an array with links
$urls = [ 'http://www1.blahblah.com'."param1"."param2",
'http://www1.anotherblahblah.com'."param1"."param2",
'http://www1.andanotherblahblah.com'."param1"."param2",
'http://www1.andthelastblahblah.com'."param1"."param2"];
//for all your links try to get a content
foreach ($urls as $url) {
$xml = file_get_contents($url);
//do your things if content was read without failure and break the loop
if ($xml !== false) {
$obj = SimpleXML_Load_String($xml);
break;
}
}

How to read a csv file with php code inside?

i searched Google but found nothing what fits for my problem, or i search with the wrong words.
In many threads i read, the smarty Template was the solution, but i dont wont use smarty because its to big for this little project.
My problem:
I got a CSV file, this file contents only HTML and PHP code, its a simple html template document the phpcode i use for generating dynamic imagelinks for example.
I want to read in this file (that works) but how can i handle the phpcode inside this file, because the phpcode shown up as they are. All variables i use in the CSV file still works and right.
Short Version
how to handle, print or echo phpcode in a CSV file.
thanks a lot,
and sorry for my Bad english

Formatting your comment above you have the following code:
$userdatei = fopen("selltemplate/template.txt","r");
while(!feof($userdatei)) {
$zeile = fgets($userdatei);
echo $zeile;
}
fclose($userdatei);
// so i read in the csv file and the content of csv file one line:
// src="<?php echo $bild1; ?>" ></a>
This is assuming $bild1 is defined somewhere else, but try using these functions in your while loop to parse and output your html/php:
$userdatei = fopen("selltemplate/template.txt","r");
while(!feof($userdatei)) {
$zeile = fgets($userdatei);
outputResults($zeile);
}
fclose($userdatei);
//-- $delims contains the delimiters for your $string. For example, you could use <?php and ?> instead of <?php and ?>
function parseString($string, $delims) {
$result = array();
//-- init delimiter vars
if (empty($delims)) {
$delims = array('<?php', '?>');
}
$start = $delims[0];
$end = $delims[1];
//-- where our delimiters start/end
$php_start = strpos($string, $start);
$php_end = strpos($string, $end) + strlen($end);
//-- where our php CODE starts/ends
$php_code_start = $php_start + strlen($start);
$php_code_end = strpos($string, $end);
//-- the non-php content before/after the php delimiters
$pre = substr($string, 0, $php_start);
$post = substr($string, $php_end);
$code_end = $php_code_end - $php_code_start;
$code = substr($string, $php_code_start, $code_end);
$result['pre'] = $pre;
$result['post'] = $post;
$result['code'] = $code;
return $result;
}
function outputResults($string) {
$result = parseString($string);
print $result['pre'];
eval($result['code']);
print $result['post'];
}

Having PHP code inside a CSV file that should be parsed and probably executed using eval sounds pretty dangerous to me.
If I get you right you just want to have dynamic parameters in your CSV file right? If thats the case and you don't want to implement an entire templating language ( like Mustache, Twig or Smarty ) into your application you could do a simple search and replace thing.
$string = "<img alt='{{myImageAlt}}' src='{{myImage}}' />";
$parameters = [
'myImageAlt' => 'company logo',
'myImage' => 'assets/images/logo.png'
];
foreach( $parameters as $key => $value )
{
$string = str_replace( '{{'.$key.'}}', $value, $string );
}

PHP / Curl To Randomize Text Between Two Tags

Below is my working code that pulls a text file from a remote location and inserts into the html body of a page a specific line. The code works just fine as it is now. I want to do an addition to the code however and have it randomize the line that it gets. Here is what I'm wanting to do.
The text file that is being pulled will have a varied amount of lines. Only one line is chosen via the echo $lines[0]; which tells which line to get. The line will be formatted like this..
<p>This is a line of text domain 1. This is a line of text.</p><p>This is a line of text domain 2. This is a line of text.</p><p>This is a line of text domain 3. This is a line of text.</p>
All of that would be one line and pulled into the html of the page. The above example would display 3 paragraphs of text with links in the order above.
What I would like to do is have that line of text randomize between the <p>..</p> So for instance if I put the below code on Site A the output would be in order of domain 1 then domain 2 and then domain 3. If I put the code on Site B I would like it to be domain 3 and then domain 1 and then domain 2. To display them in random order, not the exact order for each time I put the code on a site.
I don't know if there would need to be some sort of cache on the site I have the code on to remember which random order to display in. That is what I want. I do not want a random order on each page load.
I hope this makes sense. If not please tell me so I can try and explain it better. Here is my working code as of now. Can anyone help me get this working? Thank you very much for your help.
<?php
function url_get_contents ($url) {
if (function_exists('curl_exec')){
$conn = curl_init($url);
curl_setopt($conn, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($conn, CURLOPT_FRESH_CONNECT, true);
curl_setopt($conn, CURLOPT_RETURNTRANSFER, 1);
$url_get_contents_data = (curl_exec($conn));
curl_close($conn);
}elseif(function_exists('file_get_contents')){
$url_get_contents_data = file_get_contents($url);
}elseif(function_exists('fopen') && function_exists('stream_get_contents')){
$handle = fopen ($url, "r");
$url_get_contents_data = stream_get_contents($handle);
}else{
$url_get_contents_data = false;
}
return $url_get_contents_data;
}
?>
<?php
$data = url_get_contents("http://mydomain.com/mytextfile.txt");
if($data){
$lines = explode("\n", $data);
echo $lines[0];
}
?>

Try This
$str = '<p>This is a line of text domain 1. This is a line of text.</p><p>This is a line of text domain 2. This is a line of text.</p><p>This is a line of text domain 3. This is a line of text.</p>';
preg_match_all('%(<p[^>]*>.*?</p>)%i', $str, $match);
$count = 0;
$used = array();
while ($count < 3) {
$index = rand(0, 2);
if (!isset($used[$index])) {
$used[$index] = 1;
echo $match[0][$index];
$count++;
}
}

I think I understand what you are asking, but if not, please let me know and I will adjust.
Basically, what I'm doing here is counting the number of lines in the array that you exploded and then using that as a max number to randomize against. Once I have a random number, then I just access that line of the file array. So if I generate the number 5, then it will grab the 5th line from the array.
$lines = explode("\n", $data);
$line_count = count($lines) - 1;
for ($i = 0; $i < 3; $i++) {
print "<p>".$lines[get_random_line($line_count)]."</p>";
}
function get_random_line($line_count) {
mt_srand(microtime() * 1000000);
$random_number = rand(0, $line_count);
return $random_number;
}

Without modifying your code too much and without getting into storing values in databases, using flat file storage you can do something like the following:
Create a file called "count.txt" and place it in the same location as your php file.
<?php
function url_get_contents ($url) {
if (function_exists('curl_exec')){
$conn = curl_init($url);
curl_setopt($conn, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($conn, CURLOPT_FRESH_CONNECT, true);
curl_setopt($conn, CURLOPT_RETURNTRANSFER, 1);
$url_get_contents_data = (curl_exec($conn));
curl_close($conn);
}elseif(function_exists('file_get_contents')){
$url_get_contents_data = file_get_contents($url);
}elseif(function_exists('fopen') && function_exists('stream_get_contents')){
$handle = fopen ($url, "r");
$url_get_contents_data = stream_get_contents($handle);
}else{
$url_get_contents_data = false;
}
return $url_get_contents_data;
}
$data = url_get_contents("http://mydomain.com/mytextfile.txt");
$fp=fopen('count.txt','r');//Open count.txt for reading
$count=fread($fp,4) ? $count++ : $count=0;//Get and increment $count (4=no. bytes to read)
fclose($fp); //Close file
if($data){
$lines=explode("\n",$data);
if($count>count($lines)){$count=0;}//Reset $count if more than available lines
echo $lines[$count];
$fp=fopen('count.txt','w'); //Another fopen to truncate the file simply
fwrite($fp,$count); //Store $count just displayed
fclose($fp); //Close file
}
?>

Sounds like your really looking for a way to have unique content or maybe also have the appearance to updated content on your HTML page. This has been extremely useful for me and Im sure many others will like it as well even though it is a bit different than what you are trying to do.
This will grab nested Spintax from a text file. It will then spin the content and display in your page. Your page will need to be .php however there is a way for this to work on an HTML page that's what I use this for.
Spintax Example: {cat|Dog|Mouse} Works on Sentence Spins, Spin/Rotate images, spin HTML code etc... There are many things that you can do with this.
<?php
function spin($s){
preg_match('#\{(.+?)\}#is',$s,$m);
if(empty($m)) return $s;
$t = $m[1];
if(strpos($t,'{')!==false){
$t = substr($t, strrpos($t,'{') + 1);
}
$parts = explode("|", $t);
$s = preg_replace("+\{".preg_quote($t)."\}+is",
$parts[array_rand($parts)], $s, 1);
return spin($s);
}
$file = "http://www.yourwebsite/Data.txt";
$f = fopen($file, "r");
while ( $line = fgets($f, 1000) ) {
echo spin($line);
}
?>

Simple Site Stat script not gathering data from file, I have an almost exact script that works

I made a script a while ago that wrote to a file, I did the same thing here, only added a part to read the file and write it again. What I am trying to achive is quite simple, but the problem is eluding me, I am trying to make my script write to a file basically holding the following information
views:{viewcount}
date-last-visited:{MM/DD/YYYY}
last-ip:{IP-Adress}
Now I have done a bit of research, and tried several methods to reading the data, none have returned anything. My current code is as follows.
<?php
$filemade = 0;
if(!file_exists("stats")){
if(!mkdir("stats")){
exit();
}
$filemade = 1;
}
echo $filemade;
$hwrite = fopen("stats/statistics.txt", 'w');
$icount = 0;
if(filemade == 0){
$data0 = file_get_contents("stats/statistics.txt");
$data2 = explode("\n", $data0);
$data1 = $data_1[0];
$ccount = explode(":", data1);
$icount = $ccount[1] + 1;
echo "<br>icount:".$icount."<br>";
echo "data1:".$data1."<br>";
echo "ccount:".$ccount."<br>";
echo "ccount[0]:".$ccount1[0]."<br>";
echo "ccount[1]:".$ccount1[1]."<br>";
}
$date = getdate();
$ip=#$REMOTE_ADDR;
fwrite($hwrite, "views:" . $icount . "\nlast-viewed:" . $date[5] . "/" . $date[3] . $date[2] . "/" . $date[6] . "\nlast-ip:" . $ip);
fclose($hwrite);
?>
the result is always:
views:1
last-viewed://
last-ip:
the views never go up, the date never works, and the IP address never shows.
I have looked at many sources before finally deciding to ask, I figured I'd get more relevant information this way.
Looking forward to some replies. PHP is my newest language, and so I don't know much.
What I have tried.
I have tried:
$handle_read = fopen("stats/statistics.txt", "r");//make a new file handle in read mode
$data = fgets($handle_read);//get first line
$data_array = explode(":", $data);//split first line by ":"
$current_count = $data_array[1];//get second item, the value
and
$handle_read = fopen("stats/statistics.txt", "r");//make a new file handle in read mode
$pre_data = fread($handle_read, filesize($handle_read));//read all the file data
$pre_data_array = explode("\n", $pre_data);//split the file by lines
$data = pre_data_array[0];//get first line
$data_array = explode(":", $data);//split first line by ":"
$current_count = $data_array[1];//get second item, the value
I have also tried split instead of explode, but I was told split is deprecated and explode is up-to-date.
Any help would be great, thank you for your time.

Try the following:
<?php
if(!file_exists("stats")){
if(!mkdir("stats")) die("Could not create folder");
}
// file() returns an array of file contents or false
$data = file("stats/statistics.txt", FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
if(!$data){
if(!touch("stats/statistics.txt")) die("Could not create file");
// Default Values
$data = array("views:0", "date-last-visited:01/01/2000", "last-ip:0.0.0.0");
}
// Update the data
foreach($data as $key => $val){
// Limit explode to 2 chunks because we could have
// IPv6 Addrs (e.x ::1)
$line = explode(':', $val, 2);
switch($key){
case 0:
$line[1]++;
break;
case 1:
$line[1] = date('m/d/Y');
break;
case 2:
$line[1] = $_SERVER['REMOTE_ADDR'];
break;
}
$data[$key] = implode(':', $line);
echo $data[$key]. "<br />";
}
// Write the data back into the file
if(!file_put_contents("stats/statistics.txt", implode(PHP_EOL, $data))) die("Could not write file");
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php scraping with simple Dom model - php

Related

how can I get the image names that are inside a text file with php?

php file_get_contents from different URL if first one not available

How to read a csv file with php code inside?

PHP / Curl To Randomize Text Between Two Tags

Simple Site Stat script not gathering data from file, I have an almost exact script that works

Categories

Resources