Get HTML source code of page with PHP

Get HTML source code of page with PHP - php

If I have the html file:
<!doctype html>
<html>
<head></head>
<body>
<!-- Begin -->
Important Information
<!-- End -->
</body>
</head>
</html>
How can I use PHP to get the string "Important Information" from the file?

If you already have the parsing sorted, just use file_get_contents(). You can pass it a URL and it will return the content found at the URL, in this case, the html. Or if you have the file locally, you pass it the file path.

In this simple example you can open the file and do fgets() until you find a line with <!-- Begin --> and saving the lines until you find <!-- End -->.
If your HTML is in a variable you can just do:
<?php
$begin = strpos($var, '<!-- Begin -->') + strlen('<!-- Begin -->'); // Can hardcode this with 14 (the length of your 'needle'
$end = strpos($var, '<!-- End -->');
$text = substr($var, $begin, ($end - $begin));
echo $text;
?>
You can see the output here.

You can fetch "HTML" by this
//file_get_html function from third party library
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
and any operation on DOM then read following docs:
http://de.php.net/manual/en/book.dom.php

Related

Writing to a new line in PHP

I have read the thread Writing a new line to file in PHP pasted the exact code of there in my own netbeans IDE but didn't work. The pasted code (with some minor changes) was:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<?php
$i = 0;
$file = fopen('ids.txt', 'w');
$gemList=array(1,2,3,4,5);
foreach ($gemList as $gem)
{
fwrite($file, $gem."\n");
$i++;
}
fclose($file);
?>
</body>
</html>
I also tried to write in a new line of file using another code. My code goes like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<?php
$fh = fopen("testfile.txt", 'w') or die("Failed to create file");
$text = <<<_END
Line 1
Line 2
Line 3
_END;
fwrite($fh, $text) or die("Could not write to file");
fclose($fh);
echo "File 'testfile.txt' written successfully";
?>
</body>
</html>
but the result in the text file is
12345
and what I expect is
1
2
3
4
5
I greatly appreciate any help. Also my Netbeans version is 8.2 and my running OS is Windows 10.

With your current code, check the page source and it is giving you the correct result.
But remember that you are running it on an html page so if you want a new line, use the <br> tag.
foreach ($gemList as $gem)
{
fwrite($file, $gem."<br>");
$i++;
}
fclose($file);
HTML does not take new lines \n into consideration, unless you specifically set the CSS property white-space:pre;

Ibu's answer is correct if you are displaying on a webpage.
For your fwrite() call, be sure the text viewer you are using understands \n as the EOL character. In other words if you are on Windows, and will only work with the resulting file(s) on Windows, a \n\r (new line and carriage return) is what you want to use for your EOL character(s)
Or, leave as-is, and use a text editor that supports "Unix style line endings" - Notepad++ does...

If you are viewing the content of your file in the browser (eg. echoed in PHP) you need to use the nl2br() PHP function to convert newlines to html's <br/>:
<div>
<?= nl2br(file_get_contents("testfile.txt")); ?>
</div>
Alternatively enclose the file content withing a with CSS white-space property set to "pre":
<div style="white-space: pre">
<?= file_get_contents("testfile.txt"); ?>
</div>

Is it possible to change original html text in php?

I am trying to make "manner friendly" website. We use different declination dependent on gender and other factors. For example:
You did = robili
It did = robilo
She did = robila
Linguisticaly this is very simplified (and unlucky) example! I would like to change html text in php file where appropriate. For example
<? php
something
?>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^</div>
<? php something ?>
Now I would like to replace all occurences of different tokens ^characters|characters|characters^ and replace them by one of their internal values according to "gender".
It is easy in javascript on the client side, but you will see all this weird "tokenizing" before javascript replace it.
Here I do not know the elegant solution.
Or do you have better idea?
Thanks for advice.

You can add these scripts before and after the HTML:
<?php
// start output buffering
ob_start();
?>
<html>
<body>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^, but also vital^si|sa|ste^, borko^mal|mala|malo^ </div>
</body>
</html>
<?php
$use = 1; // indicate which declination to use (0,1 or 2)
// get buffered html
$html = ob_get_contents();
ob_end_clean();
// match anything between '^' than's not a control chr or '^', min 5 and max 20 chrs.
if (preg_match_all('/\^[^[:cntrl:]\^]{3,20}\^/',$html,$matches))
{
// replace all
foreach (array_unique($matches[0]) as $match)
{
$choices = explode('|',trim($match,'^'));
$html = str_replace($match,$choices[$use],$html);
}
}
echo $html;
This returns:
html text of the page and somewhere is the word "robil" we tried to
robilo, but also vitalsa, borkomala

PHP: php and .html file separation

I'm currently working on separating HTML & PHP code here's my code which is currently working for me.
code.php
<?php
$data['#text#'] = 'A';
$html = file_get_contents('test.html');
echo $html = str_replace(array_keys($data),array_values($data),$html);
?>
test.html
<html>
<head>
<title>TEST HTML</title>
</head>
<body>
<h1>#text#</h1>
</body>
</html>
OUTPUT: A
it search and change the #text# value to array_value A it works for me.
Now i'm working on a code to search "id" tags on html file. If it's searches the "id" in ".html" file it will put the array_values in the middle of >
EX: <div id="test"> **aray_values here** </div>
test.php
<?php
$data['id="test"'] = 'A';
$html = file_get_contents('test.html');
foreach ($data as $search => $value)
{
if (strpos($html , $search))
{
echo 'FOUND';
echo $value;
}
}
?>
test.html
<html>
<head>
<title>TEST</title>
</head>
<body>
<div id="test" ></div>
</body>
</html>
My problem is I don't know how to put the array_values in the middle of every ></ search in the .html file.
Desired OUTPUT: <div id="test" >A</div>

function callbackInsert($matches)
{
global $data;
return $matches[1].$matches[3].$matches[4].$data[$matches[3]].$matches[6];
}
$data['test'] = 'A';
$html = file_get_contents('test.html');
foreach ($data as $search => $value)
{
preg_replace_callback('#(<([a-zA-Z]+)[^>]*id=")(.*?)("[^>]*>)([^<]*?)(</\\2>)#ism', 'callbackInsert', $html);
}
Warning: code is not tested and could be improved - re global keyword and what items are allowed between > and
Regular expression explanation:
(<([a-zA-Z]+) - any html tag starting including the last letter of the tag
[^>]* - anything that is inside a tag <>
id=")(.*?)(" - the id attribute and its value
[^>]* - anything that is inside a tag <>
>) - the closing tag
([^<]*?) - anything that is not a tag, tested by opening a tag <
(</\\2>) - the closing tag matching the 2nd bracket, ie. the matching opening tag

Use views (.phtml) files to dynamically generate content. This is native for PHP (no 3rd party required).
See this answer: What is phtml, and when should I use a .phtml extension rather than .php?
and this:
https://stackoverflow.com/questions/62617/whats-the-best-way-to-separate-php-code-and-html

how to display an array content inside CKeditor?

I need to display the content of doc file inside CKeditor.
I read the content of doc file & passing it into an array line by line :
$rs = fopen("text.doc", "r");
while ($line = fgets($rs, 1024)) {
$this->data[] = $line . "<BR>";
}
then I create an instance of CKeditor:
include_once("ckeditor/ckeditor.php");
$CKeditor = new CKeditor();
$CKeditor->basePath = '/ckeditor/';
foreach ($this->data as $value) {
//what should I write here
}
$CKeditor->editor('editor1');
the CKeditor work right now & appear on my webpage .. but without any content ?
what should I right inside the foreach to passing array content into the editor ?
please help =(

.doc files are zipped up and cannot be read like this, by line. Consider using PHPWord to get access to the contents inside.
EDIT: Looks like PHPDoc can only write and not read, upon further investigation.
PHP tools are very deficient in this area. Your best bet is to use something like DocVert to do your file conversions on the command line. THEN you could load that document inside CKEditor.
EDIT: after OP's comment:
let's consider it's a txt file ... I need the Ckeditor method
Load your decoded HTML content into a Textarea, and give this textarea an HTML ID or class:
$textarea_content = htmlspecialchars_decode(file_get_contents('text.doc'));
Then, in your HTML, call the CKEditor inside a JavaScript tag to replace the textarea with the editor:
<html>
<head>
<!-- include CKEditor in a <script> tag first -->
<script type="text/javascript">
window.onload = function()
{
CKEDITOR.replace( 'editor1' );
};
</script>
</head>
<body>
<textarea id="editor1" name="editor1"><?php echo $textarea_content ?></textarea>
</body>
The documentation page has a lot more details.

Screen scraping with cURL and Regex

Consider a document in the following format:
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match to find the opening and closing tags; they will always look the same, including that ending comment.
I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.
I have some simple PHP cURL code:
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);
Of course, now $output contains the entire source. How will I get just the contents of that element?

That's quite easy if you are sure the begin and end is ALWAYS the same. All you have to do is search for the beginning and end and match everything between that. I think a lot of people will be pissed at me for using regex to find a bit of HTML but it'll do the job!
// cURL
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(.*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches); // all matches
Because I don't know which website you're trying to crawl I'm not sure if this works or not.
After searching for quite a while (26 minutes to be exact) I have found why it didn't work. The dot (.) doesn't match newlines. Because HTML is full of new lines, it couldn't match the contents. Using a slightly dirty hack I managed to get it matching anyway (even though you already picked an answer).
// cURL
$ch = curl_init('http://blogg.oscarclothilde.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(([^.]|.)*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches[1][0]); // all matches

If you are sure about the following structure:
<div class="blog_post_item first">
WHATEVER
</div><!-- end blog_post_item -->
AND you are sure the ending-code doesn't appear in WHATEVER, then you can simply grab it.
(Note please that I replaced your original PHP with WHATEVER. CURL will only fetch the HTML, and it will contain content, not PHP.)
You don't need a regex. You can also do it simply by searching for the wanted strings, like in my example below.
$curlResponse = '
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>';
$startStr = '<div class="blog_post_item first">';
$endStr = '</div><!-- end blog_post_item -->';
$startStrPos = strpos($curlResponse, $startStr)+strlen($startStr);
$endStrPos = strpos($curlResponse, $endStr);
$wanted = substr($curlResponse, $startStrPos, $endStrPos-$startStrPos );
echo htmlentities($wanted);

This piece of code should work (>= 5.3.6 and dom extension):
$s = <<<EOM
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
EOM;
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//div[contains(#class, "blog_post_item") and contains(#class, "first")]') as $el) {
echo $d->saveHTML($el);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get HTML source code of page with PHP - php

If I have the html file: <!doctype html> <html> <head></head> <body>  Important Information  </body> </head> </html> How can I use PHP to get the string "Important Information" from the file?

If you already have the parsing sorted, just use file_get_contents(). You can pass it a URL and it will return the content found at the URL, in this case, the html. Or if you have the file locally, you pass it the file path.

You can fetch "HTML" by this //file_get_html function from third party library // Create DOM from URL or file $html = file_get_html('http://www.example.com/'); and any operation on DOM then read following docs: http://de.php.net/manual/en/book.dom.php

Related

Writing to a new line in PHP

Is it possible to change original html text in php?

PHP: php and .html file separation

how to display an array content inside CKeditor?

Screen scraping with cURL and Regex

Categories

Resources