How to convert <textarea> output to clean html and write to file? - php

My script writes content of < textarea > to text file:
<!DOCTYPE html>
<html <?php language_attributes(); ?>>
<head> etc
Is there anyway I can convert output to clean html so it looks like this:
<!DOCTYPE html>
<html <?php language_attributes(); ?>>
<head>
etc :)
$file = 'wp.txt';
$regex = '/<textarea name="example" id="newcontent">(.*?)<\/textarea>/s';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
print "Error";
$file = 'wp.txt';
file_put_contents($file, $list, FILE_APPEND | LOCK_EX);
Thanks!

html_entity_decode
http://php.net/manual/en/function.html-entity-decode.php
That should do the trick.

Use the method html_entity_decode....
$file = 'wp.txt';
$regex = '/<textarea name="example" id="newcontent">(.*?)<\/textarea>/s';
if ( preg_match($regex, $page, $list) )
echo html_entity_decode($list[0]);
else
print "Error";
$file = 'wp.txt';
file_put_contents($file, $list, FILE_APPEND | LOCK_EX);

You'll need the html_entitiy_decode function.

Related

Reading and encoding html

I am trying to read and display the content of the title (contained in a h1 tag) from many HTML files. These files are all in the same folder.
This is what the html files look like :
<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'>
<html>
<head>
<title>A title</title>
<style type='text/css'>
... Styles here ...
</style>
</head>
<body>
<h1>Être aidant</h1>
<p>En général, les aidants doivent équilibrer...</p>
... more tags ...
</body>
I have tried to display the content from the H1 tag with this PHP script :
<?php
foreach (glob("test/*.html") as $file) {
$file_handle = fopen($file, "r");
$doc = new DOMDocument();
$doc->loadHTMLfile($file);
$title = $doc->getElementsByTagName('h1');
if ( $title && 0<$title->length ) {
$title = $title->item(0);
$content = $doc->savehtml($title);
echo $content;
}
fclose($file_handle);
}
?>
But the output contains wrong characters. For the example file, the output is :
Être aidant
How can I achieve this output?
Être aidant
You should state a charset in the <head> of your HTML document.
<meta charset="utf-8">
you need to use utf-8 encoding
change echo $content to echo utf8_encode($content);

Simple_Html_Dom how to parse chinese character

Would like to try crawling data from taobao site.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include_once('simple_html_dom.php');
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach ($html->find('h3[class=tb-main-title]') as $post) {
echo html_entity_decode($post, ENT_QUOTES, "ISO-8859-1") . "<br />";
}
?>
</body>
</html>
But it displays the product title in this:
2014��ЬŮʿ�������¿��ϸ��ƽ���ļ��¿����ϴ���ƽ����Ь��
In order to avoid that, you need to use iconv function. Consider this example:
include 'simple_html_dom.php';
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$contents = file_get_contents($target_url);
$html = str_get_html($contents);
foreach($html->find('h3[class=tb-main-title]') as $post) {
$text = $post->innertext;
$text = iconv('gb2312', 'utf-8', $text);
echo $text;
// 2014拖鞋女士人字拖新款豹纹细带平底夏季新款凉拖大码平底拖鞋潮
}

How to match and add a class name using preg_replace?

I'm trying to match the class attribute of <html> tag and to add a class name using preg_replace().
Here is what I tried so far:
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/< *html[^>]*class *= *["\']?([^"\']*)/i';
if(preg_match($pattern, $content, $matches)){
$content = preg_replace($pattern, '<html class="$1 my-custom-class">', $content);
}
echo htmlentities($content);
But, I got only this returned:
<!DOCTYPE html><html class="dummy my-custom-class">"><head></head><body></body></html>
The attribute lang="en" is dropped out and the tag is appended with the duplicates like ">">. Please help me.
Please try this code it works, perfectly well :)
<?php
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/(<html.*class="([^"]+)"[^>]*>)/i';
$callback_fn = 'process';
$content=preg_replace_callback($pattern, $callback_fn, $content);
function process($matches) {
$matches[1]=str_replace($matches[2],$matches[2]." # My Own Class", $matches[1]);
return $matches[1];
}
echo htmlentities($content);
?>
Remove the * in pattern for regex way
Use this pattern
/<html[^>]*class *= *["\']?([^"\']*)/i
I suggest use Dom parser for parsing the html
<?php
libxml_use_internal_errors(true);
$html="<!DOCTYPE html><html lang='en' class='dummy'><head></head><body></body></html>";
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('html') as $node) {
$node->setAttribute('class','dummy my-custom-class');
}
$html=$dom->saveHTML();
echo $html;
OUTPUT:
<!DOCTYPE html>
<html lang="en" class="dummy my-custom-class"><head></head><body></body></html>

I want to store tweets that I find after a search, in a text file

Does anyone know how could I save the tweets after searching in a .txt file?
My index file is the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Twitter</title>
</head>
<body>
<?php
require('twitter.class.php');
$twitter = new twitter_class();
echo $twitter->getTweets('tomato', 15);
?>
</body>
</html>
I'm new to all this so I would appreciate any help.
Here is the code to save the tweets:
<?php
require('twitter.class.php');
$twitter = new twitter_class();
$tweets = $twitter->getTweets('tomato', 15);
$currentfile = file_get_contents('tweets.txt');
file_put_contents('tweets.txt', $currentfile.$tweets);
?>
This will append the tweets instead of erasing the data if you don't want to append tweets just do this:
<?php
require('twitter.class.php');
$twitter = new twitter_class();
$tweets = $twitter->getTweets('tomato', 15);
file_put_contents('tweets.txt', $tweets);
?>
fwrite() is your friend. In loop where you echo tweets instead of echoing it write them to textfile
Have you tried the function file_punt_contents?
You could do:
<?php
$file = 'tweets.txt';
$tweets = $twitter->getTweets('tomato', 15);
// Write the contents back to the file
file_put_contents($file, $tweets);
?>
More info.
You can create file and write into in php usine fwrite , here is a simple code:
$fp = fopen('data.txt', 'w');
fwrite($fp, $twitter->getTweets('tomato', 15););
fclose($fp);

Make relative links into absolute ones

I am requesting the source code of a website like this:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
echo $txt; ?>
Bu I would like to replace the relative links with absolute ones! Basically,
<img src="/images/legend_15s.png"/> and <img src='/images/legend_15s.png'/>
should be replaced by
<img src="http://domain.com/images/legend_15s.png"/>
and
<img src='http://domain.com/images/legend_15s.png'/>
respectively. How can I do this?
This can be acheived with the following:
<?php
$input = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$domain = 'http://stats.pingdom.com/';
$rep['/href="(?!https?:\/\/)(?!data:)(?!#)/'] = 'href="'.$domain;
$rep['/src="(?!https?:\/\/)(?!data:)(?!#)/'] = 'src="'.$domain;
$rep['/#import[\n+\s+]"\//'] = '#import "'.$domain;
$rep['/#import[\n+\s+]"\./'] = '#import "'.$domain;
$output = preg_replace(
array_keys($rep),
array_values($rep),
$input
);
echo $output;
?>
Which will output links as follows:
/something
will become,
http://stats.pingdom.com//something
And
../something
will become,
http://stats.pingdom.com/../something
But it will not edit "data:image/png;" or anchor tags.
I'm pretty sure the regular expressions can be improved though.
This code replaces only the links and images:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$txt = str_replace(array('href="', 'src="'), array('href="http://stats.pingdom.com/', 'src="http://stats.pingdom.com/'), $txt);
echo $txt; ?>
I have tested and its working :)
UPDATED
Here is done with regular expression and working better:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$domain = "http://stats.pingdom.com";
$txt = preg_replace("/(href|src)\=\"([^(http)])(\/)?/", "$1=\"$domain$2", $txt);
echo $txt; ?>
Done :D
You dont need php, you only need to use the html5 base tag, and put your php code in html body, you only need to do the following
Example :
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document</title>
<base href="http://yourdomain.com/">
</head>
<body>
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
echo $txt; ?>
</body>
</html>
and all the files will use the absolute url

Categories