Convert CSS font shorthand to long hand - php

Given a font CSS string such as this:
font:italic bold 12px/30px Georgia, serif;
or
font:12px verdana;
I want to convert it to its long hand format i.e:
font-style: italic; font-weight: bold;
Here is my miserable attempt: http://pastebin.com/e3KdMvGT
But of course it doesn't work for the second example since its expecting things to be in order, how can I improve it?

Here is a function which should do the work. The problem lies in the font-style, font-variant and font-weight properties and the value "normal" as you can read in the css specs ([[ <'font-style'> || <'font-variant'> || <'font-weight'> ]? <'font-size'> [ / <'line-height'> ]? <'font-family'> ] | caption | icon | menu | message-box | small-caption | status-bar | inherit).
$testStrings = array('12px/14px sans-serif',
'80% sans-serif',
'x-large/110% "New Century Schoolbook", serif',
'x-large/110% "New Century Schoolbook"',
'bold italic large Palatino, serif ',
'normal small-caps 120%/120% fantasy',
'italic bold 12px/30px Georgia, serif',
'12px verdana');
foreach($testStrings as $font){
echo "Test ($font)\n<pre>
";
$details = extractFull($font);
print_r($details);
echo "
</pre>";
}
function extractFull($fontString){
// Split $fontString. The only area where quotes should be found is around font-families. Which are at the end.
$parts = preg_split('`("|\')`', $fontString, 2, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
$chunks = preg_split('` `', $parts[0], NULL, PREG_SPLIT_NO_EMPTY);
if(isset($parts[1])){
$chunks[] = $parts[1] . $parts[2];
}
$details = array();
$next = -1;
// Manage font-style / font-variant / font-weight properties
$possibilities = array();
$fontStyle = array('italic', 'oblique');
$fontVariant = array('small-caps');
$fontWeight = array('bold', 'bolder', 'lighter', '100', '200', '300', '400', '500', '600', '700', '800', '900');
// First pass on chunks 0 to 2 to see what each can be
for($i = 0; $i < 3; $i++){
$possibilities[$i] = array();
if(!isset($chunks[$i])){
if($next == -1){
$next = $i;
}
continue;
}
if(in_array($chunks[$i], $fontStyle)){
$possibilities[$i] = 'font-style';
}
else if(in_array($chunks[$i], $fontVariant)){
$possibilities[$i] = 'font-variant';
}
else if(in_array($chunks[$i], $fontWeight)){
$possibilities[$i] = 'font-weight';
}
else if($chunks[$i] == 'normal'){
$possibilities['normal'] = 1;
}
else if($next == -1){
// Used to know where other properties will start at
$next = $i;
}
}
// Second pass to determine what real properties are defined
for($i = 0; $i < 3; $i++){
if(!empty($possibilities[$i])){
$details[$possibilities[$i]] = $chunks[$i];
}
}
// Third pass to set the properties which have to be set as "normal"
if(!empty($possibilities['normal'])){
$properties = array('font-style', 'font-variant', 'font-weight');
foreach($properties as $property){
if(!isset($details[$property])){
$details[$property] = 'normal';
}
}
}
if(!isset($chunks[$next])){
return $details;
}
// Extract font-size and line height
if(strpos($chunks[$next], '/')){
$size = explode('/', $chunks[$next]);
$details['font-size'] = $size[0];
$details['line-height'] = $size[1];
$details['font-family'] = $chunks[$next+1];
}
else if(preg_match('`^-?[0-9]+`', $chunks[$next]) ||
in_array($chunks[$next], array('xx-small', 'x-small', 'small', 'medium', 'large', 'x-large', 'xx-large', 'larger', 'smaller'))){
$details['font-size'] = $chunks[$next];
$details['font-family'] = $chunks[$next+1];
}
else{
$details['font-family'] = $chunks[$next];
}
return $details;
}

here my attempt
function rewriteFont($short) {
$short = str_replace("font:","",$short);
$values = explode(" ", $short);
$length = count($values);
$familyLength = 0;
for($i = 0; $i < $length; $i++) {
$currentValue = $values[$i];
if($currentValue == 'italic' || $currentValue == 'oblique' ) {
echo "font-style:{$currentValue};";
}
else if($currentValue == 'small-caps') {
echo "font-variant:{$currentValue};";
}
else if ($currentValue == 'bold' || $currentValue == 'bolder' || $currentValue == 'lighter' || is_numeric($currentValue) ) {
echo "font-weight:{$currentValue};";
}
else if (strpos($currentValue, "px") || strpos($currentValue, "em") || strpos($currentValue, "ex")|| strpos($currentValue, "pt")
|| strpos($currentValue, "%") || $currentValue == "xx-small" || $currentValue == "x-small" || $currentValue == "small"
|| $currentValue == "medium" || $currentValue == "large" || $currentValue == "x-large" || $currentValue == "xx-large") {
$sizeLineHeight = explode("/", $currentValue);
echo "font-size:{$sizeLineHeight[0]};";
if(isset($sizeLineHeight[1])) {
echo "line-height:{$sizeLineHeight[1]};";
}
}
else {
if($familyLength == 0) {
echo "font-family:{$currentValue} ";
}
else {
echo $currentValue;
}
$familyLength++;
}
}
}

Following the specs, I'd do the following code. It works with all examples from the specs. You can see it here : http://codepad.viper-7.com/ABhc22
function font($s){
$fontStyle = array('normal', 'italic', 'oblique', 'inherit');
$fontVariant = array('normal', 'small-caps','inherit');
$fontSize = array('xx-small', 'x-small', 'small', 'medium', 'large', 'x-large', 'xx-large');
$fontWeight = array('normal', 'bold', 'bolder', 'lighter', '100', '200', '300', '400', '500', '600', '700', '800', '900', 'inherit');
$s = str_replace('font: ', '', $s);
$array = explode(' ', $s);
$count = count($array);
$sizeFound = false;
$final = array();
$final['font-style'] = 'normal';
$final['font-variant'] = 'normal';
$final['font-weight'] = 'normal';
for($i = 0; $i < $count; $i++){
$cur = $array[$i];
if($sizeFound){
$final['font-family'] = isset($final['font-family']) ? $final['font-family'].$cur : $cur;
}else if(strripos($cur,'%') !== false || strripos($cur,'px') !== false || in_array($cur, $fontSize)){
$sizeFound = true;
$size = explode('/', $cur);
if(count($size) > 1){
$final['font-size'] = $size[0];
$final['line-height'] = $size[1];
}else
$final['font-size'] = $size[0];
}else{
if(in_array($cur, $fontStyle))
$final['font-style'] = $cur;
if(in_array($cur, $fontVariant))
$final['font-variant'] = $cur;
if(in_array($cur, $fontWeight))
$final['font-weight'] = $cur;
}
}
$ret = '';
foreach($final as $prop => $val)
$ret .= $prop.': '.$val.';';
return $ret;
}
I guess it can be optimised. :)

You could look at CSSTidy, whose code is open-source and try to reverse engineer it.

Related

get attachments from nested imap parts

I'm using the below function to get attachments from an email (index $m) on an existing imap connection ($imap)
function parse_parts(&$structure, &$attachments, &$imap, $m, $par){
if(isset($structure->parts) && count($structure->parts) ) {
for($i = 0; $i < count($structure->parts); $i++) {
$attachments[] = array(
'is_attachment' => false,
'filename' => '',
'name' => '',
'attachment' => ''
);
$lstk= array_keys($attachments); $j=end($lstk); // use this instead of $i to avoid overlapping indices when there's sub-parts and recursion
if($structure->parts[$i]->ifdparameters) {
foreach($structure->parts[$i]->dparameters as $object) {
if(strtolower($object->attribute) == 'filename') {
$attachments[$j]['is_attachment'] = true;
$attachments[$j]['filename'] = $object->value;
}
}
}
if($structure->parts[$i]->ifparameters) {
foreach($structure->parts[$i]->parameters as $object) {
if(strtolower($object->attribute) == 'name') {
$attachments[$j]['is_attachment'] = true;
$attachments[$j]['filename'] = imap_utf8($object->value);
}
}
}
if($attachments[$j]['is_attachment']) {
if($par==''){
$attachments[$j]['attachment'] = imap_fetchbody($imap, $m, $i+1);
}else{
$attachments[$j]['attachment'] = imap_fetchbody($imap, $m, $par . $i+1);
}
if($structure->parts[$i]->encoding == 3) { // 3 = BASE64
$attachments[$j]['attachment'] = base64_decode($attachments[$j]['attachment']);
}
elseif($structure->parts[$i]->encoding == 4) { // 4 = QUOTED-PRINTABLE
$attachments[$j]['attachment'] = quoted_printable_decode($attachments[$j]['attachment']);
}else{echo "encod:". $structure->parts[$i]->encoding . PHP_EOL; }
}
if(isset($structure->parts[$i]->parts) && count($structure->parts[$i]->parts)){ //if the part contains its own parts, recurse
$prnt = $i;
$prnt = $prnt . '.';
parse_parts($structure->parts[$i], $attachments, $imap, $m, $prnt );
}
}
}
}
the later function which calls this looks like:
$structure = imap_fetchstructure($imap, $m);
$attachments = array();
parse_parts($structure, $attachments, $imap, $m, '');
the standard function works fine, but when it has to recurse (i.e. there are nested parts), the resulting file is not readable, though it does appear to be the correct size. Any idea what i'm doing wrong?
i've tried changing $prnt to
$prnt = $i+1 .'.';
but then, it just outputs an empty file.
i noticed that other people had similar issues here:
imap - get attached file
but their solutions don't appear to work for this email/code.
thank you to IVO GELOV for finding my mistake. final working function:
function parse_parts(&$structure, &$attachments, &$imap, $m, $par){
if(isset($structure->parts) && count($structure->parts) ) {
for($i = 0; $i < count($structure->parts); $i++) {
$attachments[] = array(
'is_attachment' => false,
'filename' => '',
'name' => '',
'attachment' => ''
);
$lstk= array_keys($attachments); $j=end($lstk); // use this instead of $i to avoid overlapping indices when there's sub-parts and recursion
if($structure->parts[$i]->ifdparameters) {
foreach($structure->parts[$i]->dparameters as $object) {
if(strtolower($object->attribute) == 'filename') {
$attachments[$j]['is_attachment'] = true;
$attachments[$j]['filename'] = $object->value;
}
}
}
if($structure->parts[$i]->ifparameters) {
foreach($structure->parts[$i]->parameters as $object) {
if(strtolower($object->attribute) == 'name') {
$attachments[$j]['is_attachment'] = true;
$attachments[$j]['filename'] = imap_utf8($object->value);
}
}
}
if($attachments[$j]['is_attachment']) {
if($par==''){
$attachments[$j]['attachment'] = imap_fetchbody($imap, $m, ($i+1));
}else{
$attachments[$j]['attachment'] = imap_fetchbody($imap, $m, $par . ($i+1));
}
if($structure->parts[$i]->encoding == 3) { // 3 = BASE64
$attachments[$j]['attachment'] = base64_decode($attachments[$j]['attachment']);
}
elseif($structure->parts[$i]->encoding == 4) { // 4 = QUOTED-PRINTABLE
$attachments[$j]['attachment'] = quoted_printable_decode($attachments[$j]['attachment']);
}else{echo "encod:". $structure->parts[$i]->encoding . PHP_EOL; }
}
if(isset($structure->parts[$i]->parts) && count($structure->parts[$i]->parts)){ //if the part contains its own parts, recurse
$prnt = $i+1;
$prnt = $prnt . '.';
parse_parts($structure->parts[$i], $attachments, $imap, $m, $prnt );
}
}
}
}
attachment may be inline or external file too.
so we have to get both type of attachments.
$structure = imap_fetchstructure($this->imap, $id);
$fileName = $attachments = array();
/* if any attachments found... */
if (isset($structure->parts) && count($structure->parts)) {
$n = 0;
for($i = 0; $i < count($structure->parts); $i++){
if(isset($structure->parts[$i]->parts) && !empty($structure->parts[$i]->parts) && count($structure->parts[$i]->parts)>0){
for($y = 0; $y < count($structure->parts[$i]->parts); $y++)
{
$attachments[$n] = array('is_attachment' => false,'filename' => '','name' => '','attachment' => '','size' => '','cid' => 0);
if($structure->parts[$i]->parts[$y]->ifdparameters==1) {
foreach($structure->parts[$i]->parts[$y]->dparameters as $object){
if(strtolower($object->attribute) == 'filename'){
$attachments[$n]['is_attachment'] = true;
$fileName[0] = $object->value;
$filename = $this->filterFileName($fileName);
$attachments[$n]['filename'] = $filename;
$attachments[$n]['name'] = $filename;
}
}
}
if($structure->parts[$i]->parts[$y]->ifparameters){
foreach($structure->parts[$i]->parts[$y]->parameters as $object){
if(strtolower($object->attribute) == 'name'){
$attachments[$n]['is_attachment'] = true;
$fileName[0] = $object->value;
$filename = $this->filterFileName($fileName);
$attachments[$n]['name'] = $filename;
$attachments[$n]['filename'] = $filename;
}
}
}
if($attachments[$n]['is_attachment']){
$attachments[$n]['attachment'] = imap_fetchbody($this->imap, $id, ($i+1).'.'.($y+1));
if($structure->parts[$i]->parts[$y]->encoding == 3){ /* 4 = QUOTED-PRINTABLE encoding */
$attachments[$n]['attachment'] = base64_decode($attachments[$n]['attachment']);
}
elseif($structure->parts[$i]->parts[$y]->encoding == 4){ /* 3 = BASE64 encoding */
$attachments[$n]['attachment'] = quoted_printable_decode($attachments[$n]['attachment']);
}
$attachments[$n]['size'] = $structure->parts[$i]->parts[$y]->bytes;
$attachments[$n]['cid'] = ($structure->parts[$i]->parts[$y]->ifid) ? str_replace(array('<', '>'), '', $structure->parts[$i]->parts[$y]->id) : 0;
}
$email['attachments'][] = $attachments[$n];$n++;
}
}else{
$attachments[$n] = array('is_attachment' => false,'filename' => '','name' => '','attachment' => '','size' => '','cid' => 0);
if($structure->parts[$i]->ifdparameters==1){
foreach($structure->parts[$i]->dparameters as $object){
if(strtolower($object->attribute) == 'filename'){
$attachments[$n]['is_attachment'] = true;
$fileName[0] = $object->value;
$filename = $this->filterFileName($fileName);
$attachments[$n]['filename'] = $filename;
$attachments[$n]['name'] = $filename;
}
}
}
if($structure->parts[$i]->ifparameters){
foreach($structure->parts[$i]->parameters as $object){
if(strtolower($object->attribute) == 'name'){
$attachments[$n]['is_attachment'] = true;
$fileName[0] = $object->value;
$filename = $this->filterFileName($fileName);
$attachments[$n]['name'] = $filename;
$attachments[$n]['filename'] = $filename;
}
}
}
if($attachments[$n]['is_attachment']){
$attachments[$n]['attachment'] = imap_fetchbody($this->imap, $id, $i+1);
if($structure->parts[$i]->encoding == 3){/* 4 = QUOTED-PRINTABLE encoding */
$attachments[$n]['attachment'] = base64_decode($attachments[$n]['attachment']);
}elseif($structure->parts[$i]->encoding == 4){ /* 3 = BASE64 encoding */
$attachments[$n]['attachment'] = quoted_printable_decode($attachments[$n]['attachment']);
}
$attachments[$n]['size'] = $structure->parts[$i]->bytes;
$attachments[$n]['cid'] = ($structure->parts[$i]->ifid) ? str_replace(array('<', '>'), '', $structure->parts[$i]->id) : 0;
}
$email['attachments'][] = $attachments[$n];$n++;
}
}
}
public function filterFileName($fileName) {
$newstr = preg_replace('/[^a-zA-Z0-9\-\.\']/', '_', $fileName[0]);
$fileName = str_replace("'", '', $newstr);
return $fileName;
}
after that you have to store this attachments

Is there a way to process PDF files as text files in PHP?

My company receives coupons codes from another company and I need to process them to use them in my company's website. The problem is that they send me a PDF file with the codes, and they say their system can only export them in PDF. Strange.
I've tried several times to let them know that I need those coupons in plain text separated by something (a.k.a CSV) and they doesn't take any notice of that.
My website is written in PHP, and uses MySQL. I would like to offer a way to upload that coupons file and add those to the database. It would be easy considering it is a CSV file but it's not.
Is there any way I can programatically -not manually- process PDF files as text files? or, any workaround for this situation?
I have found pdf2text to have worked well for quite some time.
Will not work with image (TIFF) PDF. will work with text searchable PDF.
include('/home/user/php/class.pdf2text.php');
$p2t = new PDF2Text();
$p2t ->setFilename($pdf);
$p2t ->decodePDF();
$data = $p2t ->output();
$pos = strpos($data,$search);
if (pos){...}
Source:
<?php
class PDF2Text {
// Some settings
var $multibyte = 4; // Use setUnicode(TRUE|FALSE)
var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None)
var $showprogress = true; // TRUE if you have problems with time-out
// Variables
var $filename = '';
var $decodedtext = '';
function setFilename($filename) {
// Reset
$this->decodedtext = '';
$this->filename = $filename;
}
function output($echo = false) {
if($echo) echo $this->decodedtext;
else return $this->decodedtext;
}
function setUnicode($input) {
// 4 for unicode. But 2 should work in most cases just fine
if($input == true) $this->multibyte = 4;
else $this->multibyte = 2;
}
function decodePDF() {
// Read the data from pdf file
$infile = #file_get_contents($this->filename, FILE_BINARY);
if (empty($infile))
return "";
// Get all text data.
$transformations = array();
$texts = array();
// Get the list of all objects.
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile . "endobj\r", $objects);
$objects = #$objects[1];
// Select objects with streams.
for ($i = 0; $i < count($objects); $i++) {
$currentObject = $objects[$i];
// Prevent time-out
#set_time_limit ();
if($this->showprogress) {
// echo ". ";
flush(); ob_flush();
}
// Check if an object includes data stream.
if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject . "endstream\r", $stream )) {
$stream = ltrim($stream[1]);
// Check object parameters and look for text data.
$options = $this->getObjectOptions($currentObject);
if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])) )
// if ( $options["Image"] && $options["Subtype"] )
// if (!(empty($options["Length1"]) && empty($options["Subtype"])) )
continue;
// Hack, length doesnt always seem to be correct
unset($options["Length"]);
// So, we have text data. Decode it.
$data = $this->getDecodedStream($stream, $options);
if (strlen($data)) {
if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data . "ET\r", $textContainers)) {
$textContainers = #$textContainers[1];
$this->getDirtyTexts($texts, $textContainers);
} else
$this->getCharTransformations($transformations, $data);
}
}
}
// Analyze text blocks taking into account character transformations and return results.
$this->decodedtext = $this->getTextUsingTransformations($texts, $transformations);
}
function decodeAsciiHex($input) {
$output = "";
$isOdd = true;
$isComment = false;
for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) {
$c = $input[$i];
if($isComment) {
if ($c == '\r' || $c == '\n')
$isComment = false;
continue;
}
switch($c) {
case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break;
case '%':
$isComment = true;
break;
default:
$code = hexdec($c);
if($code === 0 && $c != '0')
return "";
if($isOdd)
$codeHigh = $code;
else
$output .= chr($codeHigh * 16 + $code);
$isOdd = !$isOdd;
break;
}
}
if($input[$i] != '>')
return "";
if($isOdd)
$output .= chr($codeHigh * 16);
return $output;
}
function decodeAscii85($input) {
$output = "";
$isComment = false;
$ords = array();
for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) {
$c = $input[$i];
if($isComment) {
if ($c == '\r' || $c == '\n')
$isComment = false;
continue;
}
if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ')
continue;
if ($c == '%') {
$isComment = true;
continue;
}
if ($c == 'z' && $state === 0) {
$output .= str_repeat(chr(0), 4);
continue;
}
if ($c < '!' || $c > 'u')
return "";
$code = ord($input[$i]) & 0xff;
$ords[$state++] = $code - ord('!');
if ($state == 5) {
$state = 0;
for ($sum = 0, $j = 0; $j < 5; $j++)
$sum = $sum * 85 + $ords[$j];
for ($j = 3; $j >= 0; $j--)
$output .= chr($sum >> ($j * 8));
}
}
if ($state === 1)
return "";
elseif ($state > 1) {
for ($i = 0, $sum = 0; $i < $state; $i++)
$sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i);
for ($i = 0; $i < $state - 1; $i++) {
try {
if(false == ($o = chr($sum >> ((3 - $i) * 8)))) {
throw new Exception('Error');
}
$output .= $o;
} catch (Exception $e) { /*Dont do anything*/ }
}
}
return $output;
}
function decodeFlate($data) {
return #gzuncompress($data);
}
function getObjectOptions($object) {
$options = array();
if (preg_match("#<<(.*)>>#ismU", $object, $options)) {
$options = explode("/", $options[1]);
#array_shift($options);
$o = array();
for ($j = 0; $j < #count($options); $j++) {
$options[$j] = preg_replace("#\s+#", " ", trim($options[$j]));
if (strpos($options[$j], " ") !== false) {
$parts = explode(" ", $options[$j]);
$o[$parts[0]] = $parts[1];
} else
$o[$options[$j]] = true;
}
$options = $o;
unset($o);
}
return $options;
}
function getDecodedStream($stream, $options) {
$data = "";
if (empty($options["Filter"]))
$data = $stream;
else {
$length = !empty($options["Length"]) ? $options["Length"] : strlen($stream);
$_stream = substr($stream, 0, $length);
foreach ($options as $key => $value) {
if ($key == "ASCIIHexDecode")
$_stream = $this->decodeAsciiHex($_stream);
elseif ($key == "ASCII85Decode")
$_stream = $this->decodeAscii85($_stream);
elseif ($key == "FlateDecode")
$_stream = $this->decodeFlate($_stream);
elseif ($key == "Crypt") { // TO DO
}
}
$data = $_stream;
}
return $data;
}
function getDirtyTexts(&$texts, $textContainers) {
for ($j = 0; $j < count($textContainers); $j++) {
if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
elseif (preg_match_all("#T[d|w|m|f]\s*(\(.*\))\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
elseif (preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
}
}
function getCharTransformations(&$transformations, $stream) {
preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER);
preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER);
for ($j = 0; $j < count($chars); $j++) {
$count = $chars[$j][1];
$current = explode("\n", trim($chars[$j][2]));
for ($k = 0; $k < $count && $k < count($current); $k++) {
if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map))
$transformations[str_pad($map[1], 4, "0")] = $map[2];
}
}
for ($j = 0; $j < count($ranges); $j++) {
$count = $ranges[$j][1];
$current = explode("\n", trim($ranges[$j][2]));
for ($k = 0; $k < $count && $k < count($current); $k++) {
if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) {
$from = hexdec($map[1]);
$to = hexdec($map[2]);
$_from = hexdec($map[3]);
for ($m = $from, $n = 0; $m <= $to; $m++, $n++)
$transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n);
} elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) {
$from = hexdec($map[1]);
$to = hexdec($map[2]);
$parts = preg_split("#\s+#", trim($map[3]));
for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++)
$transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n]));
}
}
}
}
function getTextUsingTransformations($texts, $transformations) {
$document = "";
for ($i = 0; $i < count($texts); $i++) {
$isHex = false;
$isPlain = false;
$hex = "";
$plain = "";
for ($j = 0; $j < strlen($texts[$i]); $j++) {
$c = $texts[$i][$j];
switch($c) {
case "<":
$hex = "";
$isHex = true;
$isPlain = false;
break;
case ">":
$hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO)
for ($k = 0; $k < count($hexs); $k++) {
$chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero
if (isset($transformations[$chex]))
$chex = $transformations[$chex];
$document .= html_entity_decode("&#x".$chex.";");
}
$isHex = false;
break;
case "(":
$plain = "";
$isPlain = true;
$isHex = false;
break;
case ")":
$document .= $plain;
$isPlain = false;
break;
case "\\":
$c2 = $texts[$i][$j + 1];
if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2;
elseif ($c2 == "n") $plain .= '\n';
elseif ($c2 == "r") $plain .= '\r';
elseif ($c2 == "t") $plain .= '\t';
elseif ($c2 == "b") $plain .= '\b';
elseif ($c2 == "f") $plain .= '\f';
elseif ($c2 >= '0' && $c2 <= '9') {
$oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3));
$j += strlen($oct) - 1;
$plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes);
}
$j++;
break;
default:
if ($isHex)
$hex .= $c;
elseif ($isPlain)
$plain .= $c;
break;
}
}
$document .= "\n";
}
return $document;
}
}
?>

How To Delete The Top 100 Rows From a CSV File With PHP

I have a php script running on a regular basis that processes the top 100 rows of a CSV file. When it is done I want it to delete the processed rows from the CSV file.
I have tried the below code, but it does not delete anything. I am not sure the best way to state the condition in PHP and am not sure what to put for the $id. I set the number of rows at 5 for testing purposes. Likely have somy syntax wrong.
Any suggestions? Tips?
function delete_line($id)
{
if($id)
{
$file_handle = fopen("file.csv", "w+");
$myCsv = array();
while (!feof($file_handle) )
{
$line_of_text = fgetcsv($file_handle, 1024);
if ($id != $line_of_text[0])
{
fputcsv($file_handle, $line_of_text);
}
}
fclose($file_handle);
}
}
$in = fopen( 'file.csv', 'r');
$out = fopen( 'file.csv', 'w');
// Check whether they opened
while( $row = fgetcsv( $in, $cnt)){
fputcsv( $out, $row);
}
fclose( $in); fclose( $out);
Whole script is here...
<?php
require_once('wp-config.php');
$siteurl = get_site_url();
//print_r($_FILES);
//if($_FILES['file']['tmp_name']) {
//$upload = ABSPATH . 'file.csv';
//move_uploaded_file($_FILES['file']['tmp_name'], $upload);
//}
function clearer($str) {
//$str = iconv("UTF-8", "UTF-8//IGNORE", $str);
$str = utf8_encode($str);
$str = str_replace("’", "'", $str);
$str = str_replace("–", "-", $str);
return htmlspecialchars($str);
}
//file read
if(file_exists("file.csv")) $csv_lines = file("file.csv");
if(is_array($csv_lines)) {
$cnt = 5;
for($i = 0; $i < $cnt; $i++) {
$line = $csv_lines[$i];
$line = trim($line);
$first_char = true;
$col_num = 0;
$length = strlen($line);
for($b = 0; $b < $length; $b++) {
if($skip_char != true) {
$process = true;
if($first_char == true) {
if($line[$b] == '"') {
$terminator = '",';
$process = false;
}else
$terminator = ',';
$first_char = false;
}
if($line[$b] == '"'){
$next_char = $line[$b + 1];
if($next_char == '"')
$skip_char = true;
elseif($next_char == ',') {
if($terminator == '",') {
$first_char = true;
$process = false;
$skip_char = true;
}
}
}
if($process == true){
if($line[$b] == ',') {
if($terminator == ',') {
$first_char = true;
$process = false;
}
}
}
if($process == true)
$column .= $line[$b];
if($b == ($length - 1)) {
$first_char = true;
}
if($first_char == true) {
$values[$i][$col_num] = $column;
$column = '';
$col_num++;
}
}
else
$skip_char = false;
}
}
$values = array_values($values);
//print_r($values);
/*************************************************/
if(is_array($values)) {
//file.csv read
for($i = 0; $i < count($values); $i++) {
unset($post);
//check duplicate
//$wpdb->show_errors();
$wpdb->query("SELECT `ID` FROM `" . $wpdb->prefix . "posts`
WHERE `post_title` = '".clearer($values[$i][0])."' AND `post_status` = 'publish'");
//echo $wpdb->num_rows;
if($values[$i][0] != "Name" && $values[$i][0] != "" && $wpdb->num_rows == 0) {
$post['name'] = clearer($values[$i][0]);
$post['Address'] = clearer($values[$i][1]);
$post['City'] = clearer($values[$i][2]);
$post['Categories'] = $values[$i][3];
$post['Tags'] = $values[$i][4];
$post['Top_image'] = $values[$i][5];
$post['Body_text'] = clearer($values[$i][6]);
//details
for($k = 7; $k <= 56; $k++) {
$values[$i][$k] != '' ? $post['details'] .= "<em>".clearer($values[$i][$k])."</em>\r\n" : '';
}
//cats
$categoryes = explode(";", $post['Categories']);
foreach($categoryes AS $category_name) {
$term = term_exists($category_name, 'category');
if (is_array($term)) {
//category exist
$cats[] = $term['term_id'];
}else{
//add category
wp_insert_term( $category_name, 'category' );
$term = term_exists($category_name, 'category');
$cats[] = $term['term_id'];
}
}
//top image
if($post['Top_image'] != "") {
$im_name = md5($post['Top_image']).'.jpg';
$im = #imagecreatefromjpeg($post['Top_image']);
if ($im) {
imagejpeg($im, ABSPATH.'images/'.$im_name);
$post['topimage'] = '<img class="alignnone size-full" src="'.$siteurl.'/images/'.$im_name.'" alt="" />';
}
}
//bottom images
for($k = 57; $k <= 76; $k++) {
if($values[$i][$k] != '') {
$im_name = md5($values[$i][$k]).'.jpg';
$im = #imagecreatefromjpeg($values[$i][$k]);
if ($im) {
imagejpeg($im, ABSPATH.'images/'.$im_name);
$post['images'] .= '<img class="alignnone size-full" src="'.$siteurl.'/images/'.$im_name.'" alt="" />';
}
}
}
$post = array_map( 'stripslashes_deep', $post );
//print_r($post);
//post created
$my_post = array (
'post_title' => $post['name'],
'post_content' => '
<em>Address: '.$post['Address'].'</em>
'.$post['topimage'].'
'.$post['Body_text'].'
<!--more-->
'.$post['details'].'
'.$post['images'].'
',
'post_status' => 'publish',
'post_author' => 1,
'post_category' => $cats
);
unset($cats);
//add post
//echo "ID:" .
$postid = wp_insert_post($my_post); //post ID
//tags
wp_set_post_tags( $postid, str_replace(';',',',$post['Tags']), true ); //tags
echo $post['name']. ' - added. ';
//google coords
$address = preg_replace("!\((.*?)\)!si", " ", $post['Address']).', '.$post['City'];
$json = json_decode(file_get_contents('http://hicon.by/temp/googlegeo.php?address='.urlencode($address)));
//print_r($json);
if($json->status == "OK") {
//нашло адрес
$google['status'] = $json->status;
$params = $json->results[0]->address_components;
if(is_array($params)) {
foreach($params AS $id => $p) {
if($p->types[0] == 'locality') $google['locality_name'] = $p->short_name;
if($p->types[0] == 'administrative_area_level_2') $google['sub_admin_code'] = $p->short_name;
if($p->types[0] == 'administrative_area_level_1') $google['admin_code'] = $p->short_name;
if($p->types[0] == 'country') $google['country_code'] = $p->short_name;
if($p->types[0] == 'postal_code') $google['postal_code'] = $p->short_name;
}
}
$google['address'] = $json->results[0]->formatted_address;
$google['location']['lat'] = $json->results[0]->geometry->location->lat;
$google['location']['lng'] = $json->results[0]->geometry->location->lng;
//print_r($params);
//print_r($google);
//insert into DB
$insert_code = $wpdb->insert( $wpdb->prefix . 'geo_mashup_locations',
array( 'lat' => $google['location']['lat'], 'lng' => $google['location']['lng'], 'address' => $google['address'],
'saved_name' => $post['name'], 'postal_code' => $google['postal_code'],
'country_code' => $google['country_code'], 'admin_code' => $google['admin_code'],
'sub_admin_code' => $google['sub_admin_code'], 'locality_name' => $google['locality_name'] ),
array( '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s' )
);
if($insert_code) {
$google_code_id = $wpdb->insert_id;
$geo_date = date( 'Y-m-d H:i:s' );
$wpdb->insert(
$wpdb->prefix . 'geo_mashup_location_relationships',
array( 'object_name' => 'post', 'object_id' => $postid, 'location_id' => $google_code_id, 'geo_date' => $geo_date ),
array( '%s', '%s', '%s', '%s' )
);
}else{
//can't insert data
}
echo ' address added.<br />';
}else{
//echo $json->status;
}
}
} //$values end (for)
}
}else{
//not found file.csv
echo 'not found file.csv';
}
function delete_line($id)
{
if($id)
{
$file_handle = fopen("file.csv", "w+");
$myCsv = array();
while (!feof($file_handle) )
{
$line_of_text = fgetcsv($file_handle, 1024);
if ($id != $line_of_text[0])
{
fputcsv($file_handle, $line_of_text);
}
}
fclose($file_handle);
}
}
$in = fopen( 'file.csv', 'r');
$out = fopen( 'file.csv', 'w');
// Check whether they opened
while( $row = fgetcsv( $in, $cnt)){
fputcsv( $out, $row);
}
fclose( $in); fclose( $out);
?>
Update
I tried one of the answer codes and got a server error. Code below, thoughts?
<?php
require_once('wp-config.php');
$siteurl = get_site_url();
//print_r($_FILES);
//if($_FILES['file']['tmp_name']) {
//$upload = ABSPATH . 'file.csv';
//move_uploaded_file($_FILES['file']['tmp_name'], $upload);
//}
function clearer($str) {
//$str = iconv("UTF-8", "UTF-8//IGNORE", $str);
$str = utf8_encode($str);
$str = str_replace("’", "'", $str);
$str = str_replace("–", "-", $str);
return htmlspecialchars($str);
}
//file read
if(file_exists("file.csv")) $csv_lines = file("file.csv");
if(is_array($csv_lines)) {
$cnt = 5;
for($i = 0; $i < $cnt; $i++) {
$line = $csv_lines[$i];
$line = trim($line);
$first_char = true;
$col_num = 0;
$length = strlen($line);
for($b = 0; $b < $length; $b++) {
if($skip_char != true) {
$process = true;
if($first_char == true) {
if($line[$b] == '"') {
$terminator = '",';
$process = false;
}else
$terminator = ',';
$first_char = false;
}
if($line[$b] == '"'){
$next_char = $line[$b + 1];
if($next_char == '"')
$skip_char = true;
elseif($next_char == ',') {
if($terminator == '",') {
$first_char = true;
$process = false;
$skip_char = true;
}
}
}
if($process == true){
if($line[$b] == ',') {
if($terminator == ',') {
$first_char = true;
$process = false;
}
}
}
if($process == true)
$column .= $line[$b];
if($b == ($length - 1)) {
$first_char = true;
}
if($first_char == true) {
$values[$i][$col_num] = $column;
$column = '';
$col_num++;
}
}
else
$skip_char = false;
}
}
$values = array_values($values);
//print_r($values);
/*************************************************/
if(is_array($values)) {
//file.csv read
for($i = 0; $i < count($values); $i++) {
unset($post);
//check duplicate
//$wpdb->show_errors();
$wpdb->query("SELECT `ID` FROM `" . $wpdb->prefix . "posts`
WHERE `post_title` = '".clearer($values[$i][0])."' AND `post_status` = 'publish'");
//echo $wpdb->num_rows;
if($values[$i][0] != "Name" && $values[$i][0] != "" && $wpdb->num_rows == 0) {
$post['name'] = clearer($values[$i][0]);
$post['Address'] = clearer($values[$i][1]);
$post['City'] = clearer($values[$i][2]);
$post['Categories'] = $values[$i][3];
$post['Tags'] = $values[$i][4];
$post['Top_image'] = $values[$i][5];
$post['Body_text'] = clearer($values[$i][6]);
//details
for($k = 7; $k <= 56; $k++) {
$values[$i][$k] != '' ? $post['details'] .= "<em>".clearer($values[$i][$k])."</em>\r\n" : '';
}
//cats
$categoryes = explode(";", $post['Categories']);
foreach($categoryes AS $category_name) {
$term = term_exists($category_name, 'category');
if (is_array($term)) {
//category exist
$cats[] = $term['term_id'];
}else{
//add category
wp_insert_term( $category_name, 'category' );
$term = term_exists($category_name, 'category');
$cats[] = $term['term_id'];
}
}
//top image
if($post['Top_image'] != "") {
$im_name = md5($post['Top_image']).'.jpg';
$im = #imagecreatefromjpeg($post['Top_image']);
if ($im) {
imagejpeg($im, ABSPATH.'images/'.$im_name);
$post['topimage'] = '<img class="alignnone size-full" src="'.$siteurl.'/images/'.$im_name.'" alt="" />';
}
}
//bottom images
for($k = 57; $k <= 76; $k++) {
if($values[$i][$k] != '') {
$im_name = md5($values[$i][$k]).'.jpg';
$im = #imagecreatefromjpeg($values[$i][$k]);
if ($im) {
imagejpeg($im, ABSPATH.'images/'.$im_name);
$post['images'] .= '<img class="alignnone size-full" src="'.$siteurl.'/images/'.$im_name.'" alt="" />';
}
}
}
$post = array_map( 'stripslashes_deep', $post );
//print_r($post);
//post created
$my_post = array (
'post_title' => $post['name'],
'post_content' => '
<em>Address: '.$post['Address'].'</em>
'.$post['topimage'].'
'.$post['Body_text'].'
<!--more-->
'.$post['details'].'
'.$post['images'].'
',
'post_status' => 'publish',
'post_author' => 1,
'post_category' => $cats
);
unset($cats);
//add post
//echo "ID:" .
$postid = wp_insert_post($my_post); //post ID
//tags
wp_set_post_tags( $postid, str_replace(';',',',$post['Tags']), true ); //tags
echo $post['name']. ' - added. ';
//google coords
$address = preg_replace("!\((.*?)\)!si", " ", $post['Address']).', '.$post['City'];
$json = json_decode(file_get_contents('http://hicon.by/temp/googlegeo.php?address='.urlencode($address)));
//print_r($json);
if($json->status == "OK") {
//нашло адрес
$google['status'] = $json->status;
$params = $json->results[0]->address_components;
if(is_array($params)) {
foreach($params AS $id => $p) {
if($p->types[0] == 'locality') $google['locality_name'] = $p->short_name;
if($p->types[0] == 'administrative_area_level_2') $google['sub_admin_code'] = $p->short_name;
if($p->types[0] == 'administrative_area_level_1') $google['admin_code'] = $p->short_name;
if($p->types[0] == 'country') $google['country_code'] = $p->short_name;
if($p->types[0] == 'postal_code') $google['postal_code'] = $p->short_name;
}
}
$google['address'] = $json->results[0]->formatted_address;
$google['location']['lat'] = $json->results[0]->geometry->location->lat;
$google['location']['lng'] = $json->results[0]->geometry->location->lng;
//print_r($params);
//print_r($google);
//insert into DB
$insert_code = $wpdb->insert( $wpdb->prefix . 'geo_mashup_locations',
array( 'lat' => $google['location']['lat'], 'lng' => $google['location']['lng'], 'address' => $google['address'],
'saved_name' => $post['name'], 'postal_code' => $google['postal_code'],
'country_code' => $google['country_code'], 'admin_code' => $google['admin_code'],
'sub_admin_code' => $google['sub_admin_code'], 'locality_name' => $google['locality_name'] ),
array( '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s' )
);
if($insert_code) {
$google_code_id = $wpdb->insert_id;
$geo_date = date( 'Y-m-d H:i:s' );
$wpdb->insert(
$wpdb->prefix . 'geo_mashup_location_relationships',
array( 'object_name' => 'post', 'object_id' => $postid, 'location_id' => $google_code_id, 'geo_date' => $geo_date ),
array( '%s', '%s', '%s', '%s' )
);
}else{
//can't insert data
}
echo ' address added.<br />';
}else{
//echo $json->status;
}
}
} //$values end (for)
}
}else{
//not found file.csv
echo 'not found file.csv';
}
function csv_delete_rows($filename='file.csv', $startrow=1, $endrow=5, $inner=false) {
$status = 0;
//check if file exists
if (file_exists($filename)) {
//end execution for invalid startrow or endrow
if ($startrow < 0 || $endrow < 0 || $startrow > 0 && $endrow > 0 && $startrow > $endrow) {
die('Invalid startrow or endrow value');
}
$updatedcsv = array();
$count = 0;
//open file to read contents
$fp = fopen($filename, "r");
//loop to read through csv contents
while ($csvcontents = fgetcsv($fp)) {
$count++;
if ($startrow > 0 && $endrow > 0) {
//delete rows inside startrow and endrow
if ($inner) {
$status = 1;
if ($count >= $startrow && $count <= $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
//delete rows outside startrow and endrow
else {
$status = 2;
if ($count < $startrow || $count > $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
}
else if ($startrow == 0 && $endrow > 0) {
$status = 3;
if ($count <= $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
else if ($endrow == 0 && $startrow > 0) {
$status = 4;
if ($count >= $startrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
else if ($startrow == 0 && $endrow == 0) {
$status = 5;
} else {
$status = 6;
}
}//end while
if ($status < 5) {
$finalcsvfile = implode("\n", $updatedcsv);
fclose($fp);
$fp = fopen($filename, "w");
fwrite($fp, $finalcsvfile);
}
fclose($fp);
return $status;
} else {
die('File does not exist');
}
}
?>
<html>
<body>
<form enctype="multipart/form-data" method="post">
CSV: <input name="file" type="file" />
<input type="submit" value="Send File" />
</form>
</body>
</html>
Here is php code for that:
$input = explode("\n", file_get_contents("file.csv"));
foreach ($input as $line) {
// process all lines.
}
// This function removes first 100 elements.
// More info:
// http://php.net/manual/en/function.array-slice.php
$output = array_slice($input, 100);
file_put_contents("out.csv", implode("\n", $output));
Note, if csv file contains header you have to remove first element from array $input.
I used this script in the passed written by Joby Joseph:
function csv_delete_rows($filename=NULL, $startrow=0, $endrow=0, $inner=true) {
$status = 0;
//check if file exists
if (file_exists($filename)) {
//end execution for invalid startrow or endrow
if ($startrow < 0 || $endrow < 0 || $startrow > 0 && $endrow > 0 && $startrow > $endrow) {
die('Invalid startrow or endrow value');
}
$updatedcsv = array();
$count = 0;
//open file to read contents
$fp = fopen($filename, "r");
//loop to read through csv contents
while ($csvcontents = fgetcsv($fp)) {
$count++;
if ($startrow > 0 && $endrow > 0) {
//delete rows inside startrow and endrow
if ($inner) {
$status = 1;
if ($count >= $startrow && $count <= $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
//delete rows outside startrow and endrow
else {
$status = 2;
if ($count < $startrow || $count > $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
}
else if ($startrow == 0 && $endrow > 0) {
$status = 3;
if ($count <= $endrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
else if ($endrow == 0 && $startrow > 0) {
$status = 4;
if ($count >= $startrow)
continue;
array_push($updatedcsv, implode(',', $csvcontents));
}
else if ($startrow == 0 && $endrow == 0) {
$status = 5;
} else {
$status = 6;
}
}//end while
if ($status < 5) {
$finalcsvfile = implode("\n", $updatedcsv);
fclose($fp);
$fp = fopen($filename, "w");
fwrite($fp, $finalcsvfile);
}
fclose($fp);
return $status;
} else {
die('File does not exist');
}
}
The function accepts 4 parameters:
filename (string): Path to csv file. Eg: myfile.csv
startRow (int): First row in delete area
endRow (int): Last row in delete area
inner (boolean): decide whether rows deleted are from inner area or outer area
Now Let us consider various cases. I have a csv file with me named ‘test.csv’. Here is the screenshot of the same.
Example 1:
$status = csv_delete_rows('test.csv', 3, 5, true);
will delete the red part of:
Example 2:
$status = csv_delete_rows('test.csv', 3, 5, false);
will delete the red part of:
Example 3:
Like in your situation, if you want to delete the first 100 rows, use this:
$status = csv_delete_rows('test.csv', 0, 100);

How can I extract text from PDF files using PHP?

I know there are a lot of PDF extraction methods/techniques, but I'm after a reliable text extractor for PDFs in PHP. All I want is to extract words, but not numbers and no special characters.
Any ideas of solid techniques to achieve this?
The Zend Framework provides Zend_Pdf, a php class that will load and parse pdf documents.
Here is a script that shows how to extract the text from a loaded Zend_Pdf object.
If you use laravel, you can try my library...
<?php
namespace App\Helpers;
/*
SYNTAX:
include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('test.pdf');
$a->decodePDF();
echo $a->output();
*/
class PDF2Text {
// Some settings
var $multibyte = 4; // Use setUnicode(TRUE|FALSE)
var $convertquotes = ENT_QUOTES; // ENT_COMPAT (double-quotes), ENT_QUOTES (Both), ENT_NOQUOTES (None)
var $showprogress = false; // TRUE if you have problems with time-out
// Variables
var $filename = '';
var $decodedtext = '';
function setFilename($filename) {
$this->decodedtext = '';
$this->filename = $filename;
}
function output($echo = false) {
if($echo) echo $this->decodedtext;
else return $this->decodedtext;
}
function setUnicode($input) {
// 4 for unicode. But 2 should work in most cases just fine
if($input == true) $this->multibyte = 4;
else $this->multibyte = 2;
}
function decodePDF() {
// Read the data from pdf file
$infile = #file_get_contents($this->filename, FILE_BINARY);
if (empty($infile))
return "";
// Get all text data.
$transformations = array();
$texts = array();
// Get the list of all objects.
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile . "endobj\r", $objects);
$objects = #$objects[1];
// Select objects with streams.
for ($i = 0; $i < count($objects); $i++) {
$currentObject = $objects[$i];
// Prevent time-out
#set_time_limit ();
if($this->showprogress) {
flush(); ob_flush();
}
// Check if an object includes data stream.
if (preg_match("#stream[\n|\r](.*)endstream[\n|\r]#ismU", $currentObject . "endstream\r", $stream )) {
$stream = ltrim($stream[1]);
// Check object parameters and look for text data.
$options = $this->getObjectOptions($currentObject);
if (!(empty($options["Length1"]) && empty($options["Type"]) && empty($options["Subtype"])) )
continue;
unset($options["Length"]);
// So, we have text data. Decode it.
$data = $this->getDecodedStream($stream, $options);
if (strlen($data)) {
if (preg_match_all("#BT[\n|\r](.*)ET[\n|\r]#ismU", $data . "ET\r", $textContainers)) {
$textContainers = #$textContainers[1];
$this->getDirtyTexts($texts, $textContainers);
} else
$this->getCharTransformations($transformations, $data);
}
}
}
// Analyze text blocks taking into account character transformations and return results.
$this->decodedtext = $this->getTextUsingTransformations($texts, $transformations);
}
function decodeAsciiHex($input) {
$output = "";
$isOdd = true;
$isComment = false;
for($i = 0, $codeHigh = -1; $i < strlen($input) && $input[$i] != '>'; $i++) {
$c = $input[$i];
if($isComment) {
if ($c == '\r' || $c == '\n')
$isComment = false;
continue;
}
switch($c) {
case '\0': case '\t': case '\r': case '\f': case '\n': case ' ': break;
case '%':
$isComment = true;
break;
default:
$code = hexdec($c);
if($code === 0 && $c != '0')
return "";
if($isOdd)
$codeHigh = $code;
else
$output .= chr($codeHigh * 16 + $code);
$isOdd = !$isOdd;
break;
}
}
if($input[$i] != '>')
return "";
if($isOdd)
$output .= chr($codeHigh * 16);
return $output;
}
function decodeAscii85($input) {
$output = "";
$isComment = false;
$ords = array();
for($i = 0, $state = 0; $i < strlen($input) && $input[$i] != '~'; $i++) {
$c = $input[$i];
if($isComment) {
if ($c == '\r' || $c == '\n')
$isComment = false;
continue;
}
if ($c == '\0' || $c == '\t' || $c == '\r' || $c == '\f' || $c == '\n' || $c == ' ')
continue;
if ($c == '%') {
$isComment = true;
continue;
}
if ($c == 'z' && $state === 0) {
$output .= str_repeat(chr(0), 4);
continue;
}
if ($c < '!' || $c > 'u')
return "";
$code = ord($input[$i]) & 0xff;
$ords[$state++] = $code - ord('!');
if ($state == 5) {
$state = 0;
for ($sum = 0, $j = 0; $j < 5; $j++)
$sum = $sum * 85 + $ords[$j];
for ($j = 3; $j >= 0; $j--)
$output .= chr($sum >> ($j * 8));
}
}
if ($state === 1)
return "";
elseif ($state > 1) {
for ($i = 0, $sum = 0; $i < $state; $i++)
$sum += ($ords[$i] + ($i == $state - 1)) * pow(85, 4 - $i);
for ($i = 0; $i < $state - 1; $i++) {
try {
if(false == ($o = chr($sum >> ((3 - $i) * 8)))) {
throw new Exception('Error');
}
$output .= $o;
} catch (Exception $e) { /*Dont do anything*/ }
}
}
return $output;
}
function decodeFlate($data) {
return #gzuncompress($data);
}
function getObjectOptions($object) {
$options = array();
if (preg_match("#<<(.*)>>#ismU", $object, $options)) {
$options = explode("/", $options[1]);
#array_shift($options);
$o = array();
for ($j = 0; $j < #count($options); $j++) {
$options[$j] = preg_replace("#\s+#", " ", trim($options[$j]));
if (strpos($options[$j], " ") !== false) {
$parts = explode(" ", $options[$j]);
$o[$parts[0]] = $parts[1];
} else
$o[$options[$j]] = true;
}
$options = $o;
unset($o);
}
return $options;
}
function getDecodedStream($stream, $options) {
$data = "";
if (empty($options["Filter"]))
$data = $stream;
else {
$length = !empty($options["Length"]) ? $options["Length"] : strlen($stream);
$_stream = substr($stream, 0, $length);
foreach ($options as $key => $value) {
if ($key == "ASCIIHexDecode")
$_stream = $this->decodeAsciiHex($_stream);
elseif ($key == "ASCII85Decode")
$_stream = $this->decodeAscii85($_stream);
elseif ($key == "FlateDecode")
$_stream = $this->decodeFlate($_stream);
elseif ($key == "Crypt") { // TO DO
}
}
$data = $_stream;
}
return $data;
}
function getDirtyTexts(&$texts, $textContainers) {
for ($j = 0; $j < count($textContainers); $j++) {
if (preg_match_all("#\[(.*)\]\s*TJ[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
elseif (preg_match_all("#T[d|w|m|f]\s*(\(.*\))\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
elseif (preg_match_all("#T[d|w|m|f]\s*(\[.*\])\s*Tj[\n|\r]#ismU", $textContainers[$j], $parts))
$texts = array_merge($texts, array(#implode('', $parts[1])));
}
}
function getCharTransformations(&$transformations, $stream) {
preg_match_all("#([0-9]+)\s+beginbfchar(.*)endbfchar#ismU", $stream, $chars, PREG_SET_ORDER);
preg_match_all("#([0-9]+)\s+beginbfrange(.*)endbfrange#ismU", $stream, $ranges, PREG_SET_ORDER);
for ($j = 0; $j < count($chars); $j++) {
$count = $chars[$j][1];
$current = explode("\n", trim($chars[$j][2]));
for ($k = 0; $k < $count && $k < count($current); $k++) {
if (preg_match("#<([0-9a-f]{2,4})>\s+<([0-9a-f]{4,512})>#is", trim($current[$k]), $map))
$transformations[str_pad($map[1], 4, "0")] = $map[2];
}
}
for ($j = 0; $j < count($ranges); $j++) {
$count = $ranges[$j][1];
$current = explode("\n", trim($ranges[$j][2]));
for ($k = 0; $k < $count && $k < count($current); $k++) {
if (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+<([0-9a-f]{4})>#is", trim($current[$k]), $map)) {
$from = hexdec($map[1]);
$to = hexdec($map[2]);
$_from = hexdec($map[3]);
for ($m = $from, $n = 0; $m <= $to; $m++, $n++)
$transformations[sprintf("%04X", $m)] = sprintf("%04X", $_from + $n);
} elseif (preg_match("#<([0-9a-f]{4})>\s+<([0-9a-f]{4})>\s+\[(.*)\]#ismU", trim($current[$k]), $map)) {
$from = hexdec($map[1]);
$to = hexdec($map[2]);
$parts = preg_split("#\s+#", trim($map[3]));
for ($m = $from, $n = 0; $m <= $to && $n < count($parts); $m++, $n++)
$transformations[sprintf("%04X", $m)] = sprintf("%04X", hexdec($parts[$n]));
}
}
}
}
function getTextUsingTransformations($texts, $transformations) {
$document = "";
for ($i = 0; $i < count($texts); $i++) {
$isHex = false;
$isPlain = false;
$hex = "";
$plain = "";
for ($j = 0; $j < strlen($texts[$i]); $j++) {
$c = $texts[$i][$j];
switch($c) {
case "<":
$hex = "";
$isHex = true;
$isPlain = false;
break;
case ">":
$hexs = str_split($hex, $this->multibyte); // 2 or 4 (UTF8 or ISO)
for ($k = 0; $k < count($hexs); $k++) {
$chex = str_pad($hexs[$k], 4, "0"); // Add tailing zero
if (isset($transformations[$chex]))
$chex = $transformations[$chex];
$document .= html_entity_decode("&#x".$chex.";");
}
$isHex = false;
break;
case "(":
$plain = "";
$isPlain = true;
$isHex = false;
break;
case ")":
$document .= $plain;
$isPlain = false;
break;
case "\\":
$c2 = $texts[$i][$j + 1];
if (in_array($c2, array("\\", "(", ")"))) $plain .= $c2;
elseif ($c2 == "n") $plain .= '\n';
elseif ($c2 == "r") $plain .= '\r';
elseif ($c2 == "t") $plain .= '\t';
elseif ($c2 == "b") $plain .= '\b';
elseif ($c2 == "f") $plain .= '\f';
elseif ($c2 >= '0' && $c2 <= '9') {
$oct = preg_replace("#[^0-9]#", "", substr($texts[$i], $j + 1, 3));
$j += strlen($oct) - 1;
$plain .= html_entity_decode("&#".octdec($oct).";", $this->convertquotes);
}
$j++;
break;
default:
if ($isHex)
$hex .= $c;
elseif ($isPlain)
$plain .= $c;
break;
}
}
$document .= "\n";
//$document .= "<br>";
}
return $document;
}
}

Cutting text without destroying html tags

Is there a way to do this without writing my own function?
For example:
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 2, null, 20, true);
//result: Test <span><a>something</a></span>
I need to make this function indestructible
My problem is similar to
This thread
but I need a better solution. I would like to keep nested tags untouched.
So far my algorithm is:
function cutText($content, $max_words, $max_chars, $max_word_len, $html = false) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
for ($i = 0; $i<$max_chars;$i++) {
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
if ($inHtml) {
$max_chars++;
}
if ($html && !$inHtml) {
if ($content[$i] != ' ' && !$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '- ';
}
if (($content[$i] == ' ') && $word_started) {
$word_started = false;
$res .= $current_word;
$current_word = '';
$current_word_len = 0;
if ($word_count == $max_words) {
return $res;
}
}
}
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
}
return $res;
}
But of course it won't work. I thought about remembering opened tags and closing them if they were not closed but maybe there is a better way?
This works perfectly for me:
function trimContent ($str, $trimAtIndex) {
$beginTags = array();
$endTags = array();
for($i = 0; $i < strlen($str); $i++) {
if( $str[$i] == '<' )
$beginTags[] = $i;
else if($str[$i] == '>')
$endTags[] = $i;
}
foreach($beginTags as $k=>$index) {
// Trying to trim in between tags. Trim after the last tag
if( ( $trimAtIndex >= $index ) && ($trimAtIndex <= $endTags[$k]) ) {
$trimAtIndex = $endTags[$k];
}
}
return substr($str, 0, $trimAtIndex);
}
Try something like this
function cutText($inputText, $start, $length) {
$temp = $inputText;
$res = array();
while (strpos($temp, '>')) {
$ts = strpos($temp, '<');
$te = strpos($temp, '>');
if ($ts > 0) $res[] = substr($temp, 0, $ts);
$res[] = substr($temp, $ts, $te - $ts + 1);
$temp = substr($temp, $te + 1, strlen($temp) - $te);
}
if ($temp != '') $res[] = $temp;
$pointer = 0;
$end = $start + $length - 1;
foreach ($res as &$part) {
if (substr($part, 0, 1) != '<') {
$l = strlen($part);
$p1 = $pointer;
$p2 = $pointer + $l - 1;
$partx = "";
if ($start <= $p1 && $end >= $p2) $partx = "";
else {
if ($start > $p1 && $start <= $p2) $partx .= substr($part, 0, $start-$pointer);
if ($end >= $p1 && $end < $p2) $partx .= substr($part, $end-$pointer+1, $l-$end+$pointer);
if ($partx == "") $partx = $part;
}
$part = $partx;
$pointer += $l;
}
}
return join('', $res);
}
Parameters:
$inputText - input text
$start - position of first character
$length - how menu characters we want to remove
Example #1 - Removing first 3 characters
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 0, 3);
var_dump($text);
Output (removed "Tes")
string(47) "t <span><a>something</a> something else</span>."
Removing first 10 characters
$text = cutText($text, 0, 10);
Output (removed "Test somet")
string(40) "<span><a>hing</a> something else</span>."
Example 2 - Removing inner characters - "es" from "Test "
$text = cutText($text, 1, 2);
Output
string(48) "Tt <span><a>something</a> something else</span>."
Removing "thing something el"
$text = cutText($text, 9, 18);
Output
string(32) "Test <span><a>some</a>se</span>."
Hope this helps.
Well, maybe this is not the best solution but it's everything I can do at the moment.
Ok I solved this thing.
I divided this in 2 parts.
First cutting text without destroying html:
function cutHtml($content, $max_words, $max_chars, $max_word_len) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
$i = 0;
while ($i < $max_chars) {
//skip any html tags
if ($content[$i] == '<') {
$inHtml = true;
while (true) {
$res .= $content[$i];
$i++;
while($content[$i] == ' ') { $res .= $content[$i]; $i++; }
//skip any values
if ($content[$i] == "'") {
$res .= $content[$i];
$i++;
while(!($content[$i] == "'" && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
//skip any values
if ($content[$i] == '"') {
$res .= $content[$i];
$i++;
while(!($content[$i] == '"' && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
if ($content[$i] == '>') { $res .= $content[$i]; $i++; break;}
}
$inHtml = false;
}
if (!$inHtml) {
while($content[$i] == ' ') { $res .= $content[$i]; $letter_count++; $i++; } //skip spaces
$word_started = false;
$current_word = '';
$current_word_len = 0;
while (!in_array($content[$i], array(' ', '<', '.', ','))) {
if (!$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '-';
$current_word_len = 0;
}
$i++;
}
if ($letter_count > $max_chars) {
return $res;
}
if ($word_count < $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
}
if ($word_count == $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
return $res;
}
}
}
return $res;
}
And next thing is closing unclosed tags:
function cleanTags(&$html) {
$count = strlen($html);
$i = -1;
$openedTags = array();
while(true) {
$i++;
if ($i >= $count) break;
if ($html[$i] == '<') {
$tag = '';
$closeTag = '';
$reading = false;
//reading whole tag
while($html[$i] != '>') {
$i++;
while($html[$i] == ' ') $i++; //skip any spaces (need to be idiot proof)
if (!$reading && $html[$i] == '/') { //closing tag
$i++;
while($html[$i] == ' ') $i++; //skip any spaces
$closeTag = '';
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$closeTag .= $html[$i];
$i++;
}
$c = count($openedTags);
if ($c > 0 && $openedTags[$c-1] == $closeTag) array_pop($openedTags);
}
if (!$reading) //read only tag
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$tag .= $html[$i];
$i++;
}
//skip any values
if ($html[$i] == "'") {
$i++;
while(!($html[$i] == "'" && $html[$i-1] != "\\")) {
$i++;
}
}
//skip any values
if ($html[$i] == '"') {
$i++;
while(!($html[$i] == '"' && $html[$i-1] != "\\")) {
$i++;
}
}
if ($reading && $html[$i] == '/') { //self closed tag
$tag = '';
break;
}
}
if (!empty($tag)) $openedTags[] = $tag;
}
}
while (count($openedTags) > 0) {
$tag = array_pop($openedTags);
$html .= "</$tag>";
}
}
It's not idiot proof but tinymce will clear this thing out so further cleaning is not necessary.
It may be a little long but i don't think it will eat a lot of resources and it should be faster than regex.

Categories