Limiting XML/HTML string length

Limiting XML/HTML string length - php

So I am trying to parse an XML file and display first 150 words of an article with READ MORE link. It doesn't exactly parse 150 words though. I am also not sure how to make it so it does not parse IMG tag code, etc... the code is below
// Script displays 3 most recent blog posts from blog.pinchit.com (blog..pinchit.com/api/read)
// The entries on homepage show the first 150 words of description and "READ MORE" link
// PART 1 - PARSING
// if it was a JSON file
// $string=file_get_contents("http://blog.pinchit.com/api/read");
// $json_a=json_decode($string,true);
// var_export($json_a);
// XML Parsing
$file = "http://blog.pinchit.com/api/read";
$posts_to_display = 3;
$posts = array();
// get all the file nodes
if(!$xml=simplexml_load_file($file)){
trigger_error('Error reading XML file',E_USER_ERROR);
}
// counter for posts member array
$counter = 0;
// Accessing elements within an XML document that contain characters not permitted under PHP's naming convention
// (e.g. the hyphen) can be accomplished by encapsulating the element name within braces and the apostrophe.
foreach($xml->posts->post as $post){
//post's title
$posts[$counter]['title'] = $post->{'regular-title'};
// post's full body
$posts[$counter]['body'] = $post->{'regular-body'};
// post's body's first 150 words
//for some reason, I am not sure if it's exactly 150
$posts[$counter]['preview'] = substr($posts[$counter]['body'], 0, 150);
//strip all the html tags so it doesn't mess up the page
$posts[$counter]['preview'] = strip_tags($posts[$counter]['preview']);
//post's id
$posts[$counter]['id'] = $post->attributes()->id;
$posts_to_display--;
$counter++;
//exit the for loop after we parse out all the articles that we want
if ($posts_to_display == 0 ) break;
}
// Displays all of the posts
foreach($posts as $post){
echo "<b>" . $post['title'] . "</b>";
echo "<br/>";
echo $post['preview'];
echo " <a href='http://blog.pinchit.com/post/" . $post[id] . "'>Read More</a>";
echo "<br/><br/>";
}
Here are how results look now.
Editor's Pick: Club Sportiva
Nothing makes you feel as totally free and in control as a day behind the wheel of a sleek, sophisticated, sexy sports car. It’s no surprise Read More
Pinchy Drinks & Rocks: The Hotel Utah Saloon
Hotel Utah Read More
Monday Menu: Spicy Grapefruit, Paprika, Creamsicles
Feeling summery and savory today, and we have to admit it took a lot to resist the urge to make this an all appetizers, all desserts, or all drinks Read More

The HTML tags are counting against your character total. Strip the tags out first, then take your preview sample:
$preview = strip_tags($posts[$counter]['body']);
$posts[$counter]['preview'] = substr($preview, 0, 150).'...';
Also, one usually adds an ellipse ("...") to the end of truncated text to indicate that it continues.
Note that this has the potential disadvantage of removing tags you DO want, like <p> and <br>. If you want to preserve those, you can pass them as the second argument for strip_tags:
$preview = strip_tags($posts[$counter]['body'], '<br><p>');
$posts[$counter]['preview'] = substr($preview, 0, 150).'...';
BUT, be forewarned that XML-style tags might throw this off (<br />). If you're dealing with XML/HTML mixed, you might need to elevate your tag filtering using something like htmLawed, but the concept remains the same - get rid of the HTML before you truncate.

Looking at the tag <regular-body> it seems to contain HTML. Therefore I would recommend trying to parse that into a DOMDocument ( http://www.php.net/manual/en/domdocument.loadhtml.php ). You then would be able to loop through all the items and ignore certain tags (ex. ignore <img> but keep <p>). After that, you can then render out what you want and truncate it to 150 characters.

Related

Formatting string to display as list using php

I have the contents of a textarea being stored in a PHP string after it is submitted by the user. I am hoping to be able to tweak the formatting of the contents of that string, such that it will be displayable as a list when it is echoed. In other words, I would need to insert UL and /UL at the beginning and end, respectively, and LI and /LI and the beginning and end of each line.
Before I mess with my code, I was wondering if anyone knows if this is this even possible? Are carriage returns sent via textarea submit? Any help/comments would be much appreciated.
[EDIT]
I have defined some variables to give myself all the necessary HTML stuff. The 'repertoire' variable is the original string containing text sent from user input.
$repertoire = ($_POST['repertoire']);
$list_start = '<UL>';
$list_end = '</UL>';
$list_end = '</UL>';
$list_start_line = '<LI>';
$list_end_line = '</LI>';
The following is an example of what would be submitted by the user, and therefore, what would constitute the original $repertoire string:
Luciano Berio - Circles
Mike Svoboda - Piangero la sorte mia
Nicholas von Ritter-Zahony - New Piece
Stefano Gervasoni - Due Poesie Francesi di Rilke
So we would at least need the following:
$repertoire_formatted = substr_replace($list_start, $repertoire, $list_end);
...but I don't know how to substitute <LI> for line breaks; also, I cannot know in advance the length of the string or of each line.

You can use regex to selecting every line and wrap it in <li></li>
$html = preg_replace("/([^\n]+)/", "<li>$1</li>", $repertoire);
$html = "<ul>\n$html</ul>";
Check result in demo

PHP Parsing An QBO File Using strpos and substr

I am needing PHP code that will parse specific values from a QuickBooks Online (QBO) file, also known as the OFX/QFX file format (http://en.wikipedia.org/wiki/QFX_%28file_format%29).
A section of my sample QBO file that can be used for testing is the following:
OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX><SIGNONMSGSRSV1><SONRS><STATUS><CODE>0<SEVERITY>INFO</STATUS><DTSERVER>20150518082838<LANGUAGE>ENG<FI><ORG>Bank Name<FID>1234</FI><INTU.BID>1234<INTU.USERID>123456789012</SONRS></SIGNONMSGSRSV1>
<BANKMSGSRSV1><STMTTRNRS><TRNUID>0<STATUS><CODE>0<SEVERITY>INFO</STATUS><STMTRS><CURDEF>USD<BANKACCTFROM><BANKID>123456789<ACCTID>12345678901<ACCTTYPE>CHECKING</BANKACCTFROM><BANKTRANLIST><DTSTART>20140204235959<DTEND>20150512235959
<STMTTRN><TRNTYPE>DIRECTDEBIT<DTPOSTED>20140204235959<TRNAMT>-000000056.32<FITID>2014000000000000000000000000000000000000000000000000000<NAME>ELECT PWR<MEMO>WEB</STMTTRN>
</BANKTRANLIST><LEDGERBAL><BALAMT>123.45<DTASOF>20150515235959</LEDGERBAL><AVAILBAL><BALAMT>123.45<DTASOF>20150515235959</AVAILBAL></STMTRS>
</STMTTRNRS></BANKMSGSRSV1></OFX>
I am having trouble getting values from an QBO to an array in php. I've looked into various utilities such as QBO2CSV (http://www.propersoft.net/qbo2csv/home) and FixOFX (https://github.com/wesabe/fixofx) and would like to use just PHP code to do this if at all possible. QBO2CSV seemed to almost work if I use a command line to convert the QBO to a CSV and then parse the CSV, but if I could just do this in PHP then I could cut out a few steps.
I also have issue cleaning up the QBO and then using SimpleXMLElement as the QBO files are very non-standard XML and I have been unable to clean it up enough to have SimpleXMLElement accept it as standard XML. The best example I have found for this is at: http://www.ibm.com/developerworks/library/x-ofxv2/
...and it almost works. It is the closest code solution I have found, but it still isn't producing the results. This solution also tries to use SimpleXMLElement after cleaning up the QBO, but it as well has difficulty cleaning up the QBO to become accepted by SimpleXMLElement.
Parts of my attempted solution are below, but I am having difficulty with the XML brackets.
My code:
// READ CONTENTS OF FILE TO STRING
$cont = file_get_contents('C:\xampp\htdocs\testparse\test.qbo');
// STRIP OUT HEADER
$bline = strpos($cont,"<OFX>");
$head = substr($cont,0,$bline-2);
$ofx = substr($cont,$bline-1);
// 3. Examine tags that might be improperly terminated
$ofxx = $ofx;
// NUMBER OF TAGS
$numtags = substr_count($ofxx, '<');
// INIT LOOP
$tagloop = 1;
// PARSE THROUGH TAGS
while ($tagloop <= $numtags){
$tagloop++;
$pos = strpos($ofxx,'<');
$pos2 = strpos($ofxx,'>');
$ele = substr($ofxx,$pos+1,$pos2-$pos-1);
// FIND TAGS AND MAKE SURE THEY ARE IN HTML FORMAT WITH BRACKETS:
$tagstart = "<";
$tagend = ">";
$omittag = $tagstart . $ele . $tagend;
//FIND END OF TAG
$pos3 = strpos($omittag,'>');
$pos4 = $pos3+1;
//STRIP TAG OF ANY REMAINING CHARS AFTER THE ">" CHAR
//FOR SOME REASON OCCASIONALLY THE STRING WOULD BE LONGER THAN INTENDED, SO THIS MAKES SURE IT IS CUT OFF AFTER ">"
$omittag2 = substr($omittag, 0, $pos4);
//REMOVE TAG FROM MAIN STRING
$ofxx = preg_replace($omittag2, '', $ofxx, 1);
// TROUBLES OCCUR HERE...I CAN'T SEEM TO BE ABLE TO GET RID OF EMPTY <> and > CHARS...NOT SURE WHY THEY ARE HERE SINCE THE ABOVE SHOULD HAVE REMOVED ALL OF THE TAG BUT SOMETIMES "<>" OR ">" REMAIN
// WHAT I AM THEN TRYING TO DO IS TO GRAB THIS TAG'S NAME AND THEN EVENTUALLY STORE IT IN AN ARRAY ALONG WITH ITS DATA. SINCE QBOs DO NOT HAVE TERMINATING TAGS THEY EITHER NEED CONVERTING TO SELF TERMINATING TAGS OR JUST TRY AND GRAB A TAG VALUE, AND THE DATA THAT FOLLOWS IT AS THE DATA FOR THAT TAG
//FIND START OF NEXT TAG
$pos5 = strpos($ofxx,'<');
//USE THE START OF POS5 OF THAT TAG TO GRAB DATA FOR THE CURRENT TAG IF POS5 GREATER THAN ZERO
if ($pos5 > 0) {
$tagdata = substr($ofxx, 0, $pos5);
}
// 5. Deal with special characters
$ofxx = str_replace('&','&',$ofxy);
} // END LOOP
I think my biggest issue is dealing with the "<" and ">" characters. I am having trouble removing them as I am going through the string and parsing out the values.
Once I am seeing the correct values echoed, I will then start to add them to an array to then add to a MySQL database.

PHP Looping Through Replacing Tags

I'm trying to do custom tags for links, colour and bullet points on a website so [l]...[/l] gets replaced by the link inside and [li]...[/li] gets replaced by a bullet point list.
I've got it half working but there's a problem with the link descriptions, heres the code:
// Takes in a paragraph, replaces all square-bracket tags with HTML tags. Calls the getBetweenTags() method to get the text between the square tags
function replaceTags($text)
{
$tags = array("[l]", "[/l]", "[list]", "[/list]", "[li]", "[/li]");
$html = array("<a style='text-decoration:underline;' class='common_link' href='", "'>" . getBetweenTags("[l]", "[/l]", $text) . "</a>", "<ul>", "</ul>", "<li>", "</li>");
return str_replace($tags, $html, $text);
}
// Tages in the start and end tag along with the paragraph, returns the text between the two tags.
function getBetweenTags($tag1, $tag2, $text)
{
$startsAt = strpos($text, $tag1) + strlen($tag1);
$endsAt = strpos($text, $tag2, $startsAt);
return substr($text, $startsAt, $endsAt - $startsAt);
}
The problem I'm having is when I have three links:
[l]http://www.example1.com[/l]
[l]http://www.example2.com[/l]
[l]http://www.example3.com[/l]
The links get replaced as:
http://www.example1.com
http://www.example1.com
http://www.example1.com
They are all hyperlinked correctly i.e. 1,2,3 but the text bit is the same for all links.
You can see it in action here at the bottom of the page with the three random links. How can i change the code to make the proper URL descriptions appear under each link - so each link is properly hyperlinked to the corresponding page with the corresponding text showing that URL?

str_replace does all the grunt work for you. The problem is that:
getBetweenTags("[l]", "[/l]", $text)
doesn't change. It will match 3 times but it just resolves to "http://www.example1.com" because that's the first link on the page.
You can't really do a static replacement, you need to keep at least a pointer to where you are in the input text.
My advise would be to write a simple tokenizer/ parser. It's actually not that hard. The tokenizer can be really simple, find all [ and ] and derive tags. Then your parser will try to make sense of the tokens. Your token stream can look like:
array(
array("string", "foo "),
array("tag", "l"),
array("string", "http://example"),
array("endtag", "l"),
array("string", " bar")
);

Here is how I would use preg_match_all instead personally.
$str='
[l]http://www.example1.com[/l]
[l]http://www.example2.com[/l]
[l]http://www.example3.com[/l]
';
preg_match_all('/\[(l|li|list)\](.+?)(\[\/\1\])/is',$str,$m);
if(isset($m[0][0])){
for($x=0;$x<count($m[0]);$x++){
$str=str_replace($m[0][$x],$m[2][$x],$str);
}
}
print_r($str);

Remove HTML Entity if Incomplete

I have an issue where I have displayed up to 400 characters of a string that is pulled from the database, however, this string is required to contain HTML Entities.
By chance, the client has created the string to have the 400th character to sit right in the middle of a closing P tag, thus killing the tag, resulting in other errors for code after it.
I would prefer this closing P tag to be removed entirely as I have a "...read more" link attached to the end which would look cleaner if attached to the existing paragraph.
What would be the best approach for this to cover all HTML Entity issues? Is there a PHP function that will automatically close off/remove any erroneous HTML tags? I don't need a coded answer, just a direction will help greatly.
Thanks.

Here's a simple way you can do it with DOMDocument, its not perfect but it may be of interest:
<?php
function html_tidy($src){
libxml_use_internal_errors(true);
$x = new DOMDocument;
$x->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$src);
$x->formatOutput = true;
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $x->saveHTML());
return trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
}
$brokenHTML[] = "<p><span>This is some broken html</spa";
$brokenHTML[] = "<poken html</spa";
$brokenHTML[] = "<p><span>This is some broken html</spa</p>";
/*
<p><span>This is some broken html</span></p>
<poken html></poken>
<p><span>This is some broken html</span></p>
*/
foreach($brokenHTML as $test){
echo html_tidy($test);
}
?>
Though take note of Mike 'Pomax' Kamermans's comment.

why you don't take the last word in the paragraph or content and remove it, if the word is complete you remove it , if is not complete you also remove it, and you are sure that the content still clean, i show you an example for what code will be look like :
while($row = $req->fetch(PDO::FETCH_OBJ){
//extract 400 first characters from the content you need to show
$extraction = substr($row->text, 0, 400);
// find the last space in this extraction
$last_space = strrpos($extraction, ' ');
//take content from the first character to the last space and add (...)
echo substr($extraction, 0, $last_space) . ' ...';
}

just remove last broken tag and then strip_tags
$str = "<p>this is how we do</p";
$str = substr($str, 0, strrpos($str, "<"));
$str = strip_tags($str);

HTML DOM: How to get elements without losing children?

I'm trying to perform a preg_replace on the text in an HTML string. I want to avoid replacing the text within tags, so I'm loading the string as a DOM element and grabbing the text within each node. For example, I have this list:
<ul>
<li>Boxes 1-3: 1925 - 1928 <em>(A-Ma)</em></li>
<li>Boxes 4-6: 1928 <em>(Mb-Z)</em> - 1930 <em>(A-Wi)</em></li>
<li>Boxes 7-9: 1930 <em>(Wo-Z)</em>- 1932 <em>(A-Fl)</em></li>
</ul>
I want to be able to highlight the character "1", or the letter "i", without disturbing the links or list item tag. So I grab each list item and get its value to perform the replace on:
$invfile = [string of the unordered list above]
$invcontents = new DOMDocument;
$invcontents->loadHTML($invfile);
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$f->nodeValue = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $f->nodeValue);
}
echo html_entity_decode($invcontents->saveHTML());
The problem is, when I grab the node values, the child nodes inside the list item are lost. If I print out the original string as-is, the < a >, < em >, etc. tags are all there. But when I run the script, it prints out without the links or any formatting tags. For example, if my $to_replace is the string "Boxes", the list becomes:
<ul>
<li><span class="highlight">Boxes</span> 1-3: 1925 - 1928 (A-Ma)</li>
<li><span class="highlight">Boxes</span> 4-6: 1928 (Mb-Z) - 1930 (A-Wi)</li>
<li><span class="highlight">Boxes</span> 7-9: 1930 (Wo-Z)- 1932 (A-Fl)</li>
</ul>
How can I get the text without losing the tags inside?

The problem here is that you're operating on the entire element. Boxes is part of the nodeValue of an anchor tag.
If the structure above is always the same you can do something like
$new_html = preg_replace("##", "", $f->item(0)->nodeValue);
In reality, the best way to go about it is to unset the anchor's node value and create an entirely new element and append it.
(Consider this psuedo code)
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$span = $invcontents->createElement("span");
$span->setAttribute("class", "highlight");
$span->nodeValue = $f->item(0)->nodeValue;
$f->appendChild($span);
}
echo $invcontents->saveHTML();
You'll have to do some matching in there, as well as unsetting the nodeValue of $f but hopefully this makes it a little more clear.
Also, don't set HTML in nodeValue directly, because it will run htmlentities() against all of the html you set. That is why I create a new element above. If you absolutely have to set HTML in nodeValue then you should create a DocumentFragment Object

YOu're better of operating only on the textnodes:
$x = new DOMXPath(invcontents);
foreach($x->query('//li/text()' as $textnode){
//replace text node with list of plain text nodes & your highlighting span.
}

I always use xpath for this kind of actions. It'll give you more flexibility.
This example handles
<mainlevel>
<toplevel>
<detaillevel key=...>
<xmlvalue1></xmlvalue1>
<xmlvalue1></xmlvalue2>
<sublevel key=...>
<xmlvalue1></xmlsubvalue1>
<xmlvalue1></xmlsubvalue2>
</sublevel>
</detaillevel>
</toplevel>
</mainlevel>
To parse this:
$xpath = new DOMXPath($xmlDoc);
$mainNodes = $xpath->query("/mainlevel/toplevel/detaillevel");
foreach( $mainNodes as $subNode ) {
$parameter1=$subNode->getAttribute('key');
$parameter2=$subNode->getElementsByTagName("xmlvalue1")->item(0)->nodeValue;
$parameter3=$subNode->getElementsByTagName("xmlvalue2")->item(0)->nodeValue;
foreach ($subNode->getElementsByTagName("sublevel") as $detailNode) {
$parameter1=$detailNode->getAttribute('key');
$parameter2=$detailNode->getAttribute('xmlsubvalue1');
$parameter2=$detailNode->getAttribute('xmlsubvalue2');
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Limiting XML/HTML string length - php

Related

Formatting string to display as list using php

PHP Parsing An QBO File Using strpos and substr

PHP Looping Through Replacing Tags

Remove HTML Entity if Incomplete

HTML DOM: How to get elements without losing children?

Categories

Resources