Converting a possibly nested HTML UL to CSV (4 fields) using PHP preg_replace I run onto a snag. The following line takes care of part of nested lists which go unchanged (except for removed newlines) into one of the fields created from the topmost UL:
$idx_string = preg_replace("|(<li>.*?)\n+(<ul>)\n+(.*?</li></ul></li>)|si","$1$2$3", $idx_string);
Now on some large lists without nested lists (checked there's no such thing as <ul> in it at this point of conversion) this fails due to backtrack_limit_error. So while I know how to get over it, I can't figure how matching nothing could trigger the backtrack limit at all. According to what I've found, preg_replace returns either the new string or the unchanged old string (besides NULL/FALSE on error). So how does backtrack get in here?
The list items look like this:
<li>Algeria - Italy.</li>
<li>Go sailing<br>
Anglesey / Wight / Guernsey / Jersey</li>
<li>d'Anjou et du Saumurois, Carte des Gouvernements<br>
Check out the old places!</li>
The CSV looks like this:
|9848.php|Algeria - Italy.|
Go sailing|11434.php|Anglesey - Anglesey / Wight / Guernsey / Jersey|
|11367.php|d'Anjou et du Saumurois, Carte des Gouvernements|Check out the old places!
So in effect all tags are stripped and the remainder split into 4 fields. The odd nested list is stuffed into the third field as is, that is with <ul> & <li> tags, only newlines stripped.
This is some old PHP 4 code utilized as fallback mechanism. DOMDocument might be the better general approach, but I don't want to invest much time in this and the format of the list is pretty strict & simple.
Summing up
Looking at the code again with Jerry's comments in mind it becomes obvious how the first group (<li>.*?) has PHP starting at the first <li> right at the top of the file and chewing the whole file in search for a <ul>, all into one backtrack space.
Enclosing the statement in if (stripos($idx_string, '<ul')) { ... } block reduces the chance of triggering the error, as does raising pcre.backtrack_limit to 1000000, which is the default as of PHP 5.3.7 anyway, but was not updated to here for one reason or another. So much wrapping up for the record.
Related
I need to be able to generate debugging statements for my code. For example, here is some code I have:
$this->R->radius_ft = $this->TC->diameter / 24;
$this->R->TBETA2_rad = $this->D->beta2 / $rad; //Outer angle
$this->R->TBETA1_rad = $this->R->inner_beta1 / $rad; //Inner angle
I need to be able display results of computations so that they can be read by a human.
So far I have been doing this (example showing first line from above only):
$this->R->radius_ft = $this->TC->diameter / 24;
if (self::DEBUG)
print("radius_ft({$this->R->radius_ft}) = diameter({$this->TC->diameter}) / 24");
The above print something like radius_ft(1.4583) = diameter(35) / 24 and a few of those lines looks like equations and are nicely traceable when I want to verify things on paper, or if I want to expose the intermediate work of the computations to someone else.
The problem is that it is a pain to construct those debugging statements. I craft them by hand, and usually it is not a problem, but in my current example, there are hundreds of lines of code where this needs to be done. Much pain.
I was wondering if there are facilities in PHP that will allow me to make print-outs of statements showing what each line of code does. Or methods to semi-automate creating the debug lines for me.
I have so far discovered this method to cut down on some of the work .... use Macro facilities of a text editor.
Paste line of code into TextPad (or similar editor that supports macros). Record macro and use Search, Mark and Copy facilities to carefully navigate between special symbols of the variable, such as $, >, and symbols that are not alphanumeric or $, >, etc. while copying and extracting and pasting parts of variable to craft my particular statement.
Exact steps may differ for one's needs. My macro operates on one variable like $this->R->radius_ft with cursor at the start and ends up with something like radius_ft({$this->R->radius_ft}), with cursor a few chars after the end, sometimes coinciding with the next variable to process.
Perhaps same could be done with regular expressions but I like the way Macro does it - I can process a variable and go to the next one and just repeat the macro with a hot key combination. This takes out the most tedious chunk of work for me.
Alternatively - hand the person the code and let them figure it out. Teach them how to read code.
I've got a script which generates text. I need to be strip all repeated blocks of text. The string is in xml format, so I can use the beginning and ending tags to determine where the strings are. I've been using substr_replace to remove the unnecessary text... However, this only works if I know how many times said text is going to be present in the string. Example :
<container>
<string1>This is the first string.</string>
<string2>This is the second string.</string>
<stuff>This is the important stuff.</stuff>
</container>
That container might appear once, twice six times, seven times, whatever. The point is, it's necessary to only have it appear once in the string variable. Right now this is what I'm doing.
$where_begin = strpos($wsman_output,'<container');
$where_end = strpos($wsman_output,"</container>");
$end_length = strlen("</Envelope>");
$attack = $where_end - $where_begin;
$attack = $attack + $end_length;
$wsman_output = substr_replace($wsman_output,"",$where_begin,$attack);
And I do that for each time the container exists.... However, I just found out that it's not always going to be the same.. Which really messes things up.
Any ideas?
In the end I decided to use the method suggested here.
I pulled each block of string I wanted from the variable, then combined them back together in the required order.
UPDATE AT THE BOTTOM
Maybe somebody could help with this... been struggling with it for days and i'm blocked :/
For a content-cleaner solution i'm working in, i'm trying to convert some pure-text numbered lists, like:
1 Foo
1.1 Foo 1
1.2 Foo 2
2 Bar
2.1 Bar 1
2.2 Bar 2
2.2.1 Bar 2.1
2.2.2 Bar 2.2
2.3 Bar 3
3 Z Another root item
... into correct nested html lists ...
<ul>
<li>Foo
<ul>
<li>Foo 1</li>
<li>Foo 2</li>
</ul>
</li>
<li>Bar
<ul>
<li>Bar 1</li>
<li>Bar 2
<ul>
<li>Bar 2.1</li>
<li>Bar 2.2</li>
</ul>
</li>
<li>Bar 3</li>
</ul>
<li>Another root item</li>
</ul>
Some things that may help:
No need for the result to be correctly indented, just surrounded by the correct html tags
No need to locate the list inside another text, can sume i already have only the list
No need for great performance, regexp, itaration... whatever works is fine
No need for especific language solution, PHP, Python, Javascript, Pseudocode... is fine
Can asume " " (space) as the only separator after the "1.2.3 " list text
Can asume lines are already in the correct order, no need to order them at all
UPDATE TLTR (Not homework, but real world usage)
Sorry for looking so "homework not done", my fault. English is not my language and i tried to be maybe to concise.
What i'm trying to do is to make it easier for my workmates to format text to correct html from unknow sources.
Up to day i managed to (you can see the full screenshot here http://twitpic.com/907aw5/ as i can't attach images being my first question and no reputation):
I get the original text and do a strip_tags on it to delete any incorrect HTML it can have
I insert it into a textarea
I integrated a Javascript editor ( Codemirror http://codemirror.net ) with the specifications for HTML
I injected an edition bar with the most common tags we use, as my workmates doesn't know a word about HTML
As part of the cleaning options, i set two hotkeys that makes an ul / ol of the selected text (breaking in the \n chars)
When the user saves, i run HTMLTidy on it for it to became as cleaner as posible (indent, delete propietary tags, etc...)
Just to finish, as you can see in the above screenshot, i have a lot of texts with the 1.2.3 "organization", and it will be of much help to be able to get a nested list solution out of this kind of text.
UPDATE (The especific needs)
Now the explanation of "why" i used so many bullets for asumptions:
No need for the result to be correctly indented, just surrounded by the correct html tags (Because after this, when the user hit Save button, i run htmltidy on it, so it get indented)
No need to locate the list inside another text, can sume i already have only the list (Because i run the code over the user-selected text in the editor, so i can sume he selected the correct list)
No need for great performance, regexp, itaration... whatever works is fine (As it an human-use, point-click, point-click, i don't mind if it takes 0.0001 seconds per use, or 0.1)
No need for especific language solution, PHP, Python, Javascript, Pseudocode... is fine (I intend to use it in javascript/jQuery, but what i need is just the logic, as i'm blocked... i can tarnslate it if the solution is in another language)
Can asume " " (space) as the only separator after the "1.2.3 " list text (As it is the 99% of my text-cases)
Can asume lines are already in the correct order, no need to order them at all (As you can see in the screenshot, that text is human-entered, and i asume they inserted it in the correct order)
Sorry again for not being clear enought, just my first question in Stackoverflow, and i didn't realize it will look like homework, my fault.
Just for funsies, I went ahead and wrote a solution to your problem using PHP:
function helper_func($m)
{
static $r=0;
$o='';
$l=preg_match_all("#\d+#",$m[1],$n);
while($l < $r)
{
$r--;
$o .= '</li></ul>';
}
if($l == $r)return $l == 0?$o.$m[0]:$o.'</li><li>'.$m[0];
else $o=$m[0];
while($l > $r)
{
$r++;
$o = '<ul><li>'.$o;
}
return $o;
}
echo preg_replace_callback("#^([0-9.]*).*$#m","helper_func",$input);
However, in deference to this being homework, I included a deliberate error: for it to come out correctly, you need to make a single small change to $input before passing it in... Have fun :)
Hello I have two php file. One of them builds a report, the second contains the language text. When it prints, it keeps giving me the � special character everywhere, even if I am not using any special characters in my code. Why is that and how can I get rid of those?
I am running Apache 2.2, php 5, Ubuntu 8.04.
FILE 1
<?php
function glossary() {
return <<<HTML
<h1>Arteries</h1>
<p><strong>Arteries</strong> are blood vessels that carry blood <strong>away from
the heart</strong>. All arteries, with the exception of the pulmonary and umbilical
arteries, carry oxygenated blood. The circulatory system is extremely important for
sustaining life. Its proper functioning is responsible for the delivery of oxygen
and nutrients to all cells, as well as the removal of carbon dioxide and waste products,
maintenance of optimum pH, and the mobility of the elements, proteins and cells of
the immune system. In developed countries, the two leading causes of death, myocardial
infarction and stroke each may directly result from an arterial system that has been
slowly and progressively compromised by years of deterioration.</p>
HTML;
}
?>
FILE 2:
<?php
require_once("language.php");
echo glossary();
?>
This is the printout when I execute file 2.
Glossary
Arteries
Arteries�are�blood vessels�that carry blood�away from the�heart. All arteries, with the exception of the�pulmonary�and�umbilical arteries, carry oxygenated blood. The�circulatory system�is extremely important for sustaining�life. Its proper functioning is responsible for the delivery of�oxygen and�nutrients�to all cells, as well as the removal of�carbon dioxide�and waste products, maintenance of optimum�pH, and the mobility of the elements, proteins and cells of the�immune system. In�developed countries, the two leading causes of�death, myocardial infarction�and�stroke�each may directly result from an arterial system that has been slowly and progressively compromised by years of deterioration.
Autoimmunity
Autoimmunity�is the failure of an organism to recognize its own constituent parts as�self, which allows an immune response against its own cells and tissues. Any disease that results from such an aberrant immune response is termed an�autoimmune disease.�
Basal cell carcinoma
Basal cell carcinoma�is the most common type of�skin cancer. It rarely�metastasizes�or kills, but it is still considered�malignant because it can cause significant destruction and disfigurement�by invading surrounding tissues. Statistically, approximately 3 out of 10 Caucasians develop a basal cell cancer within their lifetime. In 80 percent of all cases, basal cell cancers are found on the head and neck.�There appears to be an increase in the incidence of basal cell cancer of the trunk in recent years.
Try deleting and re-entering the spaces which show up as "�".
I suspect those you re-enter will be fine. The document likely contains alternate Unicode space characters which appear normally in your editor, but are unrecognized by the PHP code running in the default character set for your server.
Did this document originally come from MS Word or some other word processor?
You need to make sure that you have your editor encoding set to something sensible such as UTF-8. You should also make sure that your output is set to UTF-8 (or whatever encoding is relevent to you). This can be done using a meta tag <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/> and setting the PHP header header('Content-type: text/html; charset=UTF-8'); before your output begins.
Have you checked your encoding? Make sure that you use the same encoding within the editor, apache, php and the browser you are using.
Hope this helps.
In small cases like this I use notepad as a lazy man's sanitizer of charsets. Paste the text into notepad. Copy and paste it back into your document. Spaces will now be spaces.
I know this has been asked (Python Markdown nl2br extension, etc) but none of those answers is doing it for me.
I would like to render markdown so that linebreaks occuring within a <p> element will be rendered as <br>. Example: they type
Here is line one.
And line two.
New paragraph.
should render as
<p>Here is line one.<br>And line two.</p>
<p>New paragraph.</p>
I know that if you want that, you should type two spaces at the end of the line you want to <br>. I am trying to make it so my users don't have to do that, but rather, enter text as though they were using a typewriter (for those who know what that is). One hard return, new line; two hard returns, new paragraph.
I've been working with https://parsedown.org/ and have also experimented with https://commonmark.thephpleague.com; also the Python markdown module with nl2br extension (tried their example verbatim, did not work for me). Whatever I do, I end up with either too many or not enough linebreaks, depending.
I have tried what I thought would be clever and elegant: style my markdown's <p> with white-space: "pre" (also tried pre-line). That works, unless the user has done it "right" with two spaces, in which case you get the unwanted double <br> effect.
Also tried nl2br($markdown) with likewise unreliable results.
I want non-technical users to be able to use some basic formatting as easily as possible, and markdown seems just the thing, but for this detail. I don't want to write a CMS just to work around this. For example, I've thought of adding a boolean markdown property on the entity and letting them choose, yadda yadda... don't wanna go there. I've thought of doing some string-replacement or regexp magic, either at database-write time or just before rendering. But again, hoping to avoid getting too complicated. (To make it a little more challenging, I will also have to import a few thousand legacy records that are non-markdown, and potentially deal with issues around old ones versus new.)
Maybe I'm overlooking a simple, sane way out. Any thoughts as to the best strategy?
Update: by popular demand, code examples of what does not work. It's a Zend MVC application that involves Doctrine entities I call MOTD and MOTW (Message Of The Day and Message Of The Week, respectively); these have a string property called content. Generically I think of these entities as Notes and they implement a NoteInterface. When I retrieve these from the database (via a NotesService class that internally uses a custom Doctrine repository class), it's time to render the content as markdown before the controller assigns it to the view:
// from NotesService.php
use Parsedown;
// stuff omitted...
/**
* gets MOT(D|W) by date
*
* #param DateTime $date
* #param string $type
* #param boolean $render_markdown
* #return NoteInterface|null
*/
public function getNoteByDate(DateTime $date, string $type, bool $render_markdown = true) :? NoteInterface
{
$entity = $this->getRepository()->findByDate($date,$type);
if ($entity && $render_markdown) {
$content = $entity->getContent();
$entity->setContent($this->parsedown($content));
}
return $entity;
}
The point of the boolean $render_markdown is for when we want raw markdown, i.e., when it's going to populate a textarea element of a form.
And the parsedown() method, quite simply:
public function parsedown(string $content) : string
{
if (! $this->parseDown) {
$this->parseDown = new Parsedown();
}
// nope...
// return nl2br($this->parseDown->text($content));
return $this->parseDown->text($content);
}
Inside a viewscript, I just go, e.g.,
if ($this->notes['motd']):
// echo nl2br($this->notes['motd']->getContent());
echo $this->notes['motd']->getContent();
else:
?><p class="font-italic no-note">no MOTD for this date</p><?php
endif;
Now, if in the editing form they input this as content:
here is a line
and here is another
now, new paragraph.
and then we save it in the database, when you select it back out and run it through $parsedown->text($content), you get this HTML:
<p>here is a line
and here is another</p>
<p>now, new paragraph.</p>
Please note, the example input above does not have any space characters preceding the linebreaks. When you do type two spaces before the linebreaks, yeah, it works great. But I don't think my users want to think about that. So using nl2br() helps, except when it results in too many consecutive <br>s in the HTML.
My latest thinking is, use a CSS solution and an input filter that strips <space><space> at the end of lines. When it works, I'll add the story to my memoir. :-)
There may be some more desirable way to achieve this, but finally I decided to
(1) filter the input (at create|update time) with regexp pattern substition to remove trailing ' ' (two consecutive space characters) from lines. I happen to be using ZendFramework's Zend\Filter\PregReplace but it's a de facto wrapper for preg_replace('/( {2,})(\R)/m',$2).
(2) Use CSS to make newlines act like <br> when I display these entities, e.g.,
#motd .card-body p { white-space: pre-line }
Seems to be working for me.