XML maximum limit? - php

I'm using Expression Engine to generate a very large XML template. The generated XML is probably in the neighborhood of 1800 - 2000 lines. I've started to see a funky behavior where I add a new project my oldest project is no longer showing up in the XML. It is almost as if there is some kind of limit that it's reaching and pushing anything in after that limit forces the oldest item out. There are no errors on the page and the XML closes properly. Has anyone ever come across something like this?

I believe the channel:entries tag (or weblog:entries on EE1) has a default limit of 1000 entries unless specified otherwise. Try adding limit="5000" to your entries tag.

Related

Is there a way to store multiple records rather than using multiple rows in MySQL?

I would like to make full use out of MySQL for the purpose of a (web) application I have developed for a chiropractor.
So far I have been storing in a single row for [every year] for what are called progress notes. The table structure looks something like this (progress_note_id, patient_id, date (Y-0-0), progress_note). When the client wishes to append for the year of the current progress notes, he simply clicks at the top of a textarea (html), which I use TinyMCE JavaScript library, to make a new entry date along with the shorthand notes to go at the beginning of the column (progress_note). So far its been working ok, if there are 900+ clients (est.) there could potentially be 1300+ progress notes, for each year since the beginning of the application (2018).
Now the client wishes to be able to see previous progress notes (history), but is unable to modify any previous notes, while still be able to write new ones. The solution I have come up with is to use XML inside the textarea, and use PHP to decipher the new notes from the old ones.
My problem however is if I should have to convert my entire table from a yearly to a daily, that it could take up a lot of time and energy to convert multiple notes into each single rows, (est. 10x) Which could end up being 13,000+ rows. I realize that no matter what method I choose to do is going to be a lot of work. Another way around this perhaps I found was to use XML column type in MySQL to potentially store multiple records, and if I wish to append it, all I would need is PHP to interpret the entire XML and add a new child node, to the beginning. Each progress note is 255 - 500 chars. And in worst case scenario, if the patient was to be 52 times a year (1 for every week), there shouldn't be a large enough overhead.
Is this the correct way to solving this problem? I do wish to keep with MySQL DB and I realize that MySQL is not an intended for XML. And for some clarification, what I hope to accomplish is the same thing I intended to do with current progress notes, but with XML. I believe in ascending order (newer -> oldest).
<xml_result>
<progress_note>
<date>2020-08-16</date>
<content></content>
</progress_note>
<xml_result>
Thank-you for any of your time and for any suggestions.
Firstly, 13000+ is not a problem for mysql. In most case for web application, mysql can handle more than 10m+ records for a single instance with a good performance.
Secondly, you can use either XML or JSON format in a text field and handle the decoding in your application.

expressionengine stopped parsing some tags after adding some content

i am modifying an expression engine channel entry, there are a number of fields for this channel in a group and two are grids. in the template, i have loops that go through the rows in the grid and output various fields. this is all working as expected.
i recently wanted to add some content, and when i did my existing loops were no longer parsed, just output
{grid_name0}
<div>stuff</div>
{/grid_name0}
and any {grid_name0:fieldname} were replaced with empty string/nothing. other tags continued to work correctly, including tags and a loop in the new content i added.
after debugging a bit, i discovered that there is a specific amount of text that will break it, i can add a comment to the template like
<!-- 01234567890123456789012345678901234567890123456789 -->
that when i get to a certain length it will break the page. this seems like maybe a memory limit is being reached before expression engine processes all the content, i checked various areas for memory limits and they seem to be set at reasonable values (and this is a pretty simple/light site), however i suspect there is a limit set somewhere that i am not aware of.
this is on expression engine 2.7.2 with php 5.2.17, i can provide additional details from phpinfo if desired but the php memory_limit is set to 512M (which, frankly, seems like way overkill).
possibly related, but one of the things i was adding is an "advanced" condition with an {if:else}, if i have this present but less text than is required to break the page it will also break in the exact same way.
thanks in advance for any help!
huge thanks to ellis labs support, they pointed me to increasing the pcre.backtrack_limit beyond the default 100000 and it now processes all the tags. in retrospect i should have checked that sooner, but run into issues with it so infrequently it never crossed my mind.
edit system/expressionengine/config/config.php and add this just after the initial line:
ini_set('pcre.backtrack_limit', 100000000);

Running preg_replace on html code taking too long

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.
As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...
Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run

Fastest Way to display a data node + all its attributes in PHP?

I'm using php to take xml files and convert them into single line tab delimited plain text with set columns (i.e. ignores certain tags if database does not need it and certain tags will be empty). The problem I ran into is that it took 13 minutes to go through 56k (+ change) files, which I think is ridiculously slow. (average folder has upwards of a million xml files) I'll probably cronjob it overnight anyways, but it is completely untestable at a reasonable pace while I'm at work for things like missing files and corrupt files and such.
Here's hoping someone can help me make the thing faster, the xml files themselves are not too big (<1k lines) and I don't need every single data tag, just some, here's my data node method:
function dataNode ($entries) {
$out = "";
foreach ($entries as $e) {
$out .= $e->nodeValue."[ATTRIBS]";
foreach ($e->attributes as $name => $node)
$out .= $name."=".$node->nodeValue;
}
return $out;
}
where $entries is a DOMNodeList generated from XPath queries for the nodes I need. So the question is, what is the fastest way to go to a target data node or nodes (if I have 10 keyword nodes from my XPath query then I need all of them to be printed from that function) and output the nodevalue and all it's attributes?
I read here that iterating through a DOMNodeList isn't constant time but I can't really use the solution given because a sibling to the node I want might be one that I don't need or need to call a different format function before I write it to file and I really don't want to run the node through a gigantic switch statement for every iteration trying to format out the data.
Edit: I'm an idiot, I had my write function inside my processing loop so every iteration it had to reopen the file I was writing to, thanks for both of your help, I'm trying to learn XSLT right now as it seems very useful.
A comment would be a little short, so I write it as an answer:
It's hard to say where actually your setup can benefit from optimizing. Perhaps it's possible to join multiple of your many XML files together before loading.
From the information you give in your question I would assume that it's more the disk operations that are taking the time than the XML parsing. I found DomDocument and Xpath quite fast even on large files. An XML file with up to 60 MB takes about 4-6 secs to load, a file of 2MB only a fraction.
Having many small files (< 1k) would mean a lot of work on the disk, opening / closing files. Additionally, I have no clue how you iterate over directories/files, sometimes this can be speed up dramatically as well. Especially as you say that you have millions of file nodes.
So perhaps concatenating/merging files is an option for you which can be run quite safe so to reduce the time to test your converter.
If you encounter missing or corrupt files, you should create a log and catch these errors. So you can let the job run through and check for errors later.
Additionally, if possible, you can try to make your workflow resumeable. E.g. if an error occurs, the current state is saved and next time you can continue at this state.
The suggestion above in a comment to run an XSLT on the files is a good idea as well to transform them first. Having a new layer in the middle to transpose data can help to reduce the overall problem dramatically as it can reduce complexity.
This workflow on XML files has helped me so far:
Preprocess the file (plain text filters, optional)
Parse the XML. That's loading into DomDocument, XPath iterating etc.
My Parser sends out events with the parsed data if found.
The Parser throws a specific exception if data is encountered that is not in the expected format. That allows to realize errors in the own parser.
Every other errors are converted to Exceptions as well.
Exceptions can be caught and operations finished. E.g. go to next file etc.
Logger, Resumer and Exporter (file-export) can hook onto the events. Sort of the visitor pattern.
I've build such a system to process larger XML files which formats change. It's flexible enough to deal with changes (e.g. replace the parser with a new version while keeping logging and exporting). The event system really pushed it for me.
Instead of a gigantic switch statement I normally use a $state variable for the parsers state while iterating over a domnodelist. $state can be handy to resume operations later. Restore the state and go to the last known position, then continue.

lastBuildDate in dynamically generated RSS

RSS feed being generated on demand.
As far as I can see, for I have 2 options for lastBuildDate - current time or pubDate.
Which one would you choose and why?
According to the RSS 2.0 spec, lastBuildDate is the last time the content of the channel changed. (I'm not entirely satisfied with this definition because what if the feed's meta data changes? I think the common convention is to update lastBuildDate in that case, too.)
The channel-wide pubDate is supposed to be used for the original publication date of the items in the feed. It is never a good value to use for lastBuildDate because the pubDate is to stay unchanged even if the item gets updated.
Using the current time is the easy way out, but it's not perfect (because clients may start unnecessary operations due to the changed lastBuildDate)
The best way would be to actually know / find out when the feed's content last changed, and output that.
Related question
The item having the newest PubDate should become the lastBuildTime.
[EDIT]: If there is a separate PubDate you are using too for whole feed, then lastBuildTime should be current time because you are building it at current time on-demand :).
[EDIT]: 2:: As lastBuildTime is optional and you're anyways including PubDate for whole feed, why not remove it from your feed output?

Categories