Running preg_replace on html code taking too long

Running preg_replace on html code taking too long - php

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.

As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...

Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run

Related

PHP snippet always shows in the top of page in wordpress

I'm currently using a database search in wordpress wherein I show the results in the result page via snippets.
The problem is the results always get displayed in the top of page. I do understand php executes first before html, but is there a way to get the results displayed where I want?
Code:
Wordpress: [xyz-ips snippet="industry"]
Php:
$com=$_GET['company'];
$sql2 = "SELECT count(1) i from data where sector= (SELECT sector from data where company like '$com%')";
$rows2 = $wpdb->get_results($sql2 );
echo ' '.$rows2[0]->i.' ';

Long story short, you need to find where in WordPress the string of HTML is being output.
Then, you move it to where you want the HTML to be output.
The general idea, as you said, is that PHP executes (echo $html) before your browser reads the HTML sent by PHP. Your PHP script will run in order, top to bottom, so by moving sections of PHP that output HTML (or anything) to a different part of your script, you change its location on the page.
It's hard to give further input without understanding your situation in detail. I would highly recommend browsing the documentation for your "WordPress search" to see if there is a best practice for accomplishing what you want.

Thanks to the intelligent community here, they are more curious to downvote than to help fellow mate.!
I discovered after much search that the php version caused this. I downgraded (from the current 5.6) to 5.5 and that solved the issue.
This might help someone who is searching for a similar answer.

how to show short version of posts

I'm writing a blog and I want to show short versions of posts on the main page. I assume native php string functions aren't appropriate here since posts can be large and it would take long to substr all posts in loop. So, what is the common strategy here? I hope the question is clear and specific.
I don't want to shorten posts on client side with JS, that's not an option.

The solution I use is to make another field in db table with posts where I put short version of post, cuted begining or something like that.
It's faster, and better, you don't have to worry about length becouse you control it, there is no problem with evenual html tags used in context, and you can have a bit diffrent text on mainpage

I can think of two options. The first one involves you writing excerpts for your blog posts manually. Doing this, you don't have to worry about PHP at all.
If you do want to go ahead and automatically generate excerpts, I would set a upper character limit and then cut at the end of the sentence nearest the chosen limit. This approach may or may not produce good results depending on how your post is written.

Get contents of a changing DIV with static ID

I am trying to make myself a homepage, for my personal use only and what I want to do is to display different information from different websites that change few times a day. i.e. News, weather and such. I want to have my favorite information always on sight without the need to visit many pages. As many of websites don't load within an iframe which was the first thing I tried I figured PHP might be able to help me.
So what I need to do to is to get the contents of a DIV and place it within my page with PHP.
The DIV on the source page is generated on the server but it always have the same ID.
example:
<div id="nowbox">
<a href="http://www.seznam.cz/jsTitleExecute?id=91&h=19331020">
<img width="135" height="77" src="http://seznam.cz/favicons/title//009/91-JrAEVc.jpg" alt="" /></a>
<div class="cont"> <ul> <li>
<strong>Sledujte dnes od 20.00 koncert Tata Bojs</strong>
<p>Nenechte si ujít tradiční benefiční koncert kapely Tata Bojs. Sledujte představení na Seznam.cz</p> </li> </ul>
</div>
</div>
so the ID of the DIV is "nowbox" and I need to copy all that is within it and put it in my page.
So far I was only able to use this
$contents = file_get_contents("http://seznam.cz");
and view all contents of the page but I have no idea how to strip everything and leave only the needed DIV.
I am not very experienced in PHP so I would be very grateful for any help, the easier to understand the better.
EDIT:
THX for answers. Basically I just wanted to get the code I posted as example to a variable so I could ECHO it somewhere on my page. The problem is that the code changes as does the rest of the website and only some things remain the same i.e. the DIV ID.
Definitely NOT the most elegant solution (even I know that but as the website is for my purposes only it shouldn't matter) but one that I successfully managed to get to work is that I got the whole page with:
$contents = file_get_contents("http://seznam.cz");
and then counted the number of chars to a specific unique position in the code with STRPOS plus/minus a static number of characters that I could count manually. Then I split the string into ARRAYs and discard the parts I don't need to get the beginning of the code in the beginning of a string and then use the same method to cut the string after the code ended.

If you want to do this server side, I suggest you to use phpquery
require('phpQuery/phpQuery.php');
$doc = phpQuery::newDocumentFileXHTML('http://seznam.cz');
$html = pq('#nowbox')->htmlOuter();

I'm unable to fully understand what you want to achieve and why, but you can do this both through the server and client, first, the client side way:
Well, what you're asking is for is to extract parts of the DOM, using javascript + jQuery on the client side you can achieve it this very rapidly, simply by calling the $.load("/mypage #nowbox") function.
This could be achieved on the server side aswell using php by using any DOM manipulation library, either one that is bundled within (DOMDocument) or one the easier to use libs (which is a bit memory leakish), simplehtmldom
So there you have it, options for both client & server ways to implement, select which one suites your needs best.
please notice that any CSS ruling will not be available by either method, as the css won't be loaded in your dom.
Good Luck!

Need to clean database of spam

So a couple things. First, being sick, I can't seem to focus right to get this figured out like I should, and secondly, it's flat out got me stumped on how to deal with this.
So I have a client who has an old site built on old code. There were some extreme vulnerabilities in the code that allowed for injections and attacks - which happened. Since I've come onto the project, I have tightened things up considerably and haven't really had issues. But I just found something that appears to be a lingering issue from previous hacks.
So in the database they have a field called 'copy' which is intended to store content of an article in. Ok fine, not the best of names, but it's there. So here's the issue. Since the hack, there are some 52k rows that have the word "viagra" in them. So when I look closer at the copy field and the code in a view source, this is what I find:
for the little kids in the neighborhood.<div style="display: none;">
Basically the opened and closed div tags that have a style set as seen above. So it doesn't visually render on the page but when you view the source or... "search engine spiders" come by, they see it. I couldn't figure out for the life of me why the .php files that got uploaded into the article_image directory were being indexed in Webmaster Tools - til tonight. Now I know why.
So here's what I need. Because each row in the database (52k of them) have what's given as an example (the <div style...>) part, and they all appear after the content that was there originally, I need something that I can add to a loop that will clean the crap out of the copy field so it cleans up the mess. I could take the str_replace method way - but that's too long and no guarantee that i would get all the stuff.
So - any suggestions?

Try this: (assuming "content" is the name of the column with the article content)
UPDATE `copy` SET `content`=
SUBSTR(`content` FROM 1 FOR LOCATE('<div style="display: none;">',`content`))
WHERE `content` LIKE '%<div style="display: none;">%';
Since you have indicated that these injections are always the last thing in an article, this will wipe them out pretty well. I strongly suggest taking a backup copy first, though!

Zend_Search_Lucene query parsing problem

Here's the setup, I have a Lucene Index and it works well with the 2,000 documents I have indexed. I have been using Luke (Lucene Index Toolbox, v.0.9.2) to debug queries, and am using ZF 1.9.
The layout for my Lucene Index is as follows:
I = Indexed
T = Tokenized
S = Stored
Fields:
author - ITS
category - ITS
publication - ITS
publicationdate - IS
summary - ITS
title - ITS
Basically I have a form that is searchable by the above fields, letting you mix and match any of the above information, and will parse it into a zend luceue query. That is not the problem, the problem is when I start combining terms, the "optimize" method that fires within the find causes the query to just disappear.
Here is an example search I am running right now:
Form Version:
Title: test title
Publication: publication name
Lucene Query Parse:
+(title:test title) +(publication:publication name)
Now if I take this query string, and slap it into LUKE, and hit "Search", it returns the results just fine. When I use the Query Find method, it bombs out. So I did a little research into how it functions and found a problem (I believe)
First off, heres the actual lines of code that does the searching:
$searchQuery = "+(title:test title) +(publication:publication name)";
$hits = new ArrayObject($this->index->find($searchQuery));
It's a simplified version of the actual code, but thats what it generates.
Now heres what I've noticed after some debugging, the "optimize" method just destroys the query itself. I created the following code:
$rewrite = $searchQuery->rewrite($this->index);
$optimize = $searchQuery->rewrite($this->index)->optimize($this->index);
echo "======<br/>";
echo "Original: ".$searchQuery."<br/>";
echo "Rewrite: ".$rewrite."<br/>";
echo "Optimized + Rewrite: ".$optimize."<br/>";
echo "======<br/>";
Which outputs the following text:
======
Original: +(title:test title) +(publication:publication name)
Rewrite: +(title:test title) +(publication:publication name)
Optimized + Rewrite:
======
Notice how the 3rd output is completely empty. It appears that the Rewrite & Optimize on the query is causing the query string to just empty itself.
Does anyone have any idea why the optimize method seems to just be removing my query all together? Am I missing a filter or some sort of interface that might need to be parsed? All of the queries work perfectly when I paste them into LUKE and run them against the index by hand, but something silly is going on with the way Zend is parsing the query to do the search.
Any help is appreciated.

I will be quite frank, Zend_Search_Lucene (ZSL) is buggy and not maintained since a long time now.
It is also conceptually wrong. Let me explain why:
Search engines are there to reply fast to search queries, the problem with ZSL is that it is implemented in pure PHP. It means that at every query, all indexes files are read and reloaded again, continuously. It can't be fast.
There is nothing wrong with Lucene itself, there is even a very good alternative named Solr which is based on Lucene: it is a search server implemented in Java which can index and reply to all your Lucene queries. Because of the server nature of Solr, you don't suffer of poor performance by reloading all the Lucene files again and again.
This is somewhat different that what you asked, I waited two years for my ZSL bugs to be solved, it's now the case using Solr :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.