This question already exists:
Closed 11 years ago.
Possible Duplicate:
How to parse HTML with PHP?
I would be most grateful if a regex master among you would be kind enough to help me.
I'd like to make a php function that converts html tags/elements, as per the following:
I want to convert
<span class="heading1">Any generic text, or other html elements such as <p> tags</p> in here</span>
To
<h1 class="heading1">Any text, or other html elements such as <p> tags</p> in here</h1>
...So basically I want to convert the span headings to proper h1 tags (this is for the purpose of better SEO) but there could be other normal span tags that I want to preserve.
Any ideas? Thanks in advance.
Well, as the commenters above pointed out, it's probably not a good idea. However, since this case is extremely simple, the regex would be pretty easy if you want to live on the edge:
preg_replace('/<(\/*)span/', '<${1}h1', $htmlFile);
This will replace all span tags with h1 tags. Note that if there is any deviation from the format, it will break. Hence the warnings against this method. I would only recommend it if you are working with a small number of relatively small HTML files, so you can check them for errors.
EDIT: Yeah, if you only want to replace ones with class="heading1" I'm not touching it. That would require more mucking about with the regex than it would probably take to just fix all the files manually.
EDIT 2: Okay, I'm a little bored and curious, so I'm going to see if I can come up with a regex that would replace all class="heading1" spans and their corresponding closing tags with h1's:
preg_replace('/<span class="heading1">(.*(.*<span.*>.*<\/span>.*)*.*)<\/span>/', '<h1 class="heading1">${1}</h1>', $htmlFile);
If my calculations are correct, this should ignore any matching sets of span tags inside the heading1 span tags.
You're still probably better off using a DOM parser though.
Related
I am working on an automation using PHP.
I have a newsletter that get sent every week and instead of editing it in an annoying CMS, I decided to use automation and create a form that gets the variables and posts them to an HTML template.
For some reason one of the fields echos this: <p class="MsoNormal"><span lang="EN-GB"> before a text, and I am suspecting that it's because it contains chars that aren't supported, but not sure.. (in that case, is there an easy way to change the chars to supported ones?
What is the reason getting this?
How do I resolve this problem?
EDIT:
Ok, so the problem is because text I copy is from a Word Doc, and that's what's getting me these unrelevant tags.
I want tags, I want tags, ect. I just don't want the tags mentioned.
I hope this is more clear now..
Use strip_tags() to clean your strings
$string = strip_tags($string, '<a><b><br><div><em><i><li><table><td><tr><span><sub><sup><strong><u><ul>');
and so on
PHP manual: http://php.net/manual/en/function.strip-tags.php
I am working on a simple php-MySql website and presenting the data for the following fields for each entry in the database (through a loop):
Title
Organisation
DetailedInfo
The 'DetailedInfo' field in the database can hold up to 5000 characters. While displaying on the webpage I am only using the first 250 characters.
The problem is as follows. If an entry has a formatting tag (italic/bold) starting, say at character 240, and the formatting tag is not closed by the 250th character then the problem starts. For all subsequent entries the Title, Organisation and DetailedInfo are displayed with the tag (so all the subsequent text are either italic, or, bold).
I am using CSS style for Title, Organisation and DetailedInfo but it seems that the CSS is not able to get rid of the formatting tag from the data.
Any help will be appreciated.
Cheers,
Tim
If you're only displaying a small portion of the detailedInfo field I'd guess formatting it isn't that mportant. Use strip_tags() to get rid of the formatting tags before you display it.
CSS cannot fix broken HTML. You'll need to strip it back to plain text and re-code (or just leave it out).
I wouldn’t fix that with CSS (and I don’t think it’s possible). You’re outputting invalid HTML, which is going to cause problems, especially if anyone ever looks a the page in IE 8 or earlier.
It could also be worse than an unclosed tag. What if the excerpt ends with </i?
I’d either implement some crazy logic to close any unclosed HTML tags in the 250-character excerpt, or strip all HTML tags from the excerpt. I’m guessing the latter would be easier.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: Truncate HTML, ignoring tags
I have html data saved in db. I want to display html as shortened..
I try to use mb_strstr function like this;
$str = mb_strstr($this->htmlData, "</p>",true);
echo $str."</p>";
It echos the first paragraph of the html. But the problem is html is filled in admin panel and sometimes first paragraph is not have enough text.
I also dont want to use fixed character position with substr because sometimes let say 200 character can be a html tag so it produces invalid format html formatted output.
So I want to learn best practice for this kind of problem.
Thank you.
Add some custom tag or code into your WYSIWYG editor (example: <separator> or ...). You could use it to separate introduction part from the rest of the article. That will help you avoid mess with PHP tags being unclosed in introduction part. Also it gives author option to decide manually which part of text would be good for introduction.
Another wise think that can be done would be to make a separate field in the database for introduction part. Yes, it would cost more memory but it would give an option to author to write juicy introduction text to have more people open full article...
I have a PHP program that, at some point, needs to analyze a big amount of HTML+javascript text to parse info.
All I want to parse needs to be in two parts.
Seperate all "HTML goups" to parse
Parse each HTML group to get the needed information.
In the 1st parse it needs to find:
<div id="myHome"
And start capturing after that tag. Then stop capturing before
<span id="nReaders"
And capture the number that comes after this tag and stop.
In the 2nd parse use the capture nº 1 (0 has the whole thing and 2 has the number) from the parse made before and then find
.
I already have code to do that and it works. Is there a way to improve this, make it easier for the machine to parse?
preg_match_all('%<div id="myHome"[^>]>(.*?)<span id="nReaders[^>]>([0-9]+)<"%msi', $data, $results, PREG_SET_ORDER);
foreach($results AS $result){
preg_match_all('%<div class="myplacement".*?[.]php[?]((?:next|before))=([0-9]+).*?<tbody.*?<td[^>]>.*?[0-9]+"%msi', $result[1], $mydata, PREG_SET_ORDER);
//takes care of the data and finish the program
Note: I need this for a freeware program so it must be as general as possible and, if possible, not use php extensions
ADD:
I ommitted some parts here because I didn't expect for answers like those.
There is also a need to parse text inside one of the tags that is in the document. It may be the 6th 7th or 8th tag but I know it is after a certain tag. The parser I've checked (thx profitphp) does work to find the script tag. What now?
There are more than 1 tag with the same class. I want them all. But I want only with also one of a list of classes.....
Where can I find instructions and demos and limitations of DOM parsers (like the one in http://simplehtmldom.sourceforge.net/)? I need something that will work on, at least, a big amount of free servers.
Another thing. How do I parse this part:
"php?=([0-9]+)"
with those HTML parsers?
If you're concerned about efficiency (and indeed accuracy), don't attempt to parse HTML using regex.
You should use a parser, such as PHP's DOM
As noted above, regex is not a good fit for this. You'll be better of using somethign like this:
Robust and Mature HTML Parser for PHP
Efficiency doesn't matter if your results are incorrect. Parsing HTML with regexes will lead to incorrect results down the road. Use a parser.
I found a way to create efficient searches.
If you want to search for "A huge string in a whole text" you can do it this way:
(?:(?:[^A]*A)+? huge string in a whole text)
It always works. Only creates a backtrace every 'A' character and not for every single character. Because of that it is not only memory efficient but processing power efficient too. If there are two options, it's also works without a problem:
(?:(?:[^AB]*AB)+?(?: huge string in a whole text|e the huge string in a whole text))
Up until now it has never failed.
I am using preg_replace to add a link to keywords if they are found within a long HTML string. I don't want to add a link if the keyword is found within h1 tags or strong tags.
The below regex nearly works and basically says (I think): If the keyword is not immediately wrapped by either a h1 tag or a strong tag then replace with the keyword that was matched, as a bolded link to google.
$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<strong>$1</strong>', $result, -1);
(the reason I don't want to match if in strong tags is because I am recursing through a lot of keywords so don't want to link an already linked keyword on subsequent passes)
the above works fine and won't match:
<h1>bobs widgets</h1>
It will however match the keyword in the following text, because the h1 tag isn't immediately either side of the keyword:
<h1>Here are bobs widgets for sale</h1>
I need to make the spaces either side optional and have tried adding \s* but that doesn't get me anywhere. I'd be very grateful for a push in the right direction here.
Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).
What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an <h1> where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a </strong> and a </h1>. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible.
... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.
Here's an answer anyway.
You can add some wildcards instead of the word boundaries. Something like this should do the trick:
([^<>]*)(bobs widgets)([^<>]*)
Then, add some more replacement markers to keep the remainder of your text in the output:
'$1<strong>$2</strong>$3'
Now hit save and hide behind the sofa ;)