I need to do some cleanup on strings that look like this:
$author_name = '<a href="http://en.wikipedia.org/wiki/Robert_Jones_Burdette>Robert Jones Burdette </a>';
Notice the href tag doesn't have closing quotes - I'm using the DOMParser on a large table of these to extract the text, and it borks on this.
I would like to look at the string in $author_name;
IF the first > does NOT have a " before it, replace it with "> to close the tag correctly. If it is okay, just skip and do the next step. Be sure not to replace the second > at all.
Using php regex, I haven't been able to find a working solution - I could chop up the whole thing and check its parts, but that would be slow and I think there must be a regex that can do what I want.
TIA
What you can do is, find the first closing tag, with or without the double-quote ("), and replace it with (">):
$author_name = preg_replace('/(.+?)"?>(.+?)/', '$1">$2', $author_name);
http://www.barattalo.it/html-fixer/
Download that, then include it in your php.
The rest is quite easy:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It's common for people to want to use regular expressions, but you must remember that HTML is not regular.
Related
I have a set of page soure code that has elements such as
<p class='style1'>, <p class=3DMsoNormal>, <span style=3D'font-size:12.0pt'>, <p class=3DMsoNormal>
but i want am trying to replace the all single quotes with double quotes in the all of the source code and those that dont have quotes to receive double quotes such as
<p class=3DMsoNormal and remove the text '3D' from all those that have it.
Below are a series of functions i tried that didn't work. Can someone help me find a solution to this? Thanks
<?php
// test files holds the source code
$html_part = file_get_contents('testRegex.html');
$cSeq = "/(.*)='(.*)'/"; //code sequence
$nSeq = "/(.*)="."(.*)"."/"; //new sequence
preg_match_all($cSeq, $html_part, $matches);
preg_replace($cSeq, $nSeq, $html_part);
echo $html_part;
?>
I'm not sure this regex's are the way to go.
Maybe consider using a parser to read in the file, and write it back out / prettify it.
I've used Beautiful Soup in the past.
preg_replace("/(.*)?='(.*)?'/","\\1=\"\\2\"",$str)
you need to use backreference http://www.regular-expressions.info/brackets.html
You might want to take a look at quoted_printable_decode() instead of removing the '3D' manually.
I have a string that contains a lot of links and I would like to adjust them before they are printed to screen:
I have something like the following:
replace_this
and would like to end up with something like this
replace this
Normally I would just use something like:
echo str_replace("_"," ",$url);
In in this case I can't do that as the URL contains underscores so it breaks my links, the thought was that I could use regular expression to get around this.
Any ideas?
Here's the regex: <a(.+?)>.+?<\/a>.
What I'm doing is preserving the important dynamic stuff within the anchor tag, and and replacing it with the following function:
preg_replace('/<a(.+?)>.+?<\/a>/i',"<a$1>REPLACE</a>",$url);
This will cover most cases, but I suggest you review to make sure that nothing unexpected was missed or changed.
pattern = "/_(?=[^>]*<)/";
preg_replace($pattern,"",$url);
You can use this regular expression
(>(.*)<\s*/)
along with preg_replace_callback .
EDIT :
$replaced_text = preg_replace_callback('~(>(.*)<\s*/)~g','uscore_replace', $text);
function uscore_replace($matches){
return str_replace('_','',$matches[1]); //try this with 1 as index if it fails try 0, I am not entirely sure
}
First off, I don't know much (quite nothing) about PHP. I'm more familiar with CSS.
I'm making use of Ben Ward script Tumblr2Wordpress (here's the script on GitHub) to export my Tumblr blog in XML (so I can import it in my Wordpress blog). This script reads tumblr's API, queries elements, do a bit of formatting and export the whole thing in HTML.
I need to customize it just a bit to fit my needs. For example in the following function I need the blockquote to become a specific class of blockquote:
function _doBlockQuotes_callback($matches) {
$bq = $matches[1];
# trim one level of quoting - trim whitespace-only lines
$bq = preg_replace('/^[ ]*>[ ]?|^[ ]+$/m', '', $bq);
$bq = $this->runBlockGamut($bq); # recurse
$bq = preg_replace('/^/m', " ", $bq);
# These leading spaces cause problem with <pre> content,
# so we need to fix that:
$bq = preg_replace_callback('{(\s*<pre>.+?</pre>)}sx', array(&$this, '_doBlockQuotes_callback2'), $bq);
return "\n". $this->hashBlock("<blockquote>\n$bq\n</blockquote>")."\n\n";
}
At first, I thought it will be as simple as adding the class I need inside the blockquote HTML tag, like so <blockquote class="big"> But it breaks the code.
Is there a way I could add this HTML attribute as is in the PHP script? Or do I need to define the output of this <blockquote>somewhere else?
Thanks in advance for any tips!
P.
Your guess was correct, but you need to escape the quotes with backslashes:
return "\n". $this->hashBlock("<blockquote class=\"big\">\n$bq\n</blockquote>")."\n\n";
Otherwise, PHP assumes that your string ends at the class=" quote.
You can escape double quotes ".
"<blockquote class=\"big\">"
How ever, if you're going to use single quotes '. It's unnecessary.
'<blockquote class="big">'
You need to escape the quote marks
<blockquote class=\"big\">
I'm working in Wordpress and need to be able to remove images and empty paragraphs. So far, I've found out how to remove images without a problem. But, I then need to remove empty paragraph tags. I'm using PHP preg_replace to handle the regex functions.
So, as an example, I have the string:
<p style="text-align:center;"><img src="http://www.blah.com/image.jpg" alt="Blah Image" /></p><p>Some text</p>
I run this regex on it:
/<img.*?(>)/
And I end up with this string:
<p style="text-align:center;"></p><p>Some text</p>
I then need to be able to remove the empty paragraph. I tried this, but it removes all paragraphs and the contents of the paragraphs:
/<p[^>]*><\/p[^>]*>/
Any help/suggestions is greatly appreciated!
The correct regex is no regex. Use an HTML/DOM Parser instead. They're simple to use. Regex is for regular languages (which HTML is not).
/<p[^>]*><\/p[^>]*>/ (the regex you gave) should work fine. If it's giving you trouble you could try double-escaping the / like this: /<p[^>]*><\\/p[^>]*>/
PHP is funny about quoting and escape characters. For example "\n" is not equal to '\n'. The first is a line break, the second is a literal backslash followed by an 'n'. The PHP manual entry on string literals is probably worth a quick look.
I need to preg_match for
src="http:// "
where the blank space following // is the rest of the url ending with the ". My adapted doesn't seem to work:
preg_match('#src="(http://[^"]+)#', $data, $match);
And I am also struggling to get text that starts with > and ends with EITHER a full stop . or an exclamation mark ! or a question mark ? I have no idea how to do this one. An example of the text I want to preg_match for is:
blahblahblah>Hello world this is what I want.
I'm hoping a kind preg_match guru can tell me the answer and save me hours of headscratching.
Thanks for reading.
As for the URL:
preg_match('#src="(.*?)"#', $data, $match);
and for the second case, use />(.*?)(\.|!|\?)/
(.*?)" will match any character greedily up until the time it sees the end double quote
It seems that you want to parse a document or string which follows a HTML, DOM, XML or something similiar structure.
Use XPath, and parse to the Tag and let it return the src Attribute, this will save much trouble and you can forget about regular expressions.
Example: CLICK ME