I have an html string
$html_string = '<div style="font-family:comic sans ms,cursive;">
<div style="font-size:200%;">Some Text </div></div>';
I have tried
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
for($i=0;$i<$divs->length;$i++) {
$attrib = $divs->item($i)->getAttribute("style");
echo $attrib;
echo '<br />';
}
it gives the following output
font-family:comic sans ms,cursive
font-size:200%;
I need
font-family
font-size
How can I get only these keys not the values they have?
you can use regexps to do that. Something like this:
$style = 'font-family:comic sans ms,cursive;font-size:15em';
preg_match_all('/(?<names>[a-z\-]+):(?<params>[^;]+)[; ]*/', $style, $matches);
var_dump($matches['names']);
var_dump($matches['params']);
result:
array
0 => string 'font-family' (length=11)
1 => string 'font-size' (length=9)
array
0 => string 'comic sans ms,cursive' (length=21)
1 => string '15em' (length=4)
this even works with with more than one css parameter
Use a CSS parser!
All the answers with explode and regular expressions are inherently wrong. It is CSS source-code you're trying to analyze. Simple text-manipulation will never do that correctly. E.g. background-image:url('http://my.server.com/page?a=1;b=2'); list-style-image:url('http://my2.server.com/page/a=1;b=2') is perfectly valid, contains the two properties background-image and list-style-image and most text-processing will fail either because there semicolons or 4 colons in the middle of the text (both would be mistaken by poor solutions to indicate 4 properties).
Generally, never try fiddling with text-manipulation tools in source code; not for CSS, nor HTML, nor any sourcecode else. Languages are by design more complicated than that. This is what parsers are meant to accomplish, and it is the same reason why they are BIG -- or at least more complicated than strpos()...
Use explode on your current output and then go on using the first element you received from explode
Related
Tricky preg_replace_callback function here - I am admittedly not great at PRCE expressions.
I am trying to extract all img src values from a string of HTML, save the img src values to an array, and additionally replace the img src path to a local path (not a remote path). Ie I might have, surrounded by a lot of other HTML:
img src='http://www.mysite.com/folder/subfolder/images/myimage.png'
And I would want to extract myimage.png to an array, and additionally change the src to:
src='images/myimage.png'
Can that be done?
Thanks
Does it need to use regular expressions? Handling HTML is normally easier with DOM functions:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("http://stackoverflow.com"));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName("img");
$data = array();
foreach($items as $item) {
$data[] = array(
"src" => $item->getAttribute("src"),
"alt" => $item->getAttribute("alt"),
"title" => $item->getAttribute("title"),
);
}
print_r($data);
Do you need regex for this? Not necessary. Are regex the most readable solution? Probably not - at least unless you are fluent in regex. Are regex more efficient when scanning large amounts of data? Absolutely, the regex are compiled and cached upon first appearance. Do regex win the "least lines of code" trophy?
$string = <<<EOS
<html>
<body>
blahblah<br>
<img src='http://www.mysite.com/folder/subfolder/images/myimage.png'>blah<br>
blah<img src='http://www.mysite.com/folder/subfolder/images/another.png' />blah<br>
</body>
</html>
EOS;
preg_match_all("%<img .*?src=['\"](.*?)['\"]%s", $string, $matches);
$images = array_map(function ($element) { return preg_replace("%^.*/(.*)$%", 'images/$1', $element); }, $matches[1]);
print_r($images);
Two lines of code, that's hard to undercut in PHP. It results in the following $images array:
Array
(
[0] => images/myimage.png
[1] => images/another.png
)
Please note that this won't work with PHP versions prior to 5.3 unless you replace the anonymous function with a proper one.
I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?
Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)
It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.
If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}
I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>®</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )
I'm trying to make a regex for taking some data out of a table.
the code i've got now is:
<table>
<tr>
<td>quote1</td>
<td>have you trying it off and on again ?</td>
</tr>
<tr>
<td>quote65</td>
<td>You wouldn't steal a helmet of a policeman</td>
</tr>
</table>
This I want to replace by:
quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman
the code that I already have written is this:
%<td>((?s).*?)</td>%
But now I'm stuck.
If you really want to use regexes (might be OK if you are really really sure your string will always be formatted like that), what about something like this, in your case :
$str = <<<A
<table>
<tr>
<td>quote1</td>
<td>have you trying it off and on again ?</td>
</tr>
<tr>
<td>quote65</td>
<td>You wouldn't steal a helmet of a policeman</td>
</tr>
</table>
A;
$matches = array();
preg_match_all('#<tr>\s+?<td>(.*?)</td>\s+?<td>(.*?)</td>\s+?</tr>#', $str, $matches);
var_dump($matches);
A few words about the regex :
<tr>
then any number of spaces
then <td>
then what you want to capture
then </td>
and the same again
and finally, </tr>
And I use :
? in the regex to match in non-greedy mode
preg_match_all to get all the matches
You then get the results you want in $matches[1] and $matches[2] (not $matches[0]) ; here's the output of the var_dump I used (I've remove entry 0, to make it shorter) :
array
0 =>
...
1 =>
array
0 => string 'quote1' (length=6)
1 => string 'quote65' (length=7)
2 =>
array
0 => string 'have you trying it off and on again ?' (length=37)
1 => string 'You wouldn't steal a helmet of a policeman' (length=42)
You then just need to manipulate this array, with some strings concatenation or the like ; for instance, like this :
$num = count($matches[1]);
for ($i=0 ; $i<$num ; $i++) {
echo $matches[1][$i] . ':' . $matches[2][$i] . '<br />';
}
And you get :
quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman
Note : you should add some security checks (like preg_match_all must return true, count must be at least 1, ...)
As a side note : using regex to parse HTML is generally not a really good idea ; if you can use a real parser, it should be way safer...
Tim's regex probably works, but you may want to consider using the DOM functionality of PHP instead of regex, as it may be more reliable in dealing with minor changes in the markup.
See the loadHTML method
As usual, extracting text from HTML and other non-regular languages should be done with a parser - regexes can cause problems here. But if you're certain of your data's structure, you could use
%<td>((?s).*?)</td>\s*<td>((?s).*?)</td>%
to find the two pieces of text. \1:\2 would then be the replacement.
If the text cannot span more than one line, you'd be safer dropping the (?s) bits...
Extract each content from <td>
preg_match_all("%\<td((?s).*?)</td>%", $respose, $mathes);
var_dump($mathes);
Don't use regex, use a HTML parser. Such as the PHP Simple HTML DOM Parser
I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance
It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);
I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes
My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.
I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...