preg_replace regex, split string to array

preg_replace regex, split string to array - php

i have a string, where i need to split some values in to an array, what would be the best aproach?
String can look like this:
<span class="17">118</span><span style="display: inline">.</span><span style="display:none"></span>
or
125<span class="17">25</span>354
The rules are:
The string can start with a number, followed by a span or a div
The string can start with a span or a div
The string can end with a number
The string can end with a /span or a /div
The divs/spans can have a style/class
What i need, is to seperate the string, so that i get the elements seperated, such as:
0 => 123
1 => <span class="potato">123</span>
2 => <span style="color: black">123</span>
I have tried some costum regex, but regex is not my strong side:
$pattern = "/<div.(.*?)<\/div>|<span.(.*?)<\/span>/";
// i know it wont detect a number value prior to the div, thats also an issue, even if it worked
I cannot use simple_html_dom has to be done with REGEX.
Splitting the string between every >< might work, but ">(.*?)<" inserts after the < for some reason?

You might get better performance if you just load this string to DOM and then parse it manually programming your logic like:
var el = document.createElement( 'div' );
el.innerHTML = '125<span class="17">25</span>354';
// test your first element (125) index=0 (you can make for loop)
if(el.childNodes[0].nodeType == 3) alert('this is number first, validate it');
else if(el.childNodes[0].nodeType == 1) alert('this is span or div, test it');
// you can test for div or span with el.childNodes[0].nodeName
// store first element to your array
// then continue, test el.childNodes[next one, index=1 (span)...]
// then continue, test el.childNodes[next one, index=2 (354)...]
since you are already know are you looking for, this can be as simple as that

Try /(<(span|div)[^>]*>)*([^<]*)(<\/(span|div)>)*/
The Regex says something like 'there can be a span or div or nothing, then it has to be somthing then a /span or /div or nothing and that whole statement can match zero or many times.
Here is an example:
$pattern = "/(<(span|div)[^>]*>)*([^<]*)(<\/(span|div)>)*/";
$txt = '<span class="17">118</span><span style="display: inline">.</span><span style="display:none"></span>';
preg_match_all($pattern, $txt,$foo);
print_r($foo[0]);
$txt = '125<span class="17">25</span>354';
preg_match_all($pattern, $txt,$foo);
print_r($foo[0]);
?>

Related

HTML output based in input number in PHP

I have values like so:
0.00000500
0.00003491
0.00086583
1.45304093
etc
I would like to run these through a PHP function so they become:
<span class="text-muted">0.00000</span>500
<span class="text-muted">0.0000</span>3491
<span class="text-muted">0.000</span>86583
1.45304093
What I have now is:
$input_number str_replace('0', '<span class="text-muted">0</span>', $input_number);
$input_number str_replace('.', '<span class="text-muted">.</span>', $input_number);
This is a bit 'aggressive' as it would replace every character instead of using the <span> once, but I guess that's OK, even if I have say 1000 numbers on a page. But the biggest problem I have is that my code would also 'mute' the last 2 digits in 0.00000500 which I don't want.

First of all:
$input_number = str_replace('0.', '', $input_number);
We are replacing 0. with empty string
Secondly:
Use preg replace()
$newNumber = preg_replace('/^0?/','<span class="text-muted">0</span>',$input_number);
Basically /^0?/ is looking for leading 0's which will replace with span, you can also replce with empty or anything you want.

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.

I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger

To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

preg_replace : getting a html tag inside an other html tag from BBCode

So I'm trying to make a php function to get HTML tags from a BBCode-style form. The fact is, I was able to get tags pretty easily with preg_replace. But I have some troubles when I have a bbcode inside the same bbcode...
Like this :
[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]
So, when I "parse" it, I always have remains bbcode for the blue ones. Something like :
My house is [blue]very[/blue] beautiful today
Everything is colored except for the blue-tag inside the black-tag inside the first blue-tag.
How the hell can I do that ?
With more informations, I tried :
Regex: "/\[blue\](.*)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/i"
Getting : "My house is [blue]very[/blue] beautiful today"
Regex : "/\[blue\](.*?)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/Ui"
Getting : "My house is [blue]very beautiful today[/blue]"
Do I have to loop the preg_replace ? Isn't there a way to do it, regex-style, without looping the thing ?
Thx for your concern. :)

It is right that you should not reinvent the wheel on products and rather choose well-tested plugins. However, if you are experimenting or working on pet projects, by all means, go ahead and experiment with things, have fun and obtain important knowledge in the process.
With that said, you may try following regex. I'll break it down for you on below.
(\[(.*?)\])(.*?)(\[/\2\])
Philosophy
While parsing markup like this, what you are actually seeking is to match tags with their pairs.
So, a clean approach you can take would be running a loop and capturing the most outer tag pair each time and replacing it.
So, on the given regex above, capture groups will give you following info;
Opening tag (complete) [black]
Opening tag (tag name) black
Content between opening and closing tag My [black]house is [blue]very[/blue] beautiful[/black] today
Closing tag [/blue]
So, you can use $2 to determine the tag you are processing, and replace it with
<tag>$3</tag>
// or even
<$2>$3</$2>
Which will give you;
// in first iteration
<tag>My [black]house is [blue]very[/blue] beautiful[/black] today</tag>
// in second iteration
<tag>My <tag2>house is [blue]very[/blue] beautiful</tag2> today</tag>
// in third iteration
<tag>My <tag2>house is <tag3>very</tag3> beautiful</tag2> today</tag>
Code
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
function convert($input)
{
$control = $input;
while (true) {
$input = preg_replace('~(\[(.*?)\])(.*)(\[/\2\])~s', '<$2>$3</$2>', $input);
if ($control == $input) {
break;
}
$control = $input;
}
return $input;
}
echo convert($text);

As others mentionned, don't try to reinvent the wheel.
However, you could use a recursive approach:
<?php
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
$regex = '~(\[ ( (?>[^\[\]]+) | (?R) )* \])~x';
$replacements = array( "blue" => "<bleu>",
"black" => "<noir>",
"/blue" => "</bleu>",
"/black" => "</noir>");
$text = preg_replace_callback($regex,
function($match) use ($replacements) {
return $replacements[$match[2]];
},
$text);
echo $text;
# <bleu>My <noir>house is <bleu>very</bleu> beautiful</noir> today</bleu>
?>
Here, every colour tag is replaced by its French (just made it up) counterpart, see a demo on ideone.com. To learn more about recursive patterns, have a look at the PHP documentation on the subject.

get information from a website with php, Recursively traverse each HTML node

I have done lot research but I haven't found my answer. I am trying to get some information from webpage, Which have the following HTML structure
<div id="xxx" class="some1">
<h1>This is the time</h1>
<div class="ti12">
<div class="sss"></div>
<div class="sss">
<span class="hhh">
<div class="sded">
City:
<span class="sh">CCC</span>
</div>
</span>
</div>
</div>
.
.
.
<div class="pp12"></div>
</div>
Now, What i am doing is to fetch the NAME of the City and similarly other information in same way.
I have to find these information from above code.
$arr=array('City', 'Name', 'Address', 'DOB');
if exist fetch its value else leave it blank.
Hope my I am clear.
Following code it tried:
<?php
include "simple_html_dom.php";
$html = new simple_html_dom();
$listItem = array('City', 'Name', 'Address', 'DOB');
$html->load_file('simp.html');
$found=array();
foreach($listItem as $item){
$ret = $html->find('div[id=xxx] div',0);
iterateParentNode($ret, $item);
}
function iterateParentNode($ret1, $item1){
for ($node=0;$node < count($ret1->children());$node++){
$child=$ret1->children($node);
echo count($ret1->children())."<br/>";
if(count($ret1->children())==1 && strpos($child, '<span class="sh"')!==false ){
$found[$item1]=$ret1->find('span[class=sh]',0)->plaintext;
return true;
}else{
goThroughChildNode($child, $item1);
}
}
}
function goThroughChildNode($child1, $item2){
echo $child1."ITEM:".$item2;
if(strpos($child1, $item2)!==false){
iterateParentNode($child1, $item2);
}else{
return false ;
}
return true;
}
foreach ($found as $structure=>$data){
echo $structure."=>".$data."<br />";
}
?>
I know my PHP approach is not good, So please suggest me a good approach to do it with considering my PHP code.

One alternative to manual traversal is querying for the data instead. In DOMDocument this is commonly done with XPath, a language dedicated to exactly that job.
The library you use does not support XPath, however, PHP does support it out of the box. PHP also supports DOMDocument out of the box, so I think I can safely suggest you that as an alternative.
So in your case you are first looking into the the div with the ID:
//div[#id="xxx"]
and then inside a div in there somewhere:
//div
and then you want another element in there if no specific name (children):
//*
but those need to match a specific pattern: Here, containing a span with a class attribute having "sh", it must be the first span in there and before the span there must be some text:
[
span[#class="sh"]
and span = span[#class="sh"]
and span/preceding-sibling::text()
]
and of that child you want the first text node child:
/text()[1]
So just to see this at a glance:
//div[#id="xxx"]
//div
//*[
span[#class="sh"]
and span = span[#class="sh"]
and span/preceding-sibling::text()
]
/text()[1]
This will give you the named string like "City:" and so on. The next sibling (span) then will contain the value.
All you've got to do is wrap that into code (here I load a string, but you can also load a HTML file with loadHTMLFile(), check the DOMDocument link above for all the glory details):
$dom = new DOMDocument();
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
foreach ($xp->query('
//div[#id="xxx"]
//div
//*[
span[#class="sh"]
and span = span[#class="sh"]
and span/preceding-sibling::text()
]
/text()[1]
'
) as $node
) {
$name = trim($node->nodeValue);
$value = trim($node->nextSibling->nodeValue);
printf("%s %s\n", $name, $value);
}
The output with your example HTML:
City: CCC
I hope this can motivate you to look into DOMDocument and helps you to explore the power of XPath.

It would probably be simplest to do this with a regex. Of course, it will break if the HTML structure changes.
if (ereg('<div.*?h1>(.*?)</h1>.*?City:.*?>(.*?)<', $input, $regs)) {
$title = $regs[1];
$city = $regs[2];
} else {
$title = "";
$city = "";
}
/*
Match 1 of 1
Matched text: <div id="xxx" class="some1">
<h1>This is the time</h1>
<div class="ti12">
<div class="sss"></div>
<div class="sss">
<span class="hhh">
<div class="sded">
City:
<span class="sh">CCC<
Match offset: 0
Match length: 282
Group 1: This is the time
Group 1 offset: 42
Group 1 length: 16
Group 2: CCC
Group 2 offset: 278
Group 2 length: 3
*/
// <div.*?h1>(.*?)</h1>.*?City:.*?>(.*?)<
//
// Match the characters "<div" literally «<div»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "h1>" literally «h1>»
// Match the regular expression below and capture its match into backreference number 1 «(.*?)»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "</h1>" literally «</h1>»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "City:" literally «City:»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character ">" literally «>»
// Match the regular expression below and capture its match into backreference number 2 «(.*?)»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character "<" literally «<»

It took me a while to get it right, but this code traverses the entire DOM with Simple HTML Dom. Hope someone can use it.
<?php
$html = new simple_html_dom();
$html->load('<html><body>'.$text.'</body></html>');
if(method_exists($html,"childNodes")){
if($html->find('html')) {
//IF NOT OK, THROW ERROR
}}
$e=$html->find('body',0);
$p=$e->childNodes(0);
if(!$p){
//BODY HAS NO CHILDNODES< THROW ERROR
}
$loop=true;
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$i=$j=0;
$i_max=500;
$j_max=500;
while($loop==true){
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$i=0;$i++;if($i>$i_max){$loop=false;break;}
//TEST IF NODE HAS CHILDREN
$p=$e->childNodes(0);
//NO CHILDREN
if(!$p){
//DO SOMETHING WITH NODE
clean_dom($e->outertext);
//TEST IF NODE HAS SIBLING
$p=$e->next_sibling();
if(!$p){
//NO SIBLING
//TEST THE PARENT, LOOP TILL WE FIND A SIBLING
$j=0;$sib_loop=true;
while($sib_loop==true){
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$j++;if($j>$j_max){$sib_loop=false;break;}
//TEST IF THERE IS A PARENT
$e=$e->parent();
//NO PARENT, WE'VE REACHED THE TOP AGAIN
if(!$e){
echo'***THE END***';
$sib_loop=$loop=false;break;}
//ELSE, TEST IF PARENT HAS SIBLING
$p=$e->next_sibling();
//THERE IS A SIBBLING, GO THERE
if($p){
//DO SOMETHING WITH THIS NODE
clean_dom($e->outertext);
$e=$e->next_sibling();
$sib_loop=false;break;
}
else{
$ret=clean_dom($e->outertext,$all);
$e->outertext=$ret;
}
}
}
else{
//GOTO SIBLING
$e=$e->next_sibling();
}
}
else{
//THERE IS A CHILD
$e=$e->childNodes(0);
}
}
$text=$html->save();
$html->clear();
unset($html);
function clean_dom($e){
//DO SOMETHING HERE
}

php preg_match_all html dates with slashes error

I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance

It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);

I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes

My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.

I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_replace regex, split string to array - php

Related

HTML output based in input number in PHP

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

preg_replace : getting a html tag inside an other html tag from BBCode

get information from a website with php, Recursively traverse each HTML node

php preg_match_all html dates with slashes error

Categories

Resources