PHP Custom Markup Language Parser - php

I am making a website, and I would like to make a custom Markup type language in PHP. I want the tags to be surrounded with [ and ]. Now, I was thinking about this, like anyone would, and I could do something like this:
function formatMarkup($markup = ''){
$markup = str_replace('[color=blue]', '<span style="color: blue">', $markup);
return $markup
}
Even though that might work, it would be more progrematically correct if it would do something like explode(), but starting at every [ and ending at every ]. This would be great if I found out. Thank you for your time and effort.
EDIT:
I have decided to use preg_split(). It seems nice, and all, but I cannot get the regex. Here is my code.
EDIT #2:
I have got most of the regex done, but there are uneeded extra keys in the array. How would I fix them? Here is my new code.

I have made my Markup language. I used
$split = preg_split("/(\[|\])/", $markup);
to get the individual "tags" and used
foreach($split as $k => $v){
if(strlen($v) < 1){
continue;
}
to illiterate through each of them, and check if the value is empty. Then, after that, I would do all of my checks, and parse the code blocks together, and make line, after line, the re-constructed text.

Related

PHP preg_replace inside for loop

I'm currently trying out this PHP preg_replace function and I've run into a small problem. I want to replace all the tags with a div with an ID, unique for every div, so I thought I would add it into a for loop. But in some strange way, it only do the first line and gives it an ID of 49, which is the last ID they can get. Here's my code:
$res = mysqli_query($mysqli, "SELECT * FROM song WHERE id = 1");
$row = mysqli_fetch_assoc($res);
mysqli_set_charset("utf8");
$lyric = $row['lyric'];
$lyricHTML = nl2br($lyric);
$lines_arr = preg_split('[<br />]',$lyricHTML);
$lines = count($lines_arr);
for($i = 0; $i < $lines; $i++) {
$string = preg_replace(']<br />]', '</h4><h4 id="no'.$i.'">', $lyricHTML, 1);
echo $i;
}
echo '<h4>';
echo $string;
echo '</h4>';
How it works is that I have a large amount of text in my database, and when I add it into the lyric variable, it's just plain text. But when I nl2br it, it gets after every line, which I use here. I get the number of by using the little "lines_arr" method as you can see, and then basically iterate in a for loop.
The only problem is that it only outputs on the first line and gives that an ID of 49. When I move it outside the for loop and removes the limit, it works and all lines gets an <h4> around them, but then I don't get the unique ID I need.
This is some text I pulled out from the database
Mama called about the paper turns out they wrote about me
Now my broken heart´s the only thing that's broke about me
So many people should have seen what we got going on
I only wanna put my heart and my life in songs
Writing about the pain I felt with my daddy gone
About the emptiness I felt when I sat alone
About the happiness I feel when I sing it loud
He should have heard the noise we made with the happy crowd
Did my Gran Daddy know he taught me what a poem was
How you can use a sentence or just a simple pause
What will I say when my kids ask me who my daddy was
I thought about it for a while and I'm at a loss
Knowing that I´m gonna live my whole life without him
I found out a lot of things I never knew about him
All I know is that I´ll never really be alone
Cause we gotta lot of love and a happy home
And my goal is to give every line an <h4 id="no1">TEXT</h4> for example, and the number after no, like no1 or no4 should be incremented every iteration, that's why I chose a for-loop.
Looks like you need to escape your regexp
preg_replace('/\[<br \/\]/', ...);
Really though, this is a classic XY Problem. Instead of asking us how to fix your solution, you should ask us how to solve your problem.
Show us some example text in the database and then show us how you would like it to be formatted. It's very likely there's a better way.
I would use array_walk for this. ideone demo here
$lines = preg_split("/[\r\n]+/", $row['lyric']);
array_walk($lines, function(&$line, $idx) {
$line = sprintf("<h4 id='no%d'>%s</h4>", $idx+1, $line);
});
echo implode("\n", $lines);
Output
<h4 id="no1">Mama called about the paper turns out they wrote about me</h4>
<h4 id="no2">Now my broken heart's the only thing that's broke about me</h4>
<h4 id="no3">So many people should have seen what we got going on</h4>
...
<h4 id="no16">Cause we gotta lot of love and a happy home</h4>
Explanation of solution
nl2br doesn't really help us here. It converts \n to <br /> but then we'd just end up splitting the string on the br. We might as well split using \n to start with. I'm going to use /[\r\n]+/ because it splits one or more \r, \n, and \r\n.
$lines = preg_split("/[\r\n]+/", $row['lyric']);
Now we have an array of strings, each containing one line of lyrics. But we want to wrap each string in an <h4 id="noX">...</h4> where X is the number of the line.
Ordinarily we would use array_map for this, but the array_map callback does not receive an index argument. Instead we will use array_walk which does receive the index.
One more note about this line, is the use of &$line as the callback parameter. This allows us to alter the contents of the $line and have it "saved" in our original $lyrics array. (See the Example #1 in the PHP docs to compare the difference).
array_walk($lines, function(&$line, $idx) {
Here's where the h4 comes in. I use sprintf for formatting HTML strings because I think they are more readable. And it allows you to control how the arguments are output without adding a bunch of view logic in the "template".
Here's the world's tiniest template: '<h4 id="no%d">%s</h4>'. It has two inputs, %d and %s. The first will be output as a number (our line number), and the second will be output as a string (our lyrics).
$line = sprintf('<h4 id="no%d">%s</h4>', $idx+1, $line);
Close the array_walk callback function
});
Now $lines is an array of our newly-formatted lyrics. Let's output the lyrics by separating each line with a \n.
echo implode("\n", $lines);
Done!
If your text in db is in every line why just not explode it with \n character?
Always try to find a solution without using preg set of functions, because they are heavy memory consumers:
I would go lke this:
$lyric = $row['lyric'];
$lyrics =explode("\n",$lyrics);
$lyricsHtml=null;
$i=0;
foreach($lyrics as $val){
$i++;
$lyricsHtml[] = '<h4 id="no'.$i.'">'.$val.'</h4>';
}
$lyricsHtml = implode("\n",$lyricsHtml);
An other way with preg_replace_callback:
$id = 0;
$lyric = preg_replace_callback('~(^)|$~m',
function ($m) use (&$id) {
return (isset($m[1])) ? '<h4 id="no' . ++$id . '">' : '</h4>'; },
$lyric);

XPath in PHP: Get all text nodes, except navigation

I’m writing a custom parser/data extractor for some pretty shitty HTML.
Changing the HTML is out of the question.
I will spare you the details of the hoops I’ve had to jump through but I’ve now come pretty close to my original goal. I’m using a combination of DOMDocument getElementByName, regular expression replace (I know, I know...), and XPath queries.
I need to get all the text out of the body of the document. I would like for the navigation to remain a separate entity, at least in the abstract. Here’s what I’m doing now:
$contentnodes = $xpath->query("//body//*[not(self::a)]/text()|//body//ul/li/a");
foreach ($contentnodes as $contentnode) {
$type = $contentnode->nodeName;
$content = $contentnode->nodeValue;
$output[] = array( $type, $content);
}
This works, except that of course it treats all of the links on the page differently, and I only want it to do that to the navigation.
What XPath syntax can I use so that, in the first part of that query, before the |, I tell it to get all the text nodes of body’s children except ul > li > a.
Please note that I cannot rely on the presence of p tags or h1 tags or anything sensible like that to make educated guesses about content.
Thanks
Update: #hr_117’s answer below works. I’ve also found that you can use multiple not statements like so:
//body//text()[not(parent::a/parent::li/parent::ul)][not(parent::h1)]
You may try something like this:
//body//text()[not(parent::a/parent::li/parent::ul)]|//body//ul/li/a
//body//*[not(self::a/parent::li/parent::ul)]/text()[normalize-space()]|//body//ul/li/a
(test)

How to do this string replacement in PHP

I have a string and within that string are some links of the format
Text
I want to replace that entire section with a different piece of markup
The problem is that while I can get the overall structure of the markup to be replaced; and also the URL, it's not so easy for me to get the "Text". If I knew the entire link then I might do something like.
'str_replace( $each_link , $my_new_markup , $the_original_string );'
and iterate through each link, but I cant because I cant know what $each_link is going to be exactly.
Is there any way to look for something like this? I am thinking it must have something to do with REGEX but I am totally hopeless at it, and I don't even know if that's the right place to start.
[WILDCARD of some kind]
You could look at a class like this, Simple HTML DOM Parser that you can use to cycle through elements searching for a specific inner html or other attribute and then change it.
Code looking something like this
foreach($html->find('a') as $element) {
if ($element->innertext == $needle) {
$element->innertext = $my_new_markup;
}
}

Is regex the right tool to find a line of HTML?

I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can't just pull a specific line. However, the content is contained within a div that has a unique id. Is it possible (and is it the best way) for regex to search for this unique id and then pass the line of which it's on back to my script?
Example:
HTML file:
<html><head><title>Example</title></head>
<body>
<div id="Alpha"> Blah blah blah </div>
<div id="Beta"> Blah Blah Blah </div>
</body>
</html>
So let's say that I'm looking for the line with an opening div tag with an id of alpha. The code should return 3, because on the third line is the div with the id of alpha.
At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here
The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.
Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.
Otherwise, DOMDocument would be a more sane way.
EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.
I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.
Here is a rought PHP code to extract it. (just to give some idea)
$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";
$ID = "Alpha";
function GetLineOfDIV($HTML, $ID) {
$RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
$Index = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
$Match = $Match[1]; // Only the one in '(...)'
if ($Match == "")
return -1;
//$MatchStr = $Match[0]; Since you do not want it, so we comment it out.
$MatchOffset = $Match[1];
$StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
foreach($StartLines as $I => $StartLine) {
$LineOffset = $StartLine[1];
if ($MatchOffset <= $LineOffset)
return $I + 1;
}
return count($StartLines);
}
echo GetLineOfDIV($HTML, $ID);
I hope I give you some idea.
According to Jeff Atwood, you should never parse HTML using regex.
Since the line number is important to you here and not the actual contents of the div, I'd be inclined not to use regex at all. I'd probably explode() the string into an array and loop through that array looking for your marker. Like so:
<?php
$myContent = "[your string of html here]";
$myArray = explode("\n", $myContent);
$arraylen = count($myArray); // So you don't waste time counting the array at every loop
$lineNo = 0;
for($i = 0; $i < $arraylen; $i++)
{
$pos = strpos($myArray[$i], 'id="Alpha"');
if($pos !== false)
{
$lineNo = $i+1;
break;
}
}
?>
Disclaimer: I haven't got a php installation readily available to test this so some debugging may be required.
Hope this helps as I think it's probably just going to be a waste of time for you to implement a parsing engine just to do something so simple - especially if it's a one-off.
Edit: if the content is impotant to you at this stage too then you can use this in combination with the other answers which provide an adequate regex for the job.
Edit #2: Oh what the hey... here's my two cents:
"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"
The (<div.*//div>) tells the regex engine that it may find nested div tags and to just incorporate them into the match if it finds them rather than just stopping at the first </div>. However this only solves the problem if there is only one level of nesting. If there's more, then regex is not for you sorry :(.
The /m also makes the regex engine ignore linebreaks so you don't have to dirty up your expressions with [\S\s] everywhere.
Again, sorry, I've no environment to test this in at the moment so you may need to debug.
Cheers
Iain
The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply.
Not recommended.
Instead of RegEx, use a parser that is made especially to handle (messy) HTML. This will make your application less brittle in case the HTML changes slightly, and you don't have to hand-craft custom RegEx each time you want to pull out a new piece of data.
See this Stack Overflow page: Mature HTML Parsers for PHP
#OP since your requirement is that easy, you can just use string methods
$f = fopen("file","r");
if($f){
$s="";
while( !feof($f) ){
$i+=1;
$line = fgets($f,4096);
if (stripos($line,'<div id="Alpha">')!==FALSE){
print "line number: $i\n";
}
}
fclose($f);
}

Extract form fields using RegEx

I'm looking for a way to get all the form inputs and respective values from a page given a specific URL and form name.
function GetForm($url, $name)
{
return array
(
'field_name_1' => 'value_1',
'field_name_2' => 'value_2',
'select_field_name' => array('option_1', 'option_2', 'option_3'),
);
}
GetForm('http://www.google.com/', 'f');
Can anyone provide me with the necessary regular expressions to accomplish this?
EDIT: I understand that querying the DOM would be far more reliable, however what I'm looking for is a website agnostic solution that allows me to get all the fields of a given form. I don't believe this is possible with DOM without knowing the document nodes first, am I wrong?
I don't need a bullet proof solution, just something that works on standard web pages, for the FORM tag I've come up with the following RegEx;
'~<form.*?name=[\'"]?' . $name . '[\'"]?.*?>(.+?)</form>~is'
I believe that doing something similar for input fields won't be difficult, what I find most challenging is the RegEx for the select and option fields.
Using regex to parse HTML is probably not the best way to go.
You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).
You might also want to take a look at Zend_Dom and Zend_Dom_Query, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test, for instance -- and work quite well ;-)
It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...
EDIT after the comment and the edit of the OP
Here are a couple of thought about, to begin by something "simple", an input tag :
it can spread accross several lines
it can have many attributes
condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
attributes can have double-quotes, single-quotes, or even nothing arround their values
tags / attributes can be both lower-case or upper-case
tags don't always have to be closed
Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...
Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.
On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :
$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//input[#name="q"]');
if ($nodeList->length > 0) {
for ($i=0 ; $i<$nodeList->length ; $i++) {
$node = $nodeList->item($i);
var_dump($node->getAttribute('value'));
}
}
} else {
// too bad...
}
What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input> that have a name attribute that is equal to "q".
And it just works : I'm getting this result :
string 'test' (length=4)
string 'test' (length=4)
(I checked : there are two input name="q" on the page ^^ )
Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)
And that's what we get ;-)
EDIT 2 : and a bit fun with select and options :
Well, just for fun, here's what I came up for select and option :
$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeListSelects = $xpath->query('//select');
if ($nodeListSelects->length > 0) {
for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
$nodeSelect = $nodeListSelects->item($i);
$name = $nodeSelect->getAttribute('name');
$nodeListOptions = $xpath->query('option[#selected="selected"]', $nodeSelect); // We want options that are inside the current select
if ($nodeListOptions->length > 0) {
for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
$nodeOption = $nodeListOptions->item($j);
$value = $nodeOption->getAttribute('value');
var_dump("name='$name' => value='$value'");
}
}
}
}
} else {
// too bad...
}
And I get as an output :
string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
Which is what I expected.
Some explanations ?
Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.
A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D
Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^
I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)
Still : have fun !

Categories