Parsing data with REGEX (PHP)

Parsing data with REGEX (PHP) - php

I want to parse data in between brackets. Below is the code where you can see what I'm doing. I would also like to avoid using XML.
$query = '
[page:1]
<html>
all the html
</html>
[/page:1]
[page:2]
<html>
all the html
</html>
[/page:2]
';
I want to create a loop script that will use regex to find all instances of [page:x]; which in the example above is 2. And then with a get function we can specify the page we want.
if(isset($_GET['page'])) {
$page = $_GET['page'];
$regex = '\\['page':(.*?)\\';
echo preg_match($regex, $query);
}
Any thoughts?

This should find all the matching blocks at once:
preg_match_all('/\[page:([0-9]+)\](.+?)\[\/page:$1\]/', $page, $matches)
I strongly doubt regex is the most suitable solution for what you're trying to accomplish though.

Related

preg_replace with wildcards?

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.

You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

How can i have counter for php preg_match?

function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\'"]'.$id.'[\'"]>(.*?)<\/'.$tag.'>/si';
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = file_get_contents("Generated/index.php");
$extract = getContent($omg,"div","lolz2");
echo $extract;
For example i have something like this. And html have something like this inside:
<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id='lolz2'>qwdqw2cq</div>asd3qwe</div>
If we search for id lolz we get the correct answer, but if we search for lolz1 we stop at first </div> that's inner <div id="lolz2">. It's possible to keep something like counter for preg_match that's will keep how many <div>'s i pass till i find </div>?

HTML isn't a regular language, so building something like that would be overkill and is the job of an HTML parser. Please see: RegEx match open tags except XHTML self-contained tags.
The reason your code was failing however was because you were using both single and double quotes in your input but your regex didn't account for it. This works for me:
function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\\\'"]'.$id.'[\\\'"]>(.*?)<\/'.$tag.'>/si';;
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = '<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id="lolz2">qwdqw2cq</div>asd3qwe</div>';
$extract = getContent($omg,"div","lolz2");
var_dump($extract);
As long as you don't have nested elements this code will work and you won't need to use a DOM parser, though you really should for anything more complicated that might be nested (e.g. you don't have control over the input).

Using preg_replace_callback to identify and manipulate latex code

I have latex + html code somewhere in the following form:
...some text1.... \[latex-code1\]....some text2....\[latex-code2\]....etc
Firstly I want to obtain the latex codes in an array codes[] to be able to send them to a server for rendering, so that
code[0]=latex-code1, code[1]=latex-code2, etc
Secondly, I want to modify this text so that it looks like:
...some text1.... <img src="root/1.png">....some text2....<img src="root/2.png">....etc
i.e, the i-th latex code fragment is replaced by the link to the i-th rendered image.
I have been trying to do this with preg_replace_callback and preg_match_all but being new to PHP haven't been able to make it work. Please advise.

If you're looking for codez:
$html = '...some text1.... \[latex-code1\]....some text2....\[latex-code2\]....etc';
$codes = array();
$count = 0;
$replace = function($matches) use (&$codes, &$count) {
list(, $codes[]) = $matches;
return sprintf('<img src="root/%d.png">', ++$count);
};
$changed = preg_replace_callback('~\\\\\\[(.+?)\\\\\\]~', $replace, $html);
echo "Original: $html\n";
echo "Changed : $changed\n\nLatex Codes: ", print_r($codes, 1), "Count: ", $count;
I don't know at which part you've got the problems, if it's the regex pattern, you use characters inside your markers that needs heavy escaping: For PHP and PCRE, that's why there are so many slashes.
Another tricky part is the callback function because it needs to collect the codes as well as having a counter. It's done in the example with an anonymous function that has variable aliases / references in it's use clause. This makes the variables $codes and $count available inside the callback.

How can I parse a very simple Table using PHP

Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero

You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.

Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

Simple PHP Screen Scraping Function

I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution).
Using standard PHP 5, how could I create a function called fetchHTML([URL]) that returns the HTML content of a webpage that's found between the <body>...</body> tags?
Please let me know if there are any prerequisite "includes".
Thanks.

Okay, here's a DOM parser code example as requested.
<?php
function fetchHTML( $url )
{
$content = file_get_contents($url);
$html=new DomDocument();
$body=$html->getelementsbytagname('body');
foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
return $content;
}

Assuming that it will always be <body> and not <BODY> or <body style="width:100%"> or anything except <body> and </body>, and with the caveat that you shouldn't use regex to parse HTML, even though I'm about to, here ya go:
<?php
function fetchHTML( $url )
{
$feed = '<body>Lots of stuff in here</body>';
$content = file_get_contents( $url );
preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );
$content = $match[1];
return $content;
} // fetchHTML
?>
If you echo fetchHTML([some url]);, you'll get the html between the body tags.
Please note original caveats.

I think you're better of using a class like SimpleDom -> http://sourceforge.net/projects/simplehtmldom/ to extract the data as you don't need to write such complicated regular expressions

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing data with REGEX (PHP) - php

This should find all the matching blocks at once: preg_match_all('/\[page:([0-9]+)\](.+?)\[\/page:$1\]/', $page, $matches) I strongly doubt regex is the most suitable solution for what you're trying to accomplish though.

Related

preg_replace with wildcards?

How can i have counter for php preg_match?

Using preg_replace_callback to identify and manipulate latex code

How can I parse a very simple Table using PHP

Simple PHP Screen Scraping Function

Categories

Resources