preg_replace with wildcards? - php

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.

You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

Related

Get a string between 2 strings/tags in php

I'm tring to get a string between 2 strings with preg_match
The string is something like this, this is just an example
<source src='http://website.com/384238/dsjfjsd.jpg' type='image/jpg' data-res='43543' lang='English'/>
I want the link, the "data-res=" is the one that varies so:
I'm doing something like this:
preg_match("<source src='(.*)' type='image/jpg' data-res='43543",$input,$output);
I also tried this way
$output = trim(cut_str($input, '<source src='', ' type='image/jpg' data-res='43543'));
I think the problem is not knowing how do I represent the spaces or special chars, I also wanted an advice for whats the best function to solve this
While you can do this with a regular expression. I would encourage you to use DOMDocument.
From there it would be simple to grab all source tags using getElementByTagName():
$dom = new DOMDocument;
$dom->loadHTML($html);
$source_tags = $dom->getElementsByTagName('source');
foreach ($source_tags as $source_tag) {
echo 'Link: ' . $source_tag->attributes->getNamedItem('src')->nodeValue;
}
This question might also help if you are interested in source tags with the data-res attribute.
Here is a code you could try:
// The Regular Expression filter
$reg_exSRC = "/(src)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The text you want to filter for urls
$text = "<source src='http://website.com/384238/dsjfjsd.jpg' type='image/jpg' data-res='43543' lang='English'/>";
// apply expression to the text
preg_match($reg_exSRC, $text, $url);
echo $url[0];
Why not parsing it like this ? It's faster then REGEX and easier to use.
$dom = new DOMDocument;
$dom->loadHTML('<source src="http://website.com/384238/dsjfjsd.jpg" type="image/jpg" data-res="43543" lang="English" />');
// We read it
$dataSource = $dom->getElementsByTagName('source');
// We loop on it
$dataRes = FALSE;
foreach($dataSource as $data){
# We read the wanted field
if(($dataAttr = $data->attributes->getNamedItem('data-res')->nodeValue) == "43543"){
# We assign it
$dataRes&= $dataAttr;
# Done - We end the loop here
break;
}
}
# We found it ?
if($dataRes !== FALSE){
# Yes
var_dump($dataRes);
} else {
# No
exit('Failed');
}
Warning: I didn't not test this code but it should work.

How can i have counter for php preg_match?

function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\'"]'.$id.'[\'"]>(.*?)<\/'.$tag.'>/si';
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = file_get_contents("Generated/index.php");
$extract = getContent($omg,"div","lolz2");
echo $extract;
For example i have something like this. And html have something like this inside:
<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id='lolz2'>qwdqw2cq</div>asd3qwe</div>
If we search for id lolz we get the correct answer, but if we search for lolz1 we stop at first </div> that's inner <div id="lolz2">. It's possible to keep something like counter for preg_match that's will keep how many <div>'s i pass till i find </div>?
HTML isn't a regular language, so building something like that would be overkill and is the job of an HTML parser. Please see: RegEx match open tags except XHTML self-contained tags.
The reason your code was failing however was because you were using both single and double quotes in your input but your regex didn't account for it. This works for me:
function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\\\'"]'.$id.'[\\\'"]>(.*?)<\/'.$tag.'>/si';;
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = '<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id="lolz2">qwdqw2cq</div>asd3qwe</div>';
$extract = getContent($omg,"div","lolz2");
var_dump($extract);
As long as you don't have nested elements this code will work and you won't need to use a DOM parser, though you really should for anything more complicated that might be nested (e.g. you don't have control over the input).

Replace an element with Dom Document PHP

I load a html page with PHP Dom Document :
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
I search in my page all "a" elements, and if they realize my condition i need to replace for example My link is beautiful by just My link is beautiful
Here my loop :
$liens = $div->getElementsByTagName('a');
foreach($liens as $lien){
if($lien->hasAttribute('href')){
if (preg_match("/metz2/i", $lien->getAttribute('href'))) {
//HERE I NEED TO REPLACE </a>
}
$cpt++;
}
}
Do you have any ideas ? Suggestions ? Thanks :)
Every time i need to manage DOM with PHP, i use a framework called PHP Simple HTLM DOM parser. (Link here)
It's very easy to use, something like this might work for you:
// Create DOM from URL or file
$html = file_get_html('http://www.page.com/');
// Find all links
foreach($html->find('a') as $element) {
//Do your custom logic here if you need it, for example this extracts the inner contents of the a-tag, and puts it freely.
$inner = $element->innertext;
$element->outertext($inner);
}
//To echo modified html again:
echo $html;
Could be done with preg_replace as well:
$sText = 'Stackoverflow';
$sText = preg_replace( '/<a.*>(.*)<\/a>/', '$1', $sText );
echo $sText;

Is using PHP's explode() for HTML scraping considered a bad practice?

I have been coding for a while now but just can't seem to get my head around regular expressions.
This brings me to my question which is the following: is it bad practice to use PHP's explode for breaking up a string of html code to select bits of text? I need to scrape a page for various bits of information and due to my horrific regex knowledge (In a full software engineering degree I had to write maybe one....) I decided upon using explode().
I have provided my code below so someone more seasoned than me can tell me if it's essential that I use regex for this or not!
public function split_between($start, $end, $blob)
{
$strip = explode($start,$blob);
$strip2 = explode($end,$strip[1]);
return $strip2[0];
}
public function get_abstract($pubmed_id)
{
$scrapehtml = file_get_contents("http://www.ncbi.nlm.nih.gov/m/pubmed/".$pubmed_id);
$data['title'] = $this->split_between('<h2>','</h2>',$scrapehtml);
$data['authors'] = $this->split_between('<div class="auth">','</div>',$scrapehtml);
$data['journal'] = $this->split_between('<p class="j">','</p>',$scrapehtml);
$data['aff'] = $this->split_between('<p class="aff">','</p>',$scrapehtml);
$data['abstract'] = str_replace('<p class="no_t_m">','',str_replace('</p>','',$this->split_between('<h3 class="no_b_m">Abstract','</div>',$scrapehtml)));
$strip = explode('<div class="ids">', $scrapehtml);
$strip2 = explode('</div>', $strip[1]);
$ids[] = $strip2[0];
$id_test = strpos($strip[2],"PMCID");
if (isset($strip[2]) && $id_test !== false)
{
$step = explode('</div>', $strip[2]);
$ids[] = $step[0];
}
$id_count = 0;
foreach ($ids as &$value) {
$value = str_replace("<h3>", "", $value);
$data['ids'][$id_count]['id'] = str_replace("</h3>", "", str_replace('<span>','',str_replace('</span>','',$value)));
$id_count++;
}
$jsonAbstract = json_encode($data);
echo $this->indent($jsonAbstract);
}
I highly recommend you try out the PHP Simple HTML DOM Parser library. It handles invalid HTML and has been designed to solve the same problem you're working on.
A simple example from the documentation is as follows:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
It's not essential to use regular expressions for anything, although it'll be useful to get comfortable with them and know when to use them.
It looks like your scraping PubMed, which I'm guessing has fairly static mark-up in terms of mark-up. If what you have works and performs as you hope I can't see any reason to switch over to using regular expressions, they're not necessarily going to be any quicker in this example.
Learn regular expressions and try to use a language that has libraries for this kind of task like perl or python. It will save you a lot of time.
At first they might seem daunting but they are really easy for most of the tasks.
Try reading this: http://perldoc.perl.org/perlre.html

How can I parse a very simple Table using PHP

Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero
You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.
Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

Categories