regex php: find everything in div - php

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.
so currently my regexp pattern looks like this:
$gallery_pattern = '/<div class="gallery">([\s\S]*)<\/div>/';
And it does the trick - somewhat.
The problem is if i have two divs after each other - like this.
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:
"text to extract here </div>
<div class="gallery">text to extract from here as well"
So to sum up. It skips the first end of the div. and continues on to the next.
The text inside the div can contain <, / and linebreaks. just so you know!
Does anyone have a simple solution to this problem? Im still a regexp novice.

You shouldn't be using regex to parse HTML when there's a convenient DOM library:
$str = '
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName('div');
if ( count($divs ) ) {
foreach ( $divs as $div ) {
echo $div->nodeValue . '<br>';
}
}

What about something like this :
$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;
$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#s', $str, $matches);
var_dump($matches[1]);
Note the '?' in the regex, so it is "not greedy".
Which will get you :
array
0 => string 'text to extract here' (length=20)
1 => string 'text to extract from here as well' (length=33)
This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?

A possible answer to this problem can be found at http://simplehtmldom.sourceforge.net/
That class help me to solve similar problem quickly

Related

Get data from div classes

How I can get data from div classes?
Example: A <div class=ab>1 2</div>b <div class=ab>3 4</div>c.
I want: 1 2, 3 4 etc - between <div class=ab> and </div>.
Second example: http://www.imdb.com/title/tt29747">
I want: tt29747 - between http://www.imdb.com/title/ and ">.
With strstr all good, except that I get only the first result. I tried some solutions founded here but no succes, regular expressions beyond me. Thank you!
Try parsing the HTML using DOMDocument() instead of regex.
However, here is the regex to parse assuming there will be no nested div:
$html= 'Example: Lorem <div class=ab>1 2</div>ipsum <div class=ab>3 4</div>dolor.';
preg_match_all('|<div class=ab>([^<]*)</div>|i', $html, $m);
print_r($m[1]);
And for parsing the title id:
$html = 'http://www.imdb.com/title/tt29747">';
preg_match('|imdb.com/title/(tt\d+)|i', $html, $m);
print_r($m[1]);

how do I use preg_replace_callback on unknown values?

I got some great help today with starting to understand preg_replace_callback with known values. But now I want to tackle unknown values.
$string = '<p id="keepthis"> text</p><div id="foo">text</div><div id="bar">more text</div><a id="red"href="page6.php">Page 6</a><a id="green"href="page7.php">Page 7</a>';
With that as my string, how would I go about using preg_replace_callback to remove all id's from divs and a tags but keeping the id in place for the p tag?
so from my string
<p id="keepthis"> text</p>
<div id="foo">text</div>
<div id="bar">more text</div>
<a id="red"href="page6.php">Page 6</a>
<a id="green"href="page7.php">Page 7</a>
to
<p id="keepthis"> text</p>
<div>text</div>
<div>more text</div>
Page 6
Page 7
There's no need of a callback.
$string = preg_replace('/(?<=<div|<a)( *id="[^"]+")/', ' ', $string);
Live demo
However in the use of preg_replace_callback:
echo preg_replace_callback(
'/(?<=<div|<a)( *id="[^"]+")/',
function ($match)
{
return " ";
},
$string
);
Demo
For your example, the following should work:
$result = preg_replace('/(<(a|div)[^>]*\s+)id="[^"]*"\s*/', '\1', $string);
Though in general you'd better avoid parsing HTML with regular expressions and use a proper parser instead (for example load the HTML into a DOMDocument and use the removeAttribute method, like in this answer). That way you can handle variations in markup and malformed HTML much better.

HTML Regex to Extract Data

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.
I have the following HTML:
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
What I tried that seemed most likely to work:
preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);
The above returns nothing.
So then I tried this and I got the first group to match, but I have not been able to get the second.
preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);
Thank you!
Regex is great. But, some things are best tackled with a parser. Markup is one such example.
Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/
However, if you insist on using regex for this specific case, you can use this pattern:
if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
$first_text_string = $regs[2];
$second_text_string = $regs[4];
} else {
//pattern not found
}
I highly recommend using DOM and XPath for this.
$doc = new DOMDocument;
#$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach($xp->query('//div/text()') as $n) {
list($before, $after) = explode('<br />', trim($n->wholeText));
echo $before . "\n" . $after;
}
But If you still decide to take the regex route, this will work for you.
preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);
This will do what you want given the exact input you provided. If you need something more generic please let me know.
(.*)<br\s*\/>(.*)<br\s*\/>
See here for a live demo http://www.phpliveregex.com/p/1i3

Extract <span> tag contents in PHP

Say I have a WordPress post, and certain words are wrapped in span tags.
For example:
<p>John went to the <span>bakery</span> today,
and after picking up his favourite muffin
he made his way across to the <span>park</span>
and spent a couple hours on the <span>swings</span>
with his friends.</p>
Is then then a way using PHP to dynamically spit them (the words in the span tags) out as an ordered list in my template file?
Like so:
<h3>What John Did Today</h3>
<ol>
<li>bakery</li>
<li>park</li>
<li>swings</li>
</ol>
If someone could point be in the right direction of how to do something like this, it would be much appreciated. Thank you.
$str = '<p>John went to the <span>bakery</span> today, and after picking up his favourite muffin he made his way across to the <span>park</span> and spent a couple hours on the <span>swings</span> with his friends.</p>';
$d = new DomDocument;
$d->loadHTML($str);
$xpath = new DOMXPath($d);
echo "<h3>What John Did Today</h3>\n";
echo "<ol>\n";
foreach ($xpath->query('//span') as $span)
echo "<li>".$span->nodeValue."</li>\n";
echo "</ol>\n";
A simple possibility is using regular expressions, take a look at preg_match function
Parse the DOM :
http://simplehtmldom.sourceforge.net/
I'm not a regex whiz but this SHOULD do the job for replacing <span> tags with <li> tags:
$str = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $str);
..i know this doesn't directly answer your question but it should help you with this at some point lol
EDIT: full actually working regex solution for getting all your span tags into an array and converting to list items at the same time:
// input string:
$str = '<span>Walk</span> blah <span>Drive</span> blah blee blah <span>Eat</span>';
// get array of span matches
preg_match_all("/(<span>)(.*?)(<\/span>)/i", $str, $matches, PREG_SET_ORDER);
// build array using the exact matches
foreach($matches as $val){
$spanArray[] = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $val[0]);
}
if you then print_r($spanArray); you should get something that looks like this:
Array
(
[0] => <li>Walk</li>
[1] => <li>Drive</li>
[2] => <li>Eat</li>
)

How can I find the rest of a word from a string within it in PHP?

Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)
Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);
1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags
Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>
try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);

Categories