Extract <span> tag contents in PHP - php

Say I have a WordPress post, and certain words are wrapped in span tags.
For example:
<p>John went to the <span>bakery</span> today,
and after picking up his favourite muffin
he made his way across to the <span>park</span>
and spent a couple hours on the <span>swings</span>
with his friends.</p>
Is then then a way using PHP to dynamically spit them (the words in the span tags) out as an ordered list in my template file?
Like so:
<h3>What John Did Today</h3>
<ol>
<li>bakery</li>
<li>park</li>
<li>swings</li>
</ol>
If someone could point be in the right direction of how to do something like this, it would be much appreciated. Thank you.

$str = '<p>John went to the <span>bakery</span> today, and after picking up his favourite muffin he made his way across to the <span>park</span> and spent a couple hours on the <span>swings</span> with his friends.</p>';
$d = new DomDocument;
$d->loadHTML($str);
$xpath = new DOMXPath($d);
echo "<h3>What John Did Today</h3>\n";
echo "<ol>\n";
foreach ($xpath->query('//span') as $span)
echo "<li>".$span->nodeValue."</li>\n";
echo "</ol>\n";

A simple possibility is using regular expressions, take a look at preg_match function

Parse the DOM :
http://simplehtmldom.sourceforge.net/

I'm not a regex whiz but this SHOULD do the job for replacing <span> tags with <li> tags:
$str = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $str);
..i know this doesn't directly answer your question but it should help you with this at some point lol
EDIT: full actually working regex solution for getting all your span tags into an array and converting to list items at the same time:
// input string:
$str = '<span>Walk</span> blah <span>Drive</span> blah blee blah <span>Eat</span>';
// get array of span matches
preg_match_all("/(<span>)(.*?)(<\/span>)/i", $str, $matches, PREG_SET_ORDER);
// build array using the exact matches
foreach($matches as $val){
$spanArray[] = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $val[0]);
}
if you then print_r($spanArray); you should get something that looks like this:
Array
(
[0] => <li>Walk</li>
[1] => <li>Drive</li>
[2] => <li>Eat</li>
)

Related

preg_replace : getting a html tag inside an other html tag from BBCode

So I'm trying to make a php function to get HTML tags from a BBCode-style form. The fact is, I was able to get tags pretty easily with preg_replace. But I have some troubles when I have a bbcode inside the same bbcode...
Like this :
[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]
So, when I "parse" it, I always have remains bbcode for the blue ones. Something like :
My house is [blue]very[/blue] beautiful today
Everything is colored except for the blue-tag inside the black-tag inside the first blue-tag.
How the hell can I do that ?
With more informations, I tried :
Regex: "/\[blue\](.*)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/i"
Getting : "My house is [blue]very[/blue] beautiful today"
Regex : "/\[blue\](.*?)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/Ui"
Getting : "My house is [blue]very beautiful today[/blue]"
Do I have to loop the preg_replace ? Isn't there a way to do it, regex-style, without looping the thing ?
Thx for your concern. :)
It is right that you should not reinvent the wheel on products and rather choose well-tested plugins. However, if you are experimenting or working on pet projects, by all means, go ahead and experiment with things, have fun and obtain important knowledge in the process.
With that said, you may try following regex. I'll break it down for you on below.
(\[(.*?)\])(.*?)(\[/\2\])
Philosophy
While parsing markup like this, what you are actually seeking is to match tags with their pairs.
So, a clean approach you can take would be running a loop and capturing the most outer tag pair each time and replacing it.
So, on the given regex above, capture groups will give you following info;
Opening tag (complete) [black]
Opening tag (tag name) black
Content between opening and closing tag My [black]house is [blue]very[/blue] beautiful[/black] today
Closing tag [/blue]
So, you can use $2 to determine the tag you are processing, and replace it with
<tag>$3</tag>
// or even
<$2>$3</$2>
Which will give you;
// in first iteration
<tag>My [black]house is [blue]very[/blue] beautiful[/black] today</tag>
// in second iteration
<tag>My <tag2>house is [blue]very[/blue] beautiful</tag2> today</tag>
// in third iteration
<tag>My <tag2>house is <tag3>very</tag3> beautiful</tag2> today</tag>
Code
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
function convert($input)
{
$control = $input;
while (true) {
$input = preg_replace('~(\[(.*?)\])(.*)(\[/\2\])~s', '<$2>$3</$2>', $input);
if ($control == $input) {
break;
}
$control = $input;
}
return $input;
}
echo convert($text);
As others mentionned, don't try to reinvent the wheel.
However, you could use a recursive approach:
<?php
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
$regex = '~(\[ ( (?>[^\[\]]+) | (?R) )* \])~x';
$replacements = array( "blue" => "<bleu>",
"black" => "<noir>",
"/blue" => "</bleu>",
"/black" => "</noir>");
$text = preg_replace_callback($regex,
function($match) use ($replacements) {
return $replacements[$match[2]];
},
$text);
echo $text;
# <bleu>My <noir>house is <bleu>very</bleu> beautiful</noir> today</bleu>
?>
Here, every colour tag is replaced by its French (just made it up) counterpart, see a demo on ideone.com. To learn more about recursive patterns, have a look at the PHP documentation on the subject.

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)
Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)
This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

How do I find scrape information between 2 tags?

I am trying to scrape information with PHP that has their data like so:
<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>
I need to get the year that is between the <br> and the <a> tag. I have gotten the title of the movie by using PHP Simple DOM HTML parser. This was the code that I used to parse the title
foreach($dom->getElementsByTagName('a') as $link){
$title = $link->getAttribute('href');
}
I tried using:
$string = '<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>';
$year = preg_match_all('/<br>(.*)<a>', $string);
But it's not finding the year that is in between the <br> and the <a> tag. Does anyone know what I could possibly do to find the year?
Try this:
<?php
$subject = '<br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a/>';
$pattern = '/<br>[0-9]{4}/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>
Note that you can change pattern if year is shown in some other formats. If you want to see everything between two tags you can use $pattern = '/<br>.*<a/'; or any other appropriate for you.
The expression you are using: $year = preg_match_all('/<br>(.*)<a>', $string); will find text between <br> and <a>, but in your example you do not have <a> anywhere. Try looking for text between <br> and <a like this:
$year = preg_match_all ('/<br>([^<]*)<a/', $string);
note, that I also changed . to [^<] to make sure it will stop at the next tag, otherwith it will match strings like this:
<br>foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a
because they start with <br> and end with <a, but this is probably not what you need, any your year will be like this:
foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry

How can I find the rest of a word from a string within it in PHP?

Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)
Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);
1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags
Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>
try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);

regex php: find everything in div

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.
so currently my regexp pattern looks like this:
$gallery_pattern = '/<div class="gallery">([\s\S]*)<\/div>/';
And it does the trick - somewhat.
The problem is if i have two divs after each other - like this.
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:
"text to extract here </div>
<div class="gallery">text to extract from here as well"
So to sum up. It skips the first end of the div. and continues on to the next.
The text inside the div can contain <, / and linebreaks. just so you know!
Does anyone have a simple solution to this problem? Im still a regexp novice.
You shouldn't be using regex to parse HTML when there's a convenient DOM library:
$str = '
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName('div');
if ( count($divs ) ) {
foreach ( $divs as $div ) {
echo $div->nodeValue . '<br>';
}
}
What about something like this :
$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;
$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#s', $str, $matches);
var_dump($matches[1]);
Note the '?' in the regex, so it is "not greedy".
Which will get you :
array
0 => string 'text to extract here' (length=20)
1 => string 'text to extract from here as well' (length=33)
This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?
A possible answer to this problem can be found at http://simplehtmldom.sourceforge.net/
That class help me to solve similar problem quickly

Categories