Using simple_html_dom to parse a ul - php

I would like to get the inner text of each span in this ul.
<ul class="alternatingList">
<li><strong>Last Played</strong><span id="ctl00_mainContent_lastPlayedLabel">04.29.2011</span></li>
<li class="alt"><strong>Armory Completion</strong><span id="ctl00_mainContent_armorCompletionLabel">52%</span></li>
<li><strong>Daily Challenges</strong><span id="ctl00_mainContent_dailyChallengesLabel">127</span></li>
<li class="alt"><strong>Weekly Challenges</strong><span id="ctl00_mainContent_weeklyChallengesLabel">4</span></li>
<li><strong>Matchmaking MP Kills</strong><span id="ctl00_mainContent_matchmakingKillsLabel">11,280 (1.18)</span></li>
<li class="alt"><strong>Matchmaking MP Medals</strong><span id="ctl00_mainContent_medalsLabel">15,383</span></li>
<li><strong>Covenant Killed</strong><span id="ctl00_mainContent_covenantKilledLabel">10,395</span></li>
<li class="alt"><strong>Player Since</strong><span id="ctl00_mainContent_playerSinceLabel">09.13.2010</span></li>
<li class="gamesPlayed"><strong>Games Played</strong><span id="ctl00_mainContent_gamesPlayedLabel">975</span></li>
</ul>
I have this right now but I want to do it without writing the same code over for each span.
//pull last played text
$last_played = '';
$last_played_el = $html->find(".alternatingList");
if (preg_match('|<span[^>]+>(.*)</span>|U', $last_played_el[0], $matches)) {
$last_played = $matches[1];
}
echo $last_played;

you're already using a parser, why would you want to use a regex? You can access each span's inner text quite easily:
$html = new simple_html_dom();
// Load from a string, where $raw is your UL or the page its on
$html->load($raw);
foreach($html->find('.alternatingList span')as $found) {
echo $found->innertext . "\n";
}

Have a look at preg_match_all. It will give an array of all matches

Related

How to delete a specific tag using regex

I'm trying to delete a ul tag and nested tags inside that ul.
<ul class="related">
<li>Related article</li>
<li>Related article 2</li>
</ul>
I just deleted the nested li inside the ul using this (I'm using php for this thing, so I pulled content from a db as $content)
$content = $rq['content']; //here is the <ul class="related">... code
$content1 = preg_replace('~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis', '', $content); //it works here
So far I get the next string in $content1
<ul class="related"></ul>
So how do I delete this piece of remaining code using regex? I tried the similar pattern but did not get the results I am wanting.
$finalcontent = preg_replace('~<ul[^>]*>\K.*(?=</ul>)~Uis', '', $content1);
The following may suit your purpose:
$content1 = '<p>Foo</p><ul class="related"></ul><p>Bar</p>';
$finalcontent = preg_replace('~<ul[^>]*>.*</ul>~Uis', '', $content1);
echo $finalcontent;
The preg_replace call should remove all occurrences of <ul...>...</ul> from $content1. For the given example content, it returns:
<p>Foo</p><p>Bar</p>
If you want the replacement to be more specific, e.g., in order to only remove occurrences of <ul class="related">...</ul> but not other types of <ul>...</ul>, you can make the regex more specific. For example:
$content1 = '<p>Foo</p><ul class="related"></ul><p>Bar</p><ul><li>Do not delete this one</li></ul>';
$finalcontent = preg_replace('~<ul class="related">.*</ul>~Uis', '', $content1);
echo $finalcontent;
For the given example, this would return:
<p>Foo</p><p>Bar</p><ul><li>Do not delete this one</li></ul>

php dom parser get li palintext

i have html like this:
<li>
TEXT <---- GET THIS TEXT
<ul>
<li>a</li>
<li>aa</li>
</ul>
</li>
I want to get "TEXT" in li element, but then i try get li element I get all elements...
This is my code:
$html = str_get_html('<li>TEXT<ul><li>a</li><li>aa</li></ul></li>');
echo $html->find('li', 0)->plaintext
output:
TEXTaaa
but I need get only TEXT. And I can't add id or or something else
Each part before/after a node is a textnode, so you just need to get the first childnode:
$foo->firstChild->textContent;
I'm assuming Simple HTML Dom implements DOMDocument...
I solved it! What you needed was to grab the first textnode:
<?php
require_once 'simple_html_dom.php';
$html = str_get_html('<li>TEXT<ul><li>a</li><li>aa</li></ul></li>');
echo $html->find('li text', 0)->plaintext;
?>
OK, another example:
$html = str_get_html('<li>TEXTb<ul><li>a</li><li>aa</li></ul></li>');
echo $html->find('li', 0)->first_child()->plaintext;
now I get "b" how get "TEST" in this situation?

Remove specific li element from list when parsing page with simple_html_dom

I am pulling some page with simple_html_dom and on a page there is a list of ul li elements which I need to pull, but problem is that these are basically video tags, which are combined with other elements that I don't need in that.
Here is an example of original page source:
<ul id="video-tags">
<li>Uploader: </li>
<li class="profile_name">Sarasubmit.</li>
<li><em>Tagged: </em></li>
<li>makeup, </li>
<li>cosmetic, </li>
<li>liner, </li>
<li>fresh, </li>
<li>girls, </li>
<li>fashion, </li>
<li>more <strong>tags</strong>.</li>
</ul>
So when I pull the page I tried using this to get the tags.
$get_tags = $video_page_url->find('ul[id="video-tags"]', 0);
$post_tags_arr = array();
foreach($get_tags->find('a') as $tag) {
$post_tags_arr[] = $tag->plaintext;
}
$post_tags = implode(', ', $post_tags_arr);
This way I get all the a elements inside li and output text, but since profile name is also link and more tags is also link I get that 2 also so I end up with this.
sarasubmit, makeup, cosmetic, liner, fresh, girls, fashion, tags
Is there a way that I can just strip out tags and remove other elements so I end up like this:
makeup, cosmetic, liner, fresh, girls, fashion,
Edit: Just to mention, username is not constant so it's changing depending of who uploaded video, and also some videos don't have tags at all, and some have more or less tags. So things are dynamic.
You may try something like this:
foreach($get_tags->find('li[!class] a') as $tag) {
if($tag->plaintext != 'tags') $post_tags_arr[] = $tag->plaintext;
}
Instead of this:
foreach($get_tags->find('a') as $tag)
$post_tags_arr[] = $tag->plaintext;
}
Update: I've tested:
$htmlStr = '<ul id="video-tags">
<li>Uploader: </li>
<li class="profile_name">Sarasubmit.</li>
<li><em>Tagged: </em></li>
<li>makeup, </li>
<li>cosmetic, </li>
<li>liner, </li>
<li>fresh, </li>
<li>girls, </li>
<li>fashion, </li>
<li>more <strong>tags</strong>.</li>
</ul>';
$html = str_get_html($htmlStr);
foreach($html->find('li[!class] a') as $tag) {
if($tag->plaintext != 'tags') $post_tags_arr[] = $tag->plaintext;
}
print_r($post_tags_arr);
Output:
Array
(
[0] => makeup
[1] => cosmetic
[2] => liner
[3] => fresh
[4] => girls
[5] => fashion
)
So, try this:
$html = file_get_html($url);
foreach($html->find('li[!class] a') as $tag) {
if($tag->plaintext != 'tags') $post_tags_arr[] = $tag->plaintext;
}
Check the manual.

Get the inner text using curl concept in php

This is html text in the website, i want to grab
1,000 Places To See Before You Die
<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
I used the code like this
foreach($html->find('ul.listings li a') as $e)
echo $e->innertext. '<br/>';
The output i am getting is like
999: Whats Your Emergency<span class="epnum">2012</span>
including the span pls help me this
Why not DOMDocument and get title attribute?:
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[#class="listings"]/li/a/#title')->item(0)->nodeValue;
echo $text;
or
$text = explode("\n", trim($xpath->query('//ul[#class="listings"]/li/a')->item(0)->nodeValue));
echo $text[0];
Codepad Example
There are 2 ways that I could think of to solve this. One, is that you grab the title attribute from the anchor tag. Of course, not everyone set up a title attribute for the anchor tag and the value of the attribute could be different if they want to fill it that way. The other solution is that, you get the innertext attribute and then replace every child of the anchor tag with an empty value.
So, either do this
$e->title;
or this
$text = $e->innertext;
foreach ($e->children() as $child)
{
$text = str_replace($child, '', $text);
}
Though, it might be a good idea to use DOMDocument instead for this.
You can use strip_tags() for that
echo trim(strip_tags($e->innertext));
Or try to use preg_replace() to remove unwanted tag and its content
echo preg_replace('/<span[^>]*>([\s\S]*?)<\/span[^>]*>/', '', $e->innertext);
Use plaintext instead.
echo $e->plaintext;
But still the year will be present which you can trim off using regexp.
Example from the documentation here:
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);
echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
First of all check your html. Now it is like
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
There is no close tag for ul, perhaps you missed it.
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
</ul>';
Try like this
$xml = simplexml_load_string($string);
echo $xml->li->a['title'];

Replace UL tags with specific class

I'm trying to replace all ul tags with a level0 class, something like this:
<ul>
<li>Test
<ul class="level0">
...
</ul>
</li>
</ul>
would be processed to
<ul>
<li>Test</li>
</ul>
I tried
$_menu = preg_replace('/<ul class="level0">(.*)<\/ul>/iU', "", $_menu);
but it's not working, help?
Thanks.
Yehia
I am sure this is a duplicate, but anyway, here is how to do it with DOM
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//ul[#class="level0"]'); // Find all UL with class
foreach($nodes as $node) { // Iterate over found elements
$node->parentNode->removeChild($node); // Remove UL Element
}
echo $dom->saveHTML(); // output cleaned HTML
try /mis instead of /iU
Your code works fine- except you are passing $_menu as a string containing characters other than those you are doing a preg_replace against, despite the fact visually it looks fine. The string is also containing tabs, breaks and spaces- which the RegEx isnt looking for. You can resolve this using:
(for example)
$_menu='<ul>
<li>Test
<ul class="level0">
...
</ul>
</li>
</ul>
';
$breaks = array("
", "\n", "\r", "chr(13)", "\t", "\0", "\x0B");
$_menu=str_replace($breaks,"",$_menu);
$_menu = preg_replace('/<ul class="level0">(.*)<\/ul>/iU', "", $_menu);
try
$str ='<ul>
<li>Test
<ul class="level0">
tsts
</ul>
</li>
</ul>
';
//echo '<pre>';
$str = preg_replace(array("/(\s\/\/.*\\n)/","/(\\t|\\r|\\n)/",'/<!--(.*)-->/Uis','/>\\s+</'),array("","","",'><'),$str);
echo preg_replace('/<ul class="level0">(.*)<\/li>/',"</li>",trim($str));

Categories