Get the inner text using curl concept in php - php

This is html text in the website, i want to grab
1,000 Places To See Before You Die
<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
I used the code like this
foreach($html->find('ul.listings li a') as $e)
echo $e->innertext. '<br/>';
The output i am getting is like
999: Whats Your Emergency<span class="epnum">2012</span>
including the span pls help me this

Why not DOMDocument and get title attribute?:
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[#class="listings"]/li/a/#title')->item(0)->nodeValue;
echo $text;
or
$text = explode("\n", trim($xpath->query('//ul[#class="listings"]/li/a')->item(0)->nodeValue));
echo $text[0];
Codepad Example

There are 2 ways that I could think of to solve this. One, is that you grab the title attribute from the anchor tag. Of course, not everyone set up a title attribute for the anchor tag and the value of the attribute could be different if they want to fill it that way. The other solution is that, you get the innertext attribute and then replace every child of the anchor tag with an empty value.
So, either do this
$e->title;
or this
$text = $e->innertext;
foreach ($e->children() as $child)
{
$text = str_replace($child, '', $text);
}
Though, it might be a good idea to use DOMDocument instead for this.

You can use strip_tags() for that
echo trim(strip_tags($e->innertext));
Or try to use preg_replace() to remove unwanted tag and its content
echo preg_replace('/<span[^>]*>([\s\S]*?)<\/span[^>]*>/', '', $e->innertext);

Use plaintext instead.
echo $e->plaintext;
But still the year will be present which you can trim off using regexp.
Example from the documentation here:
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);
echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

First of all check your html. Now it is like
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
There is no close tag for ul, perhaps you missed it.
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
</ul>';
Try like this
$xml = simplexml_load_string($string);
echo $xml->li->a['title'];

Related

Simple HTML DOM - Skip certain element

I want to ignore the contents of the <a> which is inside <h3> element and only get the text of the <h3>.
<h3>
144.000 TL
<a class="emlak-endeksi-link trackClick trackId_emlak-endeksi-link" id="emlakEndeksiLink">
Emlak Endeksi</a>
</h3>
Example: only want to get 144.000 TL and ignore the (Emlak Endeksi)
foreach ($html1->find('div.classifiedInfo h3') as $price) {
$ilanlar['price'] = $price->plaintext;
}
not very familiar with simple html dom, but ... selecting the text node via http://simplehtmldom.sourceforge.net/manual.htm#frag_find_textcomment should help?
$ilanlar['price'] = $price->find('text', 0)->plaintext;
Maybe removing the <a> tag helps:
$str = <<<str
<h3>
144.000 TL
<a class="emlak-endeksi-link trackClick trackId_emlak-endeksi-link" id="emlakEndeksiLink">
Emlak Endeksi</a>
</h3>
str;
$html = str_get_html($str);
// Find first <h3>
$h3 = $html->find('h3', 0);
// Find first <a> inside the <h3>, or use $h3->find('a') to find all of them
$a = $h3->find('a', 0);
// Remove <a> tag
$a->outertext = '';
// Output: "144.000 TL"
print trim($h3->innertext);
You can do it via regular expression.
preg_match_all('\<h3>([^\n]*\n+)+<a([^\n]*\n+)+<\/h3>\', $content, $output);
echo $output[1];
https://regex101.com/r/qM5Nlk/1

How to webscrape HTML using PHP dom inside links

I have a problem regarding HTML webscraping.
<div class="mbs fwb">
<a href="/groups/291064327770896/" data-hovercard="/ajax/hovercard/group.php?id=291064327770896" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
NCR Business Startups </a>
</div>
<div class="mbs fwb" >
<a href="/groups/Analystamit/" data-hovercard="/ajax/hovercard/group.php?id=158649140871478" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
Risk Professionals </a>
</div>
I need to scrape inside anchor tag data-hovercard field.
Below is the code I used:
include('simple_html_dom.php');
$html = file_get_html('http://sampleurl.com/taki.html');
foreach($html->find('div[class="mbs fwb"]') as $desc11)
foreach($desc11->find('a') as $desc12)
echo $desc12->data-hovercard . '<br>';
It is not working. The result I am getting:
0
0
I want a result like this:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478
Use a Regular Expression with a pattern like: /data-hovercard="([^"]*)"/gi;
The resulting matchs' "\1" will contain all of the values for that attribute. You might need to remove newlines from your source text, just for good housekeeping.
Hope this helps.
You can do this using the built-in SimpleXMLElement class and an XPath query:
$xml = new SimpleXMLElement('http://foo.bar/baz.html', null, true);
$anchors = $xml->xpath('//div[#class="mbs fwb"]/a');
foreach ($anchors as $a) {
echo $a['data-hovercard'], PHP_EOL;
}
Output, assuming baz.html is a valid HTML file containing the divs
from the question:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478

How to select 2nd element with same tag using dom xpath?

I have layout like this:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
First I get query from xpath :
$a = $xpath->query("//div[#class='fly']""); //to get all elements in class fly
foreach ($a as $p) {
$t = $p->getElementsByTagName('img');
echo ($t->item(0)->getAttributes('data-original'));
}
When I run the code, it will produced 0 result. After I trace I found that <img class="badge"> is processed first. I want to ask, how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
Thank you,
Alernatively, you could use another xpath query on that to add on your current code.
To get the attribute, use ->getAttribute():
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('./img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('./div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('./div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
Sample Output
Thank you for your code!
I try the code but it fails, I don't know why. So, I change a bit of your code and it works!
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('**descendant::**img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('**descendant::**div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('.//div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
I have no idea what is the difference between ./ and descendant but my code works fine using descendant.
given the following XML:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
you asked:
how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
With XPath you can obtain the values as string directly:
string(//div[#class='fly']/img/#data-original)
This is the string from the first data-original attribute of an img tag within all divs with class="fly".
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])
These are the string values of first and second <h4> tag that is not followed on it's own level by another <h4> tag within all divs class="fly".
This looks a bit like standing in the way right now, but with iteration, those parts in front will not be needed any longer soon because the xpath then will be relative:
//div[#class='fly']
string(./img/#data-original)
string(.//h4[not(following-sibling::*//h4)][1])
string(.//h4[not(following-sibling::*//h4)][2])
To use xpath string(...) expressions in PHP you must use DOMXPath::evaluate() instead of DOMXPath::query(). This would then look like the following:
$aye = $xpath->evaluate("string(//div[#class='fly']/img/#data-original)");
$h4_1 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])");
$h4_2 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])");
A full example with iteration and output:
// all <div> tags with class="fly"
$divs = $xpath->evaluate("//div[#class='fly']");
foreach ($divs as $div) {
// the first data-original attribute of an <img> inside $div
echo $xpath->evaluate("string(./img/#data-original)", $div), "<br/>\n";
// all <h4> tags anywhere inside the $div
$h4s = $xpath->evaluate('.//h4[not(following-sibling::*//h4)]', $div);
foreach ($h4s as $h4) {
echo $h4->nodeValue, "<br/>\n";
}
}
As the example shows, you can use evaluate as well for node-lists, too. Obtaining the values from all <h4> tags it not with string() any longer as there could be more than just two I assume.
Online Demo including special string output (just exemplary):
echo <<<HTML
{$xpath->evaluate("string(//div[#class='fly']/img/#data-original)")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])")}<br/>
<hr/>
HTML;

PHP - extract data from a web page HTML

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code
<div id="tab-soiree" class=""><div class="soireeagenda cat_1">
<img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly">
<ul>
<li class="nom"><h2>FIESTA ERASMUS </h2></li>
<li class="genre" style="margin-bottom:4px;">
soirée étudiante </li>
<li class="lieu">Duplex</li> <li class="musique">house, electro, r&b chic, latino, disco</li>
<li class="pass-label">pass</li> </ul>
<img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle">
<hr class="clearleft">
</div>
I tested something like this
$PATTERN = "/\<div id="tab-soiree".*(.*)/"
preg_match($PATTERN, $html, $matches);
but it doesnt work.
You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php
Assuming your HTML is accessible from a variable named $html:
$doc = new DOMDocument();
$doc->loadHTML( $html );
$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);
echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;
I suggest the following pattern:
$PATTERN = '%<h2>(.*?)[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);
The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.
You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.
You can try it online here.

Using simple_html_dom to parse a ul

I would like to get the inner text of each span in this ul.
<ul class="alternatingList">
<li><strong>Last Played</strong><span id="ctl00_mainContent_lastPlayedLabel">04.29.2011</span></li>
<li class="alt"><strong>Armory Completion</strong><span id="ctl00_mainContent_armorCompletionLabel">52%</span></li>
<li><strong>Daily Challenges</strong><span id="ctl00_mainContent_dailyChallengesLabel">127</span></li>
<li class="alt"><strong>Weekly Challenges</strong><span id="ctl00_mainContent_weeklyChallengesLabel">4</span></li>
<li><strong>Matchmaking MP Kills</strong><span id="ctl00_mainContent_matchmakingKillsLabel">11,280 (1.18)</span></li>
<li class="alt"><strong>Matchmaking MP Medals</strong><span id="ctl00_mainContent_medalsLabel">15,383</span></li>
<li><strong>Covenant Killed</strong><span id="ctl00_mainContent_covenantKilledLabel">10,395</span></li>
<li class="alt"><strong>Player Since</strong><span id="ctl00_mainContent_playerSinceLabel">09.13.2010</span></li>
<li class="gamesPlayed"><strong>Games Played</strong><span id="ctl00_mainContent_gamesPlayedLabel">975</span></li>
</ul>
I have this right now but I want to do it without writing the same code over for each span.
//pull last played text
$last_played = '';
$last_played_el = $html->find(".alternatingList");
if (preg_match('|<span[^>]+>(.*)</span>|U', $last_played_el[0], $matches)) {
$last_played = $matches[1];
}
echo $last_played;
you're already using a parser, why would you want to use a regex? You can access each span's inner text quite easily:
$html = new simple_html_dom();
// Load from a string, where $raw is your UL or the page its on
$html->load($raw);
foreach($html->find('.alternatingList span')as $found) {
echo $found->innertext . "\n";
}
Have a look at preg_match_all. It will give an array of all matches

Categories