Retrieve data contained a certain span class - php

using file_get_contents, I open an Internet URL and get the contents of this webpage.
Inside the HTML there are many identical span class tags:
<span class="always-the-same-class">always dynamic text</span>
Now, I want to get an array containing all the "dynamic text" contained in any of this tags. It is not necessary to eliminate duplicated entries (I need them).
Is this possible? How could I do?

If I understood correctly, this has to be PHP as it is on the server, not in the browser. So I'd do something like
$html=file_get_contents(HTML_URL);
$a=preg_match_all("/\<span class\=\"always-the-same-class\"\>(.*?)\<\/span\>/",$html,$b);
echo $a;
print_r($b[1]);
$a has hit count, $b[1] the hits
Tested this against
<html>
.. blah ..
<body>
.. blah ..
<span class="always-the-same-class">always dynamic text A</span>
<span class="always-the-same-class">always dynamic text B</span>
<span class="always-the-same-class">always dynamic text C</span>
.. blah ..
</body>
</html>
and output was
3
Array
(
[0] => always dynamic text A
[1] => always dynamic text B
[2] => always dynamic text C
)

jquery:
var spanText = $('.always-the-same-class').text();

You can parse this content using the DOMDocument class that is provided in PHP. Once you load the content into the dom document you can then filter out the span tags by using
$content->getElementsByTagName('span');
Once you have done this then you can filter the results by the tags attributes and get the content.

Related

Removing portion from scraped array

Currently I am scraping a website and I am trying to remove a portion of the code which I don't want to be included in the array.
so the code I have currently
$content['article'] = $html2->find('.hentry-content',0);
$content['article'] = $content['article']->plaintext;
This returns everything within the .hentry-content class on the website I am gathering content from.
Now the content that gets returned looks like this.
array (
[article] => This is some example filler content please no actual meaning behind random bridge for bridge random you dog tomorrow http://example.com/our-random-mp3.com
)
Now at the end of this output it usually includes a random MP3 is there anyway that I can pull just the content portion of the array without the mp3 being included?
if link is inside of <a> tag this should work
foreach($content['article']->find('a') as $item) {
$item->outertext = '';
}
echo $content['article']->plaintext;
If the returned text only contains one link to the random mp3-file you could filter it out with:
$url_pattern = '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/';
$content['article'] = preg_replace($url_pattern, '', $content['article']->plaintext);
This will remove all urls from the text. I took the url-pattern from http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149.

Xpath query to get text in span with <H2> tags

i want to get all the text after the <span class="general2"> including the <h2> tags
i have the Html content as following
<span class="general2" itemprop="articleBody"> I WANT THIS TEXT I WANT THIS TEXTI WANT THIS TEXT<br />
<h2>I WANT THIS TEXT AND ALSO PRESERVE THE TAG</h2><br />
I WANT THIS TEXT</span>
i tried the query
//span[contains(#class,'general2')]
but it gives me all the text as plain text. want something like
//span[contains(#class,'general2')]/*[text() or local-name()='h3']
As you want quite distinct elements it is probably best to use the union operator | to join different elements together. You can first get all the text elements which are children of <span/, then also the text element of <a/> and last but not least the <h2/> element. This should work:
//span[contains(#class,'general2')]/text() | //span[contains(#class,'general2')]/h2 | //span[contains(#class,'general2')]/a/text()
Using XPath 3.0 this can be written more elegant as it allows functions as steps:
//span[contains(#class,'general2')]/(text() | h2 | a/text())
That is the task of your host programming language. The XPath's job is only to select relevant element, then you need to find a way using PHP to get inner HTML markup of the selected element. Maybe something like this (I'm not PHP guy in any way) :
$span = $xpath->query('//span[contains(#class,'general2')]');
echo $dom->saveXML($span->item(0));
PHP References to get above snippet : Get inner HTML of parent element with php and xpath, How to get innerHTML of DOMNode?

HTML doesn't get rendered, why does it happen?

On a PHP+MySQL project, there's a string of text coming from a MySQL table that contains HTML tags but those tags never get rendered by Google Chrome or any browser I tried yet:
You can see that the HTML (p, strong) aren't getting interpreted by the browser.
So the result is:
EDIT: HTML/PHP
<div class="winery_description">
<?php echo $this->winery['description']; ?>
</div>
$this->winery being the array result of the SQL Select.
EDIT 2: I'm the dumbest man in the world, the source contains entities. So the new question is: How do I force entities to be interpreted?
Real source:
Any suggestions? Thanks!
You are probably using innerText or textContent to set the content of your div, which just replace the child nodes of the div with a single text node.
Use innerHTML instead, to have the browser parse the HTML and create the appropriate DOM nodes.
The answer provided by #Paulpro is correct.
Also note that if you are using jQuery, be sure to use the .html() method instead of .text() method:
$('#your_element').html('<h1>This works!</h1>');
$('#another_element').text('<h2>Wrong; you will see the <h2> in the output');

Getting Specific Data in PHP

What is the way to get specific data using PHP. In this case i want to get some text which is wrapped by <span class="s"> to the first <b> HTML tag.Assuming a HTML source code is:
Once there was a king <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video</span> but they have...
So, here i want to get those filtered data in a variable like: $fdata = "May 3 2009";Because, May 3 2009 is wrapped by <span class="s"> to the first <b> HTML tag.
I will use it in SIMPLE PHP HTML DOM PARSING. So, any idea or example to filter those text and get it in a variable? Any idea will be a great help. *If you found a duplicate question here, its not that its more specified.
Use Simple HTML DOM
http://simplehtmldom.sourceforge.net/
Or http://php.net/manual/en/domdocument.loadhtml.php
Or you can use any other library also.
If you're using simple html dom parser you'd grab the elements you're targeting like this:
$ret = $html->find('span class="s"');
This is just a basic sample, but it should get you going in the right direction.
if you need to find a very specific instance, you can use something such as:
$ret = $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;

Code in a RSS feed

I am using a feed creator (specifically, Kohana's feed::create()), except some of my text might be like this in the description element
See code below
<?php echo 'example'; ?>
The feed creator is using the SimpleXML Library. Whenever the data is returned (using $xml->asXml()) the html angle brackets inside the description element are converted to HTML entities.
This makes the tags be parsed correctly, useful for p tags and the like. However, in this case - the PHP code won't show up (being surrounded by angle brackets).
My question is - how can I show stuff like this in a RSS feed? How can I display > when it itself is parsed back as <? Does that make sense?
Here is an example of what is being outputted:
<description><p>some content</p>
<p>WITH some code</p><p><?php
//test me out!
?></p>
</description>
(note that is not an error above - the entities are all converted)
What I'd like it to display (in a RSS reader) is
some content
WITH some code
<?php
//test me out! ?>
You want the code to actually display in the feed as code, not execute, right? If so, you need to escape it the same way you would if you wanted it to display in HTML, i.e.:
htmlspecialchars( "<?php echo 'example'; ?>" )
That will result in your feed looking even more garbled than it already does, because the PHP will be double-encoded, once for the RSS XML and again for the HTML contained in the RSS XML.
All RSS tags contain strings so can't you just do your PHP manipulation prior to setting the tag?
So instead of saying:
$xml->description = 'Description <?php echo $var; ?>';
you should be doing:
$xml->description = 'Description ' . $var;
What is the reason that you want to pass PHP code into your RSS feed? I'm guessing that a lot of feed readers would not execute it anyways.

Categories