Only show certain ID with PHP web scrape? - php

I'm working on a personal project where it gets the content of my local weather station's school/business closing and it displays the results on my personal site. Since the site doesn't use an RSS feed (sadly), I was thinking of using a PHP scrape to get the contents of the page, but I only want to show a certain ID element. Is this possible?
My PHP code is,
<?php
$url = 'http://website.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>
I was thinking of using preg_match, but I'm not sure of the syntax or if that's even the right command. The ID element I want to show is #LeftColumnContent_closings_dg.

Here's an example using DOMDocument. It pulls the text from the first <h1> element with the id="test" ...
$html = '
<html>
<body>
<h1 id="test">test element text</h1>
<h1>test two</h1>
</body>
</html>
';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$res = $xpath->query('//h1[#id="test"]');
if ($res->item(0) !== NULL) {
$test = $res->item(0)->nodeValue;
}

A library I've used with great success for this sort of things is PHPQuery: http://code.google.com/p/phpquery/ .
You basically get your website into a string (like you have above), then do:
phpQuery::newDocument($output);
$titleElement = pq('title');
$title = $titleElement->html();
For instance - that would get the contents of the title element. The benefit is that all the methods are named after the jQuery ones, making it pretty easy to learn if you already know jQuery.

Related

Grabbing content of external site CSS class. (steam store)

I have been playing around with this code for a while but cant get it to work properly.
My goal is to display or maybe even create a table with ID's of grabbed data from the steam store for my own website and game library. the class is 'game_area_description'
This is a study project of mine.
So i tried to get the table using the following code.
#section('selectedGame');
<?php
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
header("Access-Control-Allow-Origin: ${url}");
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[#class="game_area_description"]/a');
$link = $dom->saveHTML($elements->item(0));
echo $link;
?>
#endsection;
I am using Laravel by the way.
In some other cases i can get another piece of the website.
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
$content = file_get_contents($url);
$first_step = explode( '<div class="game_description_snippet">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo "<p>${second_step[0]}</p>";
Here it just takes the excerpt of the webpage which works in some cases.
Here is the biggest issue, other than not beeing able to get all the information where i get an error $first_step[1]is not valid.
Is some CORE issue.
See the webpage loads an age check in some cases like "Batman Arkham knight". the user needs to either log in or verify their age first.
Keeping me from using the second block of code.
But the first gives me all kinds of errors as the screenshot shows.
Anyone know of a way to grab this part of the page?
Where the description of the game is?
The answer to my question was in the comments.
apparently steam has some undocumented API's .
here is the code ( with bootstrap CSS).
That i used and going ti implement in my migration tables and seeder
#section('selectedGame');
<div class="container border">
<!-- Content here -->
<?php
$url = "http://store.steampowered.com/api/appdetails?appids=".$game->appID;
$jsondata = file_get_contents($url);
$parsed = json_decode($jsondata,true);
$gameID = $game->appID;
$gameDescr = $parsed[$gameID]['data']['about_the_game'];
echo $gameDescr;
?>
</div>
#endsection;

How to get data or value from any div in php

i Have create php page where use many div with different id name.
so i want to get data or value from one div.
Here am showing one div with id name
i want to get data or value from this div.
<div id="tablename">tablename</div>
i have use this but its not working.
$doc = new DomDocument();
$thediv = $doc->getElementById('tablename');
echo $thediv->textContent;
So please tell me how can i get this value from my div?
You need to pass the whole content of your page to the class, otherwise, it can't select nothing since it thinks the document is empty:
$content = '<div id="tablename"></div>';
$doc = new DomDocument();
$doc->loadHTML($content); // That's the addition
$thediv = $doc->getElementById('tablename');
echo $thediv->textContent;
More info:
loadHTML(): Load the HTML from a string.
loadHTMLFile(): Load the HTML from a file.
Downloaded and include PHP Simple HTML DOM Parser from https://sourceforge.net/projects/simplehtmldom/files/ and
Try this
include 'simple_html_dom.php';
$html = file_get_html("http://www.facebook.com");
$displaybody = $html->find('div[id=blueBarDOMInspector]', 0)->plaintext;
echo $displaybody ;exit;

Php get a value from url using a class

here is the div code on different domains, i want to display total on my homepage. I try to use the file_get_html but it displays all the div content, but i want to save the number within the <dd></dd> in a variables and add them and display them on my page.
here is the div code
<div class="stats">
<dl class="statscount">
<dt>total:</dt>
<dd>5,299</dd>
</dl>
20000
</div>
and here is my current code.
<?php
include 'simple_html_dom.php';
$html = file_get_html('http://www.targetdomain.com');
$result = $html->find('dl[class=statscount]', 0); //Output: THESE
$result = str_replace(",", "", $result);
echo $result;
?>
but there is small problem i don't need to fetch all the data in the class, i just need data for <dd></dd> tag within the class, Can you please tell me how to achieve this. basically i want to fetch the number within the <dd>5,299</dd> and add all the numbers from different pages and display the total on my website. Thanks
I would use XPath for this, this way you won't need simple_html_dom because DOM and XPath is part of the PHP5 core:
$html = <<<EOF
<div class="stats">
<dl class="statscount">
<dt>total posts:</dt>
<dd>5,299</dd>
</dl>
20000
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
$value = $selector
->query('//dl[#class="statscount"]/dd/text()')
->item(0)
->nodeValue;
var_dump($value); // Output: string(5) "5,299"
You can test the code here
Maybe a regex
preg_match('/<dd>[^>]*(.*)<\/dd>/', $htmlcode, $matches);
$result = $matches;

Screen scraping with cURL and Regex

Consider a document in the following format:
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match to find the opening and closing tags; they will always look the same, including that ending comment.
I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.
I have some simple PHP cURL code:
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);
Of course, now $output contains the entire source. How will I get just the contents of that element?
That's quite easy if you are sure the begin and end is ALWAYS the same. All you have to do is search for the beginning and end and match everything between that. I think a lot of people will be pissed at me for using regex to find a bit of HTML but it'll do the job!
// cURL
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(.*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches); // all matches
Because I don't know which website you're trying to crawl I'm not sure if this works or not.
After searching for quite a while (26 minutes to be exact) I have found why it didn't work. The dot (.) doesn't match newlines. Because HTML is full of new lines, it couldn't match the contents. Using a slightly dirty hack I managed to get it matching anyway (even though you already picked an answer).
// cURL
$ch = curl_init('http://blogg.oscarclothilde.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(([^.]|.)*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches[1][0]); // all matches
If you are sure about the following structure:
<div class="blog_post_item first">
WHATEVER
</div><!-- end blog_post_item -->
AND you are sure the ending-code doesn't appear in WHATEVER, then you can simply grab it.
(Note please that I replaced your original PHP with WHATEVER. CURL will only fetch the HTML, and it will contain content, not PHP.)
You don't need a regex. You can also do it simply by searching for the wanted strings, like in my example below.
$curlResponse = '
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>';
$startStr = '<div class="blog_post_item first">';
$endStr = '</div><!-- end blog_post_item -->';
$startStrPos = strpos($curlResponse, $startStr)+strlen($startStr);
$endStrPos = strpos($curlResponse, $endStr);
$wanted = substr($curlResponse, $startStrPos, $endStrPos-$startStrPos );
echo htmlentities($wanted);
This piece of code should work (>= 5.3.6 and dom extension):
$s = <<<EOM
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
EOM;
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//div[contains(#class, "blog_post_item") and contains(#class, "first")]') as $el) {
echo $d->saveHTML($el);
}

Add a span tag to Title without javascript

UPDATE:
Yes I am Using PHP in my pages.
Hello Friends I was thinking..... Is there a way to add a <span> tag to the title without using javascript?
May be using Regex or php or some other method. I dont really know.
Let me explain....
My HTML is like this:
<h3 class="title">The Title Goes Here</h3>
What I want is to automatically add a span tag, so the the final HTML looks like this.
<h3 class="title"><span>The </span>Title Goes Here</h3>
I want to wrap only the first word of the title in a <span> tag.
I know this can easily be dont using Javascript but I am looking for a non-javascript solution.
Please Help!
You can do this with DOMDocument in PHP if you don't want to do it with the javascript DOM:
$html = '<h3 class="title">The Title Goes Here</h3>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach($xp->query('//h3[#class="title"]') as $parent) {
$title = $parent->nodeValue;
list($first, $rest) = explode(' ', $title, 2);
$span = new DOMElement('span', $first. ' ');
$parent->nodeValue = $rest;
$parent->insertBefore($span, $parent->firstChild);
}
foreach($doc->getElementsByTagName('body')->item(0)->childNodes as $node)
{
echo $doc->saveHTML($node);
}
My answer is that the cannot be done. You can't manipulate a page in the browser without JavaScript. This can only be achieved by editing the page on the server manually, or by dynamically generating it using PHP logic, or an equivalent solution, of which there are many.
If you are doing this for a corporate solution that is only used on a single corporate standard browser, you could look into building a plugin for the browser.

Categories