How to get a specific part, or div of a website

How to get a specific part, or div of a website - php

What I would like to do: get the text headline from the top post on http://reddit.com/r/worldnews and output it to a webpage of mine that will only have that text on it.
In the end, I would like to grab the text from that webpage that I made using AppleScript cURL and output it.
I am making a script that when I click the button it will tell me the top post.
edit If you can think about any way, I would like to do the same thing, but for Facebook notifications.
edit I have PHP grabbing the site and outputting here: http://colejohnsoncreative.com/personal/ai/worldnews.php This is the code that I am using:
<?php
// Get a file into an array. In this example we'll go through HTTP to get
// the HTML source of a URL.
$lines = file('http://www.reddit.com/r/worldnews');
// Loop through our array, show HTML source as HTML source; and line numbers too.
foreach ($lines as $line_num => $line) {
echo "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br />\n";
}
// Another example, let's get a web page into a string. See also file_get_contents().
$html = implode('', file('http://www.example.com/'));
// Using the optional flags parameter since PHP 5
$trimmed = file('somefile.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
?>
So I get all of the site's code to output, but all I need for the project is
<a class="title " href="http://www.dailymail.co.uk/news/article-2219477/Cannabis-factory-couple-gave-400-000-drug-dealing-fortune-poor-Kenyans-jailed-years.html" >British couple who spent most of the money they made from canabis growing on paying for life changing operations and schooling for people in a poor Kenyan village gets sent to prison for 3 years.</a>
and everything else I need to throw away, how can I do that?

If youre in a shell you can wget the page
From php you could file_get_contents the page
From java you could get it with URLConnection
Once you have it, use what ever language you want to look through the text of the page for what you want, and do whatever you like with it

You gonna have to do some parsing. So match the pattern you want. Simplest is to do something like str_pos to get the position of the elements around what you want or use regex.
Do they have a RSS feed? If so you should use that.

Related

PHP output buffer corrupted getting include contents

Stranger things... hard to make this "question". I have an entire website made in php and JavaScrip. The contents are processed in many ways, accessing mySQL and files. One way is just to include a php that build the html string. To include right in the structure of the website, I did a simple output buffer:
ob_start();
include_once($url);
$output = ob_get_contents();
ob_end_clean();
echo_cont($output);
Where echo_cont simply store the contents to print later, on the right place. But a "simple" page that read some photo files and build an album is coming corrupted. Parts of html missing, strange changes like this:
class=" button2" when should be class="button2" so the element become
unformatted
"http www.mywebsite.com.br folder" when it suppose to be
"http//www.mywebsite.com.br/folder"...
Other pages are being included right.
I began to use output buffer in this site this year, I don't know if can be a problem of this kind or might be something else, but is not easy to look for clues, is not easy to run the page outside the site because it depends on several libraries - is kinda complex. It seams to me a text encoded and bad decoded later. What do you think?
EDIT: the echo_cont function:
$htmlConteudo = '';
function echo_cont($html){
global $htmlConteudo;
$htmlConteudo .= $html;
}

I decided to answer my own question with the ideas of contributors because the problem is not about the php feature, but the way I was investigating - and can happens with you reading my answer.
The issue is: the image displayed in browser is an interpretation of the data sent, as the information shown in developer window. Is not the original data, it is an attempt to make xml/html document from data. In this case, the original data need to be seen, previously from browser interpretation, and it can be with this simple function:
function strTag($xmlstr){
$str = str_replace('<', '<', $xmlstr);
$str = str_replace('>', '>', $str);
$str = str_replace(' ', ' ', $str);
return nl2br($str);
}
Than, the data is captured:
ob_start();
include("www_pc/conteudo_imagens.php");
$output = ob_get_contents();
ob_clean();
echo(strTag($output));
Now it is time to get close to the screen and examine all the details. In my case, there was some tags like this:
<div style="float:left;width:80px;height:120px;margin:0px 5px 5px 0px;>
You can see the quote missing at the end of style declaration (it happens coding late of the night). So when the browser try to rebuild the xml and make it's own interpretation, confusing the analysis when trying to find the error. I'd test in Safari and Firefox, so it is not browser failure, but browser AI limitation. Got to see the original code, AI only in movies!

Dealing with online newspaper headline link in PHP

I have seen on most online newspaper websites that when i click on a headline link, e.g. two thieves caught red handed, it normally opens a url like this: www.example.co.uk/news/two-thieves-caught-red-handed.
How do I deal with this url in php code, so that I can only pick the last part in the url. e.g. two-thieves-caught-red-handed. After that I want to work with this string.
I know how to deal with GET parameters like "www.example.co.uk/news/headline=two thieves caught red handed".
But I do not want to do it that way. Could you show me another way.

You can use the combination of explode and end functions for that
for example:
<?php
$url = "www.example.co.uk/news/two-thieves-caught-red-handed";
$url = explode('/', $url);
$end = end($url);
echo "$end";
?>
The code will result
two-thieves-caught-red-handed

You have several options in php to get the current url. For a detailed overview look here.
One would be to use $_SERVER[REQUEST_URI] and the use a string manipulation function for extraction of the parts you need.
Maybe this thread will help you too.

Alternative to php preg_match to pull data from an external website?

I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!

You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.

There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)

Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.

How to print code generated by some function in php?

I have a code in my CMS that prints content:<?php print $content ?>
I would like to output the actual php and html code behind $content, ideally in the browser. What I mean here is not the result in the browser, but the actual code behind it.Is it possible at all?
EDIT: Just to explain further: I need to print the source code of $content. Basically this variable produce some html and php content. I would like to see the code it produces, change it and replace $content with my custom code. Ideally the source code should be printed in the browser, is there anny php function that does it?

First off install the Devel Module, it has a wonderful function called dpm() which will print the contents of any variable to the Drupal messages area.
Then you need to go into your theme's template.php file and implement hook_preprocess_page():
function mytheme_preprocess_page(&$vars) {
dpm($vars['content']);
}
That will print out the $content array before it's rendered into a string. In the same preprocess function you can also change $vars['content'] as you see fit, and the changes will be reflected in $content in page.tpl.php.
Hope that helps

What do you mean by 'the code'? I think what you want to do is not possible, unless you make some kind of quine it's not possible to output the actual php code of a php file when you run it.
If $content is something like:
$content = 3 + 4 + 5;
echo $content; will output 12 yes? But I'm taking it you want to output 3 + 4 + 5 or something along those lines. The thing is, PHP (although it doesn't feel like it) is compiled. In this trivial example, 3 + 4 + 5 is stored exactly nowhere in your compiled program, it is stored as 12 (since it's static). More complex lines of code will be stored as pointers, values etc., all in nicely obfuscated machine code. Getting back to the 3 + 4 + 5 requires reading the input file and outputting the relevant line, which is difficult (think about what happens if you add or remove some lines, or how your running program knows where in the source file it is, or even if it's in the right source file).
tl;dr: this is not possible.

Well, if you just want to see html source for $content, you should simply use htmlspecialchars :
echo htmlspecialchars($content);
http://php.net/htmlspecialchars
or http://php.net/htmlentities

String replace the contents of a div

What I want to do:
I have a div with an id. Whenever ">" occurs I want to replace it with ">>". I also want to prefix the div with "You are here: ".
Example:
<div id="bbp-breadcrumb">Home > About > Contact</div>
Context:
My div contains breadcrumb links for bbPress but I'm trying to match its format to a site-wode bread crumb plugin that I'm using for WordPress. The div is called as function in PHP and outputted as HTML.
My question:
Do I use PHP of Javascript to replace the symbols and how do I go about calling the contents of the div in the first place?

Find the code that's generating the <, and either set the appropriate option (breadcrumb_separator or so) or modify the php code to change the separator.
Modifying supposedly static text with JavaScript is not only a maintenance nightmare, extremely brittle, and might lead to a strange rendering (as users see your site being modified if their system is slow), but will also not work in browsers without (or with disabled) JavaScript support.

You could use CSS to add the you are here text:
#bbp-breadcrumb:before {
content: "You are here: ";
}
Browser support:
http://www.quirksmode.org/css/beforeafter_content.html
You could change the > to >> with javascript:
var htmlElement = document.getElementById('bbp-breadcrumb');
htmlElement.innerHTML = htmlElement.innerHTML.split('>').join('>>').split('>').join('>>')
I don't recommend altering content like this, this is really hacky. You'd better change the ouput rendering of the breadcrumb plugin if possible. Within Wordpress this should be doable.

you can use a regex to match the breadcrumb content.. make the changes on it.. and put it back in the context..
check if this helps you:
$the_existing_html = 'somethis before<div id="bbp-breadcrumb">Home > About > Contact</div>something after'; // let's say this is your curreny html.. just added some context
echo $the_existing_html, '<hr />'; // output.. so that you can see the difference at the end
$pattern ='|<div(.*)bbp-breadcrumb(.*)>(.*)<\/div>|sU'; // find some text that is in a div that has "bbp-breadcrumb" somewhere in its atributes list
$all = preg_match_all($pattern, $the_existing_html, $matches); // match that pattern
$current_bc = $matches[3][0]; // get the text inside that div
$new_bc = 'You are here: ' . str_replace('>', '>>', $current_bc);// replace entity for > with the same thing repeated twice
$the_final_html = str_replace($current_bc, $new_bc, $the_existing_html); // replace the initial breadcrumb with the new one
echo $the_final_html; // output to see where we got

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get a specific part, or div of a website - php

If youre in a shell you can wget the page From php you could file_get_contents the page From java you could get it with URLConnection Once you have it, use what ever language you want to look through the text of the page for what you want, and do whatever you like with it

You gonna have to do some parsing. So match the pattern you want. Simplest is to do something like str_pos to get the position of the elements around what you want or use regex. Do they have a RSS feed? If so you should use that.

Related

PHP output buffer corrupted getting include contents

Dealing with online newspaper headline link in PHP

Alternative to php preg_match to pull data from an external website?

How to print code generated by some function in php?

String replace the contents of a div

Categories

Resources