Trying to scrape the entire content of a div

Trying to scrape the entire content of a div - php

I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want the content.
I'm pretty sure from looking at the touch.facebook.com/#/places_friends.php source, all i need to load is the div "content" Anyway, i'm extremely new to php and im pretty sure what i think i'm trying to do is called web scraping.
For the sake of figuring things out on stackoverflow and not needing to worry about authentication or anything yet i want to load the login page to see if i can at least get the scraper to work. Once I have a working scraping code i'm pretty sure i can handle the rest. It has load everything inside the div. I've seen this done before so i know it is possible. and it will look exactly like what you see when you try to login at touch.facebook.com but without the blue facebook logo up top and thats what im trying to accomplish right here.
So here's the login page, im trying to load the div which contains the text boxes to login the actual login button. If it's done correctly we should just see those with no blur Facebook header bar above it.
I've tried
<?php
$page = file_get_contents('http://touch.facebook.com/login.php');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
if ($div->getAttribute('id') === 'login_form') {
echo $div->nodeValue;
}
}
?>
all that does is load a blank page.
I've also tried using http://simplehtmldom.sourceforge.net/
and i modified the example basic selector to
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://touch.facebook.com/login.php');
foreach($html->find('div#login_form') as $e)
echo $e->nodeValue;
?>
I've also tried
<?php
$stream = "http://touch.facebook.com/login.php";
$cnt = simplexml_load_file($stream);
$result = $cnt->xpath("/html/body/div[#id=login_form]");
for($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
?>
that did not work either

$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < count($result); $i++){
echo $result[$i];
}
there was a syntax error in this line i removed it now just copy and paste and run this code

Im assuming that you can't use the facebook API, if you can, then I strongly suggest you use it, because you will save yourself from the whole scraping deal.
To scrape text the best tech is using xpath, if the html returned by touch.facebook.com is xhtml transitional, which it sould, the you should use xpath, a sample should look like this:
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}

You need to learn about your comparison operators
=== is for comparing strictly, you should be using ==
if ($div->getAttribute('id') == 'login_form')
{
}

Scraping isn't always the best idea for capturing data else where. I would suggest using Facebook's API to retrieve the values your needing. Scraping will break any time Facebook decides to change their markup.
http://developers.facebook.com/docs/api
http://github.com/facebook/php-sdk/

Related

:not CSS Selector Implementation Issue

I am crawling links from a website (this one), but the structure of the website creates unwanted additional output. Basically, the <a> tags have the name of an article and additional information (images and sources of those images) inside them. I would like to get ride of the additional information. I found the :not Selector to do that, but I guess I am implementing it wrong, because every combination I have tried gives me no output at all.
Here is the output.
Here is the code I need to alter:
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
(I have also tried figure:not and a couple of other combinations)
Does anyone know where I went wrong, and what I have to do to exclude the <figure> tag?
Here is my full code, not sure if that helps:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->href = 'http://www.theatlantic.com'.$post->href;
echo strip_tags($post, '<p><a>'); //echo ($post);
}
?>
</div>
</div>

TeeChart for PHP : Differences between image render and JavaScript Export

I try to use the right axis of TeeChart for PHP. I'm aware that we need to link a valid serie to both vertical axis. In fact, I have tried a simple test with the custom axis demo on the Steema site. I cut and pasted the demo and try to export it to javascript instead of rendering it.
I used this code to export to javascript :
echo $tChart1->getChart()->getExport()->getImage()->getJavaScript()->Render()->toString();
Here is a snapshot of the 2 renders side-by-side (sorry to put it in a link, this forum don't allow me to post pictures yet...)
Is there a way to get the right axis to show with the export?
EDIT:
Here is the code to test on your side :
<?php
//Includes
include "../../../sources/TChart.php";
$chart1 = new TChart(600,450);
$chart1->getChart()->getHeader()->setText("Custom Axes Demo");
$chart1->getAspect()->setView3D(false);
$line1 = new Line($chart1->getChart());
$line2 = new Line($chart1->getChart());
$line1->setColor(Color::RED());
$line2->setColor(Color::GREEN());
$chart1->addSeries($line1);
$chart1->addSeries($line2);
// Speed optimization
$chart1->getChart()->setAutoRepaint(false);
for($t = 0; $t <= 10; ++$t) {
$line1->addXY($t, (10 + $t), Color::RED());
if($t > 1) {
$line2->addXY($t, $t, Color::GREEN());
}
}
$chart1->getAxes()->getLeft()->setStartPosition(0);
$chart1->getAxes()->getLeft()->setEndPosition(50);
$chart1->getAxes()->getLeft()->getAxisPen()->color = Color::RED();
$chart1->getAxes()->getLeft()->getTitle()->getFont()->setColor(Color::RED());
$chart1->getAxes()->getLeft()->getTitle()->getFont()->setBold(true);
$chart1->getAxes()->getLeft()->getTitle()->setText("1st Left Axis");
$chart1->getAxes()->getTop()->getLabels()->setAngle(45);
$chart1->getAxes()->getTop()->getTitle()->getFont()->setColor(Color::YELLOW());
$chart1->getAxes()->getTop()->getTitle()->getFont()->setBold(true);
$chart1->getAxes()->getBottom()->getLabels()->setAngle(0);
$chart1->getAxes()->getRight()->getLabels()->setAngle(45);
$chart1->getAxes()->getBottom()->getTitle()->getFont()->setColor(new Color(255,25,25));
$chart1->getAxes()->getBottom()->getTitle()->getFont()->setBold(true);
$chart1->getAxes()->getRight()->getTitle()->getFont()->setColor(Color::BLUE());
$chart1->getAxes()->getRight()->getTitle()->getFont()->setBold(true);
$chart1->getAxes()->getRight()->getTitle()->setText("OtherSide Axis");
$chart1->getAxes()->getRight()->getLabels()->getFont()->setColor(Color::BLUE());
$chart1->getAxes()->getRight()->getAxisPen()->setColor(Color::BLUE());
$chart1->getAxes()->getTop()->getTitle()->setText("Top Axis");
$chart1->getAxes()->getBottom()->getTitle()->setText("Bottom Axis");
$line1->setHorizontalAxis(HorizontalAxis::$BOTH);
$line1->setVerticalAxis(VerticalAxis::$BOTH);
$axis1 = new Axis(false, false, $chart1->getChart());
$chart1->getAxes()->getCustom()->add($axis1);
$line2->setCustomVertAxis($axis1);
$axis1->setStartPosition(50);
$axis1->setEndPosition(100);
$axis1->getTitle()->getFont()->setColor(Color::GREEN());
$axis1->getTitle()->getFont()->setBold(true);
$axis1->getTitle()->setText("Extra Axis");
$axis1->getTitle()->setAngle(90);
$axis1->setRelativePosition(20);
$axis1->getAxisPen()->setColor(Color::GREEN());
$axis1->getGrid()->setVisible(false);
echo $tChart1->getChart()->getExport()->getImage()->getJavaScript()->Render()->toString();?>

I've modified the end of your test page to show both the HTML5 and the PHP charts at the same page:
echo $chart1->getChart()->getExport()->getImage()->getJavaScript()->Render()->toString();
$chart1->render("chart1.png");
$rand=rand();
print '<img src="chart1.png?rand='.$rand.'">';
Then, I've modified TeeChart PHP sources to also export the custom axes and the assign.
It now looks like this:
Please, send a mail to "info#steema.com" and we'll send you the modified unit (JavaScriptExport.php).

Php auto go to the next page and scrape

I'm new to Php and Im trynna code a tool that scrape Amazon product title
Right now, I can scrape the first page but I need the tool to go to the next page until there is no page left and do the same task like the 1st page which is scraping.
Here is the code:
<?php
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
echo $links[1][$i] . '<br>';
}
?>
Any help is appreciate...

To get all pages HTML as one var this would do the trick
<?php
$html = '';
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
$html .= file_get_contents($links[1][$i]);
}
echo "all pages combined:\n".$html;
?>
However, more than likely your server will time out, run out of memory or something else will go wrong. To scrape HTML content you'd be better off creating a URL list first, then scraping it one at a time. You could do this via a HTML page that calls the scraper via AJAX.

Relating PHP to source files

Sorry if the title is a bit vague, couldn't think of a better way to phrase it, anyway.
I'm attempting to make a page system for a website. Where you predictably start on page one, and then click page two and a different set of images appear. Each page has 12 images which are all thumbnail images. You click on the thumbnail image and lightbox brings up the high res shot.
My current problem is that I cannot link the PHP script to the images correctly. To me it looks correct but it doesn't work, so clearly not.
Info:
Thumbnails are name "thumb1.jpg" from 1-24, full images are name "img1.jpg" from 1-24
<?php
$imgs = array(12, 12, );
if(!empty($_GET["page"]))
{
$currPage = $_GET["page"];
}
else
{
$currPage = 1;
}
for($i = 1; $i<$imgs[$currPage-1]+1;$i++)
{
echo "<a href='albums/norfolk weekender 2012/img'.$imgs[$currPage][$i].'.jpg' rel='lightbox[group]'><img src='albums/norfolk weekender 2012/thumb'.$imgs[$currPage][$i].'.jpg'/></a>";
}
?>
.
Anyway, I'm unsure why it doesn't work, and any help will be much appreciated.
Ta.
John.

'.$imgs[$currPage][$i].'
It looks like you should be using " instead of ' to wrap round this embedded variable both times you reference it in the code, since your echo is distinguished by ".
Either way, looking at this it doesn't seem this array structure you've got going on is working.
"albums/norfolk weekender 2012/img".$imgs[$currPage][$i].".jpg"
Have you not considered something like this (care, it's rough); with $pageNo representing $_GET["page"]
for ( $i = ($pageNo - 1) * 12 + 1; $i <= ($pageNo * 12); $i++ )
{
echo "<a href='albums/norfolk weekender 2012/img".$i.".jpg' rel='lightbox[group]'><img src='albums/norfolk weekender 2012/thumb".$i.".jpg'/></a>";
}
If presentation (i.e. checking to see if an image exists before displaying it) is a major concern, you could use file_exists( filename ). By creating an Array like this...
$imgs = array(12, 12, );
...you are simply creating an array containing two elements of 12 (and possibly a blank element, I'm not entirely sure.) I think where you went wrong is you attempted to declare the size in the "constructor" of Array; in PHP this is not the case.

A fast way (or alternate way) to dynamically load 40000 links in an image map?

I'm bringing back a GD Image which is generated from user information in a database, now on the page where this image is viewed. I have the following area map for the image generated by the same sort of query to create a link to that users profile. However, there are a possible 40000 users in the database... anyway, what I have IS working, but as you can imagine it takes a long while for it to load.
<map id="pixel" name="pixel">
<?
$map_x_1 = 0;
$map_y_1 = 0;
$map_x_2 = 5;
$map_y_2 = 5;
$block_num = 1;
while ($map_y_2 <= 1000) {
while ($map_x_2 <= 1000) {
$actual_x_cood = $map_x_1+1;
$actual_y_cood = $map_y_1+1;
$grid_search = mysql_query("SELECT *
FROM project
WHERE project_x_cood = '$actual_x_cood' AND project_y_cood = '$actual_y_cood'") or die(mysql_error());
$block_exists = mysql_num_rows($grid_search);
if ($block_exists == 1) {
echo("<area shape=\"rect\" coords=\"$map_x_1, $map_y_1, $map_x_2, $map_y_2\" href=\"/block/$block_num/\" alt=\"\" title=\"$block_num\" />\n");
} else {
echo("<area shape=\"rect\" coords=\"$map_x_1, $map_y_1, $map_x_2, $map_y_2\" href=\"/block/$block_num/\" alt=\"\" title=\"$block_num\" />\n");
}
$map_x_1 = $map_x_1 + 5;
$map_x_2 = $map_x_2 + 5;
$block_num = $block_num+1;
}
$map_y_1 = $map_y_1 + 5;
$map_y_2 = $map_y_2 + 5;
$map_x_1 = 0;
$map_x_2 = 5;
}
?>
</map>
I was thinking about just throwing in a quick jquery load screen over the top in a div and then hiding it once the page has fully loaded so it looks nicer. But I'm not really too happy with the idea of it since I would just like to load it faster.
So is there a quicker way to do it, maybe PHP? JS? Thanks!

You should consider using an input:image element. It will retreive the x-y coords as built-in functionality, and can be used in JavaScript or as part of the submission of a form.
After receiving the x-y coords, you can use a quad-tree or other algorithm for quick spacial-searching in your dataset.

you should capture the coordinates in the image map (easy with jquery) and pass it to the server which then calculates the user clicked.
i did something similar with a rate bar that hat hat 100 values (1-100%). but it was done in prototype so the code wont help you much.
small hint: i had to substract the left position of the container from the absolute click position.
in php and forms its not so flexible but far easier. you can just specify an input type image. the coordinates will be passed as post variables.
something like
will suffice

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to scrape the entire content of a div - php

$stream = "http://touch.facebook.com"; $cnt = simplexml_load_file($stream); $result = $nct->xpath("/html/body/div[#id=content]"); for ($i = 0; $i < count($result); $i++){ echo $result[$i]; } there was a syntax error in this line i removed it now just copy and paste and run this code

You need to learn about your comparison operators === is for comparing strictly, you should be using == if ($div->getAttribute('id') == 'login_form') { }

Scraping isn't always the best idea for capturing data else where. I would suggest using Facebook's API to retrieve the values your needing. Scraping will break any time Facebook decides to change their markup. http://developers.facebook.com/docs/api http://github.com/facebook/php-sdk/

Related

:not CSS Selector Implementation Issue

TeeChart for PHP : Differences between image render and JavaScript Export

Php auto go to the next page and scrape

Relating PHP to source files

A fast way (or alternate way) to dynamically load 40000 links in an image map?

Categories

Resources