Scrape website to retrieve certain li elements

Scrape website to retrieve certain li elements - php

I'm running a lottery syndicate and want to automate our system to check for the lottery numbers (UK National Lottery)
The url I am getting is: https://www.national-lottery.co.uk/player/p/results/lotto.ftl
and I am using
<?php
$html = file_get_contents("https://www.national-lottery.co.uk/player/p/results/lotto.ftl");
?>
I would like to be able to grab this area of the page, namely the numbers:
The problem is, there is a lot of content on that page and I don't know the first step I would take to break it all down.
Does anyone know a way to do this in PHP or jQuery?
Thanks

what about an existing rss feed http://www.alllotto.co.uk/rss/latest.rss

I would take a look at the PHP Simple HTML DOM Parser. It simplifies scraping and does what you're asking.
Using this, finding LI elements is as easy as this:
foreach($html->find('li') as $element) {
echo $element . '<br>';
}

Related

web scrape php with clickable links

I'm trying to do a fun little project where I basically take headlines for ex from a news site, scrape it/mirror it onto an additional site using php, and then have that data that is displayed on the new site actually be clickable links to the original site. if that's a bit confusing, let me show an example.
http://www.wilsonschlamme.com/test.php
Right there, I'm using php to scrape all data from the antrimreview (local michigan news site) contained in a < span=class >.
I chose span class, because that's where their headlines are located. I'm just using antrim for testing purposes, I have no affiliation with them.
*What I'm wondering is, and what I don't know how to do, is actually make these headlines that are re displaying on my test site, as clickable links. In other words, retain the < a href > of these headlines that contain clickable links to the full articles. Put differently, on the antrim website, those headlines are clickable links to full pages. When mirrored on my test website presently, there's clearly no links, because there's nothing grabbing the data.
Does anyone know how this could be done? or any thoughts? Would really appreciate it, this is a fun project, just lacking the knowledge on how to complete it.
Oh and i know the pokemon references are lolsy down below. It's because I'm working with code originally from a tutorial somewhere lol:
<?php
$html = file_get_contents('http://www.antrimreview.net/'); //get the html
returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//span[#class]');
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>

I actually found it simple to just use a CNN rss feed for ex, using surfing-waves to generate the code. thx for the suggestions anyway.

i want to get data from another website and display it on mine but with my style.css

So my school has this very annoying way to view my rooster.
you have to bypass 5 links to get to my rooster.
this is the link for my class (it updates weekly without changing the link)
https://webuntis.a12.nl/WebUntis/?school=roc%20a12#Timetable?type=1&departmentId=0&id=2147
i want to display the content from that page on my website but with my
own stylesheet.
i don't mean this:
<?php
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
?>
or an iframe....

I think this can be better done using jquery and ajax. You can get jquery to load the target page, use selectors to strip out what you need, then attach it to your document tree. You should then be able to style it anyway you like.

I would recommend you to use the cURL library: http://www.php.net/manual/en/curl.examples.php
But you have to extract part of the page you want to display, because you will get the whole HTML document.

You'd probably read the whole page into a string variable (using file_get_contents like you mentioned for example) and parse the content, here you have some possibilities:
Regular expressions
Walking the DOM tree (eg. using PHPs DOMDocument classes)
After that, you'd most likely replace all the style="..." or class="..." information with your own.

How do I get the link element in a html page with PHP

First, I know that I can get the HTML of a webpage with:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
e.g:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
Also, the html might have errors. I don't have control on how my users code there stuff.

Using PHP's DOMDocument, this should do it (untested):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}

You should get the Base element, but know how it works and its scope.
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
http://code.google.com/p/phpquery/

I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.

So, I want to crawl a webpage? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
How to write a crawler?
Best methods to parse HTML
I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.
<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>
I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!
UPDATE
I used Simple HTML DOM on my local php MAMP server with this script, took a little while!
$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
array_push($artistPages,$element->href);
}
for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
echo $element->href . '<br>';
}
}

My favourite php library for navigating through the dom is Simple HTML DOM.
set_time_limit(0);
$poolga = file_get_html('http://poolga.com/artists');
$inRefs = $poolga->find('div#artists ol li a');
$links = array();
foreach ($inRefs as $ref) {
$site = file_get_html($ref->href);
$links[] = $site->find('a#author-url', 0)->href;
}
print_r($links);
Code, I think, is pretty self-explanatory.
Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.

Use some function to loop through the artist subpages (using jQuery as an example):
$("#artists li").each();
(each entry is under a <li> inside the <div id="artists">)
Then you will have to read each page search for the element <div id="artistSites"> or the <h2> id="author">
$("#author a").href();
The implementation details will depend on how different each page is. I only looked at two, so it may be a little more complicated than this.

DOM Manipulation with PHP

I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.

I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.

I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects

Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape website to retrieve certain li elements - php

what about an existing rss feed http://www.alllotto.co.uk/rss/latest.rss

I would take a look at the PHP Simple HTML DOM Parser. It simplifies scraping and does what you're asking. Using this, finding LI elements is as easy as this: foreach($html->find('li') as $element) { echo $element . '<br>'; }

Related

web scrape php with clickable links

i want to get data from another website and display it on mine but with my style.css

How do I get the link element in a html page with PHP

So, I want to crawl a webpage? [duplicate]

DOM Manipulation with PHP

Categories

Resources