crawling a html page using php?

crawling a html page using php? - php

This website lists over 250 courses in one list. I want to get the name of each course and insert that into my mysql database using php. The courses are listed like this:
<td> computer science</td>
<td> media studeies</td>
…
Is there a way to do that in PHP, instead of me having a mad data entry nightmare?

Regular expressions work well.
$page = // get the page
$page = preg_split("/\n/", $page);
for ($text in $page) {
$matches = array();
preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
// insert $matches[1] into the database
}
See the documentation for preg_match.

How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
foreach($dom->getElementsByTagName('td') as $title) {
echo $title->nodeValue;
}
For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.

You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

I encountered the same problem.
Here is a good class library called the html dom
http://simplehtmldom.sourceforge.net/.
This like jquery

Just for fun, here's a quick shell script to do the same thing.
curl http://courses.westminster.ac.uk/CourseList.aspx \
| sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
| uniq > courses.txt

Related

Simple PHP script for downloading results

I need to download results from a website using a for loop to compile them.
(Note that it's an ASP request which displays a webpage with these parameters)
I wrote the following code to get me this:
<?php
for ($i=10; $i<500; $i++) {
$m = $i*10;
$dl = $query;
$text = file_get_contents($dl);
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
$aObj = $doc->find('Academic');
if (count($aObj) > 0)
{
echo "<h4>Found</h4>";
//Don't download this
}
else
{
echo "<h4>Not found</h4>";
//Download this
}
}
?>
But it returns several errors. Apparently it can't copy the ASPX file to the HTML DOM. How do I go about doing this? Also, how can I download/save the pages where the string 'Download' is not found?
I also think my method of finding 'Download' in the document is not working. What is the correct way to do this?

The website you're attempting to parse contains a lot of errors, therefore you wont be able to use the standard DOMDocument object. You can attempt to use a library such as SimpleHTMLDOM (http://simplehtmldom.sourceforge.net/) or phpQuery (https://code.google.com/p/phpquery/) and hope that those are good enough to parse the malformed document.
In case you just need some information it might be easier to use regular expressions and preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php) to find every occurrence of 'Academic' for example.
Note, usually it is not very advisable to use regular expression when working with structured documents such as HTML since you wont be able to take advantage of the structure, but since those documents seem to contain over 300 errors and differ from each other it might be the only way.

PHP Regular expressions problem [duplicate]

This question already has answers here:
Parse HTML with PHP's HTML DOMDocument
(2 answers)
PHP Regular expressions class/id
(1 answer)
Closed 8 years ago.
I'm using regex to pull info from a html table.
But I'm messing up some how, and have no idea why.
PHP CODE:
$printable = file_get_contents('./testplaylist.php', true);
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/si', $printable, $matches, PREG_SET_ORDER)); {
foreach($matches as $match) {
$data = "$match[1]";
echo("$data <br />");
}
}
HTML DATA:
<TR class=" light ">
Stuff in here
</TR>
Any help would be appreciated,
Thanks!

Try this one instead
http://sandbox.phpcode.eu/g/bba70.php
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/msU', $printable, $matches)) {
foreach($matches[1] as $match) {
echo("$match <br />");
}
}

I know what your first problem is. regex! I kid! but have you checked out PHP DOM?
http://www.php.net/manual/en/domdocument.loadhtmlfile.php
It would probably work in your case just fine. It would be 10x easier too.
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems. -Jamie Zawinski

Works fine here. It should work unless you have nested tables.
The problem must be in your data source. Do some tracing with var_dump.

Use PHP's document object model to be safe when parsing HTML. Except for very simple regexes, HTML parsing rapidly gets out of control when you DIY. There's a bit of overhead to set it up, but once you get going it's straightforward.
See DOM for instructions on how to use it.
If you stick to the regex technique, at the least, you may need to escape all '<' and '>'s eg.
/\<TR[^>]*\>(.*?)\<\/TR\>/si

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance

Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need

You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.

function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif

Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/

Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

PHP get external page content

i get the html from another site with file_get_contens, my question is how can i get a specific tag value?
let's say i have:
<div id="global"><p class="paragraph">1800</p></div>
how can i get paragraph's value? thanks

If the example is really that trivial you could just use a regular expression. For generic HTML parsing though, PHP has DOM support:
$dom = new domDocument();
$dom->loadHTML("<div id=\"global\"><p class=\"paragraph\">1800</p></div>");
echo $dom->getElementsByTagName('p')->item(0)->nodeValue;

You need to parse the HTML. There are several ways to do this, including using PHP's XML parsing functions.
However, if it is just a simple value (as you asked above) I would use the following simple code:
// your content
$contents='<div id="global"><p class="paragraph">1800</p></div>';
// define start and end position
$start='<div id="global"><p class="paragraph">';
$end='</p></div>';
// find the stuff
$contents=substr($contents,strpos($contents,$start)+strlen($start));
$contents=substr($contents,0,strpos($contents,$end));
// write output
echo $contents;
Best of luck!
Christian Sciberras
(tested and works)

$input = '<div id="global"><p class="paragraph">1800</p></div>';
$output = strip_tags($input);

preg_match_all('#paragraph">(.*?)<#is', $input, $output);
print_r($output);
Untested.

Extract data from website via PHP

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";

It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

crawling a html page using php? - php

Regular expressions work well. $page = // get the page $page = preg_split("/\n/", $page); for ($text in $page) { $matches = array(); preg_match("/^<td>(.*)<\/td>$/", $text, $matches); // insert $matches[1] into the database } See the documentation for preg_match.

You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery

Just for fun, here's a quick shell script to do the same thing. curl http://courses.westminster.ac.uk/CourseList.aspx \ | sed '/<td>\(.\)<\/td>/ { s/.">\(.\)<\/a>./\1/; b }; d;' \ | uniq > courses.txt

Related

Simple PHP script for downloading results

PHP Regular expressions problem [duplicate]

using preg_match_all to get name of image

PHP get external page content

Extract data from website via PHP

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

crawling a html page using php? - php

Regular expressions work well. $page = // get the page $page = preg_split("/\n/", $page); for ($text in $page) { $matches = array(); preg_match("/^<td>(.*)<\/td>$/", $text, $matches); // insert $matches[1] into the database } See the documentation for preg_match.

You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery

Just for fun, here's a quick shell script to do the same thing. curl http://courses.westminster.ac.uk/CourseList.aspx \ | sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \ | uniq > courses.txt

Related

Simple PHP script for downloading results

PHP Regular expressions problem [duplicate]

using preg_match_all to get name of image

PHP get external page content

Extract data from website via PHP

Categories

Resources

Just for fun, here's a quick shell script to do the same thing. curl http://courses.westminster.ac.uk/CourseList.aspx \ | sed '/<td>\(.\)<\/td>/ { s/.">\(.\)<\/a>./\1/; b }; d;' \ | uniq > courses.txt