Grab content from another website [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML/XML with PHP?
I want to grab anything thats inside the code below on a website.
<table class="eventResults byEvent vevent"></table>
How can I accomplish this?
Thanks

If you want to grab HTML from a site, you will want to use a DOM Parser. PHP has several XML processing packages to help you with this, be it DOM, SimpleXML or XMLReader. An often suggested alternative at SO is SimpleHtmlDom.
Since one of the class in the table is vevent, the content inside the table could be an hCalender microformat (can't tell for sure without seeing the content). If so, you can also use a microformat parser, preferably Transformr to save you the work of manually parsing the event data.

You can use the file_get_contents function or the curl extension for that.

First fetch the content of the whole page from that website as a string, then use regular expression to extract the substring inside table tags.

Related

Getting specific HTML from a webpage with PHP [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I am learning PHP and when I have to extract (parse) some data from a webpage that does not have an available API, I use regular expressions or a function which takes the string that is between two strings.
I would like to know if there is a more "professional", easier way to do this, since regexp are resource consuming and not the easiest thing to write right now for me.
You should never try to parse XML (html) using regular-expressions, instead get yourself a proper parser library for XML and do it the correct way. I might sound like a harder task but you'll thank yourself in the end.
Parsing could be done using one of the below, or similar resources.
php.net - PHP: DOM - Manual
simplehtmldom.sourceforge.net - PHP Simple HTML DOM Parser
The popular and legendary answer regarding html and regular-expressions, poetry worth reading:
stackoverflow.com - The legendary HTML+RegExp answer!
PHP comes with a default XML parsing library for you to use in this specific case. Use file_get_contents in order to retrieve the HTML page and parse accordingly.
XML: http://php.net/manual/en/book.xml.php
file_get_contents: http://php.net/manual/en/function.file-get-contents.php

How to "read" a HTML document in PHP?

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.
I am writting a little php script that creates a PDF file from a dynamically created HTML file.
Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.
E.g.
<div><p>Test</p></div>
My script should recognize:
First tag is a div: do function for div
Second tag is a p: do function for p
I don't know for what I should search. Regular expressions? HTML parser?
Thanks for a hint!
Try an XML parser. In PHP the SimpleXML is probably what you are looking for.
I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).
What you need to do is read the HTML file into a PHP variable/object
http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx
And then use RegEx to parse the HTML Tags and Attributes
http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML

Count Hyperlinks of a Website [duplicate]

This question already exists:
Closed 11 years ago.
Possible Duplicate:
How to parse HTML with PHP?
i want to write a php-program that count all hyperlinks of a website, the user can enter.
how to do this? is there a libary or something which i can parse and analyze the html about the hyperlinks?
thanks for your help
Like this
<?php
$site = file_get_contents("someurl");
$links = substr_count($site, "<a href=");
print"There is {$links} in that page.";
?>
Well, we won't be able to give you a finite answer but only pointers. I've done a search engine once out of php so the principle will be the same:
First of all you need to code your script as a console script, a web script is not really appropriate but it's all a question of tastes
You need to understand how to work with sockets in PHP and make requests, look at the php socket library at: http://www.php.net/manual/ref.network.php
You will need to get versed in the world of HTTP requests, learn how to make your own GET/POST requests and split the headers from the returned content.
Last part will be easy with regexp, just preg_match the content for "#()*#i" (the last expression might be wrong, i didn't test it at all ok?)
Loop the list of found hrefs, compare to already visited hrefs (remember to take into account wildcard GET params in your stuff) and then repeat the process to load all the pages of a site.
It IS HARD WORK... good luck
You may have to use CURL to fetech the contents of the webpage. Store that in a variable then parse it for hyperlinks. You might need regular expression for that.

Reliably parsing HTML elements using RegEx [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I'm trying to parse a webpage using RegEx, and I'm having some trouble making it work in a reliable manner.
Say I wanted to parse the code that creates a div element, and I want to extract everything between <div> and </div>. Now, this code could just be <div></div>, but it could also very well be something like:
<div class="thisIsMyDivClass"><p>This text is inside the div</p></div>
How can I make sure that no matter how many characters that are in between the greater-than/less-than signs of the initial div tag and the corresponding last div tag, I'll always only get the content in between them? If I specify that the number of characters following < can be anything from one to ten thousand, I will always be extracting the > after ten thousand characters, and thus (most likely, unless there is a lot of code or text in between) retrieve a bunch of code in between that I don't need.
This is my code so far (not reliable for the aforementioned reason):
/<.{1,10000}>/
Regular expressions describe so called regular languages - or Type 3 in the Chomsky hierarchy. On the other hand HTML is a context free language which is Type 2 in the Chomsky hierarchy. So: There is no way to reliably parse HTML with regular expressions in general. Use a HTML parser instead. For PHP you can find some suggestions in this question: How do you parse and process HTML/XML in PHP?
You will need a Lexical analyser and grammar checker to parse html correctly. RegEx main focus was for searching strings for patterns.
I would suggest using something like DOM. I am doing a large scale site with and using DOM like crazy on it. It works, works good, and with a little work can be extremely powerful.
http://php.net/manual/en/book.dom.php

How do I extract <input> tags from and (X)HTML input in PHP? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I need to extract all the tags from an HTML file, In such a way that I would end up with either an array containing key=value for each of the attributes, or at least the raw text that makes up the tag.
I don't quite get along with regex, much less in PHP, so I would really appreciate some help in this.
PD: Some of the tags may span several lines and be indented with tabs and spaces on the subsequent lines.
Thanks.
You can use the DOM functions to parse an XML/XHTML document into a DOM Tree. From there it's not too hard to traverse the nodes you wish, extracting the data you're looking for.
Some people prefer the SimpleXML functions which might work equally well for you. I personally have issues with SimpleXML and prefer the more verbose, but more powerful DOM functions.
Yes, its easy. Use the DOM-Function of PHP and try to find the nodes with XPath.
That should be the painless way.
Another option is the simplehtmldom library.

Categories