How can I screen scrape a website using cURL and show the data within a specific div?
Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.
After downloading with cURL use XPath to select the div and extract the content.
A possible alternative.
# We will store the web page in a string variable.
var string page
# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page
# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page
This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.
stex -r -c "^<div&ABC&</div\>^" $page
Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.
Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.
Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.
Output the information you've found from the regex/parser in the HTML of your page (within the required div.)
Related
I'm using CURL, DOMDocument, loadHTML, DOMXPath in PHP to get the contents of URLs. In order to verify the validity of the data, I also run checks on the amount of html, head and body tags that were retrieved.
My setup is working fine for the large majority of URLs I enter. However, for some URLs, an unexpected amount of those tags in reported. The xPaths:
$html = $this->runXpath('/html');
$head = $this->runXpath('/html/head');
$body = $this->runXpath('/html/body');
And the check:
if($html->length > 1) {
echo 'Too many html tags';
}
https://www.chownow.com/: 2x HTML (yes, I see the iframe, but that is generated through Javascript, which CURL shouldn't render? Also, the xpath states that the html should be a child of #document - which, according to $tag->parentNode->nodeName both HTML elements are? The second HTML tag also doesn't show up in neither 'View source' nor the responsebody from the CURL request).
http://neilpatel.com/: 2x HTML? (Once again a video, but seemingly not even a relevant iframe tag in the DOM source).
https://www.groovehq.com/: 2x BODY? (An iframe again, but no double html error, but a double body error instead?).
Questions
Why does xpath seem to think there are multiple instances of those tags, while I can't find them as such in the CURL response body using ctrl-f when I output it, nor in 'View source'?
How can I "see what xpath sees" in order to debug similar cases?
It would almost seem that DOMDocument or xpath parses javascript, does it? If not, how do I explain the examples above?
Any additional questions I will gladly answer. Thanks in advance!
I found this library, php query, and I wanted to know how I can utilize this jquery:
var source = $('p:not(:has(iframe))').filter(function(){
return $(this).text().length > 150;})
.slice(0,1).parent();
It finds the the first p element without an iframe that has text longer than 150 characters and takes its parent, I was wondering how I could do this in a php library. I found phpquery, a php implementation of jquery, but I've been confused on how to properly convert this above script.
try using http://simplehtmldom.sourceforge.net/manual.htm
you can Find tags on an HTML page with selectors just like jQuery.
just read the simple manual
g day dear community - hello all!
well I am trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. Perhaps i have to study the manpages again and again.
Well - the DOM-technique somewhat goes over my head:
But my example is very simple and seems to comply to the examples given in the manual (simplehtmldom.sourceforge AT net/manual.htm) but it just wont work, it's driving me up the wall. Other example scripts given with simple dom work fine.
See the example: http://www.aktive-buergerschaft.de/buergerstiftungsfinder
This is the easiest example i have found ... The question is - how to parse it?
Should i do it with Perl - The example HTML page is invalid HTML.
I do not know if the Simple HTML DOM Parser is able to handle badly malformed HTML
(probably not). And then i am lost.
Well: it is pretty hard to believe - but you can get the content with file_get_contents: But you afterwards have to do the parser job! And there i have some missing parts!
Finally: if i cannot get it to run i can try out some Perl parsers eg HTML::TreeBuilder::XPath
1: check whether file_get_contents is working!!!!
2: If no use curl or fopen or telnet to read the data.
Simple Html Dom filters all the noise can process malformed tags also...
Problem might be with your data retrieving
are there build in functions in latest versions of php specially designed to aid in this task ?
Use a DOM parser like SimpleXML to split the HTML code into nodes, and walk through the nodes to build the array.
For broken/invalid HTML, SimpleHTMLDOM is more lenient (but it's not built in).
String replace and explode would work if the HTML code is clean and always the same, as soon as you have new attributes it will brake.
So only dependable solution would be using regular expressions or XML/HTML parser.
Check http://php.net/manual/en/book.dom.php
An alternative to using a native DOM parser could be using YQL. This way you dont have to do the actual parsing yourself. The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet.
For instance, to grab the HTML table with the class example given at
http://www.w3schools.com/html/html_tables.asp
you can do
$yql = 'http://tinyurl.com/yql-table-grab';
$yql = json_decode(file_get_contents($yql));
print_r( $yql->query->results );
I've deliberated shortened the URL so it does not mess up the answer. $yql actually links to the YQL API, adds some options and contains the query:
select * from html
where xpath="//table[#class='example']"
and url="http://www.w3schools.com/html/html_tables.asp"
YQL can return JSON and XML. I've made it return JSON and decoded this then, which then results in a nested structure of stdClass objects and Arrays (so it's not all arrays). You have to see if that fits your needs.
You try out the interactive YQL console to see how it works.
i dont know if this is the faster , but you can check this class (using preg_replace)
http://wonshik.com/snippet/Convert-HTML-Table-into-a-PHP-Array
If you want to convert the html-description of a table, here's how I would do it:
remove all closing tags (</...>) ( http://php.net/manual/de/function.str-replace.php)
split string at opening tags (<...>) using a regular expression ( http://php.net/manual/en/function.split.php)
You have to work out the details on your own, since I do not know if you want to handle different lines as subarrays or you want to merge all lines into one big array or something else.
you could use the explode-function to turn the table cols and rows into arrays.
see: php explode
I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;