I am writing scraper links from all over the site, including subpages and encountered a small problem. I came up with the idea to use a recursive function because the page I want to scan has several levels. Its structure looks more or less like this:
Level 1 reference
- Second level reference
-- Third level reference
-- Third level reference
- Second level reference
-- Third level reference
-- Third level reference
-- Third level reference
--- Level four reference
It is never entirely clear whether there are more or less hidden under the tested link, hence I came up with the idea of a recursive function.
It takes a link to the main page, takes the first one and if the number of links in it is greater than one, it refers to the same function.
Unfortunately, something goes wrong and I get an empty whiteboard, how can I fix it?
function scanWebsite($url) {
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->query("/html/body//a");
$output = [];
foreach($nodes as $node) {
$url = $node->getAttribute("href");
if(count($nodes) > 1) {
scanWebsite("http://samplewebsite.com" .$url);
} else {
if(preg_match("/\/title\/.*\//", $url)) {
array_push($output, $url);
}
continue;
}
}
return $output;
}
echo '<pre>';
print_r(scanWebsite("http://samplewebsite.com"));
echo '</pre>';
Related
I got a PHP array with a lot of XML users-file URL :
$tab_users[0]=john.xml
$tab_users[1]=chris.xml
$tab_users[n...]=phil.xml
For each user a <zoom> tag is filled or not, depending if user filled it up or not:
john.xml = <zoom>Some content here</zoom>
chris.xml = <zoom/>
phil.xml = <zoom/>
I'm trying to explore the users datas and display the first filled <zoom> tag, but randomized: each time you reload the page the <div id="zoom"> content is different.
$rand=rand(0,$n); // $n is the number of users
$datas_zoom=zoom($n,$rand);
My PHP function
function zoom($n,$rand) {
global $tab_users;
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
... some other taff here
// no "zoom" value found
if ($txt_zoom =='') {
echo 'RAND='.$rand.' XML='.$tab_users[$rand].'<br />';
$datas_zoom=zoom($r,$n,$rand); } // random zoom fct again and again till...
}
else {
echo 'ZOOM='.$txt_zoom.'<br />';
return $txt_zoom; // we got it!
}
}
echo '<br />Return='.$datas_zoom;
The prob is: when by chance the first XML explored contains a "zoom" information the function returns it, but if not nothing returns... An exemple of results when the first one is by chance the good one:
// for RAND=0, XML=john.xml
ZOOM=Anything here
Return=Some content here // we're lucky
Unlucky:
RAND=1 XML=chris.xml
RAND=2 XML=phil.xml
// the for RAND=0 and XML=john.xml
ZOOM=Anything here
// content founded but Return is empty
Return=
What's wrong?
I suggest importing the values into a database table, generating a single local file or something like that. So that you don't have to open and parse all the XML files for each request.
Reading multiple files is a lot slower then reading a single file. And using a database even the random logic can be moved to SQL.
You're are currently using SimpleXML, but fetching a single value from an XML document is actually easier with DOM. SimpleXMLElement::xpath() only supports Xpath expression that return a node list, but DOMXpath::evaluate() can return the scalar value directly:
$document = new DOMDocument();
$document->load($xmlFile);
$xpath = new DOMXpath($document);
$zoomValue = $xpath->evaluate('string(//zoom[1])');
//zoom[1] will fetch the first zoom element node in a node list. Casting the list into a string will return the text content of the first node or an empty string if the list was empty (no node found).
For the sake of this example assume that you generated an XML like this
<zooms>
<zoom user="u1">z1</zoom>
<zoom user="u2">z2</zoom>
</zooms>
In this case you can use Xpath to fetch all zoom nodes and get a random node from the list.
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$zooms = $xpath->evaluate('//zoom');
$zoom = $zooms->item(mt_rand(0, $zooms->length - 1));
var_dump(
[
'user' => $zoom->getAttribute('user'),
'zoom' => $zoom->textContent
]
);
Your main issue is that you are not returning any value when there is no zoom found.
$datas_zoom=zoom($r,$n,$rand); // no return keyword here!
When you're using recursion, you usually want to "chain" return values on and on, till you find the one you need. $datas_zoom is not a global variable and it will not "leak out" outside of your function. Please read the php's variable scope documentation for more info.
Then again, you're calling zoom function with three arguments ($r,$n,$rand) while the function can only handle two ($n and $rand). Also the $r is undiefined, $n is not used at all and you are most likely trying to use the same $rand value again and again, which obviously cannot work.
Also note that there are too many closing braces in your code.
I think the best approach for your problem will be to shuffle the array and then to use it like FIFO without recursion (which should be slightly faster):
function zoom($tab_users) {
// shuffle an array once
shuffle($tab_users);
// init variable
$txt_zoom = null;
// repeat until zoom is found or there
// are no more elements in array
do {
$rand = array_pop($tab_users);
$datas_user = new SimpleXMLElement($rand, null, true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
} while(!$txt_zoom && !empty($tab_users));
return $txt_zoom;
}
$datas_zoom = zoom($tab_users); // your zoom is here!
Please read more about php scopes, php functions and recursion.
There's no reason for recursion. A simple loop would do.
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
$max = $tag->length;
while(true) {
$test_index = rand(0, $max);
if ($tag[$test_index]->zoom != "") {
break;
}
}
Of course, you might want to add a bit more logic to handle the case where NO zooms have text set, in which case the above would be an infinite loop.
I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[#class='bio']/a/#href";
$html = new DOMDocument();
#$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n){
$value = $n->nodeValue;
$profileurl[] = $value;
}
I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.
$imgurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
#$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//img[#class='photo fn']/#src");
foreach($nodelist as $n){
$value = $n->nodeValue;
$imgurl[] = $value;
}
}
Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.
So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.
How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?
Here's the query for the Twitter URLs:
$twitterurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
#$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//*[#id='bio']/div[2]/p[3]/a/#href");
foreach($nodelist as $n){
$value = $n->nodeValue;
$twitterurl[] = $value;
}
}
Since the twitter node appears zero or one times, change the foreach to
$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;
That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.
I think you have multiple issues in the way you scrape the data and will try to outline those in my answer in the hope it always clarifies your central question:
I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?
First of all collecting the URLs of each profile (detail) page is a good idea. You can even benefit more from it by putting this into the overall context of your scraping job:
* profile pages
`- profile page
+- name
+- role
+- img
+- email
+- facebook
`- twitter
This is the structure you have with the data you like to obtain. You already managed to obtain all profile pages URLs:
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[#class='bio']/a/#href";
$html = new DOMDocument();
#$html->loadHtmlFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n) {
$value = $n->nodeValue;
$profileurl[] = $value;
}
As you know that the next steps would be to load and query the 20+ profile pages, one of the very first things you could do is to extract the part of your code that is creating a DOMXPath from an URL into a function of it's own. This will also allow you to do better error handling easily:
/**
* #param string $url
*
* #throws RuntimeException
* #return DOMXPath
*/
function xpath_from_url($url)
{
$html = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$result = $html->loadHtmlFile($url);
libxml_use_internal_errors($saved);
if (!$result) {
throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
}
$xpath = new DOMXPath($html);
return $xpath;
}
This changes the main processing into a more compressed form then only by the extraction (move) of the code into the xpath_from_url function:
$xpath = xpath_from_url($url);
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n) {
$value = $n->nodeValue;
$profileurl[] = $value;
}
But it does also allow you another change to the code: You can now process the URLs directly in the structure of your main extraction routine:
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xpath = xpath_from_url($url);
$profileUrls = $xpath->query("//p[#class='bio']/a/#href");
foreach ($profileUrls as $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
// ... extract the six (inkl. optional) values from a profile
}
As you can see, this code skips creating the array of profile-URLs because a collection of all profile-URLs are already given by the first xpath operation.
Now there is the part missing to extract the up to six fields from the detail page. With this new way to iterate over the profile URLs, this is pretty easy to manage - just create one xpath expression for each field and fetch the data. If you make use of DOMXPath::evaluate instead of DOMXPath::querythen you can get string values directly. The string-value of a non-existing node, is an empty string. This is not really testing if the node exists or not, in case you need NULL instead of "" (empty string), this needs to be done differently (I can show that, too, but that's not the point right now). In the following example the anchors name and role is being extracted:
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1,
$profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
// ... extract the other four (inkl. optional) values from a profile
}
I choose to directly output the values (and not care about adding them into an array or a similar structure), so that it's easy to follow what happens:
#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...
Fetching the details about email, facebook and twitter works the same:
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1,
$profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
printf(
" email...: %s\n",
$profile->evaluate('substring-after(//*[#class="bio-email"]/a/#href, ":")')
);
printf(
" facebook: %s\n",
$profile->evaluate('string(//*[#class="bio-facebook url"]/a/#href)')
);
printf(
" twitter.: %s\n",
$profile->evaluate('string(//*[#class="bio-twitter url"]/a/#href)')
);
}
This now already outputs the data as you need it (I've left the images out because those can't be well displayed in text-mode:
#01: Marc Bailey (Morning Anchor)
email...: m.bailey#sandiego6.com
facebook: https://www.facebook.com/marc.baileySD6
twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
email...: heather.myers#sandiego6.com
facebook: https://www.facebook.com/heather.myersSD6
twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
email...: jim.patton#sandiego6.com
facebook: https://www.facebook.com/Jim.PattonSD6
twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
email...: Neda.Iranpour#sandiego6.com
facebook: https://www.facebook.com/lightenupwithneda
twitter.: http://www.twitter.com/#LightenUpWNeda
...
So now these little lines of code with one foreach loop already fairly well represent the original structure outlined:
* profile pages
`- profile page
+- name
+- role
+- img
+- email
+- facebook
`- twitter
All you have to do is just to follow that overall structure of how the data is available with your code. Then at the end when you see that all data can be obtained as wished, you do the store operation in the database: one insert per profile. that is one row per profile. you don't have to keep the whole data, you can just insert (perhaps with some check if it already exists) the data for each row.
Hope that helps.
Appendix: Code in full
<?php
/**
* Scraping detail pages based on index page
*/
/**
* #param string $url
*
* #throws RuntimeException
* #return DOMXPath
*/
function xpath_from_url($url)
{
$html = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$result = $html->loadHtmlFile($url);
libxml_use_internal_errors($saved);
if (!$result) {
throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
}
$xpath = new DOMXPath($html);
return $xpath;
}
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xpath = xpath_from_url($url);
$profileUrls = $xpath->query("//p[#class='bio']/a/#href");
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
printf(" email...: %s\n", $profile->evaluate('substring-after(//*[#class="bio-email"]/a/#href, ":")'));
printf(" facebook: %s\n", $profile->evaluate('string(//*[#class="bio-facebook url"]/a/#href)'));
printf(" twitter.: %s\n", $profile->evaluate('string(//*[#class="bio-twitter url"]/a/#href)'));
}
Using PHP 5.3.10, I created a link-list class and am trying to save a list of football players.
After calling the add function, it seems that the object never retains any information. var_dump($playerList) returns NULL for both my head and tail pointers. Or, if I replace it with var_dump($playerList->count), it prints nothing no matter where I place the var_dump count statement.
I have been through the manual and cannot find the error in my syntax. My gut is telling me mysql_fetch_array is doing something funky. As stated below, my testing shows that values are in fact being passed around when I call playerList->add(). Anyhow, here is my simple code:
/* Populates lists with available players. */
function populateList($sql)
{
$playerList = new PlayerList();
while ($row = mysql_fetch_array($sql, MYSQL_NUM))
{
$playerList->add(new Player($row[0], $row[1], $row[2], $row[3], $row[4]));
}
var_dump($playerList);
}
And my linked list class:
include 'PlayerNode.php';
class PlayerList
{
public $head;
public $tail;
public $count;
function PlayerList()
{
$head = null;
$tail = null;
$count = 0;
}
function add($player)
{
$count ++;
$node = new PlayerNode($player);
//First time in
if ($head == null)
{
$head = $node;
$tail = $node;
$head->nextPtr = null;
}
// All other times
else
{
$tail->nextPtr = $node;
$tail = $node;
$node->nextPtr = null;
}
$count++;
}
}
I can place var_dump($node) and echo statements in the linked list class and observe that PlayerNode is working correctly.
But, another strange observation... if($head==null) ALWAYS evaluates to true too. Could this be related?
Insertion in the head of the Singly Linked Lists :
We can easily insert the elements in the head of the list. So how we do it? Create a new node, set the next of the new node point to the current head node, and set the head variable (in the class) point to the new node. This method works even if the Linked List is empty. Note that we set the next of the new node point to the head node, before we sent the head variable to point to the new node.
Insertion in the tail of the Singly Linked Lists:
We can also easily insert elements in the tail of the Linked List, provided we keep a reference for the tail node of the Linked Lists. Create an new node set the next of the new node to null, set the next of the tail node point to the new node, set the tail variable to point to the new element. Note we set the next of the previous tail node before we change the tail variable to point to the new node.
In all the other times add the new node to the head or tail.
// All other times if head
else{
$temp = $head;
$head = $node;
$node->nextPtr = $temp;
count ++;
}
For the past few days, I have been trying to find a way to count all of the non-cyclic paths between two nodes. I've been working with a breadth-first search and a depth-first search. I'm fairly sure either can accomplish this task. However, I've been struggling with how to adapt the DFS code below to find all possible paths between two nodes. I've tried a few different things (remembering nodes in array, recursion), but I haven't implemented them correctly and haven't been able to output the possible paths.
Ultimately, I would like return an array of arrays that contain all possible paths between two selected nodes. Is there any simple modification I could make to accomplish this? The code below is what I'm currently working with.
function init(&$visited, &$graph){
foreach ($graph as $key => $vertex) {
$visited[$key] = 0;
}
}
/* DFS args
$graph = Node x Node sociomatrix
$start = starting node
$end = target node (currently not used)
$visited = list of visited nodes
$list = hold keys' corresponding node values, for printing path;
*/
function depth_first(&$graph, $start, $end, $visited, $list){
// create an empty stack
$s = array();
// put the starting node on the stack
array_push($s, $start);
// note that we visited the start node
$visited[$start] = 1;
// do the following while there are items on the stack
while (count($s)) {
// remove the last item from the stack
$t = array_pop($s);
// move through all connections
foreach ($graph[$t] as $key => $vertex) {
// if node hasn't been visited and there's a connection
if (!$visited[$key] && $vertex == 1) {
// note that we visited this node
$visited[$key] = 1;
// push key onto stack
array_push($s, $key);
// print the node we're at
echo $list[$key];
}
}
}
}
// example usage
$visited = array();
$visited = init($visited, $sym_sociomatrix);
breadth_first($sym_sociomatrix, 1, 3, $visited, $list);
Assuming you have a framework / library to create a graph data structure and to traverse it, you could do a backtracking depth-first search with an early return if you get a cycle. Cycle detection is easy if you store the path from the starting node. In C-style pseudo-code (sorry don't know PHP, or if it is capable of recursion):
void DFS(Vertex current, Vertex goal, List<Vertex> path) {
// avoid cycles
if (contains(path, current)
return;
// got it!
if (current == goal)) {
print(path);
return;
}
// keep looking
children = successors(current); // optionally sorted from low to high cost
for(child: children)
DFS(child, add_path(path, child));
}
and you can then call it as DFS(start, goal, List<Vertex>(empty))
I have a function below which gets name from a site. Below is the partial code, not complete. The values are passed thru a for loop using php.
function funct($$name,$page)
{
$url="http://testserver.com/client?list=$name&page=$page";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,url);
$result=curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($result);
$xpath=new DOMXPath($dom);
$elements = $xpath->evaluate("//div");
foreach ($elements as $element)
{
$name = $element->getElementsByTagName("name")->item(0)->nodeValue;
$position=$position +1;
echo $name.$position;
}
}
The code works fine but when i get a name i need to add a position and for each name it will be incremented by 1 to make it contentious. But when the values for the pages are passed, for an example when i move from page 1 to page 2. the count starts again from first, next page... same problem.
How can i make it continues on every page?
Either make $position a global variable (global $position;) or pass it to the function: function funct($name, $page, &$position). (What's with the variable variable $$name in your function signature?)
Use $_SESSION. It's designed specifically to maintain state.