I am trying to program a web crawler but I have no idea, how to create a recursion for parsing a webpage and adding all the endresults into a final array.
I never worked with php before but I did alot of research on the internet and figured already out, how to parse the page I want to scrape.
Please note, that I have changed the $url value and the array result below to some values which I have randomly generated in my mind.
<?php
include_once "simple_html_dom.php"; //http://simplehtmldom.sourceforge.net/
$url = "https://www.scrapesite.com/pagetoscrape/index.html";
function parseLink($link) {
$html = file_get_html($link);
$html = $html->find("/html/body/script[2]/text", 0);
preg_match('/\{(?:[^{}]|(?R))*\}/', $html, $matches); //this regex extracts a json array
$json = json_decode($matches[0]);
$data = ($json->props->contents);
return $data;
}
function getFolders($basepath, $data) {
$data = $data->folders;
$result = array();
foreach ($data as $value) {
$result[] = array("folder", $basepath . "/" . $value->filename, $value->href);
}
return $result;
}
$data = getFolders("", parseLink($url));
print_r ($data);
?>
This script works fine and it outputs the following:
Array
(
[0] => Array
(
[0] => folder
[1] => /1
[2] => https://www.scrapesite.com/pagetoscrape/sjdfi327943sad/index.html
)
[1] => Array
(
[0] => folder
[1] => /2
[2] => https://www.scrapesite.com/pagetoscrape/345fdsjjsdfsdf/index.html
)
[2] => Array
(
[0] => folder
[1] => /3
[2] => https://www.scrapesite.com/pagetoscrape/46589dsjodsiods/index.html
)
[3] => Array
(
[0] => folder
[1] => /4
[2] => https://www.scrapesite.com/pagetoscrape/345897dujfosfsd/index.html
)
[4] => Array
(
[0] => folder
[1] => /5
[2] => https://www.scrapesite.com/pagetoscrape/9dsfghshdfsds3/index.html
)
)
Now, the script should execute the getFolders function for every item in the above array. This may return another array of folder which should get parsed too.
And then I want to create a final array where all the folders ABSOLUTE paths ($basepath . "/" . $value->filename) and href links are listed. I really appreciate every little hint.
I was able to find some example on the web but I can't figure out how to implement it here because I have almost no experience with programming languages in general.
Initialize an empty array and pass that as a reference to the getFolders() function. Keep putting the results of scraping inside this array. Also, you need to call getFolders() again inside the foreach loop of the getFolders(). Example below:
$finalResults = array();
getFolders("", parseLink($url), $finalResults);
Your getFolders() function signature will now look like below:
function getFolders($basepath, $data, &$finalResults) //notice the & before the $finalResults used for passing by reference
And, your foreach loop:
foreach ($data as $value) {
$finalResults[] = array("folder", $basepath . "/" . $value->filename, $value->href);
getFolders("", parseLink($value->href), $finalResults);
}
Above code is just an example. Change it according to your needs.
Related
I have a PHP Multidimensional associative array structured in this way:
Array
(
[0] => Array
(
[serverid] => 1
[ip] => localhost
[name] => Server1
)
[1] => Array
(
[serverid] => 2
[ip] => localhost
[name] => Server2
)
[2] => Array
(
[serverid] => 3
[ip] => localhost
[name] => Server3
)
Now I need to push at the end of every subArray this new field with this value:
['page_url'] = base_url('/server/id/') . $server['serverid'];
Where $server['serverid'] is the serverid field relative to every single subArray.
I've tried with this cycle but seems it doesn't work:
$result = $query->result_array();
foreach($result as $server) {
$server['page_url'] = base_url('/server/id/') . $server['id'];
}
Any suggestion would be really appreciated.
If you want to modify the subarray when iterating trough an array in a foreach, you have to pass the variable as a reference using &.
If you change your code to the one below, it should work as you'll be changing the original array item instead of a created copy.
foreach($result as &$server) {
$server['page_url'] = base_url('/server/id/') . $server['id'];
}
This creates a temporary copy of the subarray, which you're changing and then throwing away on the next iteration:
foreach ($result as $server) {
$server['page_url'] = base_url('/server/id/') . $server['id'];
}
You want to change the original array. Something like this:
foreach (array_keys($result) as $index) {
$result[$index]['page_url'] = base_url('/server/id/') . $result[$index]['id'];
}
If you know you haven't mucked with the indexes in $result you could also just do:
for ($index = 0; $index < count($result); $index++) {
Its My PHP Code:
$res_media=mysql_query("SELECT * FROM mv_media");
$media = array();
while($resualt_media = mysql_fetch_assoc($res_media)) {
$media[]= $resualt_media['title'];
}
echo $media;
And Its My output:
["Test","Test","Test","Test","Test"]
I want Change it to this format :
["Test"],["Test"],["Test"],["Test"],["Test"]
I changed My Code to this code :
$res_media=mysql_query("SELECT * FROM mv_media");
$media = array();
while($resualt_media = mysql_fetch_assoc($res_media)) {
$media[]= [$resualt_media['title']];
}
echo $media;
Now My OutPut :
[["Test","Test","Test","Test","Test"]]
But I need This custom output:
[["Test"],["Test"],["Test"],["Test"],["Test"],["ItsMyCustomChild"]]
I want add Custom Child with out database!
You can change $media[] = [<value>] to $media[][] = <value> and it will work, because then you'll create a new array inside an array.
I would suggest this approach:
<?php
$input = json_decode('["Test","Test","Test","Test","Test"]');
$output = [];
array_walk($input, function($element) use (&$output) {
$output[] = [$element];
});
var_dump(json_encode($output));
Alternatively this can be simplified to just:
<?php
$data = json_decode('["Test","Test","Test","Test","Test"]');
array_walk($data, function(&$element) {
$element = [$element];
});
var_dump(json_encode($data));
The created array structure obviously is:
Array
(
[0] => Array
(
[0] => Test
)
[1] => Array
(
[0] => Test
)
[2] => Array
(
[0] => Test
)
[3] => Array
(
[0] => Test
)
[4] => Array
(
[0] => Test
)
)
Which, if you again json_encode() it, results in the desired format:
string(46) "[["Test"],["Test"],["Test"],["Test"],["Test"]]"
The specific issue you are actually dealing with is not so much the creation of the desired structure, but that you are trying to modify the JSON string instead of the actual array you want to work with. That is why a call to json_decode() is used initially.
$res_media=mysql_query("SELECT * FROM mv_media");
$media = array();
while($resualt_media = mysql_fetch_assoc($res_media)) {
array_push($media,array($resualt_media['title']));
}
print_r $media;
You just need to change below line. and you will get your desired result.
$media[][]= $resualt_media['title'];
if you want to see what you get . you need to add json_encode() after the while.
like below
echo json_encode($media);
I am writing code in PHP to collect all the hashtags which I've used in all my media posts and see in how many posts I've used the hashtag and how many likes the post with that hashtag received in total.
I have collected all of the media posts in my database and are now able to export this information. Here is an example of the multidimensional array which is being output:
Array
(
[0] => Array
(
[id] => 1
[caption] => #londra #london #london_only #toplondonphoto #visitlondon #timeoutlondon #londres #london4all #thisislondon #mysecretlondon #awesomepix #passionpassport #shootermag #discoverearth #moodygrams #agameoftones #neverstopexploring #beautifuldestinations #artofvisuals #roamtheplanet #jaw_dropping_shots #fantastic_earth #visualsoflife #bdteam #nakedplanet #ourplanetdaily #earthfocus #awesome_earthpix #exploretocreate #londoneye
[likesCount] => 522
)
[1] => Array
(
[id] => 2
[caption] => #londra #london #london_only #toplondonphoto #visitlondon #timeoutlondon #londres #london4all #thisislondon #mysecretlondon #awesomepix #passionpassport #shootermag #discoverearth #moodygrams #agameoftones #neverstopexploring #beautifuldestinations #artofvisuals #roamtheplanet #jaw_dropping_shots #fantastic_earth #visualsoflife #bdteam #nakedplanet #ourplanetdaily #earthfocus #awesome_earthpix #harrods #LDN4ALL_One4All
[likesCount] => 1412
)
)
I am able to separate these hashtags out using the following function:
function getHashtags($string) {
$hashtags= FALSE;
preg_match_all("/(#\w+)/u", $string, $matches);
if ($matches) {
$hashtagsArray = array_count_values($matches[0]);
$hashtags = array_keys($hashtagsArray);
}
return $hashtags;
}
Now I want to create a multidimensional array for each hashtag which should look like this:
Array
(
[0] => Array
(
[hash] => #londra
[times_used] => 2
[total_likes] => 153
)
[1] => Array
(
[hash] => #london
[times_used] => 12
[total_likes] => 195
)
)
I am quite new to this and not sure how to achieve this. Help and suggestions are appreciated!
It would be easier to use the hashtags as keys in your array. You can
transform it later to your final format if you want to. The idea is to
traverse your input array and within each element iterate on the given
hashtags string, increasing counters.
And if your hashtags are always in a string like that, separated by
whitespace, you can also get an array of then with explode() or
preg_split() for finer control.
$posts = # your input array
$tags = [];
foreach ($posts as $post) {
$hashtags = explode(' ', $post['caption']);
foreach ($hashtags as $tag) {
if (!key_exists($tag, $tags)) {
# first time seeing this one, initialize an entry
$tags[$tag]['counter'] = 0;
$tags[$tag]['likes'] = 0;
}
$tags[$tag]['counter']++;
$tags[$tag]['likes'] += $post['likesCount'];
}
}
Transforming to something closer to your original request:
$result = array_map(function($hashtag, $data) {
return [
'hash' => $hashtag,
'times_used' => $data['counter'],
'total_likes' => $data['likes'],
'average_likes' => $data['likes'] / $data['counter'],
];
}, array_keys($tags), $tags);
This is what I get after a print_r($myArray) (wrapped in pre) on my array.
Array
(
[0] => 203.143.197.254
[1] => not/available
)
Array
(
[0] => 40.190.125.166
[1] => articles/not/a/page
)
Array
(
[0] => 25.174.7.82
[1] => articles/not/a/page
)
How would I return or echo just the first two in this case (no regex), given the fact that I would like to only output each array whose [1] value has not been echoed before?
My list as far more entries and $myArray[1] is sometimes the same, I want to skip echoing the same thing.
I have tried array_unique but I can't get it to work as param 1 is expected to be an array.
print_r(array_unique($myArray));
This works. Didn't do a full copy paste job but hopefully you get the idea of the logic
$echoed = array();
foreach($array as $arr) {
if(!in_array($arr[1],$echoed)) {
echo $arr[1];
$echoed[] = $arr[1];
}
}
$echoedBefore = array();
print_r(array_filter($myArray, function($entry) {
global $echoedBefore;
$alreadyEchoed = in_array($entry[1], $echoedBefore);
if (!$alreadyEchoed) {
$echoedBefore[] = $entry[1];
}
return !$alreadyEchoed;
}));
I'm taking some json, made by OpenLibrary.org, and remake a new array from the info.
Link to the OpenLibrary json
here is my PHP code to decode the json:
$barcode = "9781599953540";
function parseInfo($barcode) {
$url = "http://openlibrary.org/api/books?bibkeys=ISBN:" . $barcode . "&jscmd=data&format=json";
$contents = file_get_contents($url);
$json = json_decode($contents, true);
return $json;
}
the new array I'm trying to make looks something like this:
$newJsonArray = array($barcode, $isbn13, $isbn10, $openLibrary, $title, $subTitle, $publishData, $pagination, $author0, $author1, $author2, $author3, $imageLarge, $imageMedium, $imageSmall);
but when I try to get the ISBN_13 to save it to $isbn13, I get an error:
Notice: Undefined offset: 0 in ... on line 38
// Line 38
$isbn13 = $array[0]['identifiers']['isbn_13'];
And even if I try $array[1] ,[2], [3].... I get the same thing. What am I doning wrong here! O I know my Valuable names might not be the same, that's because they are in different functions.
Thanks for your help.
Your array is not indexed by integers, it is indexed by ISBN numbers:
Array
(
// This is the first level of array key!
[ISBN:9781599953540] => Array
(
[publishers] => Array
(
[0] => Array
(
[name] => Center Street
)
)
[pagination] => 376 p.
[subtitle] => the books of mortals
[title] => Forbidden
[url] => http://openlibrary.org/books/OL24997280M/Forbidden
[identifiers] => Array
(
[isbn_13] => Array
(
[0] => 9781599953540
)
[openlibrary] => Array
(
[0] => OL24997280M
)
So, you need to call it by the first ISBN, and the key isbn_13 is itself an array which you must access by element:
// Gets the first isbn_13 for this item:
$isbn13 = $array['ISBN:9781599953540']['identifiers']['isbn_13'][0];
Or if you need a loop over many of them:
foreach ($array as $isbn => $values) {
$current_isbn13 = $values['identifiers']['isbn_13'][0];
}
If you expect only one each time and must be able to get its key without knowing it ahead of time but don't want a loop, you can use array_keys():
// Get all ISBN keys:
$isbn_keys = array_keys($array);
// Pull the first one:
$your_item = $isbn_keys[0];
// And use it as your index to $array
$isbn13 = $array[$your_item]['identifiers']['isbn_13'][0];
If you have PHP 5.4, you can skip a step via array dereferencing!:
// PHP >= 5.4 only
$your_item = array_keys($array)[0];
$isbn13 = $array[$your_item]['identifiers']['isbn_13'][0];