creating multidimensional array with two arrays

creating multidimensional array with two arrays - php

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
//links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
$links[] = url_to_absolute($URL, $theelement->href);
}
print_r($links);
//titles
$titles = Array();
$str = file_get_contents($URL);
$titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );
print_r($title[1]);

You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.
$newArray = array();
foreach ($links as $key=>$val)
{
$newArray[$key]['link'] = $val;
$newArray[$key]['title'] = $titles[$key];
}

It is not clear what you want.
Anyway, here is how I would rewrite your code in a more organized way:
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
$info = array();
$urls = array(
'http://www.youtube.com',
'http://www.google.com.br'
);
foreach ($urls as $url)
{
$str = file_get_contents($url);
$html = str_get_html($str);
$title = strval($html->find('title')->plaintext);
$links = array();
foreach($html->find(a) as $anchor)
{
$links[] = url_to_absolute($url, strval($anchor->href));
}
$links = array_unique($links);
$info[$url] = array(
'title' => $title,
'links' => $links
);
}
print_r($info);

Related

Check every URL in string to remove links of certain sites

I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>

One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);

From array to JSON

I am making parser of articles and I need to put all parsed data in josn. I tried to put them to array and then transform it in JSON, but I have some troubles. I get JSON like this:
[{"title":"title1"}][{"title":"title2"}][{"title":"title3"}]
But I want like this:
[{"title":"title1"},{"title":"title2"},{"title":"title3"}]
How I can do this?
<?
foreach ($content_prev as $el) {
$pq = pq($el);
$date = $pq->find('time')->html();
$title = $pq->find('h3 a')->html();
$link = $pq->find('h3 a')->attr('href');
$data_link = file_get_contents($link);
$document_с = phpQuery::newDocument($data_link);
$content = $document_с->find('.td-post-content');
$arr = array (
array(
"title" => $title
),
);
echo json_encode($arr, JSON_UNESCAPED_UNICODE);
}

Try to remove one array in $arr
Use below one.
<?
foreach ($content_prev as $el) {
$pq = pq($el);
$date = $pq->find('time')->html();
$title = $pq->find('h3 a')->html();
$link = $pq->find('h3 a')->attr('href');
$data_link = file_get_contents($link);
$document_с = phpQuery::newDocument($data_link);
$content = $document_с->find('.td-post-content');
$arr[] = array (
"title" => $title
);
}
echo json_encode($arr, JSON_UNESCAPED_UNICODE);

Array filter in PHP

I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?

solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;

Nested foreach loops

I know this solution is simple, but it keeps slipping my mind. When I parse the page with this code and the $links array is printed, all of href parts are correct yet the img part only prints the last src element that is found on the page.
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
}
$links[] = array(
'href' => $href,
'img' => $img
);
}
print_r(array_values($links));

The for each statement for images should be building an array where as the final array ($links)is a multi-dimentional array($img being the nested array).

Check if this works for you:
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
// storing the image src
$links[] = array(
'img' => $img
);
}
$links[] = array(
'href' => $href
);
}
print_r(array_values($links));

You use the dublicate variable $item in internal foreach.
Try this without internal foreach
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $key=>$item) {
// get links
$href = $item->getAttribute("href");
$img = $images[$key]->getAttribute('src');
$links[] = array(
'href' => $href,
'img' => $img
);
}unset($item);
print_r(array_values($links));

php get bing rss feed titles into one var

I am trying with the code below to get a bing rss news feed, grab all the titles from this data to an array and then implode them all together so I have a variable with all the words together in a string so I can then create a word cloud out of this with another peice of code. So far it grabs the rss feed and print_r($doc); if you uncomment it displays the simple xml. Howver my foreach looping to grab the titles in the array doesn't seem to be working and I can't see where the error is? Thanks in advance.
$ch = curl_init("http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXMLElement($data);
//print_r($doc);
$vals = array();
foreach ($doc->entry as $entry) {
$vals[] = (string) $entry->title;
}
//join content nodes together for the word cloud
$vals = implode(' ', $vals);
echo($vals);

The titles are at rss/channel/item/title and not where your code looks for them, at rss/entry/title. Other than that, your way of getting the values is fine.
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$titles = array();
foreach ($rss->channel->item as $item) {
$titles[] = (string) $item->title;
}
//join content nodes together for the word cloud
$words = implode(' ', $titles);
echo $words;
A quicky alternative, using XPath to get the titles, is:
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$words = implode(' ', $rss->xpath('channel/item/title'));
echo $words;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

creating multidimensional array with two arrays - php

You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key. $newArray = array(); foreach ($links as $key=>$val) { $newArray[$key]['link'] = $val; $newArray[$key]['title'] = $titles[$key]; }

Related

Check every URL in string to remove links of certain sites

From array to JSON

Array filter in PHP

Nested foreach loops

php get bing rss feed titles into one var

Categories

Resources