creating multidimensional array with two arrays - php

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
//links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
$links[] = url_to_absolute($URL, $theelement->href);
}
print_r($links);
//titles
$titles = Array();
$str = file_get_contents($URL);
$titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );
print_r($title[1]);

You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.
$newArray = array();
foreach ($links as $key=>$val)
{
$newArray[$key]['link'] = $val;
$newArray[$key]['title'] = $titles[$key];
}

It is not clear what you want.
Anyway, here is how I would rewrite your code in a more organized way:
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
$info = array();
$urls = array(
'http://www.youtube.com',
'http://www.google.com.br'
);
foreach ($urls as $url)
{
$str = file_get_contents($url);
$html = str_get_html($str);
$title = strval($html->find('title')->plaintext);
$links = array();
foreach($html->find(a) as $anchor)
{
$links[] = url_to_absolute($url, strval($anchor->href));
}
$links = array_unique($links);
$info[$url] = array(
'title' => $title,
'links' => $links
);
}
print_r($info);

Related

Check every URL in string to remove links of certain sites

I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>
One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);

From array to JSON

I am making parser of articles and I need to put all parsed data in josn. I tried to put them to array and then transform it in JSON, but I have some troubles. I get JSON like this:
[{"title":"title1"}][{"title":"title2"}][{"title":"title3"}]
But I want like this:
[{"title":"title1"},{"title":"title2"},{"title":"title3"}]
How I can do this?
<?
foreach ($content_prev as $el) {
$pq = pq($el);
$date = $pq->find('time')->html();
$title = $pq->find('h3 a')->html();
$link = $pq->find('h3 a')->attr('href');
$data_link = file_get_contents($link);
$document_с = phpQuery::newDocument($data_link);
$content = $document_с->find('.td-post-content');
$arr = array (
array(
"title" => $title
),
);
echo json_encode($arr, JSON_UNESCAPED_UNICODE);
}
Try to remove one array in $arr
Use below one.
<?
foreach ($content_prev as $el) {
$pq = pq($el);
$date = $pq->find('time')->html();
$title = $pq->find('h3 a')->html();
$link = $pq->find('h3 a')->attr('href');
$data_link = file_get_contents($link);
$document_с = phpQuery::newDocument($data_link);
$content = $document_с->find('.td-post-content');
$arr[] = array (
"title" => $title
);
}
echo json_encode($arr, JSON_UNESCAPED_UNICODE);

Array filter in PHP

I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;

Nested foreach loops

I know this solution is simple, but it keeps slipping my mind. When I parse the page with this code and the $links array is printed, all of href parts are correct yet the img part only prints the last src element that is found on the page.
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
}
$links[] = array(
'href' => $href,
'img' => $img
);
}
print_r(array_values($links));
The for each statement for images should be building an array where as the final array ($links)is a multi-dimentional array($img being the nested array).
Check if this works for you:
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
// storing the image src
$links[] = array(
'img' => $img
);
}
$links[] = array(
'href' => $href
);
}
print_r(array_values($links));
You use the dublicate variable $item in internal foreach.
Try this without internal foreach
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $key=>$item) {
// get links
$href = $item->getAttribute("href");
$img = $images[$key]->getAttribute('src');
$links[] = array(
'href' => $href,
'img' => $img
);
}unset($item);
print_r(array_values($links));

php get bing rss feed titles into one var

I am trying with the code below to get a bing rss news feed, grab all the titles from this data to an array and then implode them all together so I have a variable with all the words together in a string so I can then create a word cloud out of this with another peice of code. So far it grabs the rss feed and print_r($doc); if you uncomment it displays the simple xml. Howver my foreach looping to grab the titles in the array doesn't seem to be working and I can't see where the error is? Thanks in advance.
$ch = curl_init("http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXMLElement($data);
//print_r($doc);
$vals = array();
foreach ($doc->entry as $entry) {
$vals[] = (string) $entry->title;
}
//join content nodes together for the word cloud
$vals = implode(' ', $vals);
echo($vals);
The titles are at rss/channel/item/title and not where your code looks for them, at rss/entry/title. Other than that, your way of getting the values is fine.
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$titles = array();
foreach ($rss->channel->item as $item) {
$titles[] = (string) $item->title;
}
//join content nodes together for the word cloud
$words = implode(' ', $titles);
echo $words;
A quicky alternative, using XPath to get the titles, is:
$rss = simplexml_load_file('http://api.bing.com/rss.aspx?Source=News&Market=en-GB&Version=2.0&Query=web+design+uk');
$words = implode(' ', $rss->xpath('channel/item/title'));
echo $words;

Categories