Parsing html using php to an array - php

I have the below html
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
Does anyone have an idea to parse this html file with php to get this output using complex array
fist one for the tags "p"
and the second for tags "ul" because after above every "p" tag a tag "ul"
Array
(
[0] => Array
(
[value] => text1
(
[il] => list-a1
[il] => list-a2
[il] => list-a3
)
)
[1] => Array
(
[value] => text2
(
[il] => list-b1
[il] => list-b2
[il] => list-b3
)
)
)
I can't use replace or removing all tags cause I use
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document.') === false) {
$links2[] = array(
'value' => $link->textContent, );
}
$er=0;
foreach ($doc->getElementsByTagName('ul') as $link)
{
$dont2 = $link->nodeValue;
//echo $dont2;
if (strpos($dont2, 'favorisContribuer') === false) {
$links3[]= array(
'il' => $link->nodeValue, );
}

You could use the DOMDocument class (http://php.net/manual/en/class.domdocument.php)
You can see an example below.
<?php
$html = '
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
';
$doc = new DOMDocument();
$doc->loadHTML($html);
$textContent = $doc->textContent;
$textContent = trim(preg_replace('/\t+/', '<br>', $textContent));
echo '
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
' . $textContent . '
</body>
</html>
';
?>
However, I would suggest using javascript to find the content and send it to php instead.

Related

PHP get image url from html-string using regular expression

I'm trying to get all images urls from a html-string with php.
Both from img-tags and from inline css (background-image)
<?php
$html = '
<div style="background-image : url(https://exampel.com/media/logo.svg);"></div>
<img src="https://exampel.com/media/my-photo.jpg" />
<div style="background-image:url('https://exampel.com/media/icon.png');"></div>
';
preg_match('/<img.+src=[\'"](?P<src>.+?)[\'"].*>|background-image[ ]?:[ ]?url\([ ]?[\']?["]?(.*?\.(?:png|jpg|jpeg|gif|svg))/i', $html, $image);
echo('<pre>'.print_r($image, true).'</pre>');
?>
The output from this is:
Array
(
[0] => background-image : url(https://exampel.com/media/logo.svg
[src] =>
[1] =>
[2] => https://exampel.com/media/logo.svg
)
Prefered output would be:
Array
(
[0] => https://exampel.com/media/logo.svg
[1] => https://exampel.com/media/my-photo.jpg
[2] => https://exampel.com/media/icon.png
)
I'm missing something here but I cant figure out what
Use preg_match_all() and rearrange your result:
<?php
$html = <<<EOT
<div style="background-image : url(https://exampel.com/media/logo.svg);"></div>
<img src="https://exampel.com/media/my-photo.jpg" />
<div style="background-image:url('https://exampel.com/media/icon.png');"></div>
EOT;
preg_match_all(
'/<img.+src=[\'"](.+?)[\'"].*>|background-image ?: ?url\([\'" ]?(.*?\.(?:png|jpg|jpeg|gif|svg))/i',
$html,
$matches,
PREG_SET_ORDER
);
$image = [];
foreach ($matches as $set) {
unset($set[0]);
foreach ($set as $url) {
if ($url) {
$image[] = $url;
}
}
}
echo '<pre>' . print_r($image, true) . '</pre>' . PHP_EOL;

php xquery parsing html

how to parse nested html tags like this structure:
<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">01.12.2021</label>
<h4><a class="link-color-black" href="link-1">title-1</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">​
content-1
</p>
</div>
more
</div>
<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">02.12.2021</label>
<h4><a class="link-color-black" href="link-2">title-2</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">​
content-2
</p>
</div>
more
</div>
</article>
to array like:
$parsedArray = [
0 =>
['title => 'title',
'link' => 'link-1',
'date' => '2021-12-01',
'content' => 'content-1']
1 =>
['title => 'title-2',
'link' => 'link-2',
'date' => '2021-12-02',
'content' => 'content-2']
,....]
i use xquery like above, but this remove all tags, after that i have only implode text from all tags, i need to extract info from all tags, any tip?
$dom = new DOMDocument();
$dom->loadHTML($html['html']);
$xpath = new DOMXPath($dom);
$nodelist = $xpath->query("//article[contains(#class, 'tile')]");
foreach ($nodelist as $n) {
echo '<pre>';
var_dump($n);
echo '</pre>';
}
var_dump won't parse the DOM :)
You just need to re-query for your elements within the tile, then assign them to the array.
Assign a working item array to define the structure if it matters, else just build up the result as you go.
<?php
$str = '<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">02.12.2021</label>
<h4><a class="link-color-black" href="link-2">title-2</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">
content-2
</p>
</div>
more
</div>
</article>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHtml($str);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query("//article[contains(#class, 'tile')]") as $tile) {
// define item structure
$item = [
'title' => '',
'link' => '',
'date' => '',
'content' => ''
];
// find date
$query = $xpath->query("//label[contains(#class, 'label-date')][1]", $tile);
if (count($query)) {
$item['date'] = $query[0]->nodeValue;
}
// find link/title
$query = $xpath->query("//h4/a[1]", $tile);
if (count($query)) {
$item['link'] = $query[0]->getAttribute('href');
$item['title'] = $query[0]->nodeValue;
}
// find content
$query = $xpath->query("//p[contains(#class, 'tile-content__paragraph')][1]", $tile);
if (count($query)) {
$item['content'] = $query[0]->nodeValue;
}
// assign
$result[] = $item;
// cleanup
unset($item, $query);
}
print_r($result);
Output:
Array
(
[0] => Array
(
[title] => title-2
[link] => link-2
[date] => 02.12.2021
[content] =>
content-2
)
)

PHP - Read three lines of remote html

I need to read three lines of a remote page using PHP. I'm using code from Jose Vega found here to read the title:
<?php
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
//Example:
echo get_title("http://www.washingtontimes.com/");
?>
When I plug in a URL, I want to extract the following information:
<title>TITLE HERE</title>
<meta property="end_date" content="Tue Aug 28 2018 03:59:59 GMT+0000 (UTC)" />
<meta property="start_date" content="Mon Aug 06 2018 04:00:00 GMT+0000 (UTC)" />
Outputs: $title, $start, $end
Displayed as a title with a link to URL, followed by Starts: ____, Ends: ____, preferably converted to simple dates
Bonus Question: How can I efficiently parse dozens of sites using this script? The sites are all ascending numerically. index.php?id=103 index.php?id=104 index.php?id=105
Displaying:
ID Title Start End
#103 TitleWithLink StartDate EndDate
#104 TitleWithLink StartDate EndDate
#105 TitleWithLink StartDate EndDate
Based on your question i guessed you want to read metadata.A part of the code i will suggest now has been taken from http://php.net/manual/en/function.get-meta-tags.php
.It works fine for this SO page so it will work fine for yours too.Of course you will need to adapt it a little to get your task done.
function getUrlData($url, $raw=false) // $raw - enable for raw display
{
$result = false;
$contents = getUrlContents($url);
if (isset($contents) && is_string($contents))
{
$title = null;
$metaTags = null;
$metaProperties = null;
preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );
if (isset($match) && is_array($match) && count($match) > 0)
{
$title = strip_tags($match[1]);
}
preg_match_all('/<[\s]*meta[\s]*(name|property)="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 4)
{
$originals = $match[0];
$names = $match[2];
$values = $match[3];
if (count($originals) == count($names) && count($names) == count($values))
{
$metaTags = array();
$metaProperties = $metaTags;
if ($raw) {
if (version_compare(PHP_VERSION, '5.4.0') == -1)
$flags = ENT_COMPAT;
else
$flags = ENT_COMPAT | ENT_HTML401;
}
for ($i=0, $limiti=count($names); $i < $limiti; $i++)
{
if ($match[1][$i] == 'name')
$meta_type = 'metaTags';
else
$meta_type = 'metaProperties';
if ($raw)
${$meta_type}[$names[$i]] = array (
'html' => htmlentities($originals[$i], $flags, 'UTF-8'),
'value' => $values[$i]
);
else
${$meta_type}[$names[$i]] = array (
'html' => $originals[$i],
'value' => $values[$i]
);
}
}
}
$result = array (
'title' => $title,
'metaTags' => $metaTags,
'metaProperties' => $metaProperties,
);
}
return $result;
}
function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
$result = false;
$contents = #file_get_contents($url);
// Check if we need to go somewhere else
if (isset($contents) && is_string($contents))
{
preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
{
if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
{
return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
}
$result = false;
}
else
{
$result = $contents;
}
}
return $contents;
}
$result = getUrlData('https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html', true);
the output of print_r($result); is:
Array
(
[title] => file get contents - PHP - Read three lines of remote html - Stack Overflow
[metaTags] => Array
(
[viewport] => Array
(
[html] => <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
[value] => width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0
)
[twitter:card] => Array
(
[html] => <meta name="twitter:card" content="summary"/>
[value] => summary
)
[twitter:domain] => Array
(
[html] => <meta name="twitter:domain" content="stackoverflow.com"/>
[value] => stackoverflow.com
)
[twitter:app:country] => Array
(
[html] => <meta name="twitter:app:country" content="US" />
[value] => US
)
[twitter:app:name:iphone] => Array
(
[html] => <meta name="twitter:app:name:iphone" content="Stack Exchange iOS" />
[value] => Stack Exchange iOS
)
[twitter:app:id:iphone] => Array
(
[html] => <meta name="twitter:app:id:iphone" content="871299723" />
[value] => 871299723
)
[twitter:app:url:iphone] => Array
(
[html] => <meta name="twitter:app:url:iphone" content="se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html" />
[value] => se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:name:ipad] => Array
(
[html] => <meta name="twitter:app:name:ipad" content="Stack Exchange iOS" />
[value] => Stack Exchange iOS
)
[twitter:app:id:ipad] => Array
(
[html] => <meta name="twitter:app:id:ipad" content="871299723" />
[value] => 871299723
)
[twitter:app:url:ipad] => Array
(
[html] => <meta name="twitter:app:url:ipad" content="se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html" />
[value] => se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:name:googleplay] => Array
(
[html] => <meta name="twitter:app:name:googleplay" content="Stack Exchange Android">
[value] => Stack Exchange Android
)
[twitter:app:url:googleplay] => Array
(
[html] => <meta name="twitter:app:url:googleplay" content="http://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html">
[value] => http://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:id:googleplay] => Array
(
[html] => <meta name="twitter:app:id:googleplay" content="com.stackexchange.marvin">
[value] => com.stackexchange.marvin
)
)
[metaProperties] => Array
(
[og:url] => Array
(
[html] => <meta property="og:url" content="https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html"/>
[value] => https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[og:site_name] => Array
(
[html] => <meta property="og:site_name" content="Stack Overflow" />
[value] => Stack Overflow
)
)
)
Then to actually use it to achieve your purpose:
How can I efficiently parse dozens of sites using this script? The
sites are all ascending numerically. index.php?id=103 index.php?id=104
index.php?id=105
you need to :
-first create an array containing your urls
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h2>HTML Table</h2>
<table>
<tr>
<th >Id</th>
<th >Title</th>
<th >start_date</th>
<th>end_date</th>
</tr>
<?php
$urls=array(103=>'index.php?id=103',104=> 'index.php?id=104', 105=>'index.php?id=105');
-then loop through this array :
foreach($urls as $id=>$url):
-each iteration you use the function getUrlData() as shown:
$result=getUrlData($url, true);
-then you retrieve the needed information using eg:
?><tr>
<td><?php echo $id; ?></td>
<td><?php echo $result['title']; ?></td>
<td><?php echo $result['metaProperties']['start_date']['value']; ?></td>
<td><?php echo $result['metaProperties']['end_date']['value']; ?></td>
</tr>
to build each line and row.
At the end of the process you would have get your expected table:
Endforeach;?>
</table></body>
</html>
Well, you could solve your issue with the DomDocument class.
$doc = new \DomDocument();
$title = $start = $end = '';
if ($doc->loadHTMLFile($url)) {
// Get the title
$titles = $dom->getElementsByTagName('title');
if ($titles->length > 0) {
$title = $titles->item(0)->nodeValue;
}
// get meta elements
$xpath = new \DOMXPath($doc);
$ends = $xpath->query('//meta[#property="end_date"]');
$if ($ends->length > 0) {
$end = $ends->item(0)->getAttribute('content');
}
$starts = $xpath->query('//meta[#property="start_date"]');
if ($starts->length > 0) {
$start = $starts->item(0)->getAttribute('content');
}
var_dump($title, $start, $end);
}
With the getElementsByTagName method of the DomDocument class you can find the title element in the whole html of a given url. With the DOMXPath class you can retrieve the specific meta data you want. You don 't need much code for finding specific informations in a html string.
The code shown above is not tested.

CURLOPT_RETURNTRANSFER returns HTML in string

I'm trying to parse HTML using CURL DOMDocument or Xpath, but the CURLOPT_RETURNTRANSFER always returns the url's HTML in string which makes it invalid HTML to be parsed
Returned output:
string(102736) "<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">
<head>
<title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"
PHP snipe see the output
$cc = $http->get($url);
var_dump($cc);
CURL library used: https://github.com/seikan/HTTP/blob/master/class.HTTP.php
When I remove CURLOPT_RETURNTRANSFER I see the HTML without the string(102736), but it echo the url even if i didn't request (reference: curl_exec printing results when I don't want to)
Here is the PHP snipe I used to parse html:
$cc = $http->get($url);
$doc = new \DOMDocument();
$doc->loadHTML($cc);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
Any idea?
Check the return value -
print_r($cc);
you will probably find that the output is an array (if the code ran successfully). From the library source, the return of get() is...
return [
'header' => $headers,
'body' => substr($response, $size),
];
So you will need to change the load line to be...
$doc->loadHTML($cc['body']);
Update:
as an example of the above and using this question as the page to work with...
$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
print_r($links);
Outputs...
Array
(
[0] => Array
(
[href] => #
[text] =>
)
[1] => Array
(
[href] => https://stackoverflow.com
[text] => Stack Overflow
)
[2] => Array
(
[href] => #
[text] =>
)
[3] => Array
(
[href] => https://stackexchange.com/users/?tab=inbox
...

How to make group (if specific field are same text) in loop?

I tried render data in loop and if extend_tag field have same text group in one container.
I only know hard code like below, loop data base on how many known extend_tag group, but actually the extend_tag numbers is unknown might be tag_ and any digit , any idea how to solve it?
data
[tag] => Array (
[0] => Array (
[id] => 1
[extend_tag] => tag_0
)
[1] => Array (
[id] => 2
[extend_tag] => tag_11
)
[2] => Array (
[id] => 3
[extend_tag] => tag_4
)
)
<ul class="container">
<?php foreach($rows['tag'] as $eachRowsTag) { ?>
<?php if ($eachRowsTag['extend_tag'] == 'tag_0') { ?>
<li>><?php echo $eachRowsTag['id']; ?></li>
<?php } ?>
<?php } ?>
</ul>
<ul class="container">
<?php foreach($rows['tag'] as $eachRowsTag) { ?>
<?php if ($eachRowsTag['extend_tag'] == 'tag_1') { ?>
<li>><?php echo $eachRowsTag['id']; ?></li>
<?php } ?>
<?php } ?>
</ul>
...
Why not group them first, then iterate over the resulting array. Something like the following.
foreach ($tags as $tag) {
$grouped[$tag['extend_tag']][] = $tag;
}
// Now $grouped is something along the lines of:
// [
// 'tag_0' => [
// [ 'id' => 1, 'extend_tag' => 'tag_0'],
// ..
// ],
// ..
// ]
foreach($grouped as $extend_tag => $tags) {
echo "All tags in $extended_tag.";
foreach($tags as $tag) {
echo $tag['id'];
}
}
// For something like:
// All tags in tag_0.
// 1
// 4
// All tags in tag_1.
// ..

Categories