Extract text and put into array with PHP - php

I have the following string and need to extract the text inside the div's (EDITOR'S PREFACE, MORE CONTENT, etc) and put them into an array with php. How could I do this?
Thanks in advance.
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>

Use Simple HTML DOM
$html = <<<HTML
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>
HTML;
$src = str_get_html($html);
$elem = $src->find("div.classit a");
foreach ($elem as $link) {
$links[] = $link->plaintext;
}
print_r($links);

You could use PHP's own DOM extension
$string = '<div><a>Elem 1</a></div><div><a>Elem 2</a></div>...etc';
$dom = new DOMDocument();
$dom->loadHTML($string);
$elements = $dom->getElementsByTagName('a');
$textElements = array();
foreach($elements as $node) {
textElements[] = $node->nodeValue;
}
If you want to load a larger HTML extract, you could use DOMXPath to query the DOMDocument in order to just get the elements you want.
$xPathObj = new DOMXPath($dom);
$elements = $xPathObj->query('//div[#class='classit']/a');
Edit
DOMNodeList supports foreach, so I've changed for($i = 0; $i < $elements->length; $i++) {$elements->item($i)->nodeValue;} to foreach($elements as $node) {$node->nodeValue}

You could use preg_match_all:
<?php
$html = <<<HTML
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>
HTML;
$result = array();
if (preg_match_all('/>([^><]+)(?=<\/a>)/', $html, $matches))
{
$result = $matches[1];
}
print_r($result);

you could do using strip_tags:
$s = "<div class='classit'><a href='site.php?site=1&fn=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div> ";
foreach (explode("\n", $s) as $val){
$new[] = strip_tags($val);
}
var_dump($new);

Related

How to web-scrape in in divs with DOMparser

I am trying to get div and for other pages, trying to put it in a foreach.
But facing some troubles,
<div class="article_info">
<ul class="c-result_box">
<li>
<div class="inner cf">
<div class="c-header">
<div class="c-logo">
<im src="/e/designs/31sumai/common/img/logo_08.png" alt="#">
</div>
<p class="c-supplier">三井のマンション</p>
<p class="c-name">
パークリュクス大阪天満
</p>
I'm trying to get the text inside the <a> element, here is my codes, what I am missing here?
$start_id = 1501;
while(true){
$url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
$MyTable = false;
$insertData = [];
foreach($nodes as $node){
$allNames = [];
foreach($node->getElementsByTagName('a') as $a){
$name = $a->getElementsByTagName('a');
$allProperties[] = [
'names' => $name];
}
}
Thank you for helping!
You can rely on your XPath query to pull all the text node that you want, and then just get the nodeValue property within your loop:
$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $xpath->query("//*[contains(#class, '$classname')]/a/text()");
foreach($nodes as $node){
echo $node->nodeValue;
}

Get data-filter attribute of all divs

I have following html:
<div class="" data-filter="01">
...
</div>
<div class="" data-filter="02356">
....
</div>
<div class="" data-filter="02356">
...
</div>
How can I get the data-filter attribute of all those divs in PHP?
I've tried something like
$doc = new DOMDocument();
$doc->loadHTMLFile("items.html");
foreach ($doc->childNodes as $item){
echo $item->getAttribute('data-filter');
}
but that throws Call to undefined method error.
Try this code:
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = "//div[#class='']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo "Found: " . $entry->getAttribute("attrloc");
}
or use this php
http://simplehtmldom.sourceforge.net/manual.htm
change your code as below: need to cahnge method in foreach
$doc = new DOMDocument();
$doc->loadHTMLFile("items.html");
foreach ($doc->getElementsByTagName('div') as $item){
echo $item->getAttribute('data-filter');
}
The following code worked for me:
$doc = new DOMDocument();
$doc->loadHTMLFile("items.html");
$elements = $doc->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo $element->getAttribute('data-filter').'<br>';
}
For more information regarding this, please see DOMDocument::loadHTMLFile in the PHP documentation.

parse the html data to array data in php

I am trying to parse the html format data into arrays using the a tag classes but i was not able to get the desired format . Below is my data
$text ='<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>';
I am trying to get the result using below code
$lines = explode("\n", $text);
$out = array();
foreach ($lines as $line) {
$parts = explode(" > ", $line);
$ref = &$out;
while (count($parts) > 0) {
if (isset($ref[$parts[0]]) === false) {
$ref[$parts[0]] = array();
}
$ref = &$ref[$parts[0]];
array_shift($parts);
}
}
print_r($out);
But i need the result exactly like below
array:2 [
0 => array:3 [
0 => "Text1"
1 => "Text1"
2 => "example.com"
]
1 => array:3 [
0 => "text3"
1 => "text23"
2 => "text.com"
]
]
Demo : https://eval.in/746170
Even i was trying dom like below in laravel :
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach($dom->getElementsByTagName('a') as $node)
{
$array[] = $dom->saveHTML($node);
}
print_r($array);
So how can i use the classes to separate the data as i wanted .Any suggestions please.Thank you .
Here you go, try this and tell me if you need any more help:
<?php
$test = <<<EOS
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadHTML($test);
// first extract all the divs with the links_deep class
$divs = [];
foreach ($document->getElementsByTagName('div') as $div) {
$classes = $div->attributes->getNamedItem('class')->nodeValue;
if (!$classes) continue;
$classes = explode(' ', $classes);
if (in_array('links_main', $classes)) {
$divs[] = $div;
}
}
// now iterate through them and retrieve all the links in order
$results = [];
foreach ($divs as $div) {
$temp = [];
foreach ($div->getElementsByTagName('a') as $link) {
$temp[] = $link->nodeValue;
}
$results[] = $temp;
}
var_dump($results);
Working version - http://sandbox.onlinephpfunctions.com/code/e7ed2615ea32c5b9f0a89e3460da28a2702343f1
I will do it using DOMDocument and DOMXPath to target interesting parts more easily. In order to be more precise, I register a function that checks if a class attribute contains a set of classes:
function hasClasses($attrValue, $requiredClasses) {
$requiredClasses = explode(' ', $requiredClasses);
$classes = preg_split('~\s+~', $attrValue, -1, PREG_SPLIT_NO_EMPTY);
return array_diff($requiredClasses, $classes) ? false : true;
}
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions('hasClasses');
$mainDivClasses = 'result results_links results_links_deep web-result';
$childDivClasses = 'links_main links_deep result__body';
$divNodeList = $xp->query('//div[php:functionString("hasClasses", #class, "' . $mainDivClasses . '")]
/div[php:functionString("hasClasses", #class, "' . $childDivClasses . '")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);
without registering a function, you can also use the XPath function contains in your predicates. It's less precise since it only checks if a substring is in a larger string (and not if a class attribute have a specific class like the hasClasses function) but it must be enough:
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$divNodeList = $xp->query('//div[contains(#class, "results_links_deep")]
[contains(#class, "web-result")]
/div[contains(#class, "links_main")]
[contains(#class, "links_deep")]
[contains(#class, "result__body")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);

extract html inside <p class=" without simplehtmldom

i want to extract html inside <p class="js-swt-text" without simplehtmldom
bellow is the working example i did with simplehtmldom:
<?php
include('simple_html_dom.php');
$html = file_get_html("http://blabla.com");
foreach($html->find('.js-swt-text') as $ret) {
echo $ret;
}
?>
please help me to do with domdocument
<?php
$html = file_get_contents("http://blabla.com");
and echo html from every <p class="js-swt-text"
You can use:
$html = file_get_contents("http://blabla.com");
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$items = $xpath->query('//p[#class="js-swt-text"]');
foreach ($items as $item) {
echo $item->textContent . "\n";
}

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories