I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...
Related
I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}
You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.
Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;
$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}
I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.
I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);
I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)
From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();