Add space between textContent data scraped from website using PHP DOM - php

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David

I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you

You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

Related

How to create a simple screen scraper in PHP

I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}
You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.
Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;
$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}

PHP Extracting a part of the string if found

I am scraping a website and finding a string, then when that string is found, i will be extracting a part of that string.
I am looking for a string "twitter:image" in a website, then when found, i will be extracting the "content" value of that. So here's an example of the website that i'm scraping. This is the HTML or "View Source" of that website:
Here is an example of my code:
I am using a library called "ProxyCrawl"
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$result = $response->body;
if (strpos($result, 'name="twitter:image"') !== false) {
Log::debug("found!");
//then extract the content
} else {
//do nothing
}
}
I already have the code on checking whether the "twitter:image" exist, but i don't have the code when extracting the "content" value.
Any help is greatly appreciated. Thanks!
If <meta name="twitter:image" /> is a unique element on page then use this:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$element = $xpath->query("//meta[#name='twitter:image']/#content");
if (!empty($element->item(0))) {
$imageUrl = $element->item(0)->nodeValue;
}
}
Otherwise, if there are multiple elements of this kind, you will need to iterate:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$imageUrls = [];
$elements = $xpath->query("//meta[#name='twitter:image']");
if ($elements !== false) {
foreach ($elements as $element) {
$imageUrls[] = $element->getAttribute('content');
}
}
}
This is a really quick example but a regex would be the way to go:
/(name=\"twitter:image\")(.)content=\"(.+)\"/im
This would match a string that contains name="twitter:image" followed by content=". You can get the text of content from the third grouping:
$str = '<meta data-rl="true" name="twitter:image" content="testing"';
$regex = '/(name="twitter:image")(.)content="(.+)"/im';
preg_match_all($regex, $str, $results);
print_r($results);
This is a rough example, you'll have to use this as a basis for your exact implementation. There are cleaner solutions to this (and probably better regexes) but this will get you going.
I don't know laravel (I use Symfony) and I am new to StackOverflow, but something like this could work:
if(strstr($result, 'name="twitter:image"')) {
$namestart = strpos($result, 'name="twitter:image"');
$substr1 = substr($result, $namestart);
$contentstart = strpos($result, 'content="') + 8;
$substr2 = substr($result, $contentstart);
$contentend = strpos($substr, '"');
$content = substr($result, $contentstart, $contentend)
}
Not tested!

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

extracting anchor values hidden in div tags

From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();

Categories