Add space between textContent data scraped from website using PHP DOM

Add space between textContent data scraped from website using PHP DOM - php

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David

I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you

You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

Related

How to create a simple screen scraper in PHP

I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}

You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.

Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;

$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}

PHP Extracting a part of the string if found

I am scraping a website and finding a string, then when that string is found, i will be extracting a part of that string.
I am looking for a string "twitter:image" in a website, then when found, i will be extracting the "content" value of that. So here's an example of the website that i'm scraping. This is the HTML or "View Source" of that website:
Here is an example of my code:
I am using a library called "ProxyCrawl"
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$result = $response->body;
if (strpos($result, 'name="twitter:image"') !== false) {
Log::debug("found!");
//then extract the content
} else {
//do nothing
}
}
I already have the code on checking whether the "twitter:image" exist, but i don't have the code when extracting the "content" value.
Any help is greatly appreciated. Thanks!

If <meta name="twitter:image" /> is a unique element on page then use this:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$element = $xpath->query("//meta[#name='twitter:image']/#content");
if (!empty($element->item(0))) {
$imageUrl = $element->item(0)->nodeValue;
}
}
Otherwise, if there are multiple elements of this kind, you will need to iterate:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$imageUrls = [];
$elements = $xpath->query("//meta[#name='twitter:image']");
if ($elements !== false) {
foreach ($elements as $element) {
$imageUrls[] = $element->getAttribute('content');
}
}
}

This is a really quick example but a regex would be the way to go:
/(name=\"twitter:image\")(.)content=\"(.+)\"/im
This would match a string that contains name="twitter:image" followed by content=". You can get the text of content from the third grouping:
$str = '<meta data-rl="true" name="twitter:image" content="testing"';
$regex = '/(name="twitter:image")(.)content="(.+)"/im';
preg_match_all($regex, $str, $results);
print_r($results);
This is a rough example, you'll have to use this as a basis for your exact implementation. There are cleaner solutions to this (and probably better regexes) but this will get you going.

I don't know laravel (I use Symfony) and I am new to StackOverflow, but something like this could work:
if(strstr($result, 'name="twitter:image"')) {
$namestart = strpos($result, 'name="twitter:image"');
$substr1 = substr($result, $namestart);
$contentstart = strpos($result, 'content="') + 8;
$substr2 = substr($result, $contentstart);
$contentend = strpos($substr, '"');
$content = substr($result, $contentstart, $contentend)
}
Not tested!

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?

To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?

You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}

If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

extracting anchor values hidden in div tags

From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things

<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>

I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Add space between textContent data scraped from website using PHP DOM - php

You want each li, not the ul as one block. Try: $node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li"); Demo: https://3v4l.org/Mvfud If that doesn't work the actual HTML content should be added to the question.

Related

How to create a simple screen scraper in PHP

PHP Extracting a part of the string if found

PHP Web Crawler doesn't crawl .php files

Extract content of specific div preserving only certain elements

extracting anchor values hidden in div tags

Categories

Resources