Website Scraping from DoMDocument using php

Website Scraping from DoMDocument using php - php

I have a php code that could extract the categories and display them. However,
I still can't extract the numbers that goes along with it too(without the bracket).
Need to be separated between the categories and number(not extract together).
Maybe do another for loop using Regex, etc...
This is the code:
<?php
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp");
$finder = new DomXPath($grep);
$class = "CatLevel1";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue."<br>";
}
?>
Is there any way I could do that? Thanks!
This is my desired output:
Arts, Antiques & Collectibles : 9768<br>
B2B & Industrial Products : 2342<br>
Baby : 3453<br>
etc...

Just add the other sibling as well. Example:
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue . ': ' . str_replace(array('(', ')'), '', $span->item(1)->nodeValue);
echo '<br/>';
}
EDIT: Just use str_replace for that simple purpose of removing that parenthesis.
Sidenote: Always put the UTF-8 Encoding on your PHP file.
header('Content-Type: text/html; charset=utf-8');

Related

How to create a simple screen scraper in PHP

I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}

You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.

Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;

$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}

variable echoing out the last of 3 numbers

Hi so I've currently got a output echoing 176 8 58 from a web scraping script. I want to pack this script up into a variable and echo it out in other places on the website.
I've packed this up by doing this
ob_start();
echo $node->nodeValue. "\n";
$thenumbers = ob_get_contents();
ob_end_clean();
but when I echo it out like this
Now on the website the numbers are in spans and are split up by "/" do I need to do anything fancy? I'm kind of new to PHP so let me know if its something stupid!
<?php echo $thenumbers ?>
my output is then 176 8 58
Would really appreciate a bit of help
(web scraping script i'm using had to hide the website i'm scraping as its in development)
<?php
$teamlink = rwmb_meta( 'WEBSITE_HIDDEN' );
$arr = array( $teamlink );
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'table')]/tr[3]/td[3]/span");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
ob_start();
echo $node->nodeValue. "\n";
$win_loss = ob_get_contents();
ob_end_clean();
}
}
}
}
?>
p.s I know the script works as its currently outputting standard text fine.

My apoligies if I have completely misunderstood your question.
If you want to add a "/" between the numbers, where the spaces are you could:
echo str_replace(' ','/',$thenumbers);
If you just want to show the last 3 digits (cleaning out the spaces from the string) you could;
echo substr(str_replace(' ','',$thenumbers),-3);

Website Scraping Using Regex trying to extract integers

I'm having trouble to extract the integers between the brackets from this website.
Part of markup from the website:
<span class="b-label b-link-number" data-num="(322206)">Music & Video</span>
<span class="b-label b-link-number" data-num="(954218)">Toys, Hobbies & Games</span>
<span class="b-label b-link-number" data-num="(502981)">Kids, Baby & Maternity</span>
How do I extract the integers between the brackets?
Desired output:
322206
954218
502981
Should I use Regex since they got the same class name (but not Regex to get between brackets since there are other unwanted elements inside bracket as well from the source code).
Normally, this would be the way I use to extract information:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-list-item";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
$search = array(0,1,2,3,4,5,6,7,8,9,'(',')');
$categories = str_replace($search, '', $span->item(0)->nodeValue);
echo '<br>' . '<font color="green">' . $categories . ' ' . '</font>' ;
}
?>
but since the data I want is inside the tag, how do I extract them?

Adding on your current code, its simply straight forward, just change that $class to that class you desire and use ->getAttribute() to get those data-num's:
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-link-number"; // change the span class
$nodes = $finder->query("//*[contains(#class, '$class')]"); // target those
$numbers = array();
foreach ($nodes as $node) { // for every found elemenet
$link_num = $node->getAttribute('data-num'); // get the attribute `data-num`
$link_num = str_replace(['(', ')'], '', $link_num); // simply remove those parenthesis
$numbers[] = $link_num; // push it inside the container
}
echo '<pre>';
print_r($numbers);

<span[^>)()]*\((\d+)\)[^>]*>
Try this.Grab the capture.See demo.
http://regex101.com/r/iM2wF9/10

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}

You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');

Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

php preg_replace need help

I have created a function to search through strings and replace keywords in those strings with links. I am using
preg_replace('/\b(?<!=")(?<!=\')(?<!=)(?<!=")(?<!>)(?<!>)' . $keyword . '(?!</a)(?!</a)\b', $newString, $row);
which is working as expected. The only issue is that if someone had a link like this
Luxury Automobile sales
Automobile being our $keyword in this example.
It would end up looking like
Luxury <a href="www.domain.tdl/keywords.html">Automobile Sales</a>
You can understand my frustration.
Not being confident in regex I thought I would ask if anyone here would know a solution.
Thanks!

How about a proper HTML parser like DOMDocument?
$html = 'Luxury Automobile sales';
$dom = new DomDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node)
{
$node->nodeValue = str_replace('Automobile', 'Cars', $node->nodeValue);
echo simplexml_import_dom($node)->asXML();
}
Is not a problem to get element attribute too
foreach ($nodes as $node)
{
$attr = $node->getAttributeNode('href');
$attr->value = str_replace('Automobile', 'keyword', $attr->value);
echo simplexml_import_dom($node)->asXML();
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Website Scraping from DoMDocument using php - php

Related

How to create a simple screen scraper in PHP

variable echoing out the last of 3 numbers

Website Scraping Using Regex trying to extract integers

How to remove invalid element from DOM?

php preg_replace need help

Categories

Resources