Website Scraping Using Regex trying to extract integers

Website Scraping Using Regex trying to extract integers - php

I'm having trouble to extract the integers between the brackets from this website.
Part of markup from the website:
<span class="b-label b-link-number" data-num="(322206)">Music & Video</span>
<span class="b-label b-link-number" data-num="(954218)">Toys, Hobbies & Games</span>
<span class="b-label b-link-number" data-num="(502981)">Kids, Baby & Maternity</span>
How do I extract the integers between the brackets?
Desired output:
322206
954218
502981
Should I use Regex since they got the same class name (but not Regex to get between brackets since there are other unwanted elements inside bracket as well from the source code).
Normally, this would be the way I use to extract information:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-list-item";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
$search = array(0,1,2,3,4,5,6,7,8,9,'(',')');
$categories = str_replace($search, '', $span->item(0)->nodeValue);
echo '<br>' . '<font color="green">' . $categories . ' ' . '</font>' ;
}
?>
but since the data I want is inside the tag, how do I extract them?

Adding on your current code, its simply straight forward, just change that $class to that class you desire and use ->getAttribute() to get those data-num's:
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-link-number"; // change the span class
$nodes = $finder->query("//*[contains(#class, '$class')]"); // target those
$numbers = array();
foreach ($nodes as $node) { // for every found elemenet
$link_num = $node->getAttribute('data-num'); // get the attribute `data-num`
$link_num = str_replace(['(', ')'], '', $link_num); // simply remove those parenthesis
$numbers[] = $link_num; // push it inside the container
}
echo '<pre>';
print_r($numbers);

<span[^>)()]*\((\d+)\)[^>]*>
Try this.Grab the capture.See demo.
http://regex101.com/r/iM2wF9/10

Related

php echo "%" displaying more than once

Testing with data scraping. The output I'm scraping, is a percent. So I basically slapped on a
echo "%<br>";
At the end of the actual number output which is
echo $ret_[66];
However there's an issue where the percent is actually appearing before the number as well, which is not desirable. This is the output:
%
-0.02%
Whereas what I'm trying to get is just -0.02%
Clearly I'm doing something wrong with the PHP. I'd really appreciate any feedback/solutions. Thank you!
Full code:
<?php
error_reporting(E_ALL^E_NOTICE^E_WARNING);
include_once "global.php";
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.moneycontrol.com/markets/global-indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key => $val){
$ret_[$key] = trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
echo $ret_[66];
echo "%<br>";
}

<?php
echo "%<br>";
?>
On a seperate following PHP code. Does the same thing.

PHP Preg Replace - Match String with Space - Wordpress

I'm trying to scan my wordpress content for:
<p><span class="embed-youtube">some iframed video</span></p>
and then change it into:
<p class="img_wrap"><span class="embed-youtube">some iframed video</span></p>
using the following code in my function.php file in my theme:
$classes = 'class="img_wrap"';
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
but for some reason I am not getting a match on my regex nor is the replace working. I don't understand why there isn't a match because the span with class embed-youtube exists.
UPDATE - HERE IS THE FULL FUNCTION
function give_attachments_class($content){
$classes = 'class="img_wrap"';
$img_match = preg_match("/(<p.*?)(.*?><img)/", $content, $img_array);
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
// $doc = new DOMDocument;
// #$doc->loadHTML($content); // load the HTML data
// $xpath = new DOMXPath($doc);
// $nodes = $xpath->query('//p/span[#class="embed-youtube"]');
// foreach ($nodes as $node) {
// $node->parentNode->setAttribute('class', 'img_wrap');
// }
// $content = $doc->saveHTML();
if(!empty($img_match))
{
$content = preg_replace('/(<p.*?)(.*?><img)/', '$1 ' . $classes . '$2', $content);
}
else if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
$content = preg_replace("/<img(.*?)src=('|\")(.*?).(bmp|gif|jpeg|jpg|png)(|\")(.*?)>/", '<img$1 data-original=$3.$4 $6>' , $content);
return $content;
}
add_filter('the_content','give_attachments_class');

Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//p/span[#class="embed-youtube"]');
foreach ($nodes as $node) {
$node->parentNode->setAttribute('class', 'img_wrap');
}
echo $doc->saveHTML();

Here is a quick and dirty REGEX I did for you. It finds the entire string starting with p tag, ending p tag, span also included etc. I also wrote it to include single or double quotes for you since you never know and also to include spaces in various places. Let me know how it works out for you, thanks.
(<p )+(class=)['"]+img_wrap+['"](><span)+[ ]+(class=)+['"]embed-youtube+['"]>[A-Za-z0-9='" ]+(</span></p>)
I have tested it on your code and a few other variations and it works for me.

Website Scraping from DoMDocument using php

I have a php code that could extract the categories and display them. However,
I still can't extract the numbers that goes along with it too(without the bracket).
Need to be separated between the categories and number(not extract together).
Maybe do another for loop using Regex, etc...
This is the code:
<?php
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp");
$finder = new DomXPath($grep);
$class = "CatLevel1";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue."<br>";
}
?>
Is there any way I could do that? Thanks!
This is my desired output:
Arts, Antiques & Collectibles : 9768<br>
B2B & Industrial Products : 2342<br>
Baby : 3453<br>
etc...

Just add the other sibling as well. Example:
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue . ': ' . str_replace(array('(', ')'), '', $span->item(1)->nodeValue);
echo '<br/>';
}
EDIT: Just use str_replace for that simple purpose of removing that parenthesis.
Sidenote: Always put the UTF-8 Encoding on your PHP file.
header('Content-Type: text/html; charset=utf-8');

PHP nodevalue stripping html tags

I have seem similar solutions else where but I haven't been able to convert to work with my own code.
I have a function that splits an html string between the paragraph tags and returns in an array. Code is as follows...
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//p");
$result = array();
foreach ($entries as $entry) {
$result[] = '<' . $entry->tagName . '>' . $entry->nodeValue . '</' . $entry->tagName . '>';
}
return $result;
Can someone assist me to remove the nodeValue element from this so it returns the paragraph content with html tags complete?
The html I am testing against is this: http://adam-makes-websites.com/tests/htmltest/test.html
A full test of what im doing with the code (as it stands with the suggestion to use ownerDocument->saveHTML applied) is here: http://adam-makes-websites.com/tests/htmltest/runtest.txt
The output from the test can be seen here: http://adam-makes-websites.com/tests/htmltest/runtest.php

You need to call saveHTML on the ownerDocument property:
$result[] = $entry->ownerDocument->saveHTML($entry);

$dom = new DOMDocument();
$dom->loadHTML($string);
$entries = $dom->getElementsByTagName('p');
$new_dom = new DOMDocument();
foreach ($entries as $entry) {
$new_dom->appendChild($new_dom->importNode($entry, TRUE));
}
$result = $new_dom->saveHTML()

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}

You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');

Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Website Scraping Using Regex trying to extract integers - php

<span[^>)()]\((\d+)\)[^>]> Try this.Grab the capture.See demo. http://regex101.com/r/iM2wF9/10

Related

php echo "%" displaying more than once

PHP Preg Replace - Match String with Space - Wordpress

Website Scraping from DoMDocument using php

PHP nodevalue stripping html tags

How to remove invalid element from DOM?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Website Scraping Using Regex trying to extract integers - php

<span[^>)()]*\((\d+)\)[^>]*> Try this.Grab the capture.See demo. http://regex101.com/r/iM2wF9/10

Related

php echo "%" displaying more than once

PHP Preg Replace - Match String with Space - Wordpress

Website Scraping from DoMDocument using php

PHP nodevalue stripping html tags

How to remove invalid element from DOM?

Categories

Resources

<span[^>)()]\((\d+)\)[^>]> Try this.Grab the capture.See demo. http://regex101.com/r/iM2wF9/10