I have a string (not xml )
<headername>X-Mailer-Recptid</headername>
<headervalue>15772348</headervalue>
</header>
from this, i need to get the value 15772348, that is the value of headervalue. How is possible?
Use PHP DOM and traverse the headervalue tag using getElementsByTagName():
<?php
$doc = new DOMDocument;
#$doc->loadHTML('<headername>X-Mailer-Recptid</headername><headervalue>15772348</headervalue></header>');
$items = $doc->getElementsByTagName('headervalue');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "\n";
}
?>
This gives the following output:
15772348
[EDIT]: Code updated to suppress non-HTML warning about invalid headername and headervalue tags as they are not really HTML tags. Also, if you try to load it as XML, it totally fails to load.
This looks XML-like to me. Anyway, if you don't want to parse the string as XML (which might be a good idea), you could try something like this:
<?
$str = "<headervalue>15772348</headervalue>";
preg_match("/<headervalue\>([0-9]+)<\/headervalue>/", $str, $matches);
print_r($matches);
?>
// find string short way
function my_url_search($se_action_data)
{
// $regex = '/https?\:\/\/[^\" ]+/i';
$regex="/<headervalue\>([0-9]+)<\/headervalue>/"
preg_match_all($regex, $se_action_data, $matches);
$get_url=array_reverse($matches[0]);
return array_unique($get_url);
}
echo my_url_search($se_action_data)
<?php
$html = new simple_html_dom();
$html = str_get_html("<headername>X-Mailer-Recptid</headername>headervalue>15772348</headervalue></header>"); // Use Html dom here
$get_value=$html->find("headervalue", 0)->plaintext;
echo $get_value;
?>
http://simplehtmldom.sourceforge.net/manual.htm#section_find
Related
I am trying to fetch the content inside a <div> via file_get_contents. What I want to do is to fetch the content from the div resultStats on google.com. My problem is (afaik) printing it.
A bit of code:
$data = file_get_contents("https://www.google.com/?gws_rd=cr&#q=" . $_GET['keyword'] . "&gws_rd=ssl");
preg_match("#<div id='resultStats'>(.*?)<\/div>#i", $data, $matches);
Simply using
print_r($matches);
only returns Array(), but I want to preg_match the number. Any help is appreciated!
Edit: thanks for showing me the right direction! I got rid of the preg_ call and went for DOM instead. Although I am pretty new to PHP and this is giving me an headache; I found this code here on Stack Overflow and I am trying to edit it to get it to work. This far I only receive a blank page, and don't know what I am doing wrong.
$str = file_get_contents("https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl");
$DOM = new DOMDocument;
#$dom->loadHTML($str);
//get
$items = $DOM->getElementsByTagName('resultStats');
//print
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";
} else { exit("No keyword!") ;}
Posted on behalf of the OP.
I decided to use the PHP Simple HTML DOM Parser and ended up something like this:
include_once('/simple_html_dom.php');
$setDomain = "https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl";
$str = file_get_html($setDomain);
$html = str_get_html($str);
$html->find('div div[id=resultStats]', 0)->innertext . '<br>';
Problem solved!
I was trying to scrape imdb by following code.
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$html = new simple_html_dom();
$html->load(str_replace(' ','',$data = get_data($url)));
foreach($html->find('#left') as $total_movies)
{
$content = $total_movies->plaintext;
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
{
print_r($matches);
}
echo $content."<br>";
}
get_data() is just a curl function i created.
The problem is that preg_match is not working. i don't know why but the same thing when used work here. $content contains the text what i scrape in above code.
$content = "1-50 of 101 titles.";
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
print_r($matches);
The source on the site is actually:
<div id="left">
1-50 of 564,592
titles.
</div>
notice the \n this would need stripping out or added to your condition.
Heres a method to reach your goal without using any added extra library.
<?php
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$temp=file_get_contents($url);
$xml = new DOMDocument();
#$xml->loadHTML($temp);
foreach($xml->getElementsByTagName('div') as $div) {
if($div->getAttribute('id')=='left'){
preg_match("#of ([0-9,]+)#",$div->nodeValue,$match);
$matchs[]=preg_replace('/[^0-9]/', '', $match[0]);
}
}
echo number_format($matchs[0]); //564,592
?>
How can I use php to remove tags with empty text node?
For instance,
<div class="box"></div> remove
remove
<p></p> remove
<span style="..."></span> remove
But I want to keep the tag with text node like this,
link keep
Edit:
I want to remove something messy like this too,
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong></strong></p>
I tested both regex below,
$content = preg_replace('!<(.*?)[^>]*>\s*</\1>!','',$content);
$content = preg_replace('%<(.*?)[^>]*>\\s*</\\1>%', '', $content);
But they leave something like this,
<p><strong></strong></p>
<p><strong></strong></p>
<p><strong></strong></p>
One way could be:
$dom = new DOMDocument();
$dom->loadHtml(
'<p><strong>test</strong></p>
<p><strong></strong></p>
<p><strong></strong></p>'
);
$xpath = new DOMXPath($dom);
while(($nodeList = $xpath->query('//*[not(text()) and not(node())]')) && $nodeList->length > 0) {
foreach ($nodeList as $node) {
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHtml();
Probably you'll have to change that a bit for your needs.
You should buffer the PHP output, then parse that output with some regex, like this:
// start buffering output
ob_start();
// do some output
echo '<div id="non-empty">I am not empty</div><a class="empty"></a>';
// at this point you want to output the contents to the client
$contents = ob_get_contents();
// end buffering and flush
ob_end_flush();
// replace empty html tags
$contents = preg_replace('%<(.*?)[^>]*>\\s*</\\1>%', '', $contents);
// echo the sanitized contents
echo $contents;
Let me know if this helps :)
You could do a regex replace like:
$updated="";
while($updated != $original) {
$updated = $original;
$original = preg_replace('!<(.*?)[^>]*>\s*</\1>!','',$updated);
}
Putting it in a while loop should fix it.
I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.
Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr
I have the following code:
<?PHP
$url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr";
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
?>
Which does nothing at present and what I need this to do is scrap all the URL in the table for all 16 pages and would really appreciate some help with how to amend the above to do that and output URL into a text file.
Use HTML Dom Parser
$html = file_get_html('http://www.example.com/');
// Find all links
$links = array();
foreach($html->find('a') as $element)
$links[] = $element->href;
Now links array contains all URLs of given page and you can use these URLs to parse further.
Parsing HTML with regular expressions is not a good idea. Here are some related posts:
Using regular expressions to parse HTML: why not?
RegEx match open tags except XHTML self-contained tags
EDIT:
Some Other HTML Parsing tools as described by Gordon in comments below:
phpQuery
Zend_Dom
QueryPath
FluentDom
You really shouldn’t use regular expressions to parse HTML as it’s to error prone.
Better use an HTML parser like the one of PHP’s DOM library:
$code = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($code);
$links = array();
foreach ($doc->getElementsByTagName('a') as $element) {
if ($element->hasAttribute('href')) {
$links[] = $elements->getAttribute('href');
}
}
Note that this will collect the URI references as they appear in the document and not as an absolute URI. You might want to resolve them before.
It seems that PHP doesn’t provide an appropriate library (or I haven’t found it yet). But see RFC 3986 – Reference Resolution and my answer on Convert a relative URL to an absolute URL with Simple HTML DOM? for further details.
Try this method
function getinboundLinks($domain_name) {
ini_set('user_agent', 'NameOfAgent (<a class="linkclass" href="http://localhost">http://localhost</a>)');
$url = $domain_name;
$url_without_www=str_replace('http://','',$url);
$url_without_www=str_replace('www.','',$url_without_www);
$url_without_www= str_replace(strstr($url_without_www,'/'),'',$url_without_www);
$url_without_www=trim($url_without_www);
$input = #file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
//$inbound=0;
$outbound=0;
$nonfollow=0;
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
# $match[2] = link address
# $match[3] = link text
//echo $match[3].'<br>';
if(!empty($match[2]) && !empty($match[3])) {
if(strstr(strtolower($match[2]),'URL:') || strstr(strtolower($match[2]),'url:') ) {
$nonfollow +=1;
} else if (strstr(strtolower($match[2]),$url_without_www) || !strstr(strtolower($match[2]),'http://')) {
$inbound += 1;
echo '<br>inbound '. $match[2];
}
else if (!strstr(strtolower($match[2]),$url_without_www) && strstr(strtolower($match[2]),'http://')) {
echo '<br>outbound '. $match[2];
$outbound += 1;
}
}
}
}
$links['inbound']=$inbound;
$links['outbound']=$outbound;
$links['nonfollow']=$nonfollow;
return $links;
}
// ************************Usage********************************
$Domain='<a class="linkclass" href="http://zachbrowne.com">http://zachbrowne.com</a>';
$links=getinboundLinks($Domain);
echo '<br>Number of inbound Links '.$links['inbound'];
echo '<br>Number of outbound Links '.$links['outbound'];
echo '<br>Number of Nonfollow Links '.$links['nonfollow'];