I am using simple_html_dom to parse a website.
Is there a way to extract the doctype?
You can use file_get_contents function to get all HTML data from website.
For example
<?php
$html = file_get_contents("http://google.com");
$html = str_replace("\n","",$html);
$get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
$doctype = $matches[1][0];
?>
You can use $html->find('unknown'). This works - at least - in version 1.11 of the simplehtmldom library. I use it as follows:
function get_doctype($doc)
{
$els = $doc->find('unknown');
foreach ($els as $e => $el)
if ($el->parent()->tag == 'root')
return $el;
return NULL;
}
That's just to handle any other 'unknown' elements which might be found; I'm assuming the first will be the doctype. You can explicitly inspect ->innertext if you want to ensure it starts with '!DOCTYPE ', though.
Related
i got Source Code From Remote Url Like This
$f = file_get_contents("http://www.example.com/abc/");
$str=htmlspecialchars( $f );
echo $str;
in that code i want to replace/extract any url which is like
href="/m/offers/"
i want to replace that code/link as
href="www.example.com/m/offers/"
for that i used
$newstr=str_replace('href="/m/offers/"','href="www/exmple.com/m/offers/',$str);
echo $newstr;
but this is not replacing anything now i want to know 1st ) can i replace by str_replace ,in the code which is fetched from remote url and if 'yes' how ...? if 'no' any other solution ?
There will not be any " in your $str because htmlspecialchars() would have converted them all to be " before it got to your str_replace.
I start assuming all href attributes belong to tags.
Since we know if all tags are written in the same way. instead of opting for regular expressions, I will use an interpreter to facilitate the extraction process
<?php
use Symfony\Component\DomCrawler\Crawler;
$base = "http://www.example.com"
$url = $base . "/abc/";
$html = file_get_contents($url);
$crawler = new Crawler($html);
$links = array();
$raw_links = array();
$offers = array();
foreach($crawler->filter('a') as $atag) {
$raw_links[] = $raw_link = $atag->attr('href');
$links[] = $link = str_replce($base, '', $raw_link);
if (strpos($link, 'm/offers') !== false) {
$offers[] = $link;
}
}
now you have all the raw links, relative links and offerslinks
I use the DomCrawler component
I'm looking for a 'smart way' of decoding multiple XML tags inside a string, i have the following function:
function b($params) {
$xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
$lang = ucfirst(strtolower($params['lang']));
if (simplexml_load_string($xmldata) === FALSE) {
return $params['data'];
} else {
$langxmlobj = new SimpleXMLElement($xmldata);
if ($langxmlobj -> $lang) {
return $langxmlobj -> $lang;
} else {
return $params['data'];
}
}
}
And trying out
$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);
But outputs:
Service DNS
And I want it to basically output every tags, so result should be :
Service DNS - DNS Gratuit
Pulling my hairs out. Any quick help or directions would be appreciated.
Edit: Refine needs.
Seems that I wasn't clear enough; so let me show another example
If i have the following string as input :
The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow
because it makes him <French>Heureux</French><English>Happy</English> to know that it
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.
So if i'd run function with 'French' it will return :
The Chat is very happy to stay on stackoverflow
because it makes him Heureux to know that it
is the best Endroit to find good people with
good Réponses.
And with 'English' :
The Cat is very happy to stay on stackoverflow
because it makes him Happy to know that it
is the best Place to find good people with
good Answers.
Hope it's more clear now.
Basically, I will parse out the lang section firstly, like:
<French>Chat</French><English>Cat</English>
with this:
"#(<($defLangs)>.*?</\\2>)+#i"
Then parse the right lang str out with callback.
If you got php 5.3+, then:
function transLang($str, $lang, $defLangs = 'French|English')
{
return preg_replace_callback ( "#(<($defLangs)>.*?</\\2>)+#i",
function ($matches) use($lang)
{
preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $longSec );
return $longSec [1];
}, $str );
}
echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );
If not, a little complicated:
class LangHelper
{
private $lang;
function __construct($lang)
{
$this->lang = $lang;
}
public function callback($matches)
{
$lang = $this->lang;
preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $subMatches );
return $subMatches [1];
}
}
function transLang($str, $lang, $defLangs = 'French|English')
{
$langHelper = new LangHelper ( $lang );
return preg_replace_callback ( "#(<($defLangs)>.*?</\\2>)+#i",
array (
$langHelper,
'callback'
), $str );
}
echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );
If I understand you correctly you would like to remove all "language" tags, but keep the contents of the provided language.
The DOM is a tree of nodes. Tags are element nodes, the text is stored in text nodes. Xpath allows to select nodes using expressions. So take all the child nodes of the language elements you want to keep and copy them just before the language node. Then remove all language nodes. This will work even if the language elements contain other element nodes, like an <em>.
function replaceLanguageTags($fragment, $language) {
$dom = new DOMDocument();
$dom->loadXml(
'<?xml version="1.0" encoding="UTF-8" ?><content>'.$fragment.'</content>'
);
// get an xpath object
$xpath = new DOMXpath($dom);
// fetch all nodes with the language you like to keep
$nodes = $xpath->evaluate('//'.$language);
foreach ($nodes as $node) {
// copy all the child nodes of just before the found node
foreach ($node->childNodes as $childNode) {
$node->parentNode->insertBefore($childNode->cloneNode(TRUE), $node);
}
// remove the found node
$node->parentNode->removeChild($node);
}
// select all language nodes
$tags = array('English', 'French');
$nodes = $xpath->evaluate('//'.implode('|//', $tags));
foreach ($nodes as $node) {
// remove them
$node->parentNode->removeChild($node);
}
$result = '';
// we do not need the root node, so save all its children
foreach ($dom->documentElement->childNodes as $node) {
$result .= $dom->saveXml($node);
}
return $result;
}
$xml = <<<'XML'
The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow
because it makes him <French>Heureux</French><English>Happy</English> to know that it
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.
XML;
var_dump(replaceLanguageTags($xml, 'English'));
var_dump(replaceLanguageTags($xml, 'French'));
Output:
string(146) "The Cat is very happy to stay on stackoverflow
because it makes him Happy to know that it
is the best Place to find good people with
good Answers."
string(153) "The Chat is very happy to stay on stackoverflow
because it makes him Heureux to know that it
is the best Endroit to find good people with
good Réponses."
What version of PHP are you on? I don't know what else could be different, but I copied & pasted your code and got the following output:
SimpleXMLElement Object
(
[0] => Service DNS
[1] => DNS Gratuit
)
Just to be sure, this is the code I copied from above:
<?php
function b($params) {
$xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
$lang = ucfirst(strtolower($params['lang']));
if (simplexml_load_string($xmldata) === FALSE) {
return $params['data'];
} else {
$langxmlobj = new SimpleXMLElement($xmldata);
if ($langxmlobj -> $lang) {
return $langxmlobj -> $lang;
} else {
return $params['data'];
}
}
}
$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);
Here's my suggestion. It should be fast and it is simple. You just need to strip the tags of the desired language and then remove any other tags along with their content.
The downside is that if you wish to use any other tags than the language one, you have to make sure that the opening one is different from the closing (e.g. <p >Lorem</p> instead of <p>Lorem</p>). On the other hand this allows you to add as many languages as you want, without keeping a list of them. You need to know only the default one (or just throw and catch exception) when the asked language is missing.
function only_lang($lang, $text) {
static $infinite_loop;
$result = str_replace("<$lang>", '', $text, $num_matches_open);
$result = str_replace("</$lang>", '', $result, $num_matches_close);
// Check if the text is malformed. Good place to throw an error
if($num_matches_open != $num_matches_close) {
//throw new Exception('Opening and closing tags does not match', 1);
return $text;
}
// Check if this language is present at all.
// Otherwise fallback to default language or throw an error
if( ! $num_matches_open) {
//throw new Exception('No such language', 2);
// Prevent infinite loop if even the default language is missing
if($infinite_loop) return $text;
$infinite_loop = __FUNCTION__;
return $infinite_loop('English', $text);
}
// Strip any other language and return the result
return preg_replace('!<([^>]+)>.*</\\1>!', '', $result);
}
I got a simple one using regex. Useful, if the input only contains <lang>...</lang> tags.
function to_lang($lang="", $str="") {
return strip_tags(preg_replace('~<(\w+(?<!'.$lang.'))>.*</\1>~Us',"",$str));
}
echo to_lang("English","The happy <French>Chat</French><English>Cat</English>");
Removes each <tag>...</tag>, that is not the specified one in $lang. If there could be spaces/specials inside the <tag-name> e.g. <French-1> replace \w with [^/>].
Search pattern explained a bit
1.) <(\w+(?<!'.$lang.'))
< followed by one or more Word characters,
not matching $lang (using a negative lookbehind)
and capturing the <tag_name>
2.) .* followed by anything (ungreedy: modifier U, dot matches newlines: modifier s)
3.) </\1> until the captured tag is closed
I am trying to get the link of a background
<div class="mine" style="background: url('http://www.something.com/something.jpg')"></div>
I am using find('div.mine')
$link = find('div.mine');
$link returns the html code containing all the
How do I parse so it returns only the link?
That syntax isn't quite correct. You're doing $link = find('div.mine'); but that should be $link = $yourHTML->find('div.mine'); instead.
Get all the divs with the class name mine first, loop through them, and get the style attributes. Now you'll have a string like:
background: url('http://www.something.com/something.jpg')
You could then use a CSS Parser (recommended way), or a regular expression to grab just the URL part from that string.
if(preg_match('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $link, $matches)) {
$image_url = $matches[0];
}
Full code:
$html = file_get_html('file.html');
$divs = $html->find('div.mine');
foreach ($divs as $div) {
$link = $div->style;
}
if(preg_match('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $link, $matches)) {
$image_url = $matches[0];
}
echo $image_url;
Output:
http://www.something.com/something.jpg
The URL matching regex pattern is from Wordpress' make_clickable function in wp-includes/formatting.php. See this post for the complete implementation.
try with substr() function to extract the text
I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}
Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr
I have the following code:
<?PHP
$url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr";
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
?>
Which does nothing at present and what I need this to do is scrap all the URL in the table for all 16 pages and would really appreciate some help with how to amend the above to do that and output URL into a text file.
Use HTML Dom Parser
$html = file_get_html('http://www.example.com/');
// Find all links
$links = array();
foreach($html->find('a') as $element)
$links[] = $element->href;
Now links array contains all URLs of given page and you can use these URLs to parse further.
Parsing HTML with regular expressions is not a good idea. Here are some related posts:
Using regular expressions to parse HTML: why not?
RegEx match open tags except XHTML self-contained tags
EDIT:
Some Other HTML Parsing tools as described by Gordon in comments below:
phpQuery
Zend_Dom
QueryPath
FluentDom
You really shouldn’t use regular expressions to parse HTML as it’s to error prone.
Better use an HTML parser like the one of PHP’s DOM library:
$code = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($code);
$links = array();
foreach ($doc->getElementsByTagName('a') as $element) {
if ($element->hasAttribute('href')) {
$links[] = $elements->getAttribute('href');
}
}
Note that this will collect the URI references as they appear in the document and not as an absolute URI. You might want to resolve them before.
It seems that PHP doesn’t provide an appropriate library (or I haven’t found it yet). But see RFC 3986 – Reference Resolution and my answer on Convert a relative URL to an absolute URL with Simple HTML DOM? for further details.
Try this method
function getinboundLinks($domain_name) {
ini_set('user_agent', 'NameOfAgent (<a class="linkclass" href="http://localhost">http://localhost</a>)');
$url = $domain_name;
$url_without_www=str_replace('http://','',$url);
$url_without_www=str_replace('www.','',$url_without_www);
$url_without_www= str_replace(strstr($url_without_www,'/'),'',$url_without_www);
$url_without_www=trim($url_without_www);
$input = #file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
//$inbound=0;
$outbound=0;
$nonfollow=0;
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
# $match[2] = link address
# $match[3] = link text
//echo $match[3].'<br>';
if(!empty($match[2]) && !empty($match[3])) {
if(strstr(strtolower($match[2]),'URL:') || strstr(strtolower($match[2]),'url:') ) {
$nonfollow +=1;
} else if (strstr(strtolower($match[2]),$url_without_www) || !strstr(strtolower($match[2]),'http://')) {
$inbound += 1;
echo '<br>inbound '. $match[2];
}
else if (!strstr(strtolower($match[2]),$url_without_www) && strstr(strtolower($match[2]),'http://')) {
echo '<br>outbound '. $match[2];
$outbound += 1;
}
}
}
}
$links['inbound']=$inbound;
$links['outbound']=$outbound;
$links['nonfollow']=$nonfollow;
return $links;
}
// ************************Usage********************************
$Domain='<a class="linkclass" href="http://zachbrowne.com">http://zachbrowne.com</a>';
$links=getinboundLinks($Domain);
echo '<br>Number of inbound Links '.$links['inbound'];
echo '<br>Number of outbound Links '.$links['outbound'];
echo '<br>Number of Nonfollow Links '.$links['nonfollow'];