Im making a script to get other pages content, and right now im working on a function that should get tag content... but im a bit stuck :D
found a new tag of same kind inside tag...
nothing found...
1111
2222
is printed.
<?php
function d($toprint)
{
echo $toprint."<br />";
}
function GetTagContents($source, $tag, $pos)
{
$startTagPos = strpos( $source, "<".$tag, $pos );
$startTagEndPos = strpos( $source, ">", $startTagPos )+1;
$endTagPos = strpos( $source, "</".$tag, $startTagEndPos);
$lastpos = $startTagPos+1;
while( $lastpos != False )
{
$newStartTagPos = strpos( $source, "<".$tag, $lastpos );
if( $newStartTagPos == False )
{
d("nothing found...");
$lastpos = False;
}
else if( $newStartTagPos > $endTagPos )
{
d("out of bounds...");
$lastpos = False;
}
else
{
d("found a new tag of same kind inside tag...");
$lastpos = $newStartTagPos+1;
$endTagPos = strpos( $source, "</".$tag, $newStartTagPos);
}
}
return substr($source, $startTagEndPos, $endTagPos-$startTagEndPos);
}
?>
<html>
<body>
<?php
d(GetTagContents('<div>1111<div>2222</div>3333</div>', "div", 0));
?>
</body>
</html>
someone got any ideas?
Using PHP DOM:
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
$src->load('path/to/file.html');
$tagName = 'foo';
$element = $src->getElementsByTagName($tagName)->item(0);
var_dump($element->nodValue)
strpos will return 0 the first time, and 0 == false in PHP. The check you want is to compare the result with ===, which evaluates to true if both values are the same value and the same type. That is, 0 == false is true but 0 === false is not true.
you can use this
simplexml_load_string
$xml = "[div]1111[div]2222[/div]3333[/div]";
$loadStrring = simplexml_load_string($xml);
foreach($loadStrring->children() as $name => $data) {
if($name ='div')
echo $data . "\n";
}
}
Related
first time resorting to actually posting on SO.
Also sorry if this has been asked many times, i think ive about read most of them here, but still no dice.
I have a generated log file continaing text i wish to extract the line in the log file is this:
{22:30:47} System:"Obambivas" StarPos:(-59.938,7.375,56.813)ly Body:13 RelPos:(-0.529636,-0.130899,0.838064)km NormalFlight
So far ive manaaged to get the matches via preg_match_all, and works fine.
However i really need each System:"" only once as the log may have several exacly the same.
Ive tried to use array_unique, but im fairly sure im using it wrong as it either retruns nothing or the same results, ie 10+ matches for each match found
So i need just each unique match from the matches found in the log file.
My code so far (sorry if its messy)
And thanks in advance
if (is_dir($log) && is_readable($log)) {
if (!$files = scandir($log, SCANDIR_SORT_DESCENDING)) {
}
$newest_file = $files[0];
if (!$line = file($log . "/" . $newest_file)) {
} else {
foreach ($line as $line_num => $line) {
$pos = strpos($line, 'System:"');
$pos2 = strrpos($line, "ProvingGround");
if ($pos !== false && $pos2 === false) {
preg_match_all("/\System:\"(.*?)\"/", $line, $matches);
$cssystemname = $matches[1][0];
$curSys["name"] = $cssystemname;
preg_match_all("/\StarPos:\((.*?)\)/", $line, $matches2);
$curSys["coordinates"] = $matches2[1][0];
$coord_parts = explode(",", $curSys["coordinates"]);
$curSys["x"] = $coord_parts[0];
$curSys["y"] = $coord_parts[1];
$curSys["z"] = $coord_parts[2];
echo $curSys["name"].' | Coords: '.$curSys["x"].','.$curSys["y"].','.$curSys["z"].'<br />';
}
}
}
}
I added $hash array to avoid duplicates
if (is_dir($log) && is_readable($log)) {
if (!$files = scandir($log, SCANDIR_SORT_DESCENDING)) {
}
$newest_file = $files[0];
if (!$line = file($log . "/" . $newest_file)) {
} else {
$hash = array();
foreach ($line as $line_num => $line) {
$pos = strpos($line, 'System:"');
$pos2 = strrpos($line, "ProvingGround");
if ($pos !== false && $pos2 === false) {
preg_match_all("/\System:\"(.*?)\"/", $line, $matches);
$cssystemname = $matches[1][0];
if ($hash[$cssystemname] == "")
{
$curSys["name"] = $cssystemname;
preg_match_all("/\StarPos:\((.*?)\)/", $line, $matches2);
$curSys["coordinates"] = $matches2[1][0];
$coord_parts = explode(",", $curSys["coordinates"]);
$curSys["x"] = $coord_parts[0];
$curSys["y"] = $coord_parts[1];
$curSys["z"] = $coord_parts[2];
echo $curSys["name"].' | Coords: '.$curSys["x"].','.$curSys["y"].','.$curSys["z"].'<br />';
}
} else $hash[$cssystemname] = "inhash";
}
}
I have to get all links whose nodeValue is 'Download', but when I try to get all the links first, and then select the ones I need, only links that are in my <header> tag are picked up. 'Download' links are in the further down on the page.
What am I doing wrong?
Here is the function:
<?php
function rkm_download_links_fix($current_url) {
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($current_url);
libxml_use_internal_errors(false);
$urls = $dom->getElementsByTagName('a');
print_r($urls); // here i get only links in <header>
$url_copy = array();
foreach ($urls as $url) {
print_r($url->nodeValue);
if($url->nodeValue == 'download') {
$attributes = $url->attributes;
$url_copy[] = array('url' => $url->getAttribute('href'));
}
}
} ?>
If you need more info, please do not hesitate to ask.
Thanks in advance!
Why you need DOM?
Just simple php:
<?php
function getLinks($url)
{
$document = file_get_contents($url);
$links = explode('<a', $document);
$resultLinks = array();
if(count($links) <= 1)
return 'no links';
for($i = 0; $i < count($links); ++$i)
{
if(mb_strpos($links[$i], '>download</a>', 0, 'UTF-8') === false &&
mb_strpos($links[$i], '>Download</a>', 0, 'UTF-8') === false)
continue;
$hrefStart = mb_strpos($links[$i], 'href', 0, 'UTF-8');
if($hrefStart === false)
continue;
$hrefStart += 4;
$hrefStart = mb_strpos($links[$i], '"', $hrefStart, 'UTF-8');
if($hrefStart === false)
$hrefStart = mb_strpos($links[$i], '\'', $hrefStart, 'UTF-8');
if($hrefStart === false)
continue;
++$hrefStart;
$hrefEnd = mb_strpos($links[$i], '"', $hrefStart, 'UTF-8');
if($hrefEnd === false)
$hrefEnd = mb_strpos($links[$i], '\'', $hrefStart, 'UTF-8');
if($hrefEnd === false)
continue;
$resultLinks[] = mb_substr($links[$i], $hrefStart, ($hrefEnd - $hrefStart), 'UTF-8');
}
return $resultLinks;
}
$links = getLinks('http://parse/ws/try2.html');
echo '<pre>';
print_r($links);
echo '</pre>';
I'm trying to code a php parser to gather professor reviews from ratemyprofessor.com. Each professor has a page and it has all the reviews in it, I want to parse each professor's site and extract the comments into a txt file.
This is what I have so far but it doesn't excute properly when I run it because the output txt file remains empty. what can be the issue?
<?php
set_time_limit(0);
$domain = "http://www.ratemyprofessors.com";
$content = "div id=commentsection";
$content_tag = "comment";
$output_file = "reviews.txt";
$max_urls_to_check = 400;
$rounds = 0;
$reviews_stack = array();
$max_size_domain_stack = 10000;
$checked_domains = array();
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($domain);
$found = false;
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}
$checked_domains[$domain] = $found;
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
unset($domain_stack[0]);
$domain_stack = array_values($domain_stack);
$rounds++;
}
$found_domains = "";
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}
file_put_contents($output_file, $found_domains);
?>
This is what I have so far but it doesn't excute properly when I run it because the output txt file remains empty. what can be the issue?
It gives empty output since there is a lack of array variable initialization.
Main part. Add an initialization of variable:
$domain_stack = array(); // before while ($domain != ...... )
Additional. Fix other warnings and notices:
// change this
$checked_domains["http://".$href_array[2]] === null
// into
!isset($checked_domains["http://".$href_array[2]])
// another line
// check if key exists
if (isset($domain_stack[0])) {
$domain = $domain_stack[0];
unset($domain_stack[0]);
}
when I spide a website ,I got a lot of bad url like these.
http://example.com/../../.././././1.htm
http://example.com/test/../test/.././././1.htm
http://example.com/.//1.htm
http://example.com/../test/..//1.htm
all of these should be http://example.com/1.htm.
how to use PHP codes to do this ,thanks.
PS: I use http://snoopy.sourceforge.net/
I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .
You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:
$test = array('http://example.com/../../../././.\./1.htm',
'http://example.com/test/../test/../././.\./1.htm',
'http://example.com/.//1.htm',
'http://example.com/../test/..//1.htm');
foreach ($test as $url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
echo $path.'<br />'.PHP_EOL;
}
/* result
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
*/
//or as a function #lpc2138
function getRealUrl($url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
$path .= (!empty($u['query'])) ? '?'.$u['query'] : '';
return $path;
}
You seem to be looking for a algorithm to remove the dot segments:
function remove_dot_segments($abspath) {
$ib = $abspath;
$ob = '';
while ($ib !== '') {
if (substr($ib, 0, 3) === '../') {
$ib = substr($ib, 3);
} else if (substr($ib, 0, 2) === './') {
$ib = substr($ib, 2);
} else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) {
$ib = '/'.substr($ib, 3);
} else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) {
$ib = '/'.substr($ib, 4);
$ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/')));
} else if ($ib === '.' || $ib === '..') {
$ib = '';
} else {
$pos = strpos($ib, '/', 1);
if ($pos === false) {
$ob .= $ib;
$ib = '';
} else {
$ob .= substr($ib, 0, $pos);
$ib = substr($ib, $pos);
}
}
}
return $ob;
}
This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.
You could do some fancy regex but this works just fine.
fixUrl('http://example.com/../../../././.\./1.htm');
function fixUrl($str) {
$str = str_replace('../', '', $str);
$str = str_replace('./', '', $str);
$str = str_replace('\.', '', $str);
return $str;
}
I am wondering if there is an elegant way to trim some text but while being HTML tag aware?
For example, I have this string:
$data = '<strong>some title text here that could get very long</strong>';
And let's say I need to return/output this string on a page but would like it to be no more than X characters. Let's say 35 for this example.
Then I use:
$output = substr($data,0,20);
But now I end up with:
<strong>some title text here that
which as you can see the closing strong tags are discarded thus breaking the HTML display.
Is there a way around this? Also note that it is possible to have multiple tags in the string for example:
<p>some text here <strong>and here</strong></p>
A few mounths ago I created a special function which is solution for your problem.
Here is a function:
function substr_close_tags($code, $limit = 300)
{
if ( strlen($code) <= $limit )
{
return $code;
}
$html = substr($code, 0, $limit);
preg_match_all ( "#<([a-zA-Z]+)#", $html, $result );
foreach($result[1] AS $key => $value)
{
if ( strtolower($value) == 'br' )
{
unset($result[1][$key]);
}
}
$openedtags = $result[1];
preg_match_all ( "#</([a-zA-Z]+)>#iU", $html, $result );
$closedtags = $result[1];
foreach($closedtags AS $key => $value)
{
if ( ($k = array_search($value, $openedtags)) === FALSE )
{
continue;
}
else
{
unset($openedtags[$k]);
}
}
if ( empty($openedtags) )
{
if ( strpos($code, ' ', $limit) == $limit )
{
return $html."...";
}
else
{
return substr($code, 0, strpos($code, ' ', $limit))."...";
}
}
$position = 0;
$close_tag = '';
foreach($openedtags AS $key => $value)
{
$p = strpos($code, ('</'.$value.'>'), $limit);
if ( $p === FALSE )
{
$code .= ('</'.$value.'>');
}
else if ( $p > $position )
{
$close_tag = '</'.$value.'>';
$position = $p;
}
}
if ( $position == 0 )
{
return $code;
}
return substr($code, 0, $position).$close_tag."...";
}
Here is DEMO: http://sandbox.onlinephpfunctions.com/code/899d8137c15596a8528c871543eb005984ec0201 (click "Execute code" to check how it works).
Using #newbieuser his function, I had the same issue, like #pablo-pazos, that it was (not) breaking when $limit fell into an html tag (in my case <br /> at the r)
Fixed with some code
if ( strlen($code) <= $limit ){
return $code;
}
$html = substr($code, 0, $limit);
//We must find a . or > or space so we are sure not being in a html-tag!
//In my case there are only <br>
//If you have more tags, or html formatted text, you must do a little more and also use something like http://htmlpurifier.org/demo.php
$_find_last_char = strrpos($html, ".")+1;
if($_find_last_char > $limit/3*2){
$html_break = $_find_last_char;
}else{
$_find_last_char = strrpos($html, ">")+1;
if($_find_last_char > $limit/3*2){
$html_break = $_find_last_char;
}else{
$html_break = strrpos($html, " ");
}
}
$html = substr($html, 0, $html_break);
preg_match_all ( "#<([a-zA-Z]+)#", $html, $result );
......
substr(strip_tags($content), 0, 100)