How to change url in html parser - php

I have this code that output all url with ?tid=someNumbers
<?php
include 'simple_html_dom.php';
// Create DOM from URL or file
$html = file_get_html('http://news.sinchew.com.my/node');
// Find all links
foreach($html->find('a') as $element) {
$tid = '?tid';
$url = 'news.sinchew.com.my/node';
if(strpos($element->href,$tid) && (strpos($element->href,$url))) {
echo $element->href . '<br>';
}
}
?>
What i wanted to do is change ?tid=someNumbers to ?tid=1234 and then output all url with ?tid=1234 . I stuck here for hours,can someone help me with this?

Try preg_replace to perform substitutions based on regular expressions:
<?php
//...
echo preg_replace("/\\?tid=[0-9]+/", "?tid=1234", $element->href);
//...
?>

Related

array_unique() in php simple html dom

I wrote the code blow to get all unique links from a url:
include_once ('simple_html_dom.php');
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element){
$input = array($element->href = $element->href . '<br />');
print_r(array_unique($input));}
but I really can't understand why it shows the duplicated links too!
is there any problem with the function array_unique and simple html dom?
and there's another thing I guess is related to the problem: when you execute this you see all of the link that it extracted are in one key I mean this :
array(key => all values)
Is there any one who can solve this?
I believe you want it more like this:
$temp = array();
foreach($html->find('a') as $element) {
$temp[] = $element->href;
}
echo '<pre>' . print_r(array_unique($temp), true) . '</pre>';

when parsing html, check if element is present

Im parsing html from some a page, to get a list of the outgoing, i want to split them in two - the ones with the rel="nofollow" / rel="nofollow me" / rel="me nofollow" element and the ones with with out those expressions.
At the moment im using the code bellow parsed using - PHP Simple HTML DOM Parser
$html = file_get_html("$url");
foreach($html->find('a') as $element) {
echo $element->href; // THE LINK
}
but im not quite sure how to implement it, any ideas ?
Try using something like this :
$html = file_get_html("$url");
// Creating array for storing links
$arrayLinks = array(
"nofollow" => array(),
"others" => array()
);
foreach($html->find('a') as $element) {
// Search for "nofollow" expression with no case-sensitive (i flag)
if(preg_match('#nofollow#i', $element->rel)) {
$arrayLinks["nofollow"][] = $element->href;
}
else {
$arrayLinks["others"][] = $element->href;
}
}
// Display the array
echo "<pre>";
print_r($arrayLinks);
echo "</pre>";
Do a regexp on $element->rel I guess

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});​
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

get specified url from webpage using simplehtmldom

i am trying to build simple php crawler
for this purpose
i am getting constants of webpage using
http://simplehtmldom.sourceforge.net/
after getting page data i get page as bellow
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
this works perfectly,and print all links on that page.
i only want to get some url like
/view.php?view=open&id=
i have wirtten function for this purpose
function starts_text_with($s, $prefix){
return strpos($s, $prefix) === 0;
}
and use this function as
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (starts_text_with($e->href, "/view.php?view=open&id=")))
echo $e->href . '<br>';
}
but nothing return.
i hope you understand what i need.
i need to print only url which match that criteria.
Thanks
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (preg_match($e->href, "view.php?view=open&id="))
echo $e->href . '<br>';
}
try this once.
refer preg_match

Simple HTML DOM Parser error handling

I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?

Categories