I have a script that I'm trying to finish with simple_html_dom
I can scrape the web-pages I want, the links are invalid. I want to make the links valid, so I’ve been trying different things, and not getting it to work.
I can get it to scrape, or to fix links from a previously saved page, but I cant seem to scrape the links, and fix the links so they reference the correct domain.
I might be misusing or misunderstanding how to use simplehtmldom's "save" function.
Here is what Ive got right now:
<?php
include 'simple_html_dom.php';
$file1 = "http://www.indeed.com/jobs?q=Electrician&l=maine";
$file2 = "http://www.indeed.com/jobs?q=Electronic&l=maine";
$file3 = "http://www.indeed.com/jobs?q=Electronics+Tech&l=maine";
$file4 = "http://www.indeed.com/jobs?q=Helpdesk&l=maine";
$file5 = "http://www.indeed.com/jobs?q=Trades&l=maine";
$SEARCH = array($file1, $file2, $file3, $file4, $file5);
//Fix links
$domain = "http://www.indeed.com";
$rep['/href="(?!https?:\/\/)(?!data:)(?!#)/'] = 'href="'.$domain;
$rep['/src="(?!https?:\/\/)(?!data:)(?!#)/'] = 'src="'.$domain;
$rep['/#import[\n+\s+]"\//'] = '#import "'.$domain;
$rep['/#import[\n+\s+]"\./'] = '#import "'.$domain;
//Find this: data-tn-component="organicJob"
//<div class=" row result" id="p_a8a968e2788dad48" data-jk="a8a968e2788dad48" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">
$html = new simple_html_dom();
for ($i = 0; $i<6; $i++)
{
$html->load_file($SEARCH[$i]);
foreach($html->find('div[data-tn-component="organicJob"]') as $div)
{
$str = $html->save($div);
$output = preg_replace(array_keys($rep), array_values($rep), $str);
echo $output->innertext . "\n";
}
}
?>
how can I scrape the pages, and fix the links to point to the correct domain?
Related
i got Source Code From Remote Url Like This
$f = file_get_contents("http://www.example.com/abc/");
$str=htmlspecialchars( $f );
echo $str;
in that code i want to replace/extract any url which is like
href="/m/offers/"
i want to replace that code/link as
href="www.example.com/m/offers/"
for that i used
$newstr=str_replace('href="/m/offers/"','href="www/exmple.com/m/offers/',$str);
echo $newstr;
but this is not replacing anything now i want to know 1st ) can i replace by str_replace ,in the code which is fetched from remote url and if 'yes' how ...? if 'no' any other solution ?
There will not be any " in your $str because htmlspecialchars() would have converted them all to be " before it got to your str_replace.
I start assuming all href attributes belong to tags.
Since we know if all tags are written in the same way. instead of opting for regular expressions, I will use an interpreter to facilitate the extraction process
<?php
use Symfony\Component\DomCrawler\Crawler;
$base = "http://www.example.com"
$url = $base . "/abc/";
$html = file_get_contents($url);
$crawler = new Crawler($html);
$links = array();
$raw_links = array();
$offers = array();
foreach($crawler->filter('a') as $atag) {
$raw_links[] = $raw_link = $atag->attr('href');
$links[] = $link = str_replce($base, '', $raw_link);
if (strpos($link, 'm/offers') !== false) {
$offers[] = $link;
}
}
now you have all the raw links, relative links and offerslinks
I use the DomCrawler component
I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.
I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}
I have searched, and searched for 3+ hours this morning and tried over 10 different setups for how to grab and display a list of images from a url, and none of them worked correctly. I would either end up with no info displaying, or a 500 error. Can someone point me to an example or help me out here on how to do this properly. file_get_contents is not a viable option.
Example Directory: http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/
Files i know that are in that directory:
001.jpg,
002.jpg,
003.jpg
I would like the output to be the exact url to the file.
Let me know if more info is needed, i'm not 100% sure exactly how to explain it right lol.
Edit:
ok so what I guess i actually want to do is check the url for all the image tags and display a list with the full url to that image.
New to working with this url+images+php stuff so please don't hit me too hard with your downvote hammer with no comments lol.
Code I Tried:
<?php
/*
Credits: Bit Repository
URL: http://www.bitrepository.com/
*/
$url = $location;
// Fetch page
$string = FetchPage($url);
// Regex that extracts the images (full tag)
$image_regex_src_url = '/<img[^>]*'.
'src=[\"|\'](.*)[\"|\']/Ui';
preg_match_all($image_regex, $string, $out, PREG_PATTERN_ORDER);
$img_tag_array = $out[0];
echo "<pre>"; print_r($img_tag_array); echo "</pre>";
// Regex for SRC Value
$image_regex_src_url = '/<img[^>]*'.
'src=[\"|\'](.*)[\"|\']/Ui';
preg_match_all($image_regex_src_url, $string, $out, PREG_PATTERN_ORDER);
$images_url_array = $out[1];
echo "<pre>"; print_r($images_url_array); echo "</pre>";
// Fetch Page Function
function FetchPage($path)
{
$file = fopen($path, "r");
if (!$file)
{
exit("The was a connection error!");
}
$data = '';
while (!feof($file))
{
// Extract the data from the file / url
$data .= fgets($file, 1024);
}
return $data;
}
?>
and it returned a blank page
Based loosely on the code you already tried (but was riddled with problems). This grabs the full contents of the URL $url, parses out the <img> src attributes, and then outputs them.
Because this particular web host uses <base href=""/> tag to reset the base part of all URLs on the page, I've added a $base variable which you should set to the contents of the base tag.
Additionally, it looks like this particular web host has some pretty smart anti-hotlinking in place, so not all images may be visible.
But! Give it a whirl, let me know if it does what you need it to, and any questions.
<?php
$url = 'http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/';
$base = 'http://www.webtoonlive.com/';
// Pull in the external HTML contents
$contents = file_get_contents( $url );
// Use Regular Expressions to match all <img src="???" />
preg_match_all( '/<img[^>]*src=[\"|\'](.*)[\"|\']/Ui', $contents, $out, PREG_PATTERN_ORDER);
foreach ( $out[1] as $k=>$v ){ // Step through all SRC's
// Prepend the URL with the $base URL (if needed)
if ( strpos( $v, 'http://' ) !== true ) $v = $base . $v;
// Output a link to the URL
echo '' . $v . '<br/>';
}
Sample output:
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/000.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/001.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/002.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/003.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/004.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/005.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/006.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/007.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/008.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/009.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/010.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/011.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/012.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/013.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/014.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/015.jpg
http://www.webtoonlive.com/webtoon/fantasy_world_survival/ch02/016.jpg
I have been researching this topic for a few days now and i'm still non the wiser as on how to do it.
I want to get an RSS feed from forexfactory.com to my website, i want to do some formatting on whats happening and i also want the latest information from them (Although those last two points can wait as long as i have some more or feed running).
Preferably I'd like to develop this from the ground up if anyone knows of a tutorial or something i could use?
If not i will settle for using a third party API or something like that as long as i get to do some of the work.
I'm not sure what it is but there is something about RSS that i'm not getting so if anyone knows of any good, probably basic tutorials that would help me out a lot. It's kind of hard going through page after page of google searches.
Also i'm not to fussed on the language it's outputted in Javascript, PHP or HTML will be great though.
Thanks for the help.
It looks like SimplePie may be what you are looking for. It's a very basic RSS plugin which is quite easy to use and is customisable too. You can download it from the website.
You can use it at it's bare bones or you can delve deeper in to the plugin if you wish. Here's a demo on their website.
index.php
include('rss_class.php');
$feedlist = new rss($feed_url);
echo $feedlist->display(2,"Feed Title");
rss_class.php
<?php
class rss {
var $feed;
function rss($feed){
$this->feed = $feed;
}
function parse(){
$rss = simplexml_load_file($this->feed);
//print_r($rss);die; /// Check here for attributes
$rss_split = array();
foreach ($rss->channel->item as $item) {
$title = (string) $item->title;
$link = (string) $item->link;
$pubDate = (string) $item->pubDate;
$description = (string) $item->description;
$image = $rss->channel->item->enclosure->attributes();
$image_url = $image['url'];
$rss_split[] = '
<li>
<h5>'.$title.'</h5>
<span class="dateWrap">'.$pubDate.'</span>
<p>'.$description.'</p>
Read Full Story
</li>
';
}
return $rss_split;
}
function display($numrows,$head){
$rss_split = $this->parse();
$i = 0;
$rss_data = '<h2>'.$head.'</h2><ul class="newsBlock">';
while($i<$numrows){
$rss_data .= $rss_split[$i];
$i++;
}
$trim = str_replace('', '',$this->feed);
$user = str_replace('&lang=en-us&format=rss_200','',$trim);
$rss_data.='</ul>';
return $rss_data;
}
}
?>
I didn't incorporate the < TABLE > tags as there might be more than one article that you would like to display.
class RssFeed
{
public $rss = "";
public function __construct($article)
{
$this->rss = simplexml_load_file($article, 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOWARNING);
if($this->rss != false)
{
printf("<TR>\r\n");
printf("<TD>\r\n");
printf("<h3>%s</h3>\r\n", $this->rss->channel->title);
printf("</TD></TR>\r\n");
foreach($this->rss->channel->item as $value)
{
printf("<TR>\r\n");
printf("<TD id=\"feedmiddletd\">\r\n");
printf("<A target=\"_blank\" HREF=\"%s\">%s</A><BR/>\r\n", $value->link, $value->title);
printf($value->description);
printf("</TD></TR>\r\n");
}
}
}
}