trying to scrape all facebook links from a web page - php

I'm trying to scrape the page for links from Facebook. However, I get a blank page, without any error message.
My code is as follows:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$matches = array();
if (preg_match('~^https?://(?:www\.)?facebook.com/(.+)/?$~', $html, $matches)) {
print_r($matches);
}
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
getFacebook($html);
What's wrong with it?

A better alternative (and more robust) would be to use DOMDocument and DOMXPath:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$dom = new DOMDocument;
#$dom->loadHTML($html);
$query = new DOMXPath($dom);
$result = $query->evaluate("(//a|//A)[contains(#href, 'facebook.com')]");
$return = array();
foreach ($result as $element) {
/** #var $element DOMElement */
$return[] = $element->getAttribute('href');
}
return $return;
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
var_dump(getFacebook($html));
For your specific problem, however, I did the following things:
Change preg_match to preg_match_all, in order to not stop after the first finding.
Removed the ^ (start) and $ (end) characters from the pattern. Your links will appear in the middle of the document, not in the beginning or end (definitely not both!)
So the corrected code:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$matches = array();
if (preg_match_all('~https?://(?:www\.)?facebook.com/(.+)/?~', $html, $matches)) {
print_r($matches);
}
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
getFacebook($html);

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

how to find urls under double quote

let's say we load the source code of this question and we want to find the url alongside "childUrl"
or goto this site source code and search "childUrl".
<?php
$sites_html = file_get_contents("https://stackoverflow.com/questions/46272862/how-to-find-urls-under-double-quote");
$html = new DOMDocument();
#$html->loadHTML($sites_html);
foreach() {
# now i want here to echo the link alongside "childUrl"
}
?>
Try this
<?php
function extract($url){
$sites_html = file_get_contents("$url");
$html = new DOMDocument();
$$html->loadHTML($sites_html);
foreach ($html->loadHTML($sites_html) as $row)
{
if($row=="wanted_url")
{
echo $row;
}
}
}
?>
you can use regex:
try this code
$matches = [[],[]];
preg_match_all('/\"wanted_url\": \"([^\"]*?)\"/', $sites_html, $matches);
foreach($matches[1] as $match) {
echo $match;
}
this will print all urls with wanted_url tag

Optimize remote page retrieving and parsing

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>
As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>
Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

getting all values from h1 tags using php

I want to receive an array that contains all the h1 tag values from a text
Example, if this where the given input string:
<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>
I need to receive an array containing this:
titles[0] = 'hello',
titles[1] = 'title number two!'
I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.
I'm currently using this to receive the first tag:
function getTextBetweenTags($string, $tagname)
{
$pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
return $matches[1];
}
I pass it the string I want to be parsed and as $tagname I put in "h1".
I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.
I was hoping someone could help me out.
Thanks in advance.
you could use simplehtmldom:
function getTextBetweenTags($string, $tagname) {
// Create DOM from string
$html = str_get_html($string);
$titles = array();
// Find all tags
foreach($html->find($tagname) as $element) {
$titles[] = $element->plaintext;
}
}
function getTextBetweenTags($string, $tagname){
$d = new DOMDocument();
$d->loadHTML($string);
$return = array();
foreach($d->getElementsByTagName($tagname) as $item){
$return[] = $item->textContent;
}
return $return;
}
Alternative to DOM. Use when memory is an issue.
$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;
$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
echo $reader->readString();
}
}
function getTextBetweenH1($string)
{
$pattern = "/<h1>(.*?)<\/h1>/";
preg_match_all($pattern, $string, $matches);
return ($matches[1]);
}

Categories