I am scraping a website and finding a string, then when that string is found, i will be extracting a part of that string.
I am looking for a string "twitter:image" in a website, then when found, i will be extracting the "content" value of that. So here's an example of the website that i'm scraping. This is the HTML or "View Source" of that website:
Here is an example of my code:
I am using a library called "ProxyCrawl"
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$result = $response->body;
if (strpos($result, 'name="twitter:image"') !== false) {
Log::debug("found!");
//then extract the content
} else {
//do nothing
}
}
I already have the code on checking whether the "twitter:image" exist, but i don't have the code when extracting the "content" value.
Any help is greatly appreciated. Thanks!
If <meta name="twitter:image" /> is a unique element on page then use this:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$element = $xpath->query("//meta[#name='twitter:image']/#content");
if (!empty($element->item(0))) {
$imageUrl = $element->item(0)->nodeValue;
}
}
Otherwise, if there are multiple elements of this kind, you will need to iterate:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$dom->loadHTML($response->body);
$xpath = new DOMXpath($dom);
$imageUrls = [];
$elements = $xpath->query("//meta[#name='twitter:image']");
if ($elements !== false) {
foreach ($elements as $element) {
$imageUrls[] = $element->getAttribute('content');
}
}
}
This is a really quick example but a regex would be the way to go:
/(name=\"twitter:image\")(.)content=\"(.+)\"/im
This would match a string that contains name="twitter:image" followed by content=". You can get the text of content from the third grouping:
$str = '<meta data-rl="true" name="twitter:image" content="testing"';
$regex = '/(name="twitter:image")(.)content="(.+)"/im';
preg_match_all($regex, $str, $results);
print_r($results);
This is a rough example, you'll have to use this as a basis for your exact implementation. There are cleaner solutions to this (and probably better regexes) but this will get you going.
I don't know laravel (I use Symfony) and I am new to StackOverflow, but something like this could work:
if(strstr($result, 'name="twitter:image"')) {
$namestart = strpos($result, 'name="twitter:image"');
$substr1 = substr($result, $namestart);
$contentstart = strpos($result, 'content="') + 8;
$substr2 = substr($result, $contentstart);
$contentend = strpos($substr, '"');
$content = substr($result, $contentstart, $contentend)
}
Not tested!
Related
I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.
I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>
As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>
Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>
Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php
From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();
I have my application to allow users to write comments on my website. Its working fine. I also have tool to insert their weblinks in it. I feel good with contents with their own weblinks.
Now i want to add rel="nofollow" to every links on content that they have been written.
I would like to add rel="nofollow" using php i.e while saving data.
So what's a simple method to add rel="nofollow" or updated rel="someother" with rel="someother nofollow" using php
a nice example will be much efficient
Regexs really aren't the best tool for dealing with HTML, especially when PHP has a pretty good HTML parser built in.
This code will handle adding nofollow if the rel attribute is already populated.
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
var_dump($dom->saveHTML());
CodePad.
The resulting HTML is in $dom->saveHTML(). Except it will wrap it with html, body elements, etc, so use this to extract just the HTML you entered...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
echo $html;
If you have >= PHP 5.3, replace saveXML() with saveHTML() and drop the second argument.
Example
This HTML...
hello
hello
hello
hello
...is converted into...
hello
hello
hello
hello
Good Alex. If it is in the form of a function it is more useful. So I made it below:
function add_no_follow($str){
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
$dom->saveHTML();
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
return $html;
}
Use as follows :
$str = "Some content with link Some content ... ";
$str = add_no_follow($str);
I've copied Alex's answer and made it into a function that makes links nofollow and open in a new tab/window (and added UTF-8 support). I'm not sure if this is the best way to do this, but it works (constructive input is welcome):
function nofollow_new_window($str)
{
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor)
{
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$target = array();
if ($anchor->hasAttribute('target') AND ($relAtt = $anchor->getAttribute('target')) !== '') {
$target = preg_split('/\s+/', trim($relAtt));
}
if (in_array('_blank', $target)) {
continue;
}
$target[] = '_blank';
$anchor->setAttribute('target', implode(' ', $target));
}
$str = utf8_decode($dom->saveHTML($dom->documentElement));
return $str;
}
Simply use the function like this:
$str = '<html><head></head><body>fdsafffffdfsfdffff dfsdaff flkklfd aldsfklffdssfdfds Google</body></html>';
$str = nofollow_new_window($str);
echo $str;