I am scraping a website and finding a string, then when that string is found, i will be extracting a part of that string.
I am looking for a string "twitter:image" in a website, then when found, i will be extracting the "content" value of that. So here's an example of the website that i'm scraping. This is the HTML or "View Source" of that website:
Here is an example of my code:
I am using a library called "ProxyCrawl"
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$result = $response->body;
if (strpos($result, 'name="twitter:image"') !== false) {
//then extract the content
} else {
//do nothing
I already have the code on checking whether the "twitter:image" exist, but i don't have the code when extracting the "content" value.
Any help is greatly appreciated. Thanks!
If <meta name="twitter:image" /> is a unique element on page then use this:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
$element = $xpath->query("//meta[#name='twitter:image']/#content");
if (!empty($element->item(0))) {
$imageUrl = $element->item(0)->nodeValue;
Otherwise, if there are multiple elements of this kind, you will need to iterate:
$ch = new ProxyCrawl();
$response = $ch->get($link, false);
if ($response->original_status == 200) {
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
$imageUrls = [];
$elements = $xpath->query("//meta[#name='twitter:image']");
if ($elements !== false) {
foreach ($elements as $element) {
$imageUrls[] = $element->getAttribute('content');
This is a really quick example but a regex would be the way to go:
This would match a string that contains name="twitter:image" followed by content=". You can get the text of content from the third grouping:
$str = '<meta data-rl="true" name="twitter:image" content="testing"';
$regex = '/(name="twitter:image")(.)content="(.+)"/im';
preg_match_all($regex, $str, $results);
This is a rough example, you'll have to use this as a basis for your exact implementation. There are cleaner solutions to this (and probably better regexes) but this will get you going.
I don't know laravel (I use Symfony) and I am new to StackOverflow, but something like this could work:
if(strstr($result, 'name="twitter:image"')) {
$namestart = strpos($result, 'name="twitter:image"');
$substr1 = substr($result, $namestart);
$contentstart = strpos($result, 'content="') + 8;
$substr2 = substr($result, $contentstart);
$contentend = strpos($substr, '"');
$content = substr($result, $contentstart, $contentend)
Not tested!
I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
echo $la;
Current Result
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.
I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
I have some more preg_match calls inside the loop.
How can I optimize my code?
I've changed my code to use xpath instead of regex, and it became much more slower.
That's my full code:
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
# set up some dummy data
$data = <<<DATA
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
an example to get all a tags on a page:
$url = "http://your_url/";
$html = new simple_html_dom();
foreach($html->find("a") as $link)
// do something with the link
Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
//procedure to place the href value into a file
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$html[] = $newDom->saveHTML();
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php
From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
$file=fopen("xx.html","r") or exit("Unable to open file!");
while (!feof($file))
if ($ff==$dd)
$sData = fgetc($file);
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
function substring(&$string,$start,$end)
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
$contents = #file_get_contents("xx.html");
while ($old <> $contents)
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
//$videosArray is array of v's
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
// Clear invalid markup error buffer
I have my application to allow users to write comments on my website. Its working fine. I also have tool to insert their weblinks in it. I feel good with contents with their own weblinks.
Now i want to add rel="nofollow" to every links on content that they have been written.
I would like to add rel="nofollow" using php i.e while saving data.
So what's a simple method to add rel="nofollow" or updated rel="someother" with rel="someother nofollow" using php
a nice example will be much efficient
Regexs really aren't the best tool for dealing with HTML, especially when PHP has a pretty good HTML parser built in.
This code will handle adding nofollow if the rel attribute is already populated.
$dom = new DOMDocument;
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
if (in_array('nofollow', $rel)) {
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
The resulting HTML is in $dom->saveHTML(). Except it will wrap it with html, body elements, etc, so use this to extract just the HTML you entered...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
echo $html;
If you have >= PHP 5.3, replace saveXML() with saveHTML() and drop the second argument.
This HTML...
...is converted into...
Good Alex. If it is in the form of a function it is more useful. So I made it below:
function add_no_follow($str){
$dom = new DOMDocument;
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
if (in_array('nofollow', $rel)) {
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
return $html;
Use as follows :
$str = "Some content with link Some content ... ";
$str = add_no_follow($str);
I've copied Alex's answer and made it into a function that makes links nofollow and open in a new tab/window (and added UTF-8 support). I'm not sure if this is the best way to do this, but it works (constructive input is welcome):
function nofollow_new_window($str)
$dom = new DOMDocument;
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor)
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
if (in_array('nofollow', $rel)) {
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$target = array();
if ($anchor->hasAttribute('target') AND ($relAtt = $anchor->getAttribute('target')) !== '') {
$target = preg_split('/\s+/', trim($relAtt));
if (in_array('_blank', $target)) {
$target[] = '_blank';
$anchor->setAttribute('target', implode(' ', $target));
$str = utf8_decode($dom->saveHTML($dom->documentElement));
return $str;
Simply use the function like this:
$str = '<html><head></head><body>fdsafffffdfsfdffff dfsdaff flkklfd aldsfklffdssfdfds Google</body></html>';
$str = nofollow_new_window($str);
echo $str;