PHP grabbing content between two strings - php

// get CONTENT from united domains footer
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
// remove spaces from CONTENT
$content = preg_replace('/\s+/', '', $content);
// match all tld tags
$regex = '#target="_parent">.(.*?)</a></li><li>#';
preg_match($regex, $source, $matches);
print_r($matches);
I am wanting to match all of the TLDs:
Each tld is preceded by target="_parent">. and followed by </a></li><li>
I am wanting to end up with an array like array('africa','amsterdam','bnc'...ect ect )
What am I doing wrong here?
NOTE: The second step to remove all the spaces is just to simplify things.

Here's a regular expression that will do it for that page.
\.\w+(?=</a></li>)
REY
PHP
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
preg_match_all('/\.\w+(?=<\/a><\/li>)/m', $content, $matches);
print_r($matches);
PHPFiddle
Here are the results:
.africa, .amsterdam, .bcn, .berlin, .boston, .brussels, .budapest, .gent, .hamburg, .koeln, .london, .madrid, .melbourne, .moscow, .miami, .nagoya, .nyc, .okinawa, .osaka, .paris, .quebec, .roma, .ryukyu, .stockholm, .sydney, .tokyo, .vegas, .wien, .yokohama, .africa, .arab, .bayern, .bzh, .cymru, .kiwi, .lat, .scot, .vlaanderen, .wales, .app, .blog, .chat, .cloud, .digital, .email, .mobile, .online, .site, .mls, .secure, .web, .wiki, .associates, .business, .car, .careers, .contractors, .clothing, .design, .equipment, .estate, .gallery, .graphics, .hotel, .immo, .investments, .law, .management, .media, .money, .solutions, .sucks, .taxi, .trade, .archi, .adult, .bio, .center, .city, .club, .cool, .date, .earth, .energy, .family, .free, .green, .live, .lol, .love, .med, .ngo, .news, .phone, .pictures, .radio, .reviews, .rip, .team, .technology, .today, .voting, .buy, .deal, .luxe, .sale, .shop, .shopping, .store, .eus, .gay, .eco, .hiv, .irish, .one, .pics, .porn, .sex, .singles, .vin, .vip, .bar, .pizza, .wine, .bike, .book, .holiday, .horse, .film, .music, .party, .email, .pets, .play, .rocks, .rugby, .ski, .sport, .surf, .tour, .video

Using the DOM is cleaner:
$doc = new DOMDocument();
#$doc->loadHTMLFile('http://www.uniteddomains.com/index/footer/');
$xpath = new DOMXPath($doc);
$items = $xpath->query('/html/body/div/ul/li/ul/li[not(#class)]/a[#target="_parent"]/text()');
$result = '';
foreach($items as $item) {
$result .= $item->nodeValue; }
$result = explode('.', $result);
array_shift($result);
print_r($result);

Related

preg_match_all How to get all links?

I'm trying to get all images links with preg_match_all those that begin with http://i.ebayimg.com/ and ends with .jpg , from page that I'm scraping.. I Can not do it correctly... :( I tried this but this is not what i need...:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);
Same problem is with normal links... I don't know how to write preg_match_all to this:
<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">
Thank you very much!!!
UPDATE
I'm trying from here:
http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals= get cars links and image links and all information, with information is everything fine, my script works good, but i have problem with scraping images and links.. here is my script :
<?php
$id= $_GET['id'];
$user= $_GET['user'];
$login=$_COOKIE['login'];
$query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
$rezultatas=mysql_fetch_row($query);
$url = "$rezultatas[1]";
$info = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
//turinio iskirpimas
$turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
//filtravimas naikinami mokami top skelbimai
$contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
//filtravimas baigtas
preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas);
preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data);
preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);
preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);
preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);
print_r($pavadinimas);
print_r($data);
print_r($miestas);
print_r($kaina);
print_r($result);
print_r($matches);
?>
1. To capture src attribute starting by http://i.ebayimg.com/ of all img tags :
regex : /src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i
Here is an example :
$re = "/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an img tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<img(?:.*?)src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
2. To capture href attribute starting by http://i.ebayimg.com/ of all a tags :
regex : /href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i
Here is an example :
$re = "/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i;
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an a tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<a(?:.*?)href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i";
More handy with DOMDocument:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);
$imgNodes = $dom->getElementsByTagName('img');
$result = [];
foreach ($imgNodes as $imgNode) {
$src = $imgNode->getAttribute('src');
$urlElts = parse_url($src);
$ext = strtolower(array_pop(explode('.', $urlElts['path'])));
if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
$result[] = $src;
}
print_r($result);
To get "normal" links, use the same way (DOMDocument + parse_url).

Preg_replace replace dashes with spaces between tags

I have a HTML code and would like to replace only the dashes with spaces but only between specific tags.
function getTextBetweenTags($string, $tagname) {
$pattern = "/<$tagname ?.*>(\d*)[-*](\d*)<\/$tagname>/";
$replace = " ";
$string = preg_replace($pattern, $replace, $string);
}
CODE EXAMPLE:
<div class="xxx">
start
World
Fantastic-yyy-zz
peter-hey
</div>
RESULT: Although 'peter hey' is without dashes it's more important the Tag's values.
<div class="xxx">
start
World
Fantastic yyy zz
peter-hey
</div>
You DO NOT need regular expressions for this task:
$contents = '<div class="xxx">
start
World
Fantastic-yyy-zz
peter-hey
</div>';
$doc = new DOMDocument();
$doc->loadXML($contents);
$tagName = 'a';
$tags = $doc->getElementsByTagName($tagName);
foreach ($tags as $tag) {
$newValue = str_replace('-', ' ', $tag->nodeValue);
$tag->nodeValue = $newValue;
}
echo $doc->saveHTML();
Demo: http://ideone.com/rI6k8b
#zerkms thank you for your help and patience, tried almost exactly as you told but it shows a warning and doesn't make a change.
Warning: DOMDocument::loadXML(): Extra content at the end of the document in Entity
CODE:
function process(&$vars) {
$theme = get_theme();
if ($vars['elts']['#xxx'] == 'main') {
$vars['bread'] = $theme->page['bread'];
/*add code*/
$doc = new DOMDocument();
$doc->loadXML($vars['bread']);
$tagName = 'a';
$tags = $doc->getElementsByTagName($tagName);
foreach ($tags as $tag) {
$newValue = str_replace('-', ' ', $tag->nodeValue);
$tag->nodeValue = $newValue;
}
echo $doc->saveHTML();
/*end add code*/
}
}
#zerkms, I give you the answer as valid as you really gave a correct answer. I'm also amazed to say that I found some interesting answers:
CODE TO FIND INFO
$tagname = 'a';
$pattern = "/<$tagname ?.*>(.*)\-+(.*)<\/$tagname>/";
$matches = "";
preg_match($pattern, $contents, $matches);
CODE TO CHANGE : As I only have a piece of code, I really don't need to check the tag is 'a'.
$pattern = "/>(.*)\-+(.*)\-+(.*)</";
$replace = ">$1 $2 $3<";
$res = preg_replace($pattern, $replace, $contents);
//$contents is my string with the code.
Hope it really helps someone.

php remove results of regex from file content

I want to remove the result of my regex from a file.. how should I do this...
$pattern = "/<a href=\"(.*?)a>/s";
$html = file_get_contents('content.html');
$check = preg_match_all($pattern,$html,$match);
foreach($match[1] as $result)
{
// what should I put here
}
Try this..
This regex will remove the content.html and replace with null.
$pattern = "/<a href=\"(.+?)\">/s";
$html = file_get_contents('content.html');
$subject="<a href="content.html">"
$check = preg_replace($pattern,'',$subject); //preg_replace('pattern','replacement','subject');
I want to remove the result of my regex
Use preg_replace instead of preg_match_all.
$pattern = '~<a href=(.*?)</a>~'; // only match on same line
$html = file_get_contents('content.html');
$check = preg_replace($pattern, '', $html);

php regular expression to match string if NOT in an HTML tag

I'm trying to solve this bug in Drupal's Hashtags module: http://drupal.org/node/1718154
I've got this function that matches every word in my text that is prefixed by "#", like #tag:
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
I need to ignore internal links in pages, such as link, or, more in general, any word prefixed by # that appears inside an HTML tag (so preceeded by < and followed by >).
Any idea how can I achieve this?
Can you strip the tags first because matching (using the strip_tags function)?
function hashtags_get_tags($text) {
$text = strip_tags($text);
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
A regular expression is going to be tricky if you want to only match hashtags that are not inside an HTML tag.
You could throw out the tags before hand using preg_replace
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
$text=preg_replace("/<[^>]*>/","",$text);
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
I made this function using PHP DOM.
It returns all links that have # in the href.
If you want it to only remove internal hash tags, replace this line:
if(strpos($link->getAttribute('href'), '#') === false) {
with this:
if(strpos($link->getAttribute('href'), '#') !== 0) {
This is the function:
function no_hashtags($text) {
$doc = new DOMDocument();
$doc->loadHTML($text);
$links = $doc->getElementsByTagName('a');
$nohashes = array();
foreach($links as $link) {
if(strpos($link->getAttribute('href'), '#') === false) {
$temp = new DOMDocument();
$elem = $temp->importNode($link->cloneNode(true), true);
$temp->appendChild($elem);
$nohashes[] = $temp->saveHTML();
}
}
// return $nohashes;
return implode('', $nohashes);
// return implode(',', $nohashes);
}

php associative arrays, regex, array

I currently have the following code :
$content = "
<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
I need to find a method to create and array as name=>value. E.g Manufacturer => John Deere.
Can anyone help me with a simple code snipped I tried some regex but doesn't even work to extract the names or values, e.g.:
$pattern = "/<name>Manufacturer<\/name><value>(.*)<\/value>/";
preg_match_all($pattern, $content, $matches);
$st_selval = $matches[1][0];
You don't want to use regex for this. Try out something like SimpleXML
EDIT
Well, why don't you start with this:
<?php
$content = "<root>" . $content . "</root>";
$xml = new SimpleXMLElement($c);
print_r($xml);
?>
EDIT 2
Despite the fact that some of the answers posted using regular expression MAY work, you should get in the habit of using the correct tool for the job and regular expressions are not the correct tool for parsing of XML.
I'm using your $content variable:
$preg1 = preg_match_all('#<name>([^<]+)#', $content, $name_arr);
$preg2 = preg_match_all('#<value>([^<]+)#', $content, $val_arr);
$array = array_combine($name_arr[1], $val_arr[1]);
This is rather simple, can be solved by regex. Should be:
$name = '<name>\s*([^<]+)</name>\s*';
$value = '<value>\s*([^<]+)</value>\s*';
$pattern = "|$name $value|";
preg_match_all($pattern, $content, $matches);
# create hash
$stuff = array_combine($matches[1], $matches[2]);
# display
var_dump($stuff);
Regards
rbo
First of all, never use regex to parse xml...
You could do this with an XPATH query...
First, wrap the content in a root tag to make the parser happy (if it doesn't already have it):
$content = '<root>' . $content . '</root>';
Then, load the document
$dom = new DomDocument();
$dom->loadXml($content);
Then, initialize the XPATH
$xpath = new DomXpath($dom);
Write your query:
$xpathQuery = '//name[text()="Manufacturer"]/follwing-sibling::value/text()';
Then, execute it:
$manufacturer = $xpath->evaluate($xpathQuery);
If I did the xpath right, it $manufacturer should be John Deere...
You can see the docs on DomXpath, a basic primer on XPath, and a bunch of XPath examples...
Edit: That won't work (PHP doesn't support that syntax (following-sibling). You could do this instead of the xpath query:
$xpathQuery = '//name[text()="Manufacturer"]';
$elements = $xpath->query($xpathQuery);
$manufacturer = $elements->item(0)->nextSibling->nodeValue;
I think this is what you're looking for:
<?php
$content = "<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
$pattern = "(\<name\>(\w*)\<\/name\>\<value\>(\w*)\<\/value\>)";
preg_match_all($pattern, $content, $matches);
$arr = array();
for ($i=0; $i<count($matches); $i++){
$arr[$matches[1][$i]] = $matches[2][$i];
}
/* This is an example on how to use it */
echo "Location: " . $arr["Location"] . "<br><br>";
/* This is the array */
print_r($arr);
?>
If your array has a lot of elements dont use the count() function in the for loop, calculate the value first and then use it as a constant.
I'll edit as my PHP is wrong, but here's some PHP (pseudo-)code to give some direction.
$pattern = '|<name>([^<]*)</name>\s*<value>([^<]*)</value>|'
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
for($i = 0; $i < count($matches); $i++) {
$arr[$matches[$i][1]] = $matches[$i][2];
}
$arr is the array you want to store the name/value pairs.
Using XMLReader:
$content = '<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>';
$content = '<content>' . $content . '</content>';
$output = array();
$reader = new XMLReader();
$reader->XML($content);
$currentKey = null;
$currentValue = null;
while ($reader->read()) {
switch ($reader->name) {
case 'name':
$reader->read();
$currentKey = $reader->value;
$reader->read();
break;
case 'value':
$reader->read();
$currentValue = $reader->value;
$reader->read();
break;
}
if (isset($currentKey) && isset($currentValue)) {
$output[$currentKey] = $currentValue;
$currentKey = null;
$currentValue = null;
}
}
print_r($output);
The output is:
Array
(
[Manufacturer] => John Deere
[Year] => 2001
[Location] => NSW
[Hours] => 6320
)

Categories