preg_match_all How to get all links? - php
I'm trying to get all images links with preg_match_all those that begin with http://i.ebayimg.com/ and ends with .jpg , from page that I'm scraping.. I Can not do it correctly... :( I tried this but this is not what i need...:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);
Same problem is with normal links... I don't know how to write preg_match_all to this:
<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">
Thank you very much!!!
UPDATE
I'm trying from here:
http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals= get cars links and image links and all information, with information is everything fine, my script works good, but i have problem with scraping images and links.. here is my script :
<?php
$id= $_GET['id'];
$user= $_GET['user'];
$login=$_COOKIE['login'];
$query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
$rezultatas=mysql_fetch_row($query);
$url = "$rezultatas[1]";
$info = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
//turinio iskirpimas
$turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
//filtravimas naikinami mokami top skelbimai
$contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
//filtravimas baigtas
preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas);
preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data);
preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);
preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);
preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);
print_r($pavadinimas);
print_r($data);
print_r($miestas);
print_r($kaina);
print_r($result);
print_r($matches);
?>
1. To capture src attribute starting by http://i.ebayimg.com/ of all img tags :
regex : /src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i
Here is an example :
$re = "/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an img tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<img(?:.*?)src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
2. To capture href attribute starting by http://i.ebayimg.com/ of all a tags :
regex : /href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i
Here is an example :
$re = "/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i;
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an a tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<a(?:.*?)href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i";
More handy with DOMDocument:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);
$imgNodes = $dom->getElementsByTagName('img');
$result = [];
foreach ($imgNodes as $imgNode) {
$src = $imgNode->getAttribute('src');
$urlElts = parse_url($src);
$ext = strtolower(array_pop(explode('.', $urlElts['path'])));
if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
$result[] = $src;
}
print_r($result);
To get "normal" links, use the same way (DOMDocument + parse_url).
Related
How to extract m3u8 of youtube by regex?
I had a php file already using regex to extract m3u8 link from youtube, which was working fine until last week. http://server.com/youtube.php?id=youtbueid use to pass the youtube id like this. $string = get_data('https://www.youtube.com/watch?v=' . $channelid); if(preg_match('#"hlsManifestUrl.":."(.*?m3u8)#', $string, $match)) { $var1=$match[1]; $var1=str_replace("\/", "/", $var1); $man = get_data($var1); //echo $man; preg_match_all('/(https:\/.*\/95\/.*index.m3u8)/U',$man,$matches, PREG_PATTERN_ORDER); $var2=$matches[1][0]; header("Content-type: application/vnd.apple.mpegurl"); header("Location: $var2"); } else { preg_match_all('#itag.":([^,]+),."url.":."(.*?).".*?qualityLabel.":."(.*?)p."#', $string, $match); //preg_match_all('#itag.":([^,]+),."url.":."(.*?).".*?bitrate.":.([^,]+),#', $string, $match); $filter_keys = array_filter($match[3], function($element) { return $element <= 720; }); //print_r($filter_keys); $max_key = array_keys($filter_keys, max($filter_keys))[0]; //print_r($max_key); $urls = $match[2]; foreach($urls as &$url) { $url = str_replace('\/', '/', $url); $url = str_replace('\\\u0026', '&', $url); } print_r($urls[$max_key]); header('location: ' . $urls[$max_key]); How do I solve this problem?
Based on this post, I'm guessing that the desired URLs might look like: and we can write a simple expression such as: (.+\?v=)(.+) We can also add more boundaries to it, if it was necessary. RegEx If this expression wasn't desired, you can modify/change your expressions in regex101.com. RegEx Circuit You can also visualize your expressions in jex.im: PHP Test $re = '/(.+\?v=)(.+)/m'; $str = ' https://www.youtube.com/watch?v=_Gtc-GtLlTk'; $subst = '$2'; $result = preg_replace($re, $subst, $str); echo $result; JavaScript Demo This snippet shows that we likely have a valid expression: const regex = /(.+\?v=)(.+)/gm; const str = ` https://www.youtube.com/watch?v=_Gtc-GtLlTk`; const subst = `$2`; // The substituted value will be contained in the result variable const result = str.replace(regex, subst); console.log('Substitution result: ', result);
How to post content between two tags PHP
I currently am using two PHP scripts; 1 to post content to a file and 1 to get content from the file from between two tags. Post Content Script: $content = $_POST["maintenancetext"]; $strNewContents = "$content"; $fileRefID = fopen("../../../maintenance.php", "w"); fwrite($fileRefID, $strNewContents); fclose($fileRefID); Get Content from between two tags script: $start = '<p>'; $end = '</p>'; $string = file_get_contents("../../../maintenance.php"); $output = strstr( substr( $string, strpos( $string, $start) + strlen($start)), $end, true); echo htmlentities($output, ENT_QUOTES); I am currently posting from a text area to the file however I need this content to be changed between two tags only. How can I achieve this? Thanks.
Assuming post content script is working, now to get content between two tags: $string = file_get_contents("../../../maintenance.php"); $matches = array(); $pattern = "'<p>(.*?)</p>'si"; preg_match($pattern, $string, $matches); $output = $matches[1]; echo $output; To replace content: $newstring = "i am new string"; $newouput = str_replace($output, $newstring, $output); echo $newoutput;
PHP grabbing content between two strings
// get CONTENT from united domains footer $content = file_get_contents('http://www.uniteddomains.com/index/footer/'); // remove spaces from CONTENT $content = preg_replace('/\s+/', '', $content); // match all tld tags $regex = '#target="_parent">.(.*?)</a></li><li>#'; preg_match($regex, $source, $matches); print_r($matches); I am wanting to match all of the TLDs: Each tld is preceded by target="_parent">. and followed by </a></li><li> I am wanting to end up with an array like array('africa','amsterdam','bnc'...ect ect ) What am I doing wrong here? NOTE: The second step to remove all the spaces is just to simplify things.
Here's a regular expression that will do it for that page. \.\w+(?=</a></li>) REY PHP $content = file_get_contents('http://www.uniteddomains.com/index/footer/'); preg_match_all('/\.\w+(?=<\/a><\/li>)/m', $content, $matches); print_r($matches); PHPFiddle Here are the results: .africa, .amsterdam, .bcn, .berlin, .boston, .brussels, .budapest, .gent, .hamburg, .koeln, .london, .madrid, .melbourne, .moscow, .miami, .nagoya, .nyc, .okinawa, .osaka, .paris, .quebec, .roma, .ryukyu, .stockholm, .sydney, .tokyo, .vegas, .wien, .yokohama, .africa, .arab, .bayern, .bzh, .cymru, .kiwi, .lat, .scot, .vlaanderen, .wales, .app, .blog, .chat, .cloud, .digital, .email, .mobile, .online, .site, .mls, .secure, .web, .wiki, .associates, .business, .car, .careers, .contractors, .clothing, .design, .equipment, .estate, .gallery, .graphics, .hotel, .immo, .investments, .law, .management, .media, .money, .solutions, .sucks, .taxi, .trade, .archi, .adult, .bio, .center, .city, .club, .cool, .date, .earth, .energy, .family, .free, .green, .live, .lol, .love, .med, .ngo, .news, .phone, .pictures, .radio, .reviews, .rip, .team, .technology, .today, .voting, .buy, .deal, .luxe, .sale, .shop, .shopping, .store, .eus, .gay, .eco, .hiv, .irish, .one, .pics, .porn, .sex, .singles, .vin, .vip, .bar, .pizza, .wine, .bike, .book, .holiday, .horse, .film, .music, .party, .email, .pets, .play, .rocks, .rugby, .ski, .sport, .surf, .tour, .video
Using the DOM is cleaner: $doc = new DOMDocument(); #$doc->loadHTMLFile('http://www.uniteddomains.com/index/footer/'); $xpath = new DOMXPath($doc); $items = $xpath->query('/html/body/div/ul/li/ul/li[not(#class)]/a[#target="_parent"]/text()'); $result = ''; foreach($items as $item) { $result .= $item->nodeValue; } $result = explode('.', $result); array_shift($result); print_r($result);
How to get value inside <a tag using preg match all?
i got html content that need to extract values inside hyperlink tag using preg match all. I tried the following but i don't get any data. I included a sample input data. Could you guys help me fix this code and print all values in front of play.asp?ID=(example: i want to get this value 12345 from play.asp?ID=12345) ? sample input html data: <span id="Img_1"></span></TD> and the code $regexp = "<A\s[^>]*HREF=\"play.asp(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/A>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { $url=str_replace('?ID=', '', $matches[2]); $url2=str_replace('&Selected_ID=&PhaseID=123', '', $url); print_r($url2); }
$str = '<span id="Img_1"></span>'; preg_match_all( '/<\s*A[^>]HREF="(.*?)"\s?(.*?)>/i', $str, $match); print_r( $match ); Try out this.
Don't! Regular expressions are a (bad) way of text processing. This is not text, but HTML sourcecode. The tools to cope with it are called HTML parsers. Although PHP's DOMDocument is also able to loadHTML, it may glitch on some rare cases. A poorly built regexp (and you are wrong to think there's any other) will glitch on almost any changes in the page.
Isnt this enough? /<a href="(.*?)?"/I EDIT: This seems to work: '/<a href="(.*?)\?/i'
this should achieve the desired result. it's a combination of an HTML parser and a contents extraction function: function extractContents($string, $start, $end) { $pos = stripos($string, $start); $str = substr($string, $pos); $str_two = substr($str, strlen($start)); $second_pos = stripos($str_two, $end); $str_three = substr($str_two, 0, $second_pos); $extractedContents = trim($str_three); return $extractedContents; } include('simple_html_dom.php'); $html = file_get_html('http://siteyouwantlinksfrom.com'); $links = $html->find('a'); foreach($links as $link) { $playIDs[] = extractContents($link->href, 'play.asp?ID=', '&'); } print_r($playIDs); you can download simple_html_dom.php from here
You shouldn't use Regular Expression to parse HTML. This is a solution with DOMDocument : <?php $input = '<span id="Img_1"></span>'; // Clean "&" element in href $cleanInput = str_replace('&','&',$input); // Load HTML $domDocument = new DOMDocument(); $domDocument->loadHTML($cleanInput); // Retrieve <a /> tags $aTags = $domDocument->getElementsByTagName('a'); foreach($aTags as $aTag) { $href = $aTagA->getAttribute('href'); $url = parse_url($href); $vars = array(); parse_str($url['query'], $vars); var_dump($vars); } ?> Output : array (size=3) 'ID' => string '12345' (length=5) 'Selected_ID' => string '' (length=0) 'PhaseID' => string '123' (length=3)
replace same url in text with regex
I am using the following code to add links to urls in text... if (preg_match_all("#((http(s?)://)|www\.)?([a-zA-Z0-9\-\.])(\w+[^\s\)\<]+)#i", $str, $matches)) { ?><pre><?php print_r($matches); ?></pre><?php for ($i = 0; $i < count($matches[0]); $i++) { $url = $matches[0][$i]; $parsed = parse_url($url); $prefix = ''; if (!isset($parsed["scheme"])){ $prefix = 'http://'; } $url = $prefix.$url; $replace = ''.$matches[0][$i].''; $str = str_replace($matches[0][$i], ''.$matches[0][$i].'', $str); } } the problem comes when i enter twice the same url in the text at any place.. for example. google.com text text google.com it will add a link on the first one and then search for google.com which is inside the link and try to add again in there.. how can i make sure it will add the links separately without problems?
You can use preg_replace_callback() to reliably work on individual matches.