Regex to find out specific part of an html page

Regex to find out specific part of an html page - php

i want a regex to find out the below lines from a set of codes.
The part that i want to find:---
-->Copy frame link\",\"url240\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.240.mp4\",\"url360\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.360.mp4\",\"jpg\"<--
This code form part if an html page and i want to retrieve only the part shown.I am writing the codes in php
My complete codes.....
<?php
set_time_limit(0);
function get_content_of_url($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
$plyst = get_content_of_url("http://vk.com/video56612186_167113956");
preg_match('/link\\".*"jpg\\"/', $plyst , $matches);
var_dump($matches);
//preg_match('/http:\/\/[a-zA-Z0-9\\/-_.]+/', $matches[0][0], $id);
//start_script($id[0]);
?>

How about this.
$str = "video_get_current_url\":\"Copy frame link\",\"url240\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.24‌0.mp4\",\"url360\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.36‌0.mp4\",\"jpg\":\"http:\\\/\\\/cs534515.vk.me\\\/u163220668\\\/video\\\/l_8a5b0712.jpg\",\"‌ip_subm\":1,\"nologo";
preg_match('/\\"Copy\sframe.*"jpg\\"/is', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
string(199) ""Copy frame link","url240":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.24‌0.mp4","url360":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.36‌0.mp4","jpg""
}
Edit:
And then, if you wanted to extract the video url's from that:
preg_match_all('/(https?:.*?\.mp4)/', $matches[0], $id);
//Then echo out the url's
foreach ($id[0] as $url) {
// the preg_replace strips out the double backslashes.
echo preg_replace('/\\\\/', '', $url)."<br />";
}
Output:
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.24‌0.mp4
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.36‌0.mp4
Working example: http://sandbox.onlinephpfunctions.com/code/329106d990fe8927a7670b9448770643afbd0865

Related

Regular expression to extract the content inside the script tag in php

I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.

Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi

Regex to filter captcha image src

I'm trying to parse a curl response for some recaptcha image sadly my regex experience isn't that good, how could I find the src link in this code using regex?
<img id="recaptcha_challenge_image" alt="reCAPTCHA-Bild" height="57" width="300" src="https://www.google.com/recaptcha/api/image?c=03AHJ_VuvfHOCmjv_2G6NuZOIpbC-S7FGqqg-WM4ytRKoMD7oeXGHV_fvqim2YJSw3qkD7XcSxSlCOLAeBd4vy0S7KTrLiL7xSwc2YvGnc5Q9EUXllGa-56zXarU7Pq4btSzHn0anQPToXtBgDBdu69P-SwUpnzs1q08w9GTL8TEtpMAYYcW1hAXqpQEnPQO_h5tmtEDyr9WlgvVVTNcMwDsTJOnjsGmH4kg&th=,skNXQ2Kw6Nd0dqSGIn-2Ei0m-GKXNcjwAAAAcaAAAADlawORO7hNRBFlts00ZWNVrY7qTydBWIpzMqbUQsFjKKE1a8QhNJQqFgffS8OED15Kp3u0DRvaI7SCeuXwN6hkhGhMqCqgpe9cY2VyxbkFK8pRpnNcz5gnlz43NZsAyBZmvlX-BmLj6Hgq-6DzW0pWn3WFpGmM2UhRNudpy91Vyg7OwtD0xNhMGubBALOcAmR35Em3TsWYVh9mCL37V0nJ2xEBjjBQqCtn1Yoz5fLgxd3xp7229BR1zIremO1wJMCVljmT5uxYHyArQsN5PYqF-C7u4NkOSKS6PmkRTKAdV1NihEGAKxREfCsttZbVOY7MOVBF4zemvUDwIwA8pMJrnbbwwwXEinA2-w1NNS0tCxrubtqKCxu2EQsHRQ37V5NzI-Y_0qYMdxDa3wuXdS2ojVsjwXzVnBbYnG7LFJ1BJlZev4lugPtZh2siHolIbmHT8L1z3MMk0DEH0QnhNkd9x6e66tRyqYs7PsoteXmqa76sbqb945WtI5jsiJY4wo9yuKtGH03HmxdqhftgOk9OM6Gjjvhu1lxfW8tkOhehrGD5Td7z0L5fywtXmexRSlEQ5B4_OA3LEVmoCMUjW18GXDBj_lZPjAQ-mp6zV4a19ht88ilWfFTanLZ6d9FKsRrdwlNIS6cDzVBKT90mgXKARhibHrrSXujgo1l-gDbJ0o6xJBqSIugP155OVvwhJHW_ofOnvBuxgbvvsvOfskyGcFdnPoBIwrK-47AHx4H2jryUbCc3wLAtOcUicS_I2PRxKSUuUmYUk-bQq00scg0mDoI6QlD12pkPvmNA_QDyPqKjv5z8fc5HLVIAqFdBFdbWImHFKku0clxNX_qebl7r-C7e7LNBTngIFRdtFzAX_VjZHqRouemq2y89UA30WP65JSzzbUPt-z-tb6eKW3QD0eOlm28YkbYib9mdl85bIy61rS8bCHtFuKlcTMSzZyqMJhH25faKCTPkkXHhkPnO7IkMEmyll3LA5kjkc9RwTWFgF64RqLC-BqLscVi0GbVcCodMSVy1-kRGRqPr2ZaMwbLJSJq94Dy7reaex9rgiWEfpM0jEj_b1UeGUAEENhcPM3N63bPF3_F39H1YX3oBve4UXURVo7JkU2-C6o1HmB7Xr74JMEpPl8Vj1zImRk7SSB8Z6KEGv8Nj2f2Pq0hgaiehokt9I1JpnFprXtlEQW3vvgDZa01jYb6kmKVJNdaq0mIuvg">
My actual code:
<?php
class get_recaptcha {
function __construct(){
$this->website = 'http://registration.zwinky.com/registration/register.jhtml';
}
function curl_post(){
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $this->website);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function parse_content(){
preg_match('regexFilter', $this->curl_post(), $match);
$captcha_image = $match[1];
return $captcha_image;
}
}
$get_captcha = new get_recaptcha();
$captcha_response = $get_captcha->parse_content();
echo $captcha_response;
?>
Could anyone maybe help me setting up the regex filter for this?

It appears that the captcha is within an iframe so the url you would need to use the url for the iframe I think to get the captcha. Whilst not a RegEx solution this does get the desired information. If the url of the iframe changes then you'd need to muck about with the code to get the iframe src url - hope it helps.
$url='http://www.google.com/recaptcha/api/noscript?k=6Ldx8OgSAAAAAOQu76OwUC1XwCxpEZU576k0gHIR';
$html=file_get_contents( $url );
$dom=new DOMDocument;
$dom->loadHTML($html);
$col=$dom->getElementsByTagName('img');
foreach($col as $n)echo $n->getAttribute('src');
echo $src;

Pull text from another website

Is it possible to pull text data from another domain (not currently owned) using php? If not any other method? I've tried using Iframes, and because my page is a mobile website things just don't look good. I'm trying to show a marine forecast for a specific area. Here is the link I'm trying to display.
Update...........
This is what I ended up using. Maybe it will help someone else. However I felt there was more than one right answer to my question.
<?php
$ch = curl_init("http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>

This works as I think you want it to, except it depends on the same format from the weather site (also that "Outlook" is displayed).
<?php
//define the URL of the resource
$url = 'http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1';
//function from http://stackoverflow.com/questions/5696412/get-substring-between-two-strings-php
function getInnerSubstring($string, $boundstring, $trimit=false)
{
$res = false;
$bstart = strpos($string, $boundstring);
if($bstart >= 0)
{
$bend = strrpos($string, $boundstring);
if($bend >= 0 && $bend > $bstart)
{
$res = substr($string, $bstart+strlen($boundstring), $bend-$bstart-strlen($boundstring));
}
}
return $trimit ? trim($res) : $res;
}
//if the URL is reachable
if($source = file_get_contents($url))
{
$raw = strip_tags($source,'<hr>');
echo '<pre>'.substr(strstr(trim(getInnerSubstring($raw,"<hr>")),'Outlook'),7).'</pre>';
}
else{
echo 'Error';
}
?>
If you need any revisions, please comment.

Try using a user-agent as shown below. Then you can use simplexml to parse the contents and extract the text you want. For more info on simplexml.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-agent: www.example.com"
)
);
$content = file_get_contents($url, false, stream_context_create($opts));
$xml = simplexml_load_string($content);

You may use cURL for that. Have a Look at http://www.php.net/manual/en/book.curl.php

Preg_match_all not stopping where it should be

Update Yahoo error
Ok, so I got it all working, but the preg_match_all wont work towards Yahoo.
If you take a look at:
http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t
then you can see that in their html, they have
<span class="url" id="something random"> the actual link </span>
But when I try to preg_match_all, I wont get any result.
preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);
Anyone got an idea?
End of update
I'm trying to preg_match_all the results i get from Google using a cURL curl_multi_getcontent method.
I have succeeded to fetch the site and so, but when I'm trying to get the result of the links, it just takes too much.
I'm currently using:
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
And that starts where it should be, but it doesn't stop, it just keeps going.
Check the HTML at www.google.com/search?q=random for example and you will see that all links start with and ends with .
Could someone possible help me with how I should retreive this information?
I only need the actual link address to each result.
Update Entire PHP Script
public function multiSearch($question)
{
$sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
$sites['bing'] = "http://www.bing.com/search?q={$question}";
$sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";
$urlHandler = array();
foreach($sites as $site)
{
$handler = curl_init();
curl_setopt($handler, CURLOPT_URL, $site);
curl_setopt($handler, CURLOPT_HEADER, 0);
curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);
array_push($urlHandler, $handler);
}
$multiHandler = curl_multi_init();
foreach($urlHandler as $key => $url)
{
curl_multi_add_handle($multiHandler, $url);
}
$running = null;
do
{
curl_multi_exec($multiHandler, $running);
}
while($running > 0);
$urlContents = array();
foreach($urlHandler as $key => $url)
{
$urlContents[$key] = curl_multi_getcontent($url);
}
foreach($urlHandler as $key => $url)
{
curl_multi_remove_handle($multiHandler, $url);
}
foreach($urlContents as $urlContent)
{
preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
//$this->view_data['results'][] = "Random";
}
preg_match_all('#<div id="search"(.*)</ol></div>#i', $urlContents[0], $match);
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
var_dump($links);
}

run the regular expression in U-ngready mode
preg_match_all('#<cite>(.+)</cite>#siU

As in Darhazer's answer you can turn on ungreedy mode for the whole regex using the U pattern modifier, or just make the pattern itself ungreedy (or lazy) by following it with a ?:
preg_match_all('#<cite>(.+?)</cite>#si', ...

How can assign preg_match_all variable to a vaiable

Forgive me as I am a newbie programmer. How can I assign the resulting $matches (preg_match) value, with the first character stripped, to another variable ($funded) in php? You can see what I have below:
<?php
$content = file_get_contents("https://join.app.net");
//echo $content;
preg_match_all ("/<div class=\"stat-number\">([^`]*?)<\/div>/", $content, $matches);
//testing the array $matches
//echo sprintf('<pre>%s</pre>', print_r($matches, true));
$funded = $matches[0][1];
echo substr($funded, 1);
?>

Don't parse HTML with RegEx.
The best way is to use PHP DOM:
<?php
$handle = curl_init('https://join.app.net');
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$raw = curl_exec($handle);
curl_close($handle);
$doc = new DOMDocument();
$doc->loadHTML($raw);
$elems = $doc->getElementsByTagName('div');
foreach($elems as $item) {
if($item->getAttribute('class') == 'stat-number')
if(strpos($item->textContent, '$') !== false) $funded = $item->textContent;
}
// Remove $ sign and ,
$funded = preg_replace('/[^0-9]/', '', $funded);
echo $funded;
?>
This returned 380950 at the time of posting.

I am not 100% sure but it seems like you are trying to get the dollar amount that the funding is currently ?
And the character is a dollar sign that you want to strip out ?
If that is the case why not just add the dollar sign to the regex outside the group so it isn't captured.
/<div class=\"stat-number\">\$([^`]*?)<\/div>/
Because $ means end of line in regex you must first escape it with a slash.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to find out specific part of an html page - php

Related

Regular expression to extract the content inside the script tag in php

Regex to filter captcha image src

Pull text from another website

Preg_match_all not stopping where it should be

How can assign preg_match_all variable to a vaiable

Categories

Resources