Get content from html file

Get content from html file - php

I have a list of html files. Each file repeatedly has the strings onClick="rpd(SOME_NUMBER)" . I know how to get the content from the html files, what I would want to do is get a list of the "SOME_NUMBER" . I saw that I might need to do a preg_match, but I'm horrible at regular expressions. I tried
$file_content = file_get_contents($url);
$pattern= 'onClick="rpd(#);"';
preg_match($pattern, $file_content);
As you could imagine... it didn't work. What would be the best way to get this done? Thanks!

This should get it done:
$file_content ='234=fdf donClick="rpd(5);"as23 f2 onClick="rpd(7);" dff fonClick="rpd(8);"';
$pattern= '/onClick="rpd\((\d+)\);"/';
preg_match_all($pattern, $file_content,$matches);
var_dump( $matches);
The output is like this:
array (size=2)
0 =>
array (size=3)
0 => string 'onClick="rpd(5);"' (length=17)
1 => string 'onClick="rpd(7);"' (length=17)
2 => string 'onClick="rpd(8);"' (length=17)
1 =>
array (size=3)
0 => string '5' (length=1)
1 => string '7' (length=1)
2 => string '8' (length=1)

Maybe something like this?
preg_match('/onClick="rpd\((\d+)\);"/', $file_content,$matches);
print $matches[1];

I don't know PHP, but the regular expression to match that would be:
'onClick="rpd\(([0-9]+)\)"'
Note that we need to escape those paranthesis with \ because of their special meaning, also we surrounded our match with one regular paranthesis for seperating digits.
If preg_match also supports lookahead/lookbehind expressions:
'(?<=onClick="rpd\()[0-9]+(?=\)")'
will also work.

$file_content='blah blah onClick="rpd(56)"; blah blah\nblah blah onClick="rpd(43)"; blah blah\nblah blah onClick="rpd(11)"; blah blah\n';
$pattern= '/onClick="rpd\((\d+)\)";/';
preg_match_all($pattern, $file_content, $matches);
print_r($matches);
That outputs:
Array
(
[0] => Array
(
[0] => onClick="rpd(56)";
[1] => onClick="rpd(43)";
[2] => onClick="rpd(11)";
)
[1] => Array
(
[0] => 56
[1] => 43
[2] => 11
)
)
You can play around with my example here: http://ideone.com/TzShPG

A clean way to do this is to use DOMDocument and XPath:
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$ress= $xpath->query("//*[contains(#onclick,'rpd(')]/attribute::onclick");
foreach ($ress as $res) {
echo substr($res->value,4,-1) . "\n";
}

Related

Flatten array of regular expressions

I have an array of regular expressions -$toks:
Array
(
[0] => /(?=\D*\d)/
[1] => /\b(waiting)\b/i
[2] => /^(\w+)/
[3] => /\b(responce)\b/i
[4] => /\b(from)\b/i
[5] => /\|/
[6] => /\b(to)\b/i
)
When I'm trying to flatten it:
$patterns_flattened = implode('|', $toks);
I get a regex:
/(?=\D*\d)/|/\b(waiting)\b/i|/^(\w+)/|/\b(responce)\b/i|/\b(from)\b/i|/\|/|/\b(to)\b/i
When I'm trying to:
if (preg_match('/'. $patterns_flattened .'/', 'I'm waiting for a response from', $matches)) {
print_r($matches);
}
I get an error:
Warning: preg_match(): Unknown modifier '(' in ...index.php on line
Where is my mistake?
Thanks.

You need to remove the opening and closing slashes, like this:
$toks = [
'(?=\D*\d)',
'\b(waiting)\b',
'^(\w+)',
'\b(response)\b',
'\b(from)\b',
'\|',
'\b(to)\b',
];
And then, I think you'll want to use preg_match_all instead of preg_match:
$patterns_flattened = implode('|', $toks);
if (preg_match_all("/$patterns_flattened/i", "I'm waiting for a response from", $matches)) {
print_r($matches[0]);
}
If you get the first element instead of all elements, it'll return the whole matches of each regex:
Array
(
[0] => I
[1] => waiting
[2] => response
[3] => from
)
Try it on 3v41.org

<?php
$data = Array
(
0 => '/(?=\D*\d)/',
1 => '/\b(waiting)\b/i',
2 => '/^(\w+)/',
3 => '/\b(responce)\b/i',
4 => '/\b(from)\b/i',
5 => '/\|/',
6 => '/\b(to)\b/i/'
);
$patterns_flattened = implode('|', $data);
$regex = str_replace("/i",'',$patterns_flattened);
$regex = str_replace('/','',$regex);
if (preg_match_all( '/'.$regex.'/', "I'm waiting for a responce from", $matches)) {
echo '<pre>';
print_r($matches[0]);
}
You have to remove the slashes from your regex and also the i parameter in order to make it work. That was the reason it was breaking.
A really nice tool to actually validate your regex is this :
https://regexr.com/
I always use that when i have to make a bigger than usual regular expression.
The output of the above code is :
Array
(
[0] => I
[1] => waiting
[2] => responce
[3] => from
)

There are a few adjustments to make with your $tok array.
To remove the error, you need to remove the pattern delimiters and pattern modifiers from each array element.
None of the capture grouping is necessary, in fact, it will lead to a higher step count and create unnecessary output array bloat.
Whatever your intention is with (?=\D*\d), it needs a rethink. If there is a number anywhere in your input string, you are potentially going to generate lots of empty elements which surely can't have any benefit for your project. Look at what happens when I put a space then 1 after from in your input string.
Here is my recommendation: (PHP Demo)
$toks = [
'\bwaiting\b',
'^\w+',
'\bresponse\b',
'\bfrom\b',
'\|',
'\bto\b',
];
$pattern = '/' . implode('|', $toks) . '/i';
var_export(preg_match_all($pattern, "I'm waiting for a response from", $out) ? $out[0] : null);
Output:
array (
0 => 'I',
1 => 'waiting',
2 => 'response',
3 => 'from',
)

PHP parsing string to array with regular expressions

I have a string like this:
$msg,array('goo','gle'),000,"face",'book',['twi'=>'ter','link'=>'edin']
I want to use preg_match_all to convert this to an array that could look like this:
array(
0 => $msg,
1 => array('goo','gle'),
2 => 000,
3 => "face",
4 => 'book',
5 => ['twi'=>'ter','link'=>'edin']
);
Note that all the values are string .
I am not very good at regular expressions, so I have just been unable to create a Pattern for this. Multiple preg calls will also do.

I suggest using preg_split with the following regex:
$re = "/([a-z]*(?:\\[[^]]*\\]|\\([^()]*\\)),?)|(?<=,)/";
$str = "\$msg,array('goo','gle'),000,\"face\",'book',['twi'=>'ter','link'=>'edin']";
print_r(preg_split($re, $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));
Output of the sample program:
Array
(
[0] => $msg,
[1] => array('goo','gle'),
[2] => 000,
[3] => "face",
[4] => 'book',
[5] => ['twi'=>'ter','link'=>'edin']
)

I know you asked for a regular expression solution, however I'm on an eval() kick today:
eval('$array = array('.$string.');');
print_r($array);
Also note that 000 is NOT a string and will be converted to 0.

how to pull elseif - preg_match_all

I need advise how to pull content from this string.
$string = "{elseif "xxx"=="xxx"} text {elseif "xx2"!="xx2"}
text text
text
{elseif ....} text";
//or 'xxx'=='xxx'
$regex = "??";
preg_match_all($regex, $string, $out, PREG_SET_ORDER);
var_dump($out);
And my idea of var_dump output is:
array
0 =>
array
0 => string 'xxx' (length=3)
1 => string '==' (length=2)
2 => string 'xxx' (length=3)
3 => string 'text' (length=4)
1 =>
array
1 => string 'xx2' (length=)
2 => string '!=' (length=)
3 => string 'xx2' (length=)
4 => string 'text text
text' (length=)
2 =>
array
...
The output need not necessarily be as follows, but the same content.
my attempt:
$regex = "~{elseif ([\"\'](.*)[\"\'])(!=|==|===|<=|<|>=|>)([\"\'](.*)[\"\'])}(.*)~sU";
But I have bad or no output content.

Do you mean something like this? If you want to test it.
$regex = "/\{\s*elseif\s*(\"[^"]+\")\s*([^"]+)\s*(\"[^"]+\")\s*\}\s*([^{]*)\s*/gi";

the fastest way to replace (and store in array) links in the text with their order numbers

There is a $str string that may contain html text including <a >link</a> tags.
I want to store links in array and set the proper changes in the $str.
For example, with this string:
$str="some text <a href='/review/'>review</a> here <a class='abc' href='/about/'>link2</a> hahaha";
we get:
linkArray[0]="<a href='/review/'>review</a>";
positionArray[0] = 10;//position of the first link in the string
linkArray[1]="<a class='abc' href='/about/'>link2</a>";
positionArray[1]=45;//position of the second link in the string
$changedStr="some text [[0]] here [[1]] hahaha";
Is there any faster way (the performance) to do that, than running through the whole string using for?

this can be done by preg_match_all with PREG_OFFSET_CAPTURE FLAG.
e.g.
$str="some text <a href='/review/'>review</a> here <a class='abc' href='/about/'>link2</a> hahaha";
preg_match_all("|<[^>]+>(.*)</[^>]+>|U",$str,$out,PREG_OFFSET_CAPTURE);
var_dump($out);
Here the output array is $out. PREG_OFFSET_CAPTURE captures the offset in the string where the pattern starts.
The above code will output:
array (size=2)0 =>
array (size=2)
0 =>
array (size=2)
0 => string '<a href='/review/'>review</a>' (length=29)
1 => int 10
1 =>
array (size=2)
0 => string '<a class='abc' href='/about/'>link2</a>' (length=39)
1 => int 45
1 =>
array (size=2)
0 =>
array (size=2)
0 => string 'review' (length=6)
1 => int 29
1 =>
array (size=2)
0 => string 'link2' (length=5)
1 => int 75
for more information you can click on the link http://php.net/manual/en/function.preg-match-all.php
for $changedStr:
let $out be the output string from preg_match_all
$count= 0;
foreach($out[0] as $result) {
$temp=preg_quote($result[0],'/');
$temp ="/".$temp."/";
$str =preg_replace($temp, "[[".$count."]]", $str,1);
$count++;
}
var_dump($str);
This gives the output :
string 'some text [[0]] here [[1]] hahaha' (length=33)

I would use a regular expression to do such, check this:
http://weblogtoolscollection.com/regex/regex.php
try them here:
http://www.solmetra.com/scripts/regex/index.php
And use this:
http://php.net/manual/en/function.preg-match-all.php
Find your best regular expression to solve every case you may find: preg_match_all, if you set the pattern correctly, will return you an array containing every link you desire.
Edit:
In your case, assuming you want to keep the "<a>", this may work:
$array = array();
preg_match_all('/<a.*.a>/', '{{your data}}', $arr, PREG_PATTERN_ORDER);
Input example:
test
Lkdlasdk
llkdla
xx
Output with the above regexp:
Array
(
[0] => Array
(
[0] => test
[1] => Lkdlasdk
[2] => xx
)
)
Hope this helps

Split by whitespace only if not surrounded by [,<,{ or ],>,}

I have a string like this one:
traceroute <ip-address|dns-name> [ttl <ttl>] [wait <milli-seconds>] [no-dns] [source <ip-address>] [tos <type-of-service>] {router <router-instance>] | all}
I'd like to create an array like this:
$params = array(
<ip-address|dns-name>
[ttl <ttl>]
[wait <milli-seconds]
[no-dns]
[source <ip-address>]
[tos <tos>]
{router <router-instance>] | all}
);
Should I use preg_split('/someregex/', $mystring) ?
Or is there any better solution?

Use negative lookarounds. This one uses a negative lookahead for a <. This means it will not split if it finds a < ahead of the whitespace.
$regex='/\s(?!<)/';
$mystring='traceroute <192.168.1.1> [ttl <120>] [wait <1500>] [no-dns] [source <192.168.1.11>] [tos <service>] {router <instance>] | all}';
$array=array();
$array = preg_split($regex, $mystring);
var_dump($array);
And my output is
array
0 => string 'traceroute <192.168.1.1>' (length=24)
1 => string '[ttl <120>]' (length=11)
2 => string '[wait <1500>]' (length=13)
3 => string '[no-dns]' (length=8)
4 => string '[source <192.168.1.11>]' (length=23)
5 => string '[tos <service>]' (length=15)
6 => string '{router <instance>]' (length=19)
7 => string '|' (length=1)
8 => string 'all}' (length=4)

You could use preg_match_all such as:
preg_match_all("/\\[[^]]*]|<[^>]*>|{[^}]*}/", $str, $matches);
And get your result from the $matches array.

Yes, preg_split makes sense and is probably the most efficient way to do this.
Try:
preg_split('/[\{\[<](.*?)[>\]\}]/', $mystring);
Or if you want to match rather than split, you may want to try:
$matches=array();
preg_match('/[\{\[<](.*?)[>\]\}]/',$mystring,$matches);
print_r($matches);
Updated
I missed that you're trying to get the tokens, not the content of the tokens. I think you are going to need to use preg_match. Try something like this one for a good start:
$matches = array();
preg_match_all('/(\{.*?[\}])|(\[.*?\])|(<.*?>)/', $mystring,$matches);
var_dump($matches);
I get:
Array
(
[0] => Array
(
[0] => <ip-address|dns-name>
[1] => [ttl <ttl>]
[2] => [wait <milli-seconds>]
[3] => [no-dns]
[4] => [source <ip-address>]
[5] => [tos <type-of-service>]
[6] => {router <router-instance>] | all}
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get content from html file - php

Maybe something like this? preg_match('/onClick="rpd\((\d+)\);"/', $file_content,$matches); print $matches[1];

A clean way to do this is to use DOMDocument and XPath: $doc = new DOMDocument(); #$doc->loadHTMLFile($url); $xpath = new DOMXPath($doc); $ress= $xpath->query("//*[contains(#onclick,'rpd(')]/attribute::onclick"); foreach ($ress as $res) { echo substr($res->value,4,-1) . "\n"; }

Related

Flatten array of regular expressions

PHP parsing string to array with regular expressions

how to pull elseif - preg_match_all

the fastest way to replace (and store in array) links in the text with their order numbers

Split by whitespace only if not surrounded by [,<,{ or ],>,}

Categories

Resources