PHP regular expression to extract html values - php

I want to extract values from the code below.
<tbody>
<tr>
<td><div class="file_pdf">note1</div></td>
<td class="textright">110 KB</td>
<td class="textright">106</td>
</tr>
<tr>
<td><div class="file_pdf">note2.pdf</div></td>
<td class="textright">44 KB</td>
<td class="textright">104</td>
</tr>
</tbody>
I want to extract 'note1', 'note2' strings and 1628 and 1629 numbers.
i treid
preg_match_all('~(\'\)\">(.*?)<\/a>)~', $getinside, $matches);
but its result is not what I am looking for..
is there any simple RegEx to extract them? Thanks!

It should work for you:
preg_match_all("~downloadFile\('(\d+)'\)\">([^<]*)</a>~", $getinside, $matches);
Remember: If your html is very large/complex and you also need to parse more other things from there, then regex is not a better option to do this.

Related

Regex not giving the results expected

So, i got some html tables that i need to extract values, did a regular expression to get the values i wanted.
the html tables can be in these 2 formats:
<td height="20" style="width:59px;height:20px;">1</td>
<td style="width:212px;">Mendes, Paulo [AA]</td>
<td style="width:99px;">39</td>
<td>8</td>
<td style="width:85px;">$10,000</td>
</tr><tr height="20"><td height="20" style="width:59px;height:20px;">2</td>
<td style="width:212px;">Campos, Miguel [AC]</td>
<td style="width:99px;">37</td>
<td>6</td>
<td style="width:85px;">$5,000</td>
And the other one
<td>1</td>
<td>Mendes, Paulo [AA]</td>
<td>39</td>
<td>8</td>
<td>$10,000</td>
</tr><tr height="20"><td>2</td>
<td>Campos, Miguel [AC]</td>
<td>37</td>
<td>6</td>
<td>$5,000</td>
To the example without style i can get the values i want with this regex:
<td>(\d+)<\/td>\n+\t*<td>([\w+, ]+) \[(\w{2})\]<\/td>
its to be used in php, and i been using https://regex101.com/ to test the regex first.
now to get the values of the table with styles i'm getting no luck.
tried the "perfect match" with:
<td height\=\"20\" style\=\"width\:59px\;height\:20px\;\">(\d+)<\/td>\n+\t*<td style\=\"width\:212px\;\">([\w+, ]+) \[(\w{2})\]<\/td>
but it doesn't catch want i want. even tried to do a negation search but it still doesn't work. What i'm doing wrong?
Why don't you use QuerySelectorAll (''); it is a lot easier. You can used it to retrieve the inner text of td elements and store them in an array using a loop. Once you have the td you can use jQuery Ajax to send it to a .php file to process however you want.
For example:
var tdArr = [];
var tdContent = document.querySelectorAll('table tr td');
for ( let i = 0; i < tdContent.length; i++){
tdArr.push(tdContent[i].textContent);
}

How to get strings with Regex

I want to get strings between td's but one of the td has not close tag. How can I get from this tag with other string.
<tr>
<td class="exclass">Text 0
<td class="exclass">Text 1</td>
<td class="exclass">Text 2</td>
<td class="exclass3" >Text</td >
<td class="exclass"> Text </td>
<td class="exclass3">Text</td>
<td class="exclass">Text</td>
<td class="exclass">Text</td><td class="exclass">Text</td>
<td class="exclass2">Text</td>
<td class="exclass">Text</td>
<td class="exclass" width="20"><img src="exampleSrc"></td>
</tr>
As you can see below code, I want to get Text 0 and the other strings with PHP.
So far, I tried to:
<td.+?>([\w\W]*?)<\/td.+?|<td
I assume because one of the td doesn't have close tag, that's why you can't use the DOM parser.
Here is my regex solution
(?<=>)([\s\w\n]+)(?=<)
https://regex101.com/r/BRaJAu/1

echo html as text with PHP variable

I need to display a large amount of html as text(a table) which contains PHP variables. Now I would like to know if there is a command that echos the HTML as text, but keeps the PHP.
I know I can change the > with >. I'm just wondering if there is a better way to do this.
<table style="width:400px">
<tr>
<td><b style="font-size:27px;">$_POST['name']</b></td>
<td rowspan="2"> IMG</td>
</tr>
<tr>
<td><b style="font-size:23px;font-weight:400">$_POST['function']</b></td>
</tr>
</table>
This is only a part of it of course.
TL;DR: I want to echo the above text(as text, not code) but the variables from PHP need to execute as code.
Pushing it through the htmlentities function when you output the data should work. The function will convert whatever string you put in into one that looks exactly the same when rendered in HTML, which seems to be what you want.
Use htmlspecialchars and add curly braces around PHP variables:
$string = "<table style=\"width:400px\">
<tr>
<td><b style=\"font-size:27px;\">{$_POST['name']}</b></td>
<td rowspan=\"2\"> IMG</td>
</tr>
<tr>
<td><b style=\"font-size:23px;font-weight:400\">{$_POST['function']}</b></td>
</tr>
</table>";
echo htmlspecialchars($string);

PHP Table Reader

How to read and get the ISP value from html table?
<table style="padding-top:10px;">
<tbody>
<tr>
<th>ISP:</th>
<td>My Provider</td>
</tr>
<tr><th>Organization:</th><td nowrap=""></td>
</tr>
<tr><th>Connection:</th>
</tbody></table>
Given you lack of information, a regular expression would be the easiest solution.
$matches = array();
preg_match("<th>ISP:</th>[\r\n\s\t]*<td>(.*)</td>", "<th>ISP:</th><td>My Provider</td>...", $matches);
var_dump($matches);

How Can I Get Data From HTML Source Code with PHP and RegEx?

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.
Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.
Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.
Source :
...
<TR Class="Head1">
<TD width="15%"><font size="12">Name</font></TD>
<TD>: </TD>
<TD align="center"><font color="red">Alex</font></TD>
<TD width="25%"><b>Job</b></TD>
<TD>: </B></TD>
<TD align="center" width="25%"><font color="red">Doctor</font></TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>
...
As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do.
The source is longer than this.
How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?
Im waiting for your help.
If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.
You can use DOMDocument to load badly formed HTML:
$doc = new DOMDocument();
#$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>');
$tds = #$doc->getElementsByTagName('td');
foreach ($tds as $td) {
echo $td->textContent, "\n";
}
I'm suppressing warnings in the above code for brevity.
Output:
Age
:
32
data
<!-- space -->
<!-- space -->
Using regex to parse HTML can be a futile effort as HTML is not a regular language.
Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.
You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.
Or you could use a parser.
Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.
$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>: </TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD> </B></TD>\s+<TD width="40%"> </TD>\s+</TR>
EOF;
preg_match_all($regex, $text, $result);
var_dump($result)

Categories