Regex in PHP to extract data from website

Regex in PHP to extract data from website - php

I am new to php. As a part of my course homework assignment , I am required to extract data from a website and using that data render a table.
P.S. : Using regex is not a good option but we are not allowed to use any library like DOM, jQuery etc.
Char set is UTF-8.
$searchURL = "http://www.allmusic.com/search/artists/the+beatles";
$html = file_get_contents($searchURL);
$patternform = '/<form(.*)<\/form>/sm';
preg_match_all($patternform ,$html,$matches);
Here regex works fine but when I apply the same regex for table tag, it return me empty array. Is there something to do with whitespaces in $html ?
What is wrong here?

The following code produces a good result:
$searchURL = "http://www.allmusic.com/search/artists/the+beatles";
$html = file_get_contents($searchURL);
$patternform = '/(<table.*<\/table>)/sm';
preg_match_all($patternform ,$html,$matches);
echo $matches[0][0];
Result:

Related

Preg Replace text based on string

i am trying to figure out why this has no result.
I am fetching data from wp database
$global_notice2 = get_post_meta($post->ID,'_global_notice', true);
This contains an a href link i wish to manipulate using preg replace before displaying it for the user such as
preg_replace('/<a(.*?)href="(.*?)"(.*?)>/', '', $global_notice2 );
Now we display the data
$notice2 = "<p>$alternative_content$global_notice2</p>";
The data is unmodified, what am i doing wrong?

preg_replace don't modify the argument, you need to catch the return like this :
$global_notice2 = preg_replace('/<a(.*?)href="(.*?)"(.*?)>/', '', $global_notice2);
See preg_replace documentation

preg_replace with wildcards?

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.

You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

Extract var value from preg_replace function

I'm trying to simulate a bbcode tag, like code below:
[code]this is code to render[/code]
[code attributeA=arg]this is code to render[/code]
[code attribute C=arg anotherAtributte=anotherArg]this is code to render[/code]
As you can see, the code tag can take as many attributes as needed, also could exists too many code tags in the same "publishment". I only have dealed with easiest tags like img, b, a, i. For example:
$result = preg_replace('#\[link\=(.+)\](.+)\[\/link\]#iUs', '$2', $publishment);
That works fine since it returns the final markup. But, in the code tag I need to have the "attributes" and "values" in array in order to build the markup myselft according to these attributes in order to simulate someting like this:
$code_tag = someFunction("[code ??=?? ...] content [/code]", $array );
//build the markup myself
$attribute1 = array_contains("attribute1", $array)? $array["attribute1"] : "";
echo '<pre {$attribute1}>' . $array['content'] . </pre>
So, I don't expect that you do it entirely for me, I need you just help to take me to the right direction because I never have used regex.
Thank you in advance

I like to use preg_replace_callback for such things:
function codecb($matches)
{
$original=$matches[0];
$parameters=$matches[1];
$content=$matches[2];
return "<pre>". $content ."</pre>";
}
preg_replace_callback("#\[code(.*)\](.+)\[\/code\]#iUs", "codecb", $str);
so when you have [code argA=test argB=test]This is content[/code] then in the function "codecb" you will have:
$original = "[code argA=test argB=test]This is content[/code]"
$parameters = " argA=test argB=test"
$content = "This is content"
and can preg_match the arguments and return the replacement for the whole.

how to parse HTML tags to plain text?I want to achive something like facebook or twitter

For example: I have this string
#[1234:peterwateber] <b>hello</b> <div>hi!</div> http://stackoverflow.com
I want to convert it into HTML like this:
#peterwateber <b>hello</b> <div>hi!<divb>
http://stackoverflow.com
I'm using QueryPath, and I have this code where you can get the texts from "#[123:peterwateber]" to be outputted to "123 and peterwateber" respectively.
The code to do that is:
$hidden_input = "#[1234:peterwateber] <b>hello</b> <div>hi!</div> http://stackoverflow.com";
preg_match('##\[(\w+)\:(\w+)\]#', $hidden_input, $m); //returns 123,peterwateber
What I'm trying to achieve is to have this kind of output:
I'm using Hawkee's plugin for jQuery autocomplete http://www.hawkee.com/snippet/9391/

I'm not entirly sure if there is a specific function just for that but what you can do is this:
in example of the link (a href)
$raw = "#[1234:peterwateber]"
$thingtoreplace = ("#[");
$firstpass = str_replace($thingtoreplace, "<a href='", $raw);
$raw2 = $firstpass
$thingtoreplace = (":");
$secondpass = str_replace($thingtoreplace, "'>", $raw1);
$raw3 = $second
$thingtoreplace = ("]");
$secondpass = str_replace($thingtoreplace, "'</a>", $raw3);
I know it seems tedious but it should do the trick. If its not helpful then please dont rate me down... I spent time on this

How can I parse a very simple Table using PHP

Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero

You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.

Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex in PHP to extract data from website - php

The following code produces a good result: $searchURL = "http://www.allmusic.com/search/artists/the+beatles"; $html = file_get_contents($searchURL); $patternform = '/(<table.*<\/table>)/sm'; preg_match_all($patternform ,$html,$matches); echo $matches[0][0]; Result:

Related

Preg Replace text based on string

preg_replace with wildcards?

Extract var value from preg_replace function

how to parse HTML tags to plain text?I want to achive something like facebook or twitter

How can I parse a very simple Table using PHP

Categories

Resources