Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero
You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.
Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');
Related
I don't know if this is the right way to go about it, but right now I am dealing with a very large text file of membership details. It is really inconsistent though, but typically conforming to this format:
Name
School
Department
Address
Phone
Email
&&^ (indicating the end of the individual record)
What I want to do with this information is read through it, and then format it into XML.
So right now I have a foreach reading through the long file like this:
<?php
$textline = file("asrlist.txt");
foreach($textline as $showline){
echo $showline . "<br>";
}
?>
And that's where I don't know how to continue. Can anybody give me some hints on how I could organize these records into XML?
Here a straightforward solution using simplexml:
$members = explode('&&^', $textline); // building array $members
$xml = new SimpleXMLElement("<?xml version="1.0" encoding="UTF-8"?><members></members>");
$fieldnames = array('name','school','department','address','phone','email');
// set $fieldsep to character(s) that seperate fields from each other in your textfile
$fieldsep = '\p\n'; // a wild guess...
foreach ($members as $member) {
$m = explode($fieldsep, $member); // build array $m; $m[0] would contain "name" etc.
$xmlmember = $xml->addChild('member');
foreach ($m as $key => $data)
$xmlmember->addChild($fieldnames[$key],$data);
} // foreach $members
$xml->asXML('mymembers.xml');
For reading and parsing the text-file, CSV-related functions could be a good alternative, as mentioned by other users.
To read big files you can use fgetcsv
If && works as a delimiter for records in that file, you could start with replacing it with </member><member>. Prepend whole file with <member> and append </member> at the end. You will have something XML alike.
How to replace?
You might find unix tools like sed useful.
sed 's/&&/\<\/member\>\<member\>/' <input.txt >output.xml
You can also accomplish it with PHP, using str_replace():
foreach($textline as $showline){
echo str_replace( '&&', '</member><member>', $showline ) . "<br>";
}
I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...
I am new to php. As a part of my course homework assignment , I am required to extract data from a website and using that data render a table.
P.S. : Using regex is not a good option but we are not allowed to use any library like DOM, jQuery etc.
Char set is UTF-8.
$searchURL = "http://www.allmusic.com/search/artists/the+beatles";
$html = file_get_contents($searchURL);
$patternform = '/<form(.*)<\/form>/sm';
preg_match_all($patternform ,$html,$matches);
Here regex works fine but when I apply the same regex for table tag, it return me empty array. Is there something to do with whitespaces in $html ?
What is wrong here?
The following code produces a good result:
$searchURL = "http://www.allmusic.com/search/artists/the+beatles";
$html = file_get_contents($searchURL);
$patternform = '/(<table.*<\/table>)/sm';
preg_match_all($patternform ,$html,$matches);
echo $matches[0][0];
Result:
I want to parse data in between brackets. Below is the code where you can see what I'm doing. I would also like to avoid using XML.
$query = '
[page:1]
<html>
all the html
</html>
[/page:1]
[page:2]
<html>
all the html
</html>
[/page:2]
';
I want to create a loop script that will use regex to find all instances of [page:x]; which in the example above is 2. And then with a get function we can specify the page we want.
if(isset($_GET['page'])) {
$page = $_GET['page'];
$regex = '\\['page':(.*?)\\';
echo preg_match($regex, $query);
}
Any thoughts?
This should find all the matching blocks at once:
preg_match_all('/\[page:([0-9]+)\](.+?)\[\/page:$1\]/', $page, $matches)
I strongly doubt regex is the most suitable solution for what you're trying to accomplish though.
what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.
The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.