Parsing vCards on web pages into a MySQL DB - php

I have a client who is using a separate vCard on a separate page. These are being pasted into a wordpress text field. (Not the most efficient way to maintain a list of people, but I won't editorialize after the fact.) My mission is to write something to parse through all the addresses in the vCards and to dump the information into a central database. This would allow all the disparate pages to become addresses replete with lat and lng coordinates from google and display a lovely front page with pins galore.
This page would show all the vcards from the rest of the pages of the site.
Oh, this is an example, sanitized, of a vcard on the site, in reality it would be surrounded by a lot of dubious HTML code:
<div class="vcard">
<span class="fn org">XYZ Org Name</span><br />
<span class="url">http://www.someurl.com/</span>
<div class="adr"><span class="street-address">1234 Main Ave</span><br />
<span class="locality">Chicago</span><br />
<span class="region">IL</span><br /><span class="postal-code">60647</span></div>
</div>
Now, each page has one of these, and to spider through the entire site, and collect them into an array is a bit out of my league. I can handle dumping them into a database, using PHP and mySQL.
Any and all advice would be welcome!
EDIT: Not sure how important this is, but I am pulling the data from a different server.

I believe you are looking for HTML parsers. Here is HTML parsing module for python
You need to parse the relevant data out of all the HTML files and then do whatever with it.
I have not tried any php html parsers to recommend any but since you are working on a webserver I'm hoping it has perl? Take a look at perl html parsers.
#this snippet will get contents of organization name
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = #_;
if ($tag =~ /^span$/i && $attr->{'class'} =~ /^fn org$/i) {
# see if we find <span class="fn org"
push (#org_names, $origtext);
}
}
now you have #org_names array that contains all organization names.

Try the DOMDocument class' loadHTML method. Then you can use DOMDocument methods to select the nodes, attributes and values you want. Or if you're familiar with XPath, you can also instantiate a DOMXPath object to query against the loaded DOMDocument to select the desired data.

Related

How to traverse DOM (children/siblings) using watir-webdriver?

I'm used to using PHP's Simple HTML DOM Parser(SHDP) to access elements, but I'm using ruby now with watir-webdriver, and I'm wondering if this can replace the functionality of SHDP as far as accessing elements on pages goes.
So in SHDP I'd do this:
$ret = $html->find('div[id=foo]');
Which is an array of all instances of divs with id=foo. Oh, and $html is the HTML source of a specified URL. Anyway, so then I'd put it in a loop:
foreach($ret as $element)
echo $element->first_child ()->first_child ()->first_child ()->first_child ()->first_child ()->first_child ()->first_child ()->plaintext . '<br>';
Now, here, each ->first_child() is a child of the parent div with id=foo (notice I have seven) and then I print the plaintext of the 7th child. Something like this
<div id="foo">
<div ...>
<div ...>
<div ...>
<div ...>
<div ...>
<div ...>
<div ...>HAPPINESS</div>
</div>
</div>
</div>
</div>
</div>
</div
</div>
would get "HAPPINESS" printed. So, my question is, how can this be done using watir-webdriver (if it all possible)?
Also, and more generally, how can I get SHDP's DOM-traversing abilities in watir-webdriver:
enter image description here
I ask because if watir-webdriver can't do this, I'm going to have to figure out a way to pipe source of a browser instance in watir-webdriver to a PHP script that uses SHDP and get it that way, and somehow get it back to ruby with the relevant information...
Watir implements an :index feature (zero-based):
browser.div(id: 'foo').divs # children
browser.div(id: 'foo').div(index: 6) # nth-child
browser.div(id: 'foo').parent # parent
browser.div(id: 'foo').div # first-child
browser.div(id: 'foo').div(index: -1) # last-child
next_sibling and previous_sibling are not currently implemented, please make a comment here if you think it is necessary for your code: https://github.com/watir/watir/pull/270
Note that in general you should prefer using indexes to using collections, but these also work:
browser.div(id: 'foo').divs.first
browser.div(id: 'foo').divs.last
Paperback code example (are you looking to select by text or obtain the text?):
browser.li(text: /Paperback/)
browser.td(class: "bucket").li
browser.table(id: 'productDetailsTable').li
We've also had requests in the past to support things like direct children instead of parsing all of the descendants: https://github.com/watir/watir/issues/329
We're actively working on how we want to improve things in the upcoming versions of Watir, so if this solution does not work for you, please post a suggestion with your ideal syntax for accomplishing what you want here: https://github.com/watir/watir/issues and we'll see how we can support it.
I don't believe there's a .child method to do this for you. If you know it will always be seven child divs in that structure you could do the inelegant
require 'watir-webdriver'
#browser = Watir::Browser.new
puts #browser.div(id: 'foo').div.div.div.div.div.div.div.text
You can always grab a collection of them and then address the last one, assuming it is the last one, the deepest in the stack.
puts #browser.div(id: 'foo').divs.last.text
That would also work, but assumes something absolute about the structure of the page. It's also not equivalent to the iteration of elements you've got above. As I'm not clear on the value of doing it that way I'm not comfortable taking a stab at equivalent code.
Maybe I am not giving you exactly what you were doing in PHP. However, if you know that text of 7th child will be HAPPINESS then you could simply locate an element via XPath:
STEPS:
Given(/^I click the div "(.*?)" xpath$/) do |div_xpath|
Watir::Wait.until { #browser.div(:xpath => div_xpath).exist? }
#browser.div(:xpath => div_xpath).click
end
FEATURE:
Given I click the div "//div[#id='foo'][text()='HAPPINESS']" xpath

Finding part of the tag with simple html dom

could you please help me.
I'm trying to scrape website while using php simple dome parser from here http://simplehtmldom.sourceforge.net/
Problem is that tags I need to identify have the same beginning, but don't have the same ending.
For example this is the structure:
<div id="postmenu_2861574">
<div id="post_message_2861574"> one posted message </div>
</div>
<div id="postmenu_2861617">
<div id="post_message_2861617"> another posted message </div>
</div>
All have tags have with the same beginning "post_menu" and "post_message_" but ending differs.
Is it possible to gather all post without knowing all tags endings?
Is there a way like in sql to use % sign at the end of the search phrase?
As simple way didn't work, showed that variable $postmenu empty.
foreach($html->find('div#postmenu_') as $postmenu)
$item['message'] = $article->find('div#post_message_', 0)->plaintext;
thank you for the help
According to http://www.w3.org/TR/CSS2/selector.html what you are asking is not possible.
I would make all divs with post messages the same class, e.g. class="post_message".
Then you can find all divs with this class using:
foreach($html->find('div.post_message') ...
Since you are scraping a website, performance is probably not an issue. In this case you can simply find all divs and check the ID, to see if it matches.
foreach($html->find('div') ...
// retreive ID
if (0 === strpos($id, 'post_message_')))
...

i want to get data from another website and display it on mine but with my style.css

So my school has this very annoying way to view my rooster.
you have to bypass 5 links to get to my rooster.
this is the link for my class (it updates weekly without changing the link)
https://webuntis.a12.nl/WebUntis/?school=roc%20a12#Timetable?type=1&departmentId=0&id=2147
i want to display the content from that page on my website but with my
own stylesheet.
i don't mean this:
<?php
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
?>
or an iframe....
I think this can be better done using jquery and ajax. You can get jquery to load the target page, use selectors to strip out what you need, then attach it to your document tree. You should then be able to style it anyway you like.
I would recommend you to use the cURL library: http://www.php.net/manual/en/curl.examples.php
But you have to extract part of the page you want to display, because you will get the whole HTML document.
You'd probably read the whole page into a string variable (using file_get_contents like you mentioned for example) and parse the content, here you have some possibilities:
Regular expressions
Walking the DOM tree (eg. using PHPs DOMDocument classes)
After that, you'd most likely replace all the style="..." or class="..." information with your own.

Creating an inline 'Jargon' helper in PHP

I have an article formatted in HTML. It contains a whole lot of jargon words that perhaps some people wouldn't understand.
I also have a glossary of terms (MySQL Table) with definitions which would be helpful to there people.
I want to go through the HTML of my article and find instances of these glossary terms and replace them with some nice JavaScript which will show a 'tooltip' with a definition for the term.
I've done this nearly, but i'm still having some problems:
terms are being found within words (ie: APS is in Perhaps)
I have to make sure that it doesn't do this to alt, title, linked text, etc. So only text that doesn't have any formatting applied. BUT it needs to work in tables and paragraphs.
Here is the code I have:
$query_glossary = "SELECT word FROM glossary_terms WHERE status = 1 ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);
//reset mysql via seek so we don't have to do the query again
mysql_data_seek($result_glossary,0);
while($glossary = mysql_fetch_array($result_glossary)) {
//once done we can replace the words with a nice tip
$glossary_word = $glossary['word'];
$glossary_word = preg_quote($glossary_word,'/');
$article['content'] = preg_replace_callback('/[\s]('.$glossary_word.')[\s](.*?>)/i','article_checkOpenTag',$article['content'],10);
}
And here is the PHP function:
function article_checkOpenTag($matches) {
if (strpos($matches[0], '<') === false) {
return $matches[0];
}
else {
$query_term = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE word = '".escape($matches[1])."'";
$result_term = mysql_query_run($query_term);
$term = mysql_fetch_array($result_term);
# CREATING A RELEVENT LINK
$glossary_id = $term['glossary_term_id'];
$glossary_link = SITEURL.'/glossary/term/'.string_to_url($term['word']).'-'.$term['glossary_term_id'];
# SOME DESCRIPTION STUFF FOR THE TOOLTIP
if(strlen($term['info'])>400) {
$glossary_info = substr(strip_tags($term['info']),0,350).' ...<br /> Read More';
}
else {
$glossary_info = $term['info'];
}
return ' '.$term['word'].'',$glossary_info,400,1,0,1).'">'.$matches[1].'</a> '.$matches[2];
}
}
Move the load from server to client. Assuming that your "dictionary of slang" changes not frequently and that you want to "add nice tooltips" to words across a lot of articles, you can export it into a .js file and add a corresponding <script> entry into your pages - just a static file easily cacheable by a web-browser.
Then write a client-side js-script that will try to find a dom-node where "a content with slang" is put, then parse out the occurences of the words from your dictionary and wrap them with some html to show tooltips. Everything with js, everything client-side.
If the method is not suitable and you're going to do the job within your php backend, at least consider some caching of processed content.
I also see that you insert a description text for every "jargon word" found within content. What if a word is very frequent across an article? You get overhead. Make that descriptions separate, put them into JS as an object. The task is to find words which have a description and just mark them using some short tag, for instance <em>. Your js-script should find that em`s, pick a description from the object (associative array with descriptions for words) and construct a tooltip dynamically on "mouse over" event.
Interestingly enough, I was searching exactly NOT for a question like yours, but while reading I realized that your question is one that I had been through quite some time ago
It was basically a system to parse a dictionary and spits augmented HTML.
My suggestion would include instead:
Use database if you want, but a cached generated CSV file could be faster to use as dictionary
Use a hook in your rendering system to parse the actual content within this dictionary
caching of the page could be useful too
I elaborated a solution on my blog (in French, sorry for that). But it outlines basically something that you can actually use to do that.
I called it "ContentAbbrGenerator" as a MODx plugin. But the raw of the plugin can be applied outside of the established structure.
Anyway you can download the zip file and get the RegExes and find a way around it.
My objective
Use one file that is read to get the kind of html decoration.
Generate html from within author entered content that doesnt know about accessibility and tags (dfn and or abbr)
Make it re-usable.
Make it i18n-izable. That is, in french, we use the english definition but the adaptative technology reads the english word in french and sounds weird. So we had to use the lang="" attribute to make it clear.
What I did
Is basically that the text you give, gets more semantic.
Imagine the following dictionary:
en;abbr;HTML;Hyper Text Markup Language;es
en;abbr;abbr;Abbreviation
Then, the content entered by the CMS could spit a text like this:
<p>Have you ever wanted to do not hassle with HTML abbr tags but was too lazy to hand-code them all!? That is my solution :)</p>
That gets translated into:
<p>Have you ever wanted to do not hassle with <abbr title="Hyper Text Markup Language" lang="es">HTML</abbr> <abbr title="Abbreviation">abbr</abbr> tags but was too lazy to hand-code them all!? That is my solution :)</p>
All depends from one CSV file that you can generate from your database.
The conventions I used
The file /abbreviations.txt is publicly available on the server (that could be generated) is a dictionary, one definition per accronym
An implementation has only to read the file and apply it BEFORE sending it to the client
The tooltips
I strongly recommend you use the tooltip tool that even Twitter Bootstrap implements. It basically reads the title of any marked up tags you want.
Have a look there: Bootstrap from Twitter with Toolip helper.
PS: I'm very sold to the use of the patterns Twitter put forward with this Bootstrap project, it's worth a look!!

Integrating tumblr blog with website

I would like to integrate my tumblr feed in to my website. It seems that tumblr has an API for this, but I'm not quite sure how to use it. From what I understand, I request the page, and tumblr returns an xml file with the contents of my blog. But how do I then make this xml into meaningful html? Must I parse it with php, turning the relevant tags into headers and so on? I tell myself it cannot be that painful. Anyone have any insights?
There's a javascript include that does this now, available from Tumblr (you have to login to see it): http://www.tumblr.com/developers
It winds up being something like this:
<script type="text/javascript" src="http://{username}.tumblr.com/js"></script>
You can use PHPTumblr, an API wrapper written in PHP which makes retrieving posts a breeze.
If you go to http://yourblog.tumblr.com/api/read where "yourblog" should be replaced with the name of your blog (be careful, if you host your Tumblr blog on a custom domain, like I do, use that) you'll see the XML version of your blog. It comes up really messy for me on Firefox for some reason so I use Chrome, try a couple of different browser, it'll help to see the XML file well-formed, indented and such.
Once your looking at the XML version of your blog, notice that each post has a bunch of data in an attribute="value" orientation. Here's an example from my blog:
<post id="11576453174" url="http://wamoyo.com/post/11576453174" url-with-slug="http://wamoyo.com/post/11576453174/100-year-old-marathoner-finishes-race" type="link" date-gmt="2011-10-17 18:01:27 GMT" date="Mon, 17 Oct 2011 14:01:27" unix-timestamp="1318874487" format="html" reblog-key="E2Eype7F" slug="100-year-old-marathoner-finishes-race" bookmarklet="true">
So, there's lots of ways to do this, I'll show you the one I used, and drop my code on the bottom of this post so you can just tailor that to your needs. Notice the type="link" part? Or the id="11576453174" ? These are the values you're going to use to pull data into your PHP script.
Here's the example:
<!-- The Latest Text Post -->
<?php
echo "";
$request_url = "http://wamoyo.com/api/read?type=regular"; //get xml file
$xml = simplexml_load_file($request_url); //load it
$title = $xml->posts->post->{'regular-title'}; //load post title into $title
$post = $xml->posts->post->{'regular-body'}; //load post body into $post
$link = $xml->posts->post['url']; //load url of blog post into $link
$small_post = substr($post,0,350); //shorten post body to 350 characters
echo // spit that baby out with some stylish html
'<div class="panel" style="width:220px;margin:0 auto;text-align:left;">
<h1 class="med georgia bold italic black">'.$title.'</h1>'
. '<br />'
. '<span>'.$small_post.'</span>' . '...'
. '<br /></br><div style="text-align:right;"><a class="bold italic blu georgia" href="'.$link.'">Read More...</a></div>
</div>
<img style="position:relative;top:-6px;" src="pic/shadow.png" alt="" />
';
?>
So, this is actually fairly simple. The PHP script here places data (like the post title and post text) from the xml file into php variables, and then echos out those variable along with some html to create a div which features a snippet from a blog post. This one features the most recent text post. Feel free to use it, just go in and change that first url to your own blog. And then choose whatever values you want from your xml file.
For example let's say you want, not the most recent, but the second most recent "photo" post. You have to change the request_url to this:
$request_url = "http://wamoyo.com/api/read?type=photo&start=1"
Or let's say you want the most recent post with a specific tag
$request_url = "http://wamoyo.com/api/read?tagged=events";
Or let's say you want a specific post, just use the id
$request_url = "http://wamoyo.com/api/read?id=11576453174";
So all you have to do is tack on the ? with whatever parameter and use an & if you have multiple parameters.
If you want to do something fancier, you'll need the tumblr api docs here: http://www.tumblr.com/docs/en/api/v2
Hope this was helpful!
There are two main ways to do this. First, you can parse the xml, pulling out the content from the the tags you need (a few ways to do this depending on whether you use a SAX or DOM parser). This is the quick and dirty solution.
You can also use an XSLT transformation to convert the xml source directly to the html you want. This is more involved since you have to learn the syntax for xslt templates, which is a bit verbose.

Categories