Mike's Page

Informatics, Development, Cycling, Data, Travel ...

Page 3 of 3

Makerere Advanced Programming Course 2014

A colleague at Makerere is taking Ariane_5_(maquette)over teaching the Advanced Programming course. It needs some updating.

I based my course roughly on what John taught before. The skills of the students was incredibly varied: Many hadn’t programmed before! While others were doing day jobs coding all day.
The course had the following parts:
1) Python and OOP
2) Regex
3) The linux bash command line
4) The LAMP model, sqlite3, XSS, etc.
5) Web APIs
I’d really like to change the course, adding some of the following:

Definitely need to add using code repos collaboratively: E.g. git. I’d like to make this a project where students find github projects they want to help, and make pull requests to provide improvements to fix bugs, etc. Or maybe work together on a project with their fellow students.

Using AWS: The cloud is where things are these days, learning to use their interface and the API. Maybe could approach AWS to see if they’ll provide free credits to allow the course to use their servers? They do have a free tariff that students could use maybe?

Mobile development: something that’s now ubiquitous, so really should be in a course maybe at Makerere? Could make it one week.

Microprocessor development: the arduino and atTiny

Internet of things: already tried to address this by the API work

more on security: from letsencrypt to pen-testing and firewalls (ip-tables) – I didn’t know enough about these topics to go into them in too much depth. And there’s already too much on security in the course.

Internet communication: packets etc?

Coding methodologies: e.g. pair programming
Visualisation on the web: d3?

GPU programming: also something I’ve limited experience of, but could be interesting?

You can find lots of the old course here:
http://www.michaeltsmith.org.uk/other/advancedprogramming/
Includes 2014 course work, the 2014 exam and the lecture slides. Note: Definitely need to update to python3 (from python2.x).
I used my own laptop to host quite a bit of this (including the bash learning), used my mobile as the hotspot to let the students connect.

Workshop on Big and Open Data for International Development

Yesterday I took part in the department of Development Informatics big data event at the University of Manchester.

Really interesting discussion. Made me think more carefully about what effect the blind-spots in my data will have, and how collecting data can make these blind spots worse.

At the end of the day we had a bit of a discussion. Our group was particularly concerned by the effect on power shifts or concentrations with increased data aggregation (similar to Neil Lawrence’s Digital Oligarchy).

We started with the question “What will be done with the data?” and then “How does it become information?”

This led to the obvious point that it depends on who can use it, which then reinforces the power-shift that we started from. The outcome of “How does it become information” leads to the question “How can using data actually foster development? (and avoid inequality)”.  We also had as concerns around the transformation from data to information that it is often bias or focused on inanimate or simple things (measuring the water pump rather than the people).

We finally looked at how to change or stop the shift or concentration in power. Two options presented themselves, either to stop using the data, and halt the path to large scale big-data analyses. This seems implausible, given the path we are on. A second option was “Can [the power shift] be mitigated by giving everyone access?”, in other words, will open data save us from the digital oligarchy?

This was again criticised; how can an illiterate farmer or boda-boda driver engage or use large data sets?

My own view is that we need layers of intermediary; from the machine learning/analysis experts who can combine and use the data, and visualise it in clear ways, to journalists and civil society who effectively ‘represent’ the citizen. Our concern is that the machine learning expert is a very particular part of society: usually white, highly educated, young and male. We can go some-way to mitigate that by investing in and supporting MSc and PhD level education in developing countries… however, I’m aware the students at Makerere (for example) were not a ‘typical’ sample of the Ugandan population. Most of the population is rural, with a good proportion unable to read or write. I suspect that the Ugandan students will represent their country-men and women little better than a muzungu. However, it is a start down the path, towards some form of democratic or universal access to the power provided by machine learning and big-data.

 

I presented my crash map project, and also submitted a paper on the topic.

Differential Privacy and Gaussian Processes

Regarding our new paper:  Differentially Private Gaussian Processes.

In layman terms Gaussian Processes (GPs) are, usually, used to describe a dataset, to allow predictions to be made. E.g. given previous patient data, what is the best treatment a new patient should receive? It’s a nice framework as it incorporates assumptions clearly, and, because of its probabilistic underpinnings gives estimates of the confidence for a particular estimate.

Comparison of new 'integral' kernel with a normal RBF kernel. Note the underestimate the RBF kernel suffers between 20 and 40 years.

Comparison of new ‘integral’ kernel with a normal RBF kernel. Note the underestimate the RBF kernel suffers between 20 and 40 years.

Differential Privacy is a method that’s recently started to go main-stream. I’ve written a brief introduction presentation here. The general idea is to add noise to any query result to mask the value of an individual row in a database, but still allow inference to be done on the whole database.

My research areas cover both Gaussian Processes and Differential Privacy, so it seemed to make sense to see if I could apply one to the other. In our latest paper we look at two ways to do this:

  • Bin the data and add differential privacy noise to the bin totals. Then use a GP to do the inference.
  • Use the raw non-private data to train a GP, then perturb the GP with structured noise to achieve DP.

For the former I developed a new kernel (a way of describing how data is correlated or structured) for binned or ‘histogram’ data. See this ipython notebook for examples. This hopefully is useful for many applications (outside the DP field). For example any inference using binned datasets. At the moment I’ve only applied it to the RBF kernel.

For the latter I used the results of [1] to determine the noise required to be added to the mean function of my GP. I found that we could considerably reduce the noise scale by using inducing (pseudo) inputs.

Malawi child dataset. Raw data and 2d histogram.

Malawi child dataset. Raw data and 2d histogram.

Both methods can be further improved, but it’s still early days for Differential Privacy. We need to look at how to apply DP to as many methods as possible, and start to incorporate it into more domains. I’ll be looking at how to apply it to the MND database. Finally, we need an easy “introduction to DP for practitioners”. Although I don’t know if the field is sufficiently mature for this yet.

[1] Hall, R., Rinaldo, A., & Wasserman, L. (2013). Differential privacy for functions and functional data. The Journal of Machine Learning Research, 14 (1), 703–727.

National Biodiversity Network Python Wrapper

A few months ago I took part in the NBN R hack-evenings at the University of Sheffield. Unlike everyone else in the room I was coding in python.

There is no python wrapper for the API, so I created one. Feel free to clone and use (and pull requests happily accepted)! An example to illustrate how to use the API is here:

import pynbn
c = pynbn.connect('lionfish_pynbn','password');

sp = c.get_tvk(query='Bombus terrestris') #get the tvk (the "taxon version key" for buff tails)
keys = []
for res in sp['results']:
k = res['ptaxonVersionKey']
keys.append(str(k))
print "%d species match this query string" % len(keys)
print keys
tvk = keys[0]
print "We'll use the first key (%s)" % tvk
#we usually take the first item from this list (advice from the NBN hackday)
obs = c.get_observations(tvks=[tvk], start_year=1990, end_year=2010) #get observations
print "There are %d records for B. terrestris between 1990 and 2010" % len(obs)

I’ll shortly be adding this to pip, so it can be installed with,

pip install pynbn

Kampala Air Pollution

When I lived in Uganda, I used to cycle to work every day, and was aware of the pollution building as I cycled down the hill into the city centre.

Air pollution monitor and example output. The vertical lines mark when we lit a match in the same room as the sensor.

Air pollution monitor and example output. The vertical lines mark when we lit a match in the same room as the sensor.

I wanted to measure this pollution, so using the (very cheap) shinyei sensor and a mobile phone investigated what we could measure, and with what accuracy. I’ve not got very far yet, but this funding app gives and idea about what we’ve done and what we hope to achieve.

Air pollution monitoring in Kampala

Trilateration Part 2

Example of trilateration on the output areas of Sheffield

Example of trilateration on the output areas of Sheffield

In the last post, we looked at methods to find the optimum landmark. In this post we look at how to find one’s location given a set of landmarks.

Previously I’ve naively found the probability of each location on a grid, given the reported distances to the landmarks, then sampled from this grid to find the probability for each output area.

In this notebook we approach the problem differently, and look for the probability of the set of distances to the landmarks given the output area. By swapping the order, we are able to use the node in the Bayesian network.

Read more (ipynb)

Trilateration

Combining two estimated distances to find your location.

Combining two estimated distances to find your location.

Trilateration is like triangulation, but uses the distances to landmarks, rather than their angles, to determine one’s location. GPS is probably the most common example of trilateration in use at the moment.

In our problem we have a set of landmarks. We know the distance (with some uncertainty) to one, and we want to know which of the remaining landmarks we should select next to maximise the amount of information we gain about our location.

For our particular example, we ask people to estimate the distance of various landmarks from their house.

We look at how to find a good landmark quickly, by using Bayes’ rule to rearrange the expression for the entropy in the probability distribution.

Read more.

Some tools for using MVPA

During my PhD I used the Multivariate Pattern Analysis Toolbox (www.pni.princeton.edu/mvpa), a Matlab-based toolbox to facilitate multi-voxel pattern analysis of neuroimaging data. I’ve made several alterations/additions to the tool, which others might find useful:

Data preprocessing

Multiclass SVM Classification

Just a note, as lots of people have asked:

The MVPA toolbox contains a two class SVM classifier (train_svm). Ryan Mruczek wrote a wrapper for the Chih-Chung Chang and Chih-Jen Lin’s svm library, and posted it on the MVPA group. We’ve hosted the train and test files here for ease of use. To use it, you will also need to download libSVM from the download section of Chang and Lin’s SVM website.

Boltzmann Machine pattern classifier for fMRI using the MVPA toolbox

The MVPA toolbox contains several classifiers such as linear regression, support vector machines etc. To supplement these, we have worked on a generative classifier which uses Restricted Boltzmann Machines (RBM). The general principle behind an individual RBM is that one alters the weights to make the visible nodes activity similar to one of the classes of training data. This can be used as a generative classifier by seeing which of a set of RBMs has the lowest free energy when the test data is applied. The one with the lowest has been trained (probably) on data most similar to the test data. It is based on work by Tanya Schmah, Geoffrey Hinton, et al.
The training and test scripts are available here:

train_rbm.m, test_rbm.m [to add, needs to be recovered from archive]

Psychophysics

One of the great contributions of psychophysics to psychology is the notion of measuring threshold, i.e. the signal strength required for a criterion level of response by the observer. Watson and Pelli (1983) described a maximum likelihood procedure, which they called QUEST, for estimating threshold. The Quest algorithm is an adaptive staircase method, you can think of it as a Bayesian toolbox for testing observers and estimating their thresholds. We have written Python implementations of the Quest algorithm that you can download here:

files: quest.py, test_quest.py [to add, need to be recovered from archive]

Other useful hints

Structural Equation Modelling. The link from the SPM site to the SEM extension is dead. Douglas Steele kindly linked me to this tutorial. But some of the links were dead and I had trouble getting the code to compile. So I’ve produced a step-by-step summary describing how to get it working.

Failing to install the online psychic

The online psychic is running happily on my local machine, but I needed to get it onto this webserver.

Unfortunately the server doesn’t have pandas, numpy, etc installed.

I tried downloading virtualenv, which when untarred and run generates its own module collection. But I found its version of pip didn’t work, “shared host SystemError: Cannot compile ‘Python.h’. “.

So the next option: Anaconda? (I’ve only 1Gb of space, turns out that’s not enough).

Back to virtualenv:

Install my own version of python: https://my.justhost.com/cgi/help/python-install

Then following the instructions here: http://stackoverflow.com/questions/24748084/installing-numpy-without-sudo

Combined with the help here:

http://docs.python-guide.org/en/latest/dev/virtualenvs/

(download virtualenv here: https://pypi.python.org/pypi/virtualenv#downloads )

It still didn’t work – there’s a problem with the configuration of virtualenv’s python. It might be better to scrap virtualenv and download all the modules etc that I’ll need and compile them. The only advantage of virtualenv was that it would provide pip etc.

New python executable in venv/bin/python
ERROR: The executable venv/bin/python is not functioning
ERROR: It thinks sys.prefix is u'/home/.sites/81/site18/.users/89/mts-michael/python' (should be u'/home/.sites/81/site18/.users/89/mts-michael/venv')

Lessons in Matrix Factorisation, Bayes and Collaborative Filtering

These ‘lessons’ are really notes for myself, as I become familiar with python, pandas, matrix factorisation methods, etc. I’ll add a bit of an introduction at some point, and organise them nicely, but for now you can read them in my ipython folder.

Gibbs sampling a 2d multivariate gaussian

Gibbs sampling a 2d multivariate Gaussian

Newer posts »

© 2019 Mike's Page

Theme by Anders NorenUp ↑