Mike's Page

Informatics, Development, Cycling, Data, Travel ...

Page 2 of 3

Fitting models with GPy: subtract the mean

By default it seems GPy doesn’t subtract the mean of the data prior to fitting.

So it’s worth including a mean function that does this:

 m = GPy.models.GPRegression(

Then one needs to fix this value:


Air Pollution: Kampala

A few notes from my visit to the city:

Tuesday: Arrived. A brief period of moderate panic on the plane when I thought I wouldn’t be let in without an electronic visa. But as of July 2017 people can still buy a single entry visa on entry. Had dinner down at Club 5. I think maybe it’s not as good as I remember!

Sensor on Boda boda

From left to right: Ssekanjako John, the bodaboda driver; me; Engineer Bainomugisha.

Wednesday: Engineer and Joel took me on a tour ’round Kampala to visit the sites where they’ve got air pollution monitors up. We first met the bodaboda driver who’s hosting one of the sensors on his motorbike. He’s had a bit of hassle from security asking what the box is, but he’s disguised it by painting it black and half hiding it under a shredded old bin bag!

Sensor on Jinja Road

Sensor on Jinja Road

The sensor on Jinja Road looks like it’ll be measuring quite a bit – it was surrounded by traffic regularly pumping out black smoke. I suspect that, of the pollution from vehicle emissions, the majority will be from a small proportion of vehicles…

A more sobering part of the tour was to the large dump, north of the Northern Bypass. There we saw hundreds of people (some with huts built in the dump itself) sorting through the rubbish looking for recyclables. I didn’t see much evidence of PPE.


Kampala's dump

Kampala’s main dump

The main source of particulate pollution here will probably be the dirt tracks but I suspect it will be quite low (there’s very little rubbish burning apparently, when we asked around). More concerning are gas and volatile organics. I imagine ground water is contaminated too.

Thursday: Block B was shut today as the government had rented it (I wonder who got the cash??!) to do interviews for parliamentary positions. Awkward as the lab with our equipment is in there. I got to hear a few presentations at the AI Lab though, and it was good to catch up with everyone.

I took a brief bit of time from working to visit the art gallery on campus. If anyone’s visiting Kampala and has a spare half-hour, I’d recommend it!

Friday: We got a monitor working on block B outside the lab’s window. It’s having trouble with its powersupply, so it’s somewhat erratic at the moment. I got the website up and running.

For old-times sake I went down to Mediterraneo for dinner. It still seems to be going strong, and has a nice vibe in the evening.

Next: Arusha!

Back at Makerere Guest House

Marabou Stork

A Marabou Stork (Image from wikimedia)

Back at Makerere working on the air pollution monitoring project with Engineer Bainomugisha.

One of my favourite things at Makerere is sitting at a table outside the guest house, with a cup of “African Spiced Tea”, watching the Marabou storks.


GPy Hack Day

We’ve just held the July GPy hack day. Key outcome: we’re going to be building the documentation in a brand new user-friendly way, based on sk-learn’s style, and using Tania’s new system for turning a bunch of notebooks into a website. Other notes from the meet. More on this soon…

Differential Privacy for GP available via pip

I’ve finally got differential privacy for Gaussian processes on pip.

 pip install dp4gp

Details and notes are in the development repo, and the paper it is based on is here. Although since then I’ve introduced inducing inputs, which appears to massively improve the results (see also presentation). The figures below demonstrate the scale of the DP noise added without and with inducing inputs.

Standard cloaking method

Standard cloaking method

Cloaking method with inducing inputs

Cloaking method with inducing inputs

Downloading whole dataset from the Thingspeak API

A quick post to link to a jupyter notebook demonstrating how to download a whole dataset. Here’s the code. It simply hops 7999 entries at a time, downloading all the records that fall between the two ends of each step.

import json, requests
from datetime import datetime, timedelta
apiurl = 'http://thingspeak.com/channels/241694'
nextid = 1
result = None
alldata = []
endtime = None

while result != '-1':
    print nextid
    result = json.loads(requests.post(apiurl+'/feeds/entry/%d.json' % nextid).content)

    starttime = endtime
    if result == '-1':
        endtime = datetime.now()
        endtime = datetime.strptime(result['created_at'],'%Y-%m-%dT%H:%M:%SZ')
    if (nextid==1):
        starttime = endtime
        start = datetime.strftime(starttime,'%Y-%m-%dT%H:%M:%SZ')
        end = datetime.strftime(endtime-timedelta(seconds=1),'%Y-%m-%dT%H:%M:%SZ')
        data = json.loads(requests.post(apiurl+'/feeds.json?start=%s&end=%s' % (start,end)).content)
        print nextid, len(data['feeds'])
    nextid += 7999 #thought download was 8000 fields, but it's 8000 records. 8000/len(result)

NBN using python

I finally got around to putting pynbn on pip.

pip install pynbn




Makerere Advanced Programming Course 2014

A colleague at Makerere is taking Ariane_5_(maquette)over teaching the Advanced Programming course. It needs some updating.

I based my course roughly on what John taught before. The skills of the students was incredibly varied: Many hadn’t programmed before! While others were doing day jobs coding all day.
The course had the following parts:
1) Python and OOP
2) Regex
3) The linux bash command line
4) The LAMP model, sqlite3, XSS, etc.
5) Web APIs
I’d really like to change the course, adding some of the following:

Definitely need to add using code repos collaboratively: E.g. git. I’d like to make this a project where students find github projects they want to help, and make pull requests to provide improvements to fix bugs, etc. Or maybe work together on a project with their fellow students.

Using AWS: The cloud is where things are these days, learning to use their interface and the API. Maybe could approach AWS to see if they’ll provide free credits to allow the course to use their servers? They do have a free tariff that students could use maybe?

Mobile development: something that’s now ubiquitous, so really should be in a course maybe at Makerere? Could make it one week.

Microprocessor development: the arduino and atTiny

Internet of things: already tried to address this by the API work

more on security: from letsencrypt to pen-testing and firewalls (ip-tables) – I didn’t know enough about these topics to go into them in too much depth. And there’s already too much on security in the course.

Internet communication: packets etc?

Coding methodologies: e.g. pair programming
Visualisation on the web: d3?

GPU programming: also something I’ve limited experience of, but could be interesting?

You can find lots of the old course here:
Includes 2014 course work, the 2014 exam and the lecture slides. Note: Definitely need to update to python3 (from python2.x).
I used my own laptop to host quite a bit of this (including the bash learning), used my mobile as the hotspot to let the students connect.

Workshop on Big and Open Data for International Development

Yesterday I took part in the department of Development Informatics big data event at the University of Manchester.

Really interesting discussion. Made me think more carefully about what effect the blind-spots in my data will have, and how collecting data can make these blind spots worse.

At the end of the day we had a bit of a discussion. Our group was particularly concerned by the effect on power shifts or concentrations with increased data aggregation (similar to Neil Lawrence’s Digital Oligarchy).

We started with the question “What will be done with the data?” and then “How does it become information?”

This led to the obvious point that it depends on who can use it, which then reinforces the power-shift that we started from. The outcome of “How does it become information” leads to the question “How can using data actually foster development? (and avoid inequality)”.  We also had as concerns around the transformation from data to information that it is often bias or focused on inanimate or simple things (measuring the water pump rather than the people).

We finally looked at how to change or stop the shift or concentration in power. Two options presented themselves, either to stop using the data, and halt the path to large scale big-data analyses. This seems implausible, given the path we are on. A second option was “Can [the power shift] be mitigated by giving everyone access?”, in other words, will open data save us from the digital oligarchy?

This was again criticised; how can an illiterate farmer or boda-boda driver engage or use large data sets?

My own view is that we need layers of intermediary; from the machine learning/analysis experts who can combine and use the data, and visualise it in clear ways, to journalists and civil society who effectively ‘represent’ the citizen. Our concern is that the machine learning expert is a very particular part of society: usually white, highly educated, young and male. We can go some-way to mitigate that by investing in and supporting MSc and PhD level education in developing countries… however, I’m aware the students at Makerere (for example) were not a ‘typical’ sample of the Ugandan population. Most of the population is rural, with a good proportion unable to read or write. I suspect that the Ugandan students will represent their country-men and women little better than a muzungu. However, it is a start down the path, towards some form of democratic or universal access to the power provided by machine learning and big-data.


I presented my crash map project, and also submitted a paper on the topic.

Differential Privacy and Gaussian Processes

Regarding our new paper:  Differentially Private Gaussian Processes.

In layman terms Gaussian Processes (GPs) are, usually, used to describe a dataset, to allow predictions to be made. E.g. given previous patient data, what is the best treatment a new patient should receive? It’s a nice framework as it incorporates assumptions clearly, and, because of its probabilistic underpinnings gives estimates of the confidence for a particular estimate.

Comparison of new 'integral' kernel with a normal RBF kernel. Note the underestimate the RBF kernel suffers between 20 and 40 years.

Comparison of new ‘integral’ kernel with a normal RBF kernel. Note the underestimate the RBF kernel suffers between 20 and 40 years.

Differential Privacy is a method that’s recently started to go main-stream. I’ve written a brief introduction presentation here. The general idea is to add noise to any query result to mask the value of an individual row in a database, but still allow inference to be done on the whole database.

My research areas cover both Gaussian Processes and Differential Privacy, so it seemed to make sense to see if I could apply one to the other. In our latest paper we look at two ways to do this:

  • Bin the data and add differential privacy noise to the bin totals. Then use a GP to do the inference.
  • Use the raw non-private data to train a GP, then perturb the GP with structured noise to achieve DP.

For the former I developed a new kernel (a way of describing how data is correlated or structured) for binned or ‘histogram’ data. See this ipython notebook for examples. This hopefully is useful for many applications (outside the DP field). For example any inference using binned datasets. At the moment I’ve only applied it to the RBF kernel.

For the latter I used the results of [1] to determine the noise required to be added to the mean function of my GP. I found that we could considerably reduce the noise scale by using inducing (pseudo) inputs.

Malawi child dataset. Raw data and 2d histogram.

Malawi child dataset. Raw data and 2d histogram.

Both methods can be further improved, but it’s still early days for Differential Privacy. We need to look at how to apply DP to as many methods as possible, and start to incorporate it into more domains. I’ll be looking at how to apply it to the MND database. Finally, we need an easy “introduction to DP for practitioners”. Although I don’t know if the field is sufficiently mature for this yet.

[1] Hall, R., Rinaldo, A., & Wasserman, L. (2013). Differential privacy for functions and functional data. The Journal of Machine Learning Research, 14 (1), 703–727.

« Older posts Newer posts »

© 2018 Mike's Page

Theme by Anders NorenUp ↑