With the emergence of web 2.0 practices, new participatory approaches in GIS emerged for the last decade. An underlying challenge in crowdsourcing is the strategy used to allocate the volunteers in order to optimize a set of criteria, especially the quality of generated data. In collaboration with UNOSAT analysts, I quickly investigated some participatory methods to analyze satellite imagery for humanitarian purpose.

In this blog post I do not focus on the issues about division and distribution among volunteers of a large area to analyze. I assumed the task is achievable for one expert and investigated the characteristic of a parallel model of organization involving non-experts. What mechanism at the individual and collective levels will enforce the accuracy and consistency of results? How many volunteers are needed for a given level of quality? How to avoid to waste the time of volunteers while consolidating the quality?

For instance we would like to crowdsource the geo-referencement of all the buildings on a set of maps. How doing it?

One solution is to use a parallel way to organize people $n$ volunteers perform independently the same task generating $n$ individual solutions (set of annotations) and an aggregation function is used to generate a collective output. In a future post we will see that such organization could be much more optimal by changing it into a hybrid iterative model thanks to knowledge accumulation and the measure of the difficulty of the task via the computation of the entropy of the distrubution of the contributions).

## Aggregation methods

Related crowdsourcing techniques have already been studied to classify a predefined set of images or texts, asking the opinion of several labelers for each item. In this case, volunteers play the role of (noisy) classifiers. In our case, the map is a continuous space with no predefined distinct set of items to classify. So volunteers play the role of producers of data, extracting the potential regions of interest (ROI), as well as their classical role of classifiers filtering relevant ones.

Let’s first cluster spatially the annotations to find the potential regions of interest and then use a method to filter/classify them.

### Why traditional clustering algorithms don’t work

A common spatial clustering algorithm is the density-based spatial clustering for spatial noise (DBSCAN). It requires two parameters: the minimal reachability distance $eps$ and the minimum number of points $q$ required to form a cluster. The parameter $q$ could be seen as a decision threshold in a voting process to validate or reject a cluster: if more than $q$ annotations are enough close we can consider that $q$ volunteers voting for this ROI. Once the validated clusters defined, we compute their centroids to get locate the buildings.

The main problem with DBSCAN is the no respect of the democratic aspect. Firstly several close points made by the same user could validate the area. Secondly the DBSCAN performs worse with large amount of volunteers in dense environment since it is sensitive to the “chaining-effect”.

a | b | c |

d | e | f |

### Why a grid-based algorithm is not ideal

Another solution is to divide into a grid of smaller areas and aggregate the annotations on each cell. The problem of this solution is the potential unmatching with the topology of the area, generating quite low performance results.

### democratic clustering: the DCA approach

To tackle this issue, we need to change the notion of spatial neighborhood by injecting democratic constraints: a cluster should be formed of annotations enough close and from different users. Be $g$ a group of volunteers and $A^g$ the set of annotations from $g$, we noted $N^g(a)$ the *democratic neighborhood* of the annotation $a in A^g$, defined as $N^g(a)={a_1^*, a_2^*,…,a_n^*}$ with $a_i^*=underset{a_i in A^g_i}{operatorname{argmin}} (dist(a,a_i))$ the nearest neighbor of $a$ in $A^g_i$.

Using this democratic neighborhood, I developed a simple clustering algo called *DCA* (democratic clustering algorithm): step 1- first we take a random annotation $a in A^g$ and retrieve its $N^g(a)$. we filtered $N^g(a)$ according to a maximum distance $epsilon$ to get $N^g_{epsilon}(a)={a_i^*, d(a_i^*,a)< epsilon }$ to form a cluster. We store the cluster $N^g_{epsilon}(a)$ and remove $N^g_{epsilon}(a)$ from $A^g$. We reiterate to the step 1 until $A^g$ is empty. At the end we have a set of clusters of annotations or ROIs that need to be filtered to get only the ‘consensual’ ones via a voting process (cf. next section). Once the decision taken, the centroid of the accepted ROIs are computed to get the location of the related buildings.

(a) | (b) | (c) |

#### Estimation of the density

One parameter required by the DBSCAN or the DCA is $epsilon$. How to choose the optimal $epsilon^*$ to consider 2 points being enough different? A simple solution is to crowdsource an estimation of the density using the annotations from the volunteers. To simplify here we consider that the area is enough small to have a quite stable/homogeneous density and the surfaces (e.g. buildings) to annotate are regular. For each volunteer $i in g$, we compute a distance matrix betweens all the annotations to get the minimum distance $epsilon_i=min(d(a^i_k,a^i_l))$ with $a^i_k,a^i_l in A^i$. We then collectively take the mean or the median: $epsilon^*=median({epsilon_1, epsilon_2,… ,epsilon_g})$.

#### Voting process

Once a set of ROIs found, we need to accept them or not. Let be $R={r}$ the set of ROIs found and $C={0,1}$ the binary class about the presence and absence of a building. For each $r in R$ and for each volunteer $k in N$, if $k$ has at least one annotation in $r$ he/she labeled it as $l^k_r=1$ other she/he implicitly labeled $l^k_r=0$. For further simplification, let more denote such labeling by $l^k_{cr}=1$ if the volunteer $k$ has “labeled” the ROI $r$ with the class $c$ , $l^k_{cr}=0$ otherwise. The goal is thus to find a way to aggregate all the labels of the group for a given ROI to arrive to a single ‘consensus’ label=${0,1}$.

#### Democracy & Majority rule

A common method is to apply a Majority Vote (MV) wich can often achieve relatively good empirical results depending on the accuracy of the users. MV assigns the label with the most votes: $hat{l}(r) = argmax( N(c))$ as the estimation to the true label for $r$, where $N(c)$ denotes the number of time $r$ was labeled $c in C$ from all volunteers. A limitation of a simple MV is that the consensus label is estimated locally, considering only the labels assigned to that ROI (without taking into account the error rate of the volunteers).

#### Meritocracy supervised approach

But we may want to trust a particular user more than the others.How to estimate their respective error rates? and how to aggregate the results? Evaluating the quality of workers is a common issue. A first method is to use “gold” data. by using “trap questions” we can evaluate the performance of volunteers.

Let denote $pi^k$ the latent individual error rates matrix $C times C$ of the volunteer $k$ where the $ij^th$ element $pi^k_{ij}$ denotes the probability of volunteer $k$ to ‘label’ an ROI with the class $j$ given its true label is $i$. if we have the gold standard ,we can estimate quite easily $pi^k_{ij}$

$$pi^k_{ij}= frac{text{number of r that k labeled i when j was the true label}}{text{number of r with the true label j}}$$.

However in many cases the true responses are not given or too costly to get (e.g. during a crisis in Africa, we don’t have time to get the gold data on the specific area). A second solution is to use redundancy not only for identifying the correct answers to each task but also for measuring the labeling quality of the volunteers via a latent class model ala Dawid & Skene (Expectation Maximization algorithm). here the latent/hidden parameters are $pi^k_{ij}$ and the prior probabilities $p_i$ that are unknown and we need to estimate.

The EM algorithm is an efficient iterative procedure to compute the maximum- likelihood solution in presence of missing/hidden data. Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)- step. The M-step involves maximization of a lower bound on the log-likelihood that is refined in each iteration by the E-step. In our context we follow two steps: (1) estimates the correct label for each $r in R$, using labels assigned by multiple workers, accounting for the quality of each worker; and (2) estimates the quality of the volunteers by comparing the submitted labeling to the inferred correct answers. The final output is the set of (estimated) correct answers for each ROI and an error rates matrix for each volunteer.Initially we consider all the worker are good ones by setting $pi^k_{ij}=0.9$ for $i=j$ and $pi^k_{ij}=0.1/(|C|-1)$ for $ineq j$. we also assume an equal prior probability of each class i.e. $p_c=1/|C|$.

Step 1- we would like to estimate the unknown true label $l_r$ of the ROI $r$. For each category $i$ we can estimate its probability taking into account the estimated quality of volunteers.

$$pr(l_r=i|data)=p_i.prod_{k in K}prod_{j in C}pi^k_{ij} l^k_{jr} setminus sum_{c in C} p_c.prod_{k in K} prod_{j in C}pi^k_{ic} l^k_{cr}$$

Step 2-we can then re-estimate the prior probability $p_i$:

$$p_i=sum_{r in R}pr(l_r=i|data)/|R|$$

Step 3-we now can re-estimate the error_rate $pi^k_{ij}$ of each worker by taken as gold standard the collective output (we can also remove k from K and reavaluate step 1 to avoid his/her influence).

$$pi^k_{ij}=sum_{r in R} pr(l_r=i|data) l^k_{jr} setminus sum_{c in C}sum_{r in R} pr(l_r=i|data) l^k_{cr} $$

Step 4- reiterate to the step 1 until a convergence.

At this end we choose the estimate of the true label as the label with the higher probability as $hat{l}(r) = argmax ( pr(l_r=i|data))$. While initially quite unsupervised, this method also enables a semi-supervised approach. In fact it may be practical to obtain a limited amount of gold data from experts if there is sufficient benefit to the consensus accuracy. For instance the experts can start analyzing an area but not finishing it. In this case , for some $r$ we already know $pr(l_r=i|gold+data)$ and thus influence the estimation of the prior probability, the error rate of each worker and thus the general iterative computation.

### source code

you can find an under documented source code (in R langage)

(and thanks to the Mathjax javascript for the on the fly latex interpreter used in this post)