Wikipedia - Top100 Most Controversial Topics: Contributors Co-occurrence

The network of shared editors between the Top 100 most controversial pages for a Wikipedia language edition can be represented by an adjacency matrix, where each cell ij represents an edge from vertex i to vertex j. The vertices represent the Top 100 most controverisal pages in a Wikipedia language edition, while edges represent the similarity in terms of shared editors for the two pages.

Clustering algorithm used: Optimal dendrogram ordering (implementation of binary tree ordering described in [Bar-Joseph et al., 2003] by Renaud Blanch and K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data (Ziv Bar-Joseph, Erik D. Demaine, David K. Gifford, AngĂ¨le M. Hamel, Tommy S. Jaakkola and Nathan Srebro; Bioinformatics, 19(9), pp 1070-8, 2003 http://www.cs.cmu.edu/~zivbj/compBio/k-aryBio.pdf.

Distance Measures:

Manhanttan: is the distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal.

Euclidian: is the "ordinary" (i.e. straight-line) distance between two points in Euclidean space.

Chebyshev: is the distance between two vectors is the greatest of their differences along any coordinate dimension.

Hamming: is the number of positions at which the strings of equal length are different and it measures the minimum number of substitutions required to change one string into the other.

Jaccard: measures dissimilarity between sample sets and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union.

Bray-Curtis: quantifies the compositional dissimilarity between two different sites, based on counts at each site.