Arxivsorter

Jean-Philippe Magué & Brice Ménard

We present an algorithm aimed at ranking a given list of papers depending on their potential relevance to a given reader. The method is based on the connectivity of authors in the network of co-authorship in a field (similarly to a friends-of-friends technique). Given a set of favorite authors, it provides an estimate of the proximity of any author and uses the result to assign the proximity of a given paper.

The Arxivsorter algorithm is not a filter. It simply ranks a list of papers, without any loss of information. This method is can be applied to any publication database. We illustrate it here in the context of astro-ph papers from arxiv.org.





The algorithm

Author connectivity

We can characterize the existence and strength of the co-authorship between two authors ${i,j}$ by introducing the matrix of connectivity $C_{ij}$. The simplest definition of connectivity is to have $C_{ij}=1$ if authors $i$ and $j$ have at least one paper in common, and $C_{ij}=0$ otherwise. Weighted connectivities can be used in order to better reflect certain properties of co-authorships:
  • the strength of a collaboration between two authors $\{i,j\}$ is an increasing function of the number of papers they share,
  • for the vast majority of papers on astro-ph, the authorship ordering carries some information about the relative contribution of each author.
In order to include these properties, we propose to weight the connectivity of a pair of authors ${i,j}$ according to $$ C_{ij}=\sum_{papers}\frac{1}{r_i+r_j}\,, $$ where $r_i$ and $r_j$ represent are the ranks of authors $i$ and $j$ in a common paper. If two authors do not share any paper then $C_{ij}=0$. Other functional forms can be used but have a small impact on the final results as long as the weight decreases sufficiently rapidly as a function of $r_i$ and $r_j$.

Propagation

Having introduced the connectivity between pairs of authors, we can now consider the graph of co-authorship and characterize the properties of random walks starting from a given author. First, we assume that, starting from author $i$, the probability $P_{ij}$ to reach author $j$ within one step is proportional to their connectivity $C_{ij}$. We then have $$ \forall\, j,~ \mathrm{P}_{ij}=\frac{\mathrm{C}_{ij}}{\sum_{j} \mathrm{C}_{ij}} $$ where the normalisation ensures that the sum of the probabilities of moving from $i$ to all its neighbors $j$ is unity, i.e. $\sum_j \rm P_{ij}=1$. Note that this normalisation introduces a directionality in the graph, as $\rm P_{ij}\ne P_{ji}$. We now introduce the vector $\Delta$ representing a probability density in the space of authors. If we first consider a unique author $k$, we have $\Delta_i=1$ with $i=k$ and $\Delta_i=0$ otherwise. If we consider a list of authors $\Delta$ needs to be normalized such as $\sum \Delta_i =1$.  If $\Delta^{(t=0)}$ is the vector representing the initial author density, the probability to be on any author after one propagation in the network of co-authorship is simply given by $ \rm {\Delta^{(t=1)}}= \rm P\times \rm \Delta^{(t=0)}\,. $ This process can be reiterated in order to further propagate the random walks within the graph and reach authors who are not directly connected with those defined by the initial author density: $$ \mathrm{\Delta}^{(t=n)}= \mathrm{P_{}}\times \rm \mathrm{\Delta}^{(t=n-1)}\,. $$ The convergence properties of $\Delta^{(t)}$ provide us with useful information on the proximity between authors. First, when $n\rightarrow \infty$, the distribution $\Delta^{(t=n)}$ of probabilities among the authors no longer depends on the starting point $\Delta^{(t=0)}$. Considering an initial author state $\Delta^{(t=0)}$, and a given author $i$, two types of convergence can be observed as $n$ increases from 0 to $\infty$:
    1. $\Delta_i^{(t)}$ increases monotically to reach its limiting value. In this case, author $i$ does not belong to the neighborhood of authors for which $\Delta^{(0)}\ne 0$ and the proximity between $\Delta_i$ and the initial author list depends mostly on the overall structure of the network.
    2. $\Delta_i^{(t)}$ presents a peak after a few steps and then decreases monotically to reach its limiting value. In this case, author $i$ belongs to the close neighborhood of $\Delta^{(0)}$.
In order to take these properties into account, we define the proximity of author $i$ with respect to a set of authors $\Delta^{(0)}$ as $$ \mathrm{Prox}(\Delta^{(0)} \rightarrow i)=\frac{\Delta_i^{(t_{max})}}{t_{max}} $$ where $t_{max}$ is the time step for which $\Delta_i^{(t)}$ reaches its maximum value. This method allows us to estimate the proximity between a set of favorite authors and any author $i$ in a given field.

Paper proximity

We can now use the above author proximity to compute the proximity of a given paper. Motivated by our definition of connectivity, we propose to use $$ \mathrm{Paper~Proximity}= \sum_i \frac{\mathrm{Prox}(\Delta^{(0)} \rightarrow i)}{r_i+1} $$ where the sum is over all co-authors of the considered paper and $r_i$ is the rank (starting from 0) of author $i$ in the list of co-authors. The corresponding algorithm has been implemented and is available at http://arxivsorter.org.


Remarks

On arxiv.org each author is represented by an identifier of the form name_f where f is the first letter of the first name. Therefore two authors sharing these parameters will be considered as a single entity and therefore one node of the co-authorship network. As a result their connectivity $\mathrm{C}_{ij}$ becomes the sum of their individual connectivities. In practice, this affects mostly Chinese names.

The connectivity matrix $\mathrm{C}_{ij}$ encodes the properties of the co-authorship network. Considering papers on astro-ph, it turns out that the vast majority of authors are interconnected via other authors. They form the so-called giant cluster, for which it is possible to go from any author $i$ to $j$ using the random walks described above. However, about two percents of the community is not connected to the giant cluster and the proximity of these authors to a member of the giant cluster is by definition zero. Note that a future version of Arxivsorter will alleviate this limitation.


Jean-Philippe Magué (jean.philippe.mague [at] gmail.com)
Brice Ménard (menard [at] cita.utoronto.ca)

Special thanks to Mubdi Rahman for the web support.

Back to arXiv Sorter