\documentclass[11pt,a4paper]{article}
\usepackage[top=1.25in,left=1.0in,right=1.0in,footskip=0.5in]{geometry}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage[slovene,english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{siunitx}
\usepackage{subfig}
\DeclareGraphicsExtensions{.pdf,.eps,.png,.jpg}
\hypersetup{colorlinks=true, linkcolor=LimeGreen, citecolor=LimeGreen, urlcolor=cyan}
\lstset{basicstyle=\scriptsize, frame=single, tabsize=12, title=Example file {\bf\lstname}}
\pagestyle{fancy}
\fancyhf{}
\lhead{\footnotesize\bf Introduction to Network Analysis ({\color{magenta}INA})}
\rhead{\footnotesize\bf Homework {\color{magenta}\#3}}
\cfoot{\thepage}
\fancypagestyle{titlestyle}{
\rhead{\footnotesize\bf Spring 2017/18}
\cfoot{}}
\definecolor{titc}{RGB}{106,40,134}
\definecolor{dfnc}{RGB}{155,187,90}
\definecolor{dfac}{RGB}{80,131,189}
\newcommand{\dfn}[1]{{\it\color{dfnc}#1}}
\newcommand{\dfa}[1]{{\color{cyan}#1}}
\newcommand{\hid}[1]{{\color{gray}#1}}
\newcommand{\cmp}[1]{\mathcal{O}(#1)}
\newcommand{\prb}[1]{\mathrm{P}(#1)}
\newcommand{\avg}[1]{\langle#1\rangle}
\newcommand{\hint}[1]{{\it (#1)}}
\newcommand{\point}{({\color{magenta}$1$~point})}
\newcommand{\points}[1]{({\color{magenta}$#1$~points})}
\newcommand{\total}{({\color{magenta}1 point})}
\newcommand{\totals}[1]{({\color{magenta}#1 points})}
\newcommand{\figref}[1]{{\color{LimeGreen}Figure~\ref{fig:#1}}}
\newcommand{\tblref}[1]{{\color{LimeGreen}Table~\ref{tbl:#1}}}
\newcommand{\eqref}[1]{{\color{LimeGreen}Eq.~(\ref{eq:#1})}}
\newcommand{\gnm}{$G(n,m)$\xspace}
\newcommand{\gnp}{$G(n,p)$\xspace}
\newcommand{\gk}{$G(\{k\})$\xspace}
\setcounter{page}{0}
\begin{document}
\thispagestyle{titlestyle}
\vspace*{0.0in}
\begin{center}
{\huge\bf Homework {\color{magenta}\#3}}
\end{center}
\vspace*{0.025in}
% \section*{Homework details}
\paragraph{} % This homework is only partial and will be updated in the coming days.
This homework is complete and will not be changed. The homework does not require a lot of writing, but may require a lot of thinking. It does not require a lot of processing power, but may require efficient programming. It accounts for $12.5\%$ of the course grade. All questions and comments regarding the homework should be directed to \href{https://piazza.com}{Piazza}.
\section*{Submission details}
\paragraph{} This homework is due on {\bf\color{magenta} May 1st} at 3:00pm, while late days expire on {\bf\color{magenta} May 4th} at 3:00pm. The homework must be submitted as a hard-copy in the submission box in front of R 2.49 and also as an electronic version to \href{https://ucilnica.fri.uni-lj.si/course/view.php?id=183}{eUcilnica}. It can be prepared in either English or Slovene and either written by hand or typed on a computer. The hard-copy should include (1)~this cover sheet with filled out time of the submission and signed honor code, (2)~short answers to the questions, which can also demand proofs, tables, plots, diagrams and other, and (3)~a printout of all the code required to complete the exercises. The electronic submission should include only (1)~answers to the questions in a single file and (2)~all the code in a format of the specific programming language. Note that hard-copies will be graded, while electronic submissions will be used for plagiarism detection. The homework is considered submitted only when both versions have been submitted. Failing to include this honor code in the submission will result in {\bf\color{LimeGreen} 10\% deduction}. Failing to submit all the developed code to \href{https://ucilnica.fri.uni-lj.si/course/view.php?id=183}{eUcilnica} will result in {\bf\color{LimeGreen} 50\%~deduction}.
\section*{Honor code}
\paragraph{} The students are strongly encouraged to discuss the homework with other classmates and form study groups. Yet, each student must then solve the homework by herself or himself without the help of others and should be able to redo the homework at a later time. In other words, the students are encouraged to collaborate, but should not copy from one another. Referring to any solutions obtained from classmates, course books, previous years, found online or other, is considered an honor code violation. Also, stating any part of the solutions in class or on \href{https://piazza.com}{Piazza} is considered an honor code violation. Finally, failing to name the correct study group members, or filling out the wrong date or time of the submission, is also considered an honor code violation. Honor code violation will not be tolerated. Any student violating the honor code will be reported to {\bf\color{LimeGreen} faculty disciplinary committee} and vice dean for education.
\vspace*{0.033in}
\paragraph{Name \& SID:} \rule{4.5in}{0.5pt}
\paragraph{Study group:} \rule{4.5in}{0.5pt}
\paragraph{Date \& time:} \rule{2.5in}{0.5pt}
\paragraph{} I acknowledge and accept the honor code.
\paragraph{Signature:} \rule{2.5in}{0.5pt}
\pagebreak
\section{Graph Laplacian \totals{0.75}}
\paragraph{} Let $n$ be the number of nodes in an undirected network and let $m$ be the number of links. Graph Laplacian $L$ is $n\times n$ matrix defined as $L=D-A$, where $A$ is the network adjacency matrix and $D$ the diagonal matrix with node degrees $\{k_i\}$ along its diagonal. Link incidence matrix $B$ is $m\times n$ matrix defined as $B_{ij}=1$ if $j$ is the first endpoint of $i$-th link, $B_{ij}=-1$ if $j$ is the second endpoint of $i$-th link, and $B_{ij}=0$ otherwise. \hint{Arbitrarily designate one endpoint to be the first one and the other to be the second one.} First show that $L=B^TB$. Using this equality further show that all eigenvalues of $L$ are non-negative and that vector of all ones is an eigenvector of $L$. These results prove useful in spectral community detection~\cite{Fie73,New10}.
\subsection*{What to submit?}
\paragraph{} Show that the equality holds \points{0.25}. Give proof of the non-negativity of the eigenvalues of $L$ \points{0.25} and show that vector of all ones is an eigenvector of $L$ \points{0.25}.
\section{Ring graph modularity \total}
\paragraph{} Imagine a graph with $n$ nodes positioned on a ring thus each node is linked to its two neighbors (see~\figref{ring}). Let the graph be partitioned into $c$ consecutive clusters with $n_c=n/c$ nodes each. Compute modularity $Q$~\cite{GN02} of such partition and express it in terms of $n_c$ and~$n$. \hint{See lecture handouts for the definition of modularity.} Find the size of clusters $n_c$ that optimizes modularity $Q$ and express it in terms of $n$.
\begin{figure}[h] \centering
\includegraphics[width=0.275\textwidth]{ring}
\caption{{\bf Ring graph with $n=36$, $n_c=4$ and $Q=0.64$}}
\label{fig:ring}
\end{figure}
\subsection*{What to submit?}
\paragraph{} Derive the expression for modularity $Q$ of a ring graph \points{0.5} and the optimal size of clusters $n_c$ according to modularity $Q$ \points{0.5}.
\section{Who's the winner? \totals{4.5}}
\paragraph{} Community detection is one of the most popular research areas of network science~\cite{New12}. Indeed, literary hundreds of community detection algorithms have been proposed in the literature in the last two decades~\cite{For10,FH16}. These include hierarchical clustering, spectral methods (e.g.\ \href{http://www.cs.utexas.edu/users/dml/Software/graclus.html}{Graclus}), modularity optimization (e.g.\ \href{https://sites.google.com/site/findcommunities/}{Louvain}), map equation algorithms (e.g.\ \href{http://mapequation.org/code.html}{Infomap}), statistical methods (e.g.\ \href{http://www.oslom.org}{OSLOM}), link clustering (e.g.\ \href{https://github.com/bagrow/linkcomm}{Links}), label propagation (e.g.\ \href{http://gregory.org/research/networks/software/copra.html}{COPRA}), random walks (e.g.\ \href{https://www-complexnetworks.lip6.fr/~latapy/PP/walktrap.html}{Walktrap}), clique percolation (e.g.\ \href{http://complex.cs.aalto.fi/resources/software/}{SCP}) and many others (e.g.\ \href{http://www.michelecoscia.com/?page_id=42}{DEMON}). Your task is to compare the accuracy, robustness and uncertainty of selected three algorithms. These should include at least \href{https://sites.google.com/site/findcommunities/}{Louvain} and \href{http://mapequation.org/code.html}{Infomap}, and also an algorithm of your own choice which should not be from the same class of methods as the mentioned two. \hint{If you are unable to compile any of the required algorithm implementations on your machine, and there is no equivalent implementation within your programming library, you should write to \href{https://piazza.com}{Piazza} and ask for an appropriate alternative. Code required to solve this exercise will likely consist of several ad hoc scripts that will have to be run sequentially.}
\begin{figure}[t] \centering
\subfloat[{\bf Girvan-Newman} benchmark graphs]{\includegraphics[width=0.425\textwidth]{GN}\label{fig:GN}}
\subfloat[{\bf Lancichinetti} benchmark graphs]{\includegraphics[width=0.475\textwidth]{LFR}\label{fig:LFR}}
\caption{{\bf Synthetic benchmark graphs with planted partition for $\mu=0.1$}}
\label{fig:benchmarks}
\end{figure}
\begin{description}
\item[(i)] Implement a variant of Girvan-Newman synthetic benchmark graphs with planted partition~\cite{GN02}. The graphs consist of three groups of $24$ nodes each, while the expected degree of each node is $20$ (see~\figref{GN}). The group structure is controlled by a mixing parameter $\mu$. For $\mu=0$, all links are placed within the groups, while for $\mu=1$, all links are placed between the groups. Apply all three community detection algorithms to $25$ benchmark graph realizations with $\mu$ equal to $0$, $0.1$, $0.2$, $0.3$, $0.4$ and $0.5$. For each algorithm and each value of $\mu$, compute normalized mutual information between the planted partitions and the detected community structures, and average the results. \hint{See lecture handouts for the definition of normalized mutual information.} Plot community detection accuracy of all three algorithms on a single plot with $\mu$ on the horizontal axis and the average normalized mutual information on the vertical axis. Which algorithm comes out on top? Briefly discuss the results by comparing the performance of different algorithms.
\item[(ii)] Consider a more realistic \href{http://lovro.lpt.fri.uni-lj.si/ina/LFR.zip}{Lancichinetti synthetic benchmark graphs} with planted partition \cite{LFR08}. \hint{See lecture handouts for the description of benchmark graphs.} The graphs consist of $1000$ nodes (see~\figref{LFR}), while the group structure is again controlled by a mixing parameter $\mu$. Apply all three community detection algorithms to $25$ benchmark graph realizations with $\mu$ equal to $0$, $0.2$, $0.4$, $0.6$ and $0.8$. Plot community detection accuracy of all three algorithms on a single plot with $\mu$ on the horizontal axis and the average normalized mutual information on the vertical axis. Which algorithm comes out on top now? Why? Briefly discuss the results by comparing the performance of different algorithms.
\item[(iii)] Consider an Erd\H{o}s-R\'{e}nyi random graph~\cite{ER59} that lacks community structure. Community detection algorithms should be robust enough to detect this and output each connected component of the graph as a single community. Apply all three community detection algorithms to $25$ random graph realizations with $1000$ nodes and the average degree equal to $8$, $16$, $24$, $32$ and $40$. Plot community detection robustness of all three algorithms on a single plot with the average degree on the horizontal axis and the average normalized variation of information on the vertical axis. \hint{See lecture handouts for the definition of normalized variation of information.} Which algorithms appear robust to random structure? Why? Briefly discuss the results by comparing the robustness of different algorithms.
\item[(iv)] Consider \href{http://lovro.lpt.fri.uni-lj.si/ina/dolphins}{Lusseau bottlenose dolphins network}~\cite{LSBHSD03} with a known sociological division into two groups. Apply each community detection algorithm $25$ times and analyze community detection uncertainty. More precisely, compute pair-wise normalized variation of information between the detected community structures and average the results. State the average normalized variation of information for all three algorithms and briefly discuss the results. Which algorithms appear most deterministic?
\item[(v)] Given all the knowledge gained above, which algorithm would you choose for your course project if needed? State the weaknesses of each algorithm and finally select the winner.
\end{description}
\subsection*{What to submit?}
\begin{description}
\item[(i)] Give a printout of the benchmark graph implementation \points{0.25}. Plot community detection accuracy of all three algorithms \points{3\times 0.25}. Give an answer to the question and briefly comment on the results \points{0.25}.
\item[(ii)] Plot community detection accuracy of all three algorithms \points{3\times 0.25}. Give answers to both questions and briefly comment on the results \points{0.25}.
\item[(iii)] Plot community detection robustness of all three algorithms \points{3\times 0.25}. Give answers to both questions and briefly comment on the results \points{0.25}.
\item[(iv)] State community detection uncertainty of all three algorithms \points{3\times 0.25}. Give an answer to the question and briefly comment on the results \points{0.25}.
\item[(v)] State the weaknesses of each algorithm and give a brief answer to the question \points{0.25}.
\end{description}
\section{Peers, ties and the Internet \totals{3.25}}
\paragraph{} Link prediction is probably the most common application of network analysis techniques. For given nodes $i$ and $j$, link prediction methods compute an index $s_{ij}$ that should be high for $i$ and $j$ that are likely to connect, and low for all other pairs of $i$ and $j$. You will be investigating three link prediction methods that are based on different structural properties of real networks.
\begin{enumerate}
\item Scale-free degree distribution of real networks is believed to be the consequence of preferential attachment~\cite{BA99} that states that nodes are more likely to connect to high degree nodes. The preferential attachment index~\cite{LK07} is thus defined as $s_{ij}=k_ik_j$, where $k_i$ is the degree of node $i$.
\item Small-world networks are characterized by an abundance of triangles~\cite{WS98}, which can be explained by triadic closure. Hence, nodes are more likely to connect if they share many common neighbors. The Adamic-Adar index~\cite{AA03} takes into account also that it is more likely to share a high degree neighbor. It is defined as $s_{ij}=\sum_{x\in \Gamma_i\cap\Gamma_j}\frac{1}{\log k_x}$, where $\Gamma_i$ is the set of neighbors of node $i$.
\item Many real networks consist of communities of densely linked nodes with only few links between the communities~\cite{GN02}. Links are thus more likely to appear within communities than between. Let $\{C\}$ be the communities revealed by \href{https://sites.google.com/site/findcommunities/}{Louvain modularity optimization}~\cite{BGLL08} and let $c_i$ be the community label of node $i$. Furthermore, let $n_c$ and $m_c$ be the number of nodes and links within the community $C$. Then, the community index is defined as $s_{ij}=m_c/{n_c \choose 2}$ for $c_i=c_j$, whereas $s_{ij}=0$ for $c_i\neq c_j$. \hint{If you are unable to compile the required algorithm implementation on your machine, and there is no equivalent implementation within your programming library, you should write to \href{https://piazza.com}{Piazza} and ask for an appropriate alternative.}
\end{enumerate}
\begin{figure}[t] \centering
\includegraphics[width=0.66\textwidth]{circles}
\caption{{\bf Communities in Facebook social circles revealed with Louvain method}}
\label{fig:circles}
\end{figure}
\begin{description}
\item[(x)] Assume that you apply a link prediction method to all unlinked pairs of nodes in a real network and later evaluate between which pairs of nodes the links actually occurred. Considering the density of real networks, what would be the expected classification accuracy of a method that simply predicts that no links will occur?
\item[(y)] Implement the following framework for evaluating link prediction methods. For a given network and link prediction index $s$, randomly sample $\frac{m}{10}$ pairs of nodes that are not yet linked and store them into $L_N$. These will serve as negative examples for the prediction. Next, randomly remove $\frac{m}{10}$ links from the network and store them into $L_P$. These will serve as positive examples for the prediction. Finally, compute the link prediction index $s$ for all pairs of nodes in $L_N\cup L_P$. Link prediction is usually evaluated using area under the ROC curve (AUC), which can be defined as the probability that a randomly chosen pair of nodes in $L_P$ has higher value of $s$ than a randomly chosen pair of nodes in $L_N$. Note that random guessing gives $\frac{1}{2}$. To compute AUC, randomly sample $\frac{m}{10}$ pairs of nodes from $L_P$ and $\frac{m}{10}$ pairs from $L_N$ with repetitions, and compare their indices $s$. Let $m'$ be the number of times when the value of $s$ for the pair of nodes from $L_P$ is larger than the value for the pair of nodes form $L_N$, and let $m''$ be the number of times when both values are equal. Then, AUC $=\frac{m'+m''/2}{m/10}$.
\item[(z)] Compute the average AUC over several runs for all three link prediction methods above applied to an Erd\H{o}s-R\'{e}nyi random graph~\cite{ER59} with $25000$ nodes and the average degree $10$, and to three real networks. These are \href{http://lovro.lpt.fri.uni-lj.si/ina/gnutella}{Gnutella peer-to-peer file sharing network}, a small sample of \href{http://lovro.lpt.fri.uni-lj.si/ina/circles}{Facebook social circles network} (see~\figref{circles}) and \href{http://lovro.lpt.fri.uni-lj.si/ina/nec}{nec overlay map} of the Internet. \hint{Although some networks are directed, treat them as undirected.} Which method comes out on top for each individual network? Why? Briefly discuss the results and compare the performance of methods on random graphs and real networks.
\end{description}
\subsection*{What to submit?}
\begin{description}
\item[(x)] Give a brief answer to the question \points{0.25}.
\item[(y)] Give a printout of the framework implementation \points{0.75}.
\item[(z)] For each link prediction method, state the average AUC obtained for random graphs and real networks \points{3\times 0.5}. Answer both questions for each individual network and briefly comment on the results \points{3\times 0.25}.
\end{description}
\section{Get at least 70\% right! \totals{1.5}}
\paragraph{} You are given a \href{http://lovro.lpt.fri.uni-lj.si/ina/aps_2008_2013}{citation network} of scientific papers published by the American Physical Society between the years $2008$ and $2013$. The papers were published in ten journals, which represent the metadata information you would like to infer from the structure of citation network. More precisely, you would like to predict the correct journal of all papers published in the year $2013$ based on their citation patterns and the journal information of all papers published between the years $2008$ and $2012$. Predicting the paper's journal to be the most frequent journal in the neighborhood of the corresponding node gives $\approx 65\%$ classification accuracy, while your task is to propose a strategy that gives at least $\approx 70\%$ classification accuracy. The strategy can use any network analysis technique as long as it scales better than $\mathcal{O}(n^2)$. \hint{To get full credit your strategy should be at least slightly different from the methods seen in lectures!}
\subsection*{What to submit?}
\paragraph{} Describe your strategy and briefly explain its rationale \points{0.25}. State the average classification accuracy obtained over several runs and compare results with the baseline \points{0.75}. Print out any code you might have used or describe how you solved the exercise \points{0.5}.
\bibliographystyle{alpha}
\bibliography{biblio}
\end{document}