\documentclass[11pt,a4paper]{article}
\usepackage[top=1.25in,left=1.0in,right=1.0in,footskip=0.5in]{geometry}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage[slovene,english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{siunitx}
\usepackage{subfig}
\DeclareGraphicsExtensions{.pdf,.eps,.png,.jpg}
\hypersetup{colorlinks=true, linkcolor=LimeGreen, citecolor=LimeGreen, urlcolor=cyan}
\lstset{basicstyle=\scriptsize, frame=single, tabsize=12, title=Example file {\bf\lstname}}
\pagestyle{fancy}
\fancyhf{}
\lhead{\footnotesize\bf Introduction to Network Analysis ({\color{magenta}INA})}
\rhead{\footnotesize\bf Homework {\color{magenta}\#1}}
\cfoot{\thepage}
\fancypagestyle{titlestyle}{
\rhead{\footnotesize\bf Spring 2016/17}
\cfoot{}}
\definecolor{titc}{RGB}{106,40,134}
\definecolor{dfnc}{RGB}{155,187,90}
\definecolor{dfac}{RGB}{80,131,189}
\newcommand{\dfn}[1]{{\it\color{dfnc}#1}}
\newcommand{\dfa}[1]{{\color{cyan}#1}}
\newcommand{\hid}[1]{{\color{gray}#1}}
\newcommand{\cmp}[1]{\mathcal{O}(#1)}
\newcommand{\prb}[1]{\mathrm{P}(#1)}
\newcommand{\avg}[1]{\langle#1\rangle}
\newcommand{\hint}[1]{{\it (#1)}}
\newcommand{\point}{({\color{magenta}$1$~point})}
\newcommand{\points}[1]{({\color{magenta}$#1$~points})}
\newcommand{\total}{({\color{magenta}1 point})}
\newcommand{\totals}[1]{({\color{magenta}#1 points})}
\newcommand{\figref}[1]{{\color{LimeGreen}Figure~\ref{fig:#1}}}
\newcommand{\tblref}[1]{{\color{LimeGreen}Table~\ref{tbl:#1}}}
\newcommand{\gnm}{$G(n,m)$\xspace}
\newcommand{\gnp}{$G(n,p)$\xspace}
\newcommand{\gk}{$G(\{k\})$\xspace}
\setcounter{page}{0}
\begin{document}
\thispagestyle{titlestyle}
\vspace*{0.0in}
\begin{center}
{\huge\bf Homework {\color{magenta}\#1}}
\end{center}
\vspace*{0.025in}
% \section*{Homework details}
\paragraph{} % This homework is only partial and will be updated in the coming days.
This homework is complete and will not be changed. The homework does not require a lot of writing, but may require a lot of thinking. It does not require a lot of processing power, but may require efficient programming. It accounts for $12.5\%$ of the course grade. All questions and comments regarding the homework should be directed to \href{https://piazza.com}{Piazza}.
\section*{Submission details}
\paragraph{} This homework is due on {\bf\color{magenta} March 21st} at 3:00pm, while late days expire on {\bf\color{magenta} March 24th} at 3:00pm. The homework must be submitted as a hard-copy in the submission box in front of R 2.49 and also as an electronic version to \href{https://ucilnica.fri.uni-lj.si/course/view.php?id=183}{eUcilnica}. It can be prepared in either English or Slovene and either written by hand or typed on a computer. The hard-copy should include (1)~this cover sheet with filled out time of the submission and signed honor code, (2)~short answers to the questions, which can also demand proofs, tables, plots, diagrams and other, and (3)~a printout of all the code required to complete the exercises. The electronic submission should include only (1)~answers to the questions in a single file and (2)~all the code in a format of the specific programming language. Note that hard-copies will be graded, while electronic submissions will be used for plagiarism detection. The homework is considered submitted only when both versions have been submitted. Failing to include this honor code in the submission will result in {\bf\color{LimeGreen} 10\% deduction}. Failing to submit all the developed code to \href{https://ucilnica.fri.uni-lj.si/course/view.php?id=183}{eUcilnica} will result in {\bf\color{LimeGreen} 50\% deduction}.
\section*{Honor code}
\paragraph{} The students are strongly encouraged to discuss the homework with other classmates and form study groups. Yet, each student must then solve the homework by herself or himself without the help of others and should be able to redo the homework at a later time. In other words, the students are encouraged to collaborate, but should not copy from one another. Referring to any solutions obtained from classmates, course books, previous years, found online or other, is considered an honor code violation. Also, stating any part of the solutions in class or on \href{https://piazza.com}{Piazza} is considered an honor code violation. Finally, failing to name the correct study group members, or filling out the wrong date or time of the submission, is also considered an honor code violation. Honor code violation will not be tolerated. Any student violating the honor code will be reported to {\bf\color{LimeGreen} faculty disciplinary committee} and vice dean for education.
\vspace*{0.033in}
\paragraph{Name \& SID:} \rule{4.5in}{0.5pt}
\paragraph{Study group:} \rule{4.5in}{0.5pt}
\paragraph{Date \& time:} \rule{2.5in}{0.5pt}
\paragraph{} I acknowledge and accept the honor code.
\paragraph{Signature:} \rule{2.5in}{0.5pt}
\pagebreak
\section*{Grand graph challenge ({\bf\color{magenta}extras})}
\paragraph{} Consider an Erd\H{o}s-R\'{e}nyi random graph~\cite{ER59} with $n$ nodes and $\frac{n\ln n}{8}$ links. What can you say about the expected fraction of nodes in the largest connected component $S$? Implement an efficient algorithm that constructs such a random graph and computes the fraction $S$. The algorithm can use any network representation and any method for computing connected components. It should merely output $S$ for a given $n$. What is the size of the largest random graph in terms of $n$ you are able to analyse on your computer within about one minute (\tblref{challenge})?
\begin{table}[h] \centering
\begin{tabular}{rrcll} \hline
$n$ & $m$ & ~~~~~$S$~~~~~ & Graph & Search \\[0.5ex] \hline
\num{100000} & \num{143912} & ? & $0.06$ sec & $0.06$ sec \\
\num{1000000} & \num{1726939} & ? & $2.42$ sec & $0.44$ sec \\
\num{10000000} & \num{20147620} & ? & $26.2$ sec & $8.20$ sec \\
\dfa{\num{12500000}} & \dfa{\num{25533186}} & \dfa{?} & \dfa{$49.1$ sec} & \dfa{$10.1$ sec} \\
\num{15000000} & \num{30981676} & ? & $1.38$ min & $18.3$ sec \\
\num{25000000} & \num{53232457} & ? & $2.28$ min & $32.5$ sec \\
\num{50000000} & \num{110797085} & ? & $3.78$ min & $2.23$ min \\[1ex] \hline
\end{tabular}
\caption{{\bf Results for 2.3 GHz Intel Core i7 with 16 GB memory}}
\label{tbl:challenge}
\end{table}
\subsection*{What to submit?}
\paragraph{} Submit your answers and results through the grand graph challenge submission option at \href{https://ucilnica1516.fri.uni-lj.si/course/view.php?id=183}{eUcilnica}. The grand graph challenge ends on {\bf\color{magenta} March 28th at 11:00am}.
\section{Networkology \totals{5}}
\subsection{Node degrees \totals{0.25}}
\paragraph{} \figref{degrees} shows wiring diagrams of two networks with $256$ nodes and the same average degree $\avg{k}=4$. By observing the networks, what can you say about their degree sequences $\{k_i\}$ and degree distributions $p_k$? \hint{Either reason that the networks are the same or highlight the differences between the networks.}
\begin{figure}[h] \centering
\subfloat{\includegraphics[width=0.433\textwidth]{degrees1}}
\subfloat{\includegraphics[width=0.433\textwidth]{degrees2}}
\caption{{\bf Networks with $256$ nodes and $\avg{k}=4$}}
\label{fig:degrees}
\end{figure}
\subsection{Connected components \total}
\paragraph{} Assume a simple undirected network with $n$ nodes, $m$ links and $c$ connected components. Show that the following two inequalities hold. \hint{Use induction for the first inequality and simple reasoning for the second.} Using these inequalities give a criterion for $m$ that ensures a connected network. Is the criterion practically useful? Why?
$$n-c\leq m\leq {n-c+1 \choose 2}$$
\subsection{Weak \& strong connectivity \totals{1.5}}
\paragraph{} In labs you saw an efficient algorithm for finding connected components of undirected networks. What would the same algorithm find in a directed network if one could follow the links in any direction? What would the algorithm find in a directed network if one could follow the links only in the proper direction? What would the algorithm find in a directed network if one could follow the links only in the opposite direction? Based on your answers, design an efficient algorithm for finding strongly connected components in directed networks. Implement the algorithm and find strongly connected components in \href{http://lovro.lpt.fri.uni-lj.si/ina/enron}{Enron e-mail communication network}~\cite{KY04}. Compute the number of strongly connected components and the size of the largest one. Are the results surprising? Why?
\subsection{Node \& network clustering \totals{0.75}}
\paragraph{} In lectures you saw two measures of local density in a network, namely the average clustering coefficient $\avg{C}$~\cite{WS98} and the network clustering coefficient $C$~\cite{NSW01}. Although the measures are similar, they are not equivalent. Course book~\cite{Bar16} describes a double star network for which $\avg{C}\rightarrow 1$ and $C\rightarrow 0$ when $n\rightarrow\infty$, where $n$ is the number of nodes in the network (\figref{stars}). Think of another example network for which $\avg{C}\rightarrow\mbox{const.}$ and $C\rightarrow 0$ when $n\rightarrow\infty$. \hint{Study what gave this property in a double star network.}
\begin{figure}[h] \centering
\includegraphics[width=0.3\textwidth]{stars}
\caption{{\bf Double star with node colors corresponding to clustering coefficient}}
\label{fig:stars}
\end{figure}
\subsection{Effective diameter evolution \totals{1.5}}
\paragraph{} In labs you saw an efficient algorithm for computing the diameter $D$ of a connected undirected network.
Note that $D$ is a very sensitive measure as a single chain of nodes extending out of the main part of the network already gives large $D$. A smoothed version of $D$ is called $90$-percentile effective diameter $E_{90}$~\cite{LKF07} that measures the maximum number of hops needed to reach $90\%$ of the nodes in a network. Design an efficient algorithm for computing $E_{90}$ of a connected undirected network. Implement the algorithm and compute $E_{90}$ for \href{http://lovro.lpt.fri.uni-lj.si/ina/aps.zip}{citation networks of physics papers} published by American Physical Society in the periods \href{http://lovro.lpt.fri.uni-lj.si/ina/aps_2010_2011}{$2010$-$2011$}, \href{http://lovro.lpt.fri.uni-lj.si/ina/aps_2010_2013}{$2010$-$2012$} and \href{http://lovro.lpt.fri.uni-lj.si/ina/aps_2010_2013}{$2010$-$2013$}. \hint{Although the networks are directed, treat them as undirected graphs. Note that these computations may easily take an hour.} Are the results surprising? Why? Compute also the number of nodes $n$ and the average degree $\avg{k}$ of all three networks, and discuss the results.
\subsection*{What to submit?}
\begin{description}
\item[1.1] Briefly reason why both networks are the same or highlight the differences in $\{k_i\}$ and $p_k$~\points{0.25}.
\item[1.2] Give proofs of both inequalities \points{2\times 0.25}. Derive a criterion for $m$ \points{0.25} and provide brief answers to both question \points{0.25}.
\item[1.3] Provide brief answers to all three questions \points{0.25}. Give a pseudocode of the designed algorithm \points{0.5} and a printout of the implementation \points{0.25}. State the number of strongly connected components in Enron network and the size of the largest one, and briefly comment on the results \points{2\times 0.25}.
\item[1.4] Provide brief description or a wiring diagram of the example network \points{0.25}. Give proofs for the values of $\avg{C}$ and $C$ \points{2\times 0.25}.
\item[1.5] Give a pseudocode of the designed algorithm \points{0.5} and a printout of the implementation \points{0.25}. State $n$, $\avg{k}$ and $E_{90}$ for all three citation networks and briefly comment on the results \points{3\times 0.25}.
\end{description}
\section{Graph models \totals{3}}
\subsection{Random node selection \totals{0.75}}
\paragraph{} Erd\H{o}s-R\'{e}nyi random graph model~\cite{ER59} requires an efficient implementation of a random selection of nodes, which can be easily achieved for most network representations. On the other hand, more realistic models~\cite{BA99} require more sophisticated random selection procedures. Design an algorithm that does not select nodes uniformly at random, but proportional to their degrees. Thus, node $i$ is selected with probability $\frac{k_i}{2m}$, where $k_i$ is its degree and $m$ is the number of links. The algorithm should run in constant time $\cmp{1}$, whereas you can assume any standard network representation. \hint{Think more about the network representation than the algorithm.}
\subsection{Node linking probability \totals{0.75}}
\paragraph{} Consider a random graph model in which links are placed independently between each pair of nodes $i$ and $j$ with probability $p_{ij}$ proportional to $v_iv_j$, where $v_i$ is some non-negative number associated with node $i$. First show that the expected node degree $k_i$ is proportional to $v_i$. Next, derive an exact expression for $p_{ij}$ in terms of the degree sequence $\{k_i\}$ and discuss the result.
\subsection{Node degree distributions \totals{1.5}}
\paragraph{} Represent a small part of the \href{http://lovro.lpt.fri.uni-lj.si/ina/facebook}{Facebook social network}~\cite{VMCG09} as an undirected graph and compute its degree distribution $p_k$. Plot $p_k$ on a doubly logarithmic or log-log plot by representing each distinct $(k,p_k)$ with a single dot. \hint{Transformation to logarithmic axes should be done by your plotting software.} Next, let $n$ and $m$ be the number of nodes and links, and $\avg{k}$ the average degree in Facebook network. Construct an Erd\H{o}s-R\'{e}nyi random graph~\cite{ER59} with parameters $n$ and $m$, and again compute $p_k$. Superimpose $p_k$ on the same plot using dots of different color as before. Also, compute the theoretical degree distribution of the Erd\H{o}s-R\'{e}nyi random graph $p_k\simeq\frac{\avg{k}^ke^{-\avg{k}}}{k!}$ and plot it with a solid line. Finally, construct a random graph according to the preferential attachment model~\cite{BA99}. Start with a complete graph on $\lceil\avg{k}\rceil+1$ nodes and add the remaining $n-\lceil\avg{k}\rceil-1$ nodes one at a time. Each newly arrived node selects $\lceil\avg{k}/2\rceil$ of the existing nodes with probability proportional to their degrees and links to them. \hint{If you have not solved exercise 2.1, use linear roulette wheel algorithm for random selection by degree.} Compute $p_k$ of the preferential attachment graph and again plot it using dots of different color. Compare all four degree distributions $p_k$ and highlight the differences among~them.
\subsection*{What to submit?}
\begin{description}
\item[2.1] State the network representation and give a pseudocode of the designed algorithm \points{0.5}. Reason or prove why the algorithm gives the correct result \points{0.25}.
\item[2.2] Show that $k_i$ is proportional to $v_i$ \points{0.25}. Derive an expression for $p_{ij}$ and briefly comment on the result \points{0.5}.
\item[2.3] Draw a plot with four distributions and briefly discuss each result \points{3\times 0.25}. Give a printout of the implementation of the preferential attachment model \points{0.5} and a printout of all the code used to compute $p_k$ \points{0.25}.
\end{description}
\section{Node position \totals{1.5}}
\paragraph{} You are given \href{http://lovro.lpt.fri.uni-lj.si/ina/highways}{Slovenian highway network} from $2010$ with traffic loads at each location (\figref{highways}). For reasons of simplicity, the network is represented as a simple undirected graph. Your task is to find out which measure of node position could be utilized to best predict the traffic loads. You should consider at least three node measures, namely node degree $k_i$, node clustering coefficient $C_i=\frac{2t_i}{k_i(k_i-1)}$~\cite{WS98} and node harmonic mean distance $\ell_i^{-1}=\frac{1}{n - 1}\sum_j\frac{1}{d_{ij}}$~\cite{New10}. Possibly the simplest way to achieve this goal is to compute Pearson or Spearman correlation coefficient between the values returned by some node measure and the actual traffic loads. Compute the correlation coefficient for each of the three measures. Are the results expected? Why? List also top ten locations according to the best node measure along with the computed values and the actual traffic loads.
\begin{figure}[h] \centering
\includegraphics[width=0.5\textwidth]{highways}
\caption{{\bf Slovenian highways network with node colors corresponding to traffic loads}}
\label{fig:highways}
\end{figure}
\subsection*{What to submit?}
\paragraph{} State the values of the correlation coefficient and briefly discuss each result \points{3\times 0.25}. List top ten locations according to the best measure \points{0.25}. Print out any code you might have used or describe how you solved the exercise \points{0.5}.
\bibliographystyle{alpha}
\bibliography{biblio}
\end{document}