Show The Graduate Center Menu

Topics in Distributed Systems & Algorithms: Resilient Cloud & Network Services

InstructorProfessor Kaliappa Ravindran

Kaliappa Ravindran is a Professor of Computer Science at the City College of CUNY and a doctoral faculty member at the CUNY Graduate Center. His research areas are in distributed computing over clouds, service-level network management, fault-tolerance & replication algorithms, autonomic networked systems, software-defined networks, and cyber-physical software systems. He has published over 150 papers under the broad umbrella of distributed computing and networking, keeping abreast with the technological and scientific advancements in the field. His research has been supported by many federal government agencies and industries (such as Air Force, Navy, CISCO, and Genetral Motors).


Distributed computing and networking can be viewed both as an art and science pervading through our lives in many ways. It forms the foundational pillar of various technology manifesations we have seen (and continue to see): such as the Internet, Social networks, crowd-sourcing, data centers & clouds,and multi-player games. For instance, a seemingly simple online purchase triggers a workflow process that gets executed at multiple sites to process information: such as product description, store inventory, credit card charging, scheduling shipment, and the like. The purchase involves a coordinated execution of these processes by message-passing and communications (without relying on a shared memory between them): which is a complex exercise in itself. As Computer Scientists, we ought to understand the scientific way of dealing with the underlying technical challenges, in order to continue to improve system functionality in many dimensions and better the quality of life.

The aim of GC course 82005 is to expose students to current research and technological issues in the areas of distributed systems & algorithms, faul-tolerance, and quality-of-service (QoS) & performance. Many distributed applications (such as Cloud services, E-Commerce, and Datacenters) rely heavily on protecting the critical data and computations that may be distributed across different sites and geographic regions. Here, a seemingly correct but actually faulty network subsystem and physical components (say, exhibiting an intermittent faulty and/or sub-optimal performance behavior) can cause significant damages to the critical information infrastructures. With subsystem-level failures on the rise (both benign and malicious), the course contents can help understand the computer science issues underlying the design of correctly functioning networked systemss (e.g., how a distributed system of computational elements achieves its goals despite the absence of a physically shared memory).

We shall begin with two fundamental concepts: Lamport's notion of logical time and global snapshots of network state. We shall examine different communication models (synchronous, partially-synchronous and asynchronous), and how they impact the design of distributed algorithms. We shall cover primitives for broadcast and gossip based communications between a distributed system of nodes, and provide case studies of how they are employed in cloud servers and data centers today. The course will also cover replication methods to achieve fault-tolerance and assured QoS of systems, in the presence of malicious/benign errors randomly occurring at sub-system levels (e.g., majority voting among multiple components to make correct decisions in the presence of liars). Students will also be exposed to a programming project that involves developing/testing a distributed algorithm on a network testbed (under artificially injected failure conditions).

Prior knowledge desirable

System programming on UNIX-like systems, JAVA and C-like languages, interprocess communications in operating systems, usage of networked systems

Course Topics

Following is an approximate structure of the course — a total of 40 lecture+discussion hours:

  • Introductory portion (2 hours)

    • Brief coverage of the basic computer network and distributed system concepts and terminologies

  • Distributed algorithm structure and design (10 hours)

    • Failure models: crash failures, omission failures, byzantine failures

    • Programming level abstractions: fail-stop behaviors, failure detectors

    • Communication models: synchronous, partially synchronous, and asynchronous message-passing semantics

    • Causality tracking (logical clocks, vector clocks)

    • Distributed global snapshots (Chandy-Lamport’s algorithm) and system monitoring (global predicates)

      The emphasis is on distributed algorithm design principles and methodology.

  • Distributed programming primitives (15 hours)

    • Service-level specification & verification of networked systems;

    • Axioms of distributed consensus and agreement;

    • Distributed coordination: consistency, dead-reckoning (e.g., multi-player games, shared white-board)

    • Broadcast primitives: atomic, causal, and ordered multicasts

  • Distributed system control techniques (8 hours)

    • Replication and fault-tolerance;

    • Voting, primary-backup methods;

    • Rollback & replay algorithms and correctness conditions;

    • System resilience versus robustness.

  • Case studies of distributed systems (5 hours)

    • Telcordia Resilient Clouds, IBM Websphere,

    • Cloud and network auditing tools

    • HPOpenView based system management tools



Programming project 35%, Midterm exam 25%, Final exam 40% (exams are take-home).

Reference materials

  • Distributed Systems by Sape Mullender, Addison-Wesley (ACM Press)

  • Distributed Operating Systems by M. Singhal, and N. Shivratri, McGraw-Hill Publ.

  • Conference proceedings on Network Management Systems, Real-time Dependable Systems, Cloud Services and Computing, and Monitoring and Debugging of Distributed Real-time Systems — IEEE-CS and ACM publications.

  • Distributed Networks journals: IEEE Transactions on Network Services and management, IEEE/ACM Transactions on Networking, and Springer journals on Clouds and Network Services