Skip to content

Practice Exercise 1: Local Resource Metrics

Objectives

To monitor and maintain reliability, you should track various local resource metrics that provide insights into the health and performance of individual nodes within the distributed system.

Prerequisites

Exercises

Here are some important local resource metrics and sample commands to collect and monitor them.

Task 1. CPU Usage: - Metric: Monitor CPU utilization to ensure that nodes have sufficient processing power. - Use the top command in Linux to view CPU usage.

        top

Task 2. Memory Usage: - Metric: Monitor memory usage to prevent out-of-memory issues. - Use free to check free and used memory:

        free -h

Task 3. Disk Space Usage: - Metric: Ensure that nodes have enough disk space for logs and data storage. - Use df to check disk space:

        df -h

Task 4. I/O Operations: - Metric: Monitor read and write operations on disks to prevent I/O bottlenecks. - Use iostat to view I/O statistics:

         iostat -d 1

Task 5. System Load: - Metric: Check system load to ensure that the system is not overloaded. - Use uptime to check the system load:

        uptime

Task 6. Swap Usage: - Metric: Monitor swap space usage to prevent excessive swapping. - Use swapon -s to view swap usage:

        swapon -s

Task 7. File Descriptor Usage: - Metric: Monitor the number of open file descriptors to prevent resource exhaustion. - Use ulimit -n to check the maximum number of file descriptors:

        ulimit -n

Task 8. Process Monitoring: - Metric: Monitor the number of running processes to ensure that the system doesn't run out of available processes. - Use ps aux to list running processes:

        ps aux

Task 9. Log Analysis: - Metric: Regularly review system logs for errors, warnings, and critical events. - Use cat, tail, or grep to analyze log files. For example:

        cat /var/log/syslog
        tail -n 100 /var/log/application.log
        grep "ERROR" /var/log/app.log

Conclusion

By monitoring these local resource metrics, you can proactively identify and address potential issues in your distributed system, helping to maintain its reliability and performance. Additionally, consider using monitoring tools like Prometheus, Grafana, Nagios, or Zabbix for a more comprehensive and automated approach to monitoring and alerting in distributed systems.

In addition you can implement custom health checks to monitor specific aspects of your application or services.