OperatorsQuality of Service

Quality of Service Monitoring

As an operator, the Quality of Service (QoS) system provides you with comprehensive visibility into your running blueprints. This guide explains how to access and interpret the QoS dashboards and metrics provided by the operators running your blueprints.

What is the QoS System?

The Quality of Service (QoS) system in Tangle Network provides a complete observability stack that gives you access to optional insights into your running blueprints:

  • Real-time monitoring of blueprint health and performance
  • Centralized logs for troubleshooting and audit trails
  • Heartbeat monitoring to verify continuous operation
  • Visualization dashboards for all key metrics

The information provided by the QoS service may be optional and unique depending upon the blueprint in question, so it is recommended to check the documentation of a given blueprint for more specifics.

Accessing QoS Dashboards

When a blueprint is running for you, the operator provides access to QoS dashboards through Grafana. Here’s how to access them:

  1. In your blueprint execution details, locate the operator’s QoS endpoint (typically provided after blueprint execution begins)
  2. Navigate to the Grafana URL (default: http://[operator-endpoint]:3000) - while the port defaults to 3000, it may be different and specified by the operator running it.
  3. Log in using the credentials provided by the operator (typically admin/admin for basic setups) - this may also differ from blueprint to blueprint.
  4. Once logged in, navigate to the “Dashboards” section in the left sidebar
  5. Look for a dashboard with a name that corresponds to the ID of your blueprint

What You Can Monitor

The QoS dashboards provide comprehensive visibility into your blueprint’s operation:

1. System Performance

The system metrics panels can show you how the blueprint is utilizing resources, with some example metrics being:

  • CPU Usage: Real-time CPU utilization by your blueprint
  • Memory Consumption: RAM usage over time
  • Disk I/O: Storage activity for data-intensive operations
  • Network Traffic: Inbound/outbound network traffic

These metrics help you understand if your blueprint has adequate resources and is performing efficiently.

2. Blueprint-specific Metrics

These panels show you how your specific blueprint is performing:

  • Job Execution Frequency: How often jobs are being executed
  • Job Duration Statistics: How long jobs are taking to complete
  • Error Rates: Percentage of jobs failing or experiencing errors
  • Resource Utilization: How efficiently resources are being used

Any given blueprint may also have additional information that is specific to that blueprint and the jobs it runs.

3. Heartbeat Monitoring

The heartbeat section shows you the operational status of your blueprint:

  • Last Heartbeat Timestamp: When the most recent heartbeat was recorded
  • Heartbeat Success Rate: Percentage of successful heartbeats
  • Chain Confirmation Status: Verification that heartbeats are being recorded on-chain

These heartbeats ensure that an operator is punished (slashed) if they do not run the blueprint as they should.

4. Log Visualization with Loki

Centralized logs provide detailed insights into blueprint operation:

  • Error Logs: Any errors or warnings generated by your blueprint
  • Information Logs: Standard operational logs from your blueprint
  • System Logs: Underlying system events that may affect your blueprint

Interpreting QoS Data

Key Performance Indicators

When monitoring your blueprints, pay attention to these important indicators:

  1. Job Success Rate: Should be close to 100% under normal conditions
  2. Response Time: How quickly jobs are being completed
  3. Resource Efficiency: Is your blueprint using resources as expected?
  4. Heartbeat Regularity: Heartbeats should occur at consistent intervals

Warning Signs to Watch For

These patterns may indicate issues with your blueprint:

  • Increasing Error Rates: May indicate logic problems or resource constraints
  • Growing Response Times: Could suggest performance degradation
  • Missing Heartbeats: May indicate blueprint instability or network issues
  • Unexpected Resource Spikes: Could indicate inefficient operations or potential attacks

Troubleshooting Using QoS Data

When you encounter issues with your blueprints, the QoS dashboard provides valuable diagnostics:

For Failed Jobs

  1. Check the logs panel for specific error messages
  2. Look at resource usage at the time of failure
  3. Examine any pattern in failures (time of day, specific job types)

For Performance Issues

  1. Monitor CPU and memory usage during slow periods
  2. Look for concurrent operations that may cause contention
  3. Check network traffic for potential bottlenecks

For Stability Problems

  1. Review the heartbeat history for gaps or irregularities
  2. Examine system logs around times of instability
  3. Check for correlations between resource exhaustion and failures

Frequently Asked Questions

Q: How do I access QoS dashboards if the URL wasn’t provided?
A: The endpoint of your operator is available on-chain, and you can access the QoS dashboards by following the instructions in the Accessing QoS Dashboards section.

Q: Can I export QoS metrics for my own analysis?
A: Yes, most Grafana dashboards allow data export in various formats (CSV, JSON).

Q: How long is QoS data retained?
A: This data is only retained during the duration of the service, unless otherwise stated by the operator/blueprint.

To learn more about operating with Tangle Network blueprints, you may want to review:

Understanding how to interpret QoS metrics helps you gain insights into blueprint performance and troubleshoot issues effectively.