Grafana SRE Architect - Basking Ridge, NJ (Onsite)

Basking Ridge, NJ
Contracted
Senior Executive
 

 

Grafana SRE Architect

Location: Basking Ridge, NJ (Onsite)

Indent: SF_OP_179718-2-2

 

Job Summary

The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafana’s ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments.

 

Key Responsibilities

  1. Architecture & Design
    1. Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance.
    2. Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog).
    3. Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage/retention strategies.
  2. SRE Leadership
    1. Implement SRE principles: SLAs/SLOs/SLIs, error budgets, and blameless post-mortems.
    2. Build automated monitoring/alerting systems to preemptively identify system bottlenecks and failures.
    3. Lead incident response, root cause analysis, and remediation for observability-related outages.
  3. Collaboration & Integration
    1. Partner with DevOps teams to embed Grafana into CI/CD pipelines and automate provisioning via IaC (Terraform, Ansible).
    2. Work with developers to instrument applications for observability (OpenTelemetry, custom exporters).
    3. Advise stakeholders on cost-effective monitoring strategies and resource optimization.
  4. Performance Optimization
    1. Tune Grafana dashboards, queries, and data sources for high-performance environments.
    2. Optimize PromQL/Loki LogQL queries and manage large-scale time-series databases (Mimir).
    3. Conduct capacity planning and disaster recovery testing for Grafana ecosystems.
  5. Governance & Security
    1. Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements.
    2. Monitor Grafana stack health, perform upgrades, and enforce version control.
  6. Mentorship & Innovation
    1. Mentor SRE/engineering teams on Grafana best practices and SRE culture.
    2. Stay ahead of Grafana/Observability trends and pilot new tools (e.g., AI-driven anomaly detection).

 

 

Education & Experience

  • Bachelor’s/Master’s in Computer Science, Engineering, or related field.
  • 10+ years in SRE/DevOps roles, with 5+ years hands-on Grafana experience.
  • Proven track record in designing large-scale observability solutions.
  • Managing offshore teams
  • Open to work overlapping hours with offshore teams

 

Technical Skills

  • Expertise in Grafana: Dashboards, plugins, alerting, and integrations (Prometheus, Loki, Mimir, Tempo).
  • Cloud Platforms: AWS/GCP/Azure, Kubernetes, and serverless architectures.
  • Automation: Terraform, Ansible, Python/Go scripting.
  • Monitoring Tools: Thanos, Cortex, Jaeger, OpenTelemetry.
  • Database Optimization: Time-series data (Mimir), log management (Loki).

 

Certifications (Preferred)

  • Grafana Certified: Observability Engineer/Administrator.
  • AWS/GCP/Azure Architect or DevOps certifications.

 

Soft Skills

  • Leadership in cross-functional teams and crisis management.
  • Strong communication for technical and non-technical audiences.
  • Analytical problem-solving and strategic thinking.

 

 

Preferred Qualifications

  • Contributions to Grafana/Prometheus open-source projects.
  • Experience with AI/ML model monitoring.
  • Knowledge of regulatory frameworks (GDPR, HIPAA). 

Share

Apply for this position

Required*
Apply with Indeed
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*