Icon menu dark

Software Engineer, Systems Observability


As Airbnb's infrastructure continues to scale from a monolithic stack to a highly concurrent and distributed stack, we face increasing visibility challenges into the health and performance of our systems proportional to the complexity of our infrastructure. Basic metric monitoring tools will no longer be able to capture the full depth of the increasingly unpredictable and complex interactions between services. The maturity of the monitoring and introspection tools available to our engineers will determine our ability to correctly anticipate performance bottlenecks, identify anomalous system interactions, and properly diagnose the root cause of incidents, underscoring the overall productivity of our engineering team.

The Observability team’s mission is to engineer the observability tools Airbnb engineers need to be successful in a highly distributed modern architecture. We are building a unified platform for instrumenting, processing, storing and presenting the state of our systems as metrics, traces, profiles, or call graphs. Our engineers should be able to seamlessly switch between opinionated aggregate views that help identify N+1s or performance regressions and detailed trace or profile views for closer root cause analysis. With stream processors we are correlating exceptions to deploys for automated rollbacks and hope to be able to generally surface correlated anomalous metrics to our engineers in the future.

We formed the Observability team in early 2017 to build an observability infrastructure that matches our scale and are already processing many billions of data points per day. We rely heavily on open source technology and standards but are not shy to research new tracing architectures or stream processing techniques. The team focuses on both the backend collection of data and custom interfaces and tools that unlock the deeper relations. In addition to building a monitoring system more robust than the production system that we are monitoring, we must also work closely with other infra and product teams to anticipate the modern technologies being adopted throughout our engineering team--such as GraphQL, React Native, HTML streaming, and Single Page Apps--each posing new observability challenges calling for unique instrumentations and data cubes.

We are looking for new teammates who have 2+ years industry experience in and/or similarly interested:

  • Elastic Stack (Elasticsearch, Logstash, Kibana)
  • Stream processing (Flink)
  • Tracing (OpenTracing, LTTnG, Chrome DevTools, Zipkin)
  • Profiling (ruby-prof, perf)
  • High-performance, column-oriented, distributed data store (Druid)
  • Event relay (Kafka)
  • Automated correlation and anomaly detection
  • Data visualization (dynamic dashboards, call graphs, flame graphs)
  • Site reliability engineering
  • Site performance tracking and management
  • Building robust distributed systems that must fail independently of our production system
  • Building high-leverage tools for engineers where engineers are our customers


  • Stock
  • Competitive salaries
  • Quarterly employee travel coupon
  • Paid time off
  • Medical, dental, & vision insurance
  • Life insurance and disability benefits
  • Fitness Discounts
  • 401K
  • Flexible Spending Accounts
  • Apple equipment
  • Commuter Subsidies
  • Community Involvement (4 hours per month to give back to the community)
  • Company sponsored tech talks and happy hours
  • Much more...
Verified open
Posted by employer


There was an error handling your request. Please make sure you're online.