Lead Site Reliability Engineer
Virtual, IE Cork, C, IE
OPENTEXT - THE INFORMATION COMPANY
As the Information Company, our mission at OpenText is to create software solutions and deliver services that redefine the future of digital. Be part of a winning team that leads the way in Enterprise Information Management.
The Opportunity:
You will join a team of globally located Site Reliability Engineers to design, automate, build, operate, and continuously improve some of the services that back our customer facing SaaS products. Your focus will be on NoSQL and Big Data technologies such as Elasticsearch, Cassandra, Kafka, Redis, and RabbitMQ, integrated with the underlying IaaS (VMware, AWS, GCP, Azure) and PaaS (Cloud Foundry, K8s, Anthos). You will be responsible for delivering and operating a highly available design that includes security, scalability, monitoring, upgradeability, and data backup and recovery, across non-production and production environments. You’ll work in a fast-paced organization while quickly learning new skills and creating ways to consistently meet service-level agreements for our global cloud services.
The best person for this role is someone that has a collaborative spirit - in our world, it’s not about being a hero and having all the answers, it’s about sometimes saying "I don't know" and working on finding solutions rather than starting with an assumption. The team needs someone who can ask questions, learn from others, and turn chaos into order. This role would be a great fit for someone with creative and innovative problem-solving skills. You will develop and implement solutions that operate at scale. Our teams are empowered and expected to improve our products to truly deliver a reliable experience to customers.
Your Responsibilities Will Include:
- Designing, automating, building, operating, and continuously improving multiple backing services including Cassandra, Elasticsearch, Kafka, RabbitMQ, Redis, and Solr
- Building software and systems to manage infrastructure and backing services for customer-facing OpenText applications through automation tools such as Terraform and Ansible
- Working closely with development and application support teams to design, build, deploy, support, and monitor new and existing deployments, including gathering requirements and documenting the solution
- Identifing tactical and strategic opportunities to improve service health, performance, reliability, and telemetry
- Contributing to capacity planning and management processes
- Supporting the migration of legacy deployments to modernized design patterns
- Supporting and responding to service requests that satisfy our OLAs
- Supporting incident resolution process for backing services that we are responsible for
- Participating in training and information sharing activities
- Interacting with third party provider(s) who provide additional expertise and a layer of escalation support for our services
- Implementing best practices and operating environments for Kafka, helping with topic creation and management, owning and managing the Kafka Schema Registry, helping new teams with Kafka usage, educating teams on Kafka capabilities, and helping teams to adopt new features
- Implementing best practices and operating environments for Elasticsearch, helping with index creation and lifecycle management, shard scaling, index rollover and rollup strategies and performance tuning
- Implementing best practices and operating environments for Cassandra, helping with node sizing, datacenter and rack topology, replication factor, quorum configuration and driving settings
- Implementing best practices and operating environments for Redis, RabbitMQ, and Solr
- Acting as backup for other team members when necessary
- Problem solving and finding solutions to resolve issues
- Building repeatable application technology design patterns
- Learning new technology on your own or in conjunction with an online learning platform
- Creating and updating documentation such as operational procedures, change execution plans, and incident write-ups
- May require shift work
- On-call rotation is required, as 7x24x365 support is required
Qualifications:
- Bachelor’s Degree in Computer Engineering or related field
- 8+ years of Information Technology experience, working on large scale enterprise systems
- 8+ years of experience working within the Linux operating system
- 3+ years of operations experience for one or more of the backing services that we are supporting (Kafka, Cassandra, Elasticsearch, RabbitMQ, Redis, Solr)
- Experience supporting and operating distributed java applications
- Intermediate knowledge of private and public cloud infrastructure platforms (VMware/AWS/GCP)
- Basic understanding of Java memory management including garbage collection and available GC methods
- Hands-on experience with configuration of monitoring and alerting tools such as Prometheus, New Relic, Nagios, Zabbix, and/or Pager Duty
- Experience with automation or CI/CD tools, such as Terraform, Ansible, and GitLab
- Should be extremely detail oriented and meticulous
- Strong written and verbal communication skills
- Ability to thrive in a fast-paced environment working on projects against strict deadlines.
- Strong understanding of ITIL principles, certification is a plus
- Ability to diagnose and troubleshoot user facing service incidents & outages
- Intermediate understanding of data streaming and/or NoSQL technology
- Understanding availability and performance monitoring tools and concepts
Additional Value-Added Qualifications:
- Knowledge of Kafka concepts such as topics, partitions, replication factor, offsets, consumers and producers
- Knowledge of Elasticsearch concepts such as roles, shards, replicas, indexes, index patterns, index lifecycle management, and aliases
- Basic understanding of Cassandra, RabbitMQ, Redis, or Solr operations
- Application clustering / load balancing concepts
- Understanding network topologies and common network protocols and services (DNS, HTTP(S), SSH, FTP, SMTP, DHCP, TCP, IP etc…)
- Experience monitoring cloud services with Dynatrace, New Relic, Zabbix, Nagios, BMC or any HPE tools
- Awareness and insight into industry trends (technology, methods and tooling)
At OpenText we understand and value diversity in our employees and are proud to be an Equal Opportunity Employer.
Subject to applicable laws and regulations, OpenText’s Global Vaccination Policy requires all employees to be fully vaccinated against Covid 19 in order to enter an OpenText office. Accommodations may be available.