Responsibilities:
- Install, configure, build and troubleshoot our production servers and services.
- Maintain 100% uptime of the production services.
- Optimize / tune our servers and services for performance, scalability, and maintainability.
- Ensure that our monitoring tools catch and generate alerts on all production issues.
- Resolve issues reported by our monitoring tools, including following through on long-term issues.
- Follow escalation process through issue completion, including providing documentation after resolution.
- Supervise the junior system administrators to ensure that they are following procedures and completing tasks successfully.
- Perform root cause analysis of production issues and provide a report which includes recommendations for identifying future issues more quickly as well as preventing future failures entirely, whether through process or technology improvements.
- Send periodic NOC reports to managers with the system and service status.
- Manage backups and disaster recovery, including backup monitoring and verification, and leading restoration tests and disaster recovery drills.
- Become a technical escalation point during your shift.
Required skills and experience:
- 6+ years experience supporting a real-time 24×7 production online web environment.
- Strong written and verbal communication skills; ability to organize and prioritize tasks.
- Experience training and mentoring more junior members of the team, and working with other departments to solve cross-departmental problems.
- Knowledge of unix scripting: shell, perl, python, ruby or equivalent languages (from an automation and monitoring standpoint).
- Ability to identify and configure add-on modules or plugins of open-source tools to effectively automate tasks and monitor production services.
- Strong knowledge of DNS, system build automation, and system configuration management tools.
- Familiarity with server virtualization technologies (Xen or equivalents).
- Experience with configuring and tuning monitoring systems(Nagios, Graphite or equivalents).
- Ability to work in fast-paced environments with weekly release schedules.
Desired skills and experience:
- Knowledge of redis or other nosql databases.
- RHCE certification or equivalent experience.
- Experience with package management (preferably on Debian systems).
- Working knowledge of the following technologies (or equivalents): ldap, ntp, dns, fai, dhcp, subversion, git, chef, mysql.
- Experience executing multi-week and multi-person projects as per the project plan.
For Apply, send updated profile to resumes@colligate.in, we will get back to you!
NOC Lead / Sr. System Admin