网站可靠性工程师

图标
建设者图标
图标
剪贴板图标
图标
拼图图标
Related roles: SRE Engineer, DevOps Engineer (SRE), Systems Reliability Engineer, Operations Engineer (SRE), Infrastructure Engineer (SRE), Site Operations Engineer, Production Engineer (SRE), Platform Engineer (SRE), Site Availability Engineer, Reliability Engineer

聚光灯

类似标题

SRE Engineer, DevOps Engineer (SRE), Systems Reliability Engineer, Operations Engineer (SRE), Infrastructure Engineer (SRE), Site Operations Engineer, Production Engineer (SRE), Platform Engineer (SRE), Site Availability Engineer, Reliability Engineer

工作描述

Before DevOps was born, Google had a problem and didn’t know how to fix it. The company was running large sites but needed to improve them and scale them even more. Its solution? Google tagged a team of software engineers to figure it out and from their efforts came the foundation of Site Reliability Engineering (SRE). Today the software giant defines SRE as “what you get when you treat operations as if it’s a software problem.”
 
SRE practices were so beneficial they were adopted by other large companies and, over time, enhanced and added to, resulting in a career field that shares many of the traits of today’s DevOps but with a few important distinctions. While both exist in the middle of development and operations, SRE focuses more on automation. Indeed, Google once described the engineer’s purpose as to, “automate their way out of a job.” 
 
Different organizations do SRE differently and may call it Production Engineering or Infrastructure Engineering instead. Whatever it’s labeled, at the end of the day it's an engineer's job to be a team player working continuously to improve website reliability, use incident management KPIs (Key Performance Indicators), write code, build services, and automate manual processes. Since sites stay up 24 hours a day, SREs often work on-call to respond whenever they’re needed. 

职业生涯的回报
  • 在工作中对项目进行大局观
  • Serving as a vital bridge between teams 
  • Improving processes and helping boost organizational profits
  • Generous financial compensation
内幕消息
工作职责

Working Schedule
 
SRE is a well-compensated career field, so expect to earn those salaries by putting in full-time hours! As ParkMyCloud explains it, site reliability essentially equates to business availability. In other words, it’s up to Site Reliability Engineers to minimize costly downtime. That can translate into working after-hours or being on-call to respond rapidly to issues. 


典型职责

  • Creating or improving software related to operations and support
  • Optimizing and automating processes
  • Ensuring release engineering consistency practices
  • Addressing and minimizing support escalation 
  • Capturing and documenting newly-learned information for later reference, such as by creating runbooks. Preventing “siloing” or hoarding of sharable knowledge
  • Troubleshooting issues
  • Conducting incident reviews (also known as postmortems, retrospectives, or root cause analysis) to determine why a problem occurred without placing blame 

额外责任

  • Working on-call for troubleshooting and other incident response issues
  • 确保遵守组织规程 
  • Creating action item lists to address problems and mitigate future similar issues within the Software Development Life Cycle
工作中需要的技能

软技能
 

  • 有能力促进团队之间的合作
  • Analytical problem-solving
  • 注重细节
  • 客户服务 
  • 同理心
  • 灵活性
  • 以目标为中心
  • 高度的组织能力;良好的时间管理能力
  • 调查性和好奇心强
  • 领导和管理技能
  • 客观性
  • 以过程为导向
  • 质量保证的心态
  • 强大的沟通能力

Technical Skills
 
SREs are required to have several skill sets related to the following:

  • Build automation tools
  • Build configuration languages
  • Compilers
  • 数据库
  • Distributed systems design
  • Domain knowledge related to system administration, development, configuration management, integration testing
  • General source code management
  • Installers
  • 联网
  • Operating systems
  • Package managers
  • Security
  • Software engineering
不同类型的组织
  • 计算机系统设计机构
  • 公司/企业
  • 政府/军事机构
  • 医疗保健 
  • 高等教育机构
  • 媒体和娱乐
期望与牺牲

If an organization has a site or sites that are so important they need a Site Reliability Engineer, then expectations are going to run high. According to Netguru, the four main reasons to hire an SRE are to minimize downtime, anticipate and mitigate risks, achieve faster development, and to save money through those and other implemented processes. Clearly Site Reliability Engineers have their hands full, and while they’re trying to juggle the workload they must also keep ahead of changes to the IT world. 
 
Hours can get long when problems occur, not to mention on-all rotations...which means even when you’re off, you’re still technically on. Incident response times can be short, and every employer is different when it comes to compensating for work done after hours. Some may grant Paid Time Off, some might give extra pay, and some might offer a hearty “thanks very much” and nothing else. 

当前趋势

SRE is still a relatively new concept for many growing organizations. As a result, one trend is that businesses are still trying to figure out how to best manage it. A major factor driving the push for Site Reliability Engineering is incident resolution, suggesting the notion that companies are simply getting tired of putting out fires and want to get a better handle of them. 
 
Of course, this relieves stress from the management by putting the stress instead onto the SREs. This, in turn, can require employers to find ways to keep those stressed-out workers healthy and well, so the workforce can operate at peak efficiency. Some companies do this better than others, but the trend is to recognize the value of taking care of busy workers who are taking care of business!

从事这一职业的人在年轻时喜欢做什么样的事情...

The name “Site Reliability Engineer” gives us a few clues about the type of people who work in this field. They enjoy working on websites, an interest most SREs developed in their youth. They’re responsible for ensuring sites are “reliable,” meaning everything works how it should when it should. Thus workers themselves should be reliable, which is another characteristic often honed in one’s early years. 
 
Such persons like to be punctual and prepared and likely did well academically. Indeed, to be an engineer of any type usually requires strong academic aptitudes, particularly in math and science, of course. One of the interesting things about this field, though, is how many soft skills come into play. 
 
An SRE needs to be a people person, someone comfortable working with teams, and able to foster collaboration between those teams. As a result, they may have held leadership positions in school, or perhaps simply had a lot of siblings to contend with! SREs are efficiency experts, trained to find ways to make things better by studying problems and determining solutions based on their research. This requires a creative yet analytical mindset as if both hemispheres of the brain are working in tandem. It’s possible many SREs are ambidextrous or adept at playing musical instruments. 

需要的教育和培训
  • Site Reliability Engineers need a bachelor's degree, preferably in Computer Science or a related area
  • There isn’t a set path to becoming an SRE. Some workers enter through an internship; others might do a bootcamp, then develop skills while doing other IT jobs while practicing other skills on their own
    • Ample work experience is a key requirement of most employers (many SRE employees first work in DevOps, sysadmin, or as developers or software engineers)
  • Classes to become familiar with Java, Python, Ruby, or C++, as well as Linux, Kubernetes, and MySQL
  • Courses to build soft skills in English, writing, speaking, teamwork, and leadership
  • 可选的认证包括。
    • American Society for Quality’s Reliability Engineer Certification
    • DevOps Institute’s SRE Foundation Certification 
    • CompTIA’s Linux+ Certification
  • Learn on your own by taking courses on:
    • edX - Introduction to DevOps and Site Reliability Engineering
    • Lynda (from LinkedIn) - DevOps Foundations: Site Reliability Engineering
    • Udemy - An Introduction to Reliability Engineering
    • Coursera - Site Reliability Engineering: Measuring and Managing Reliability
      • Note, the same course also offered at Pluralsight
在一个项目中需要注意的事项
  • Much of what you’ll need to know to be a successful Site Reliability Engineer will be learned outside of your college program!
  • Ideally, look for programs offering courses in the areas listed above
  • Read faculty bios to see what their areas of expertise and backgrounds are
  • What types of student clubs and organizations are available? Many soft and technical skills are most effectively learned through ample peer interactions
  • Ensure the school is accredited
  • Look for programs that publish post-graduation job stats and have a solid track record
  • Weigh the pros and cons of enrolling in an online program. On-campus engagement is very beneficial for building soft skills, so sometimes a hybrid program is beneficial
方案清单

U.S. News & World Report’s Best Computer Science Programs can help you get started, but don’t rely only on one ranking. You don’t want to miss out on good opportunities, so we recommend considering lists such as Great Value College’s 50 Great Affordable Colleges for Computer Science and Engineering for 2020 or Best Value School’s Top 25 Computer Science Programs With the Best Return on Investment. 
 
College can get outrageously expensive, but keep in mind that many employers are very practical. They may be more interested in your hard technical skills than which school you graduated from. In other words, simply having a degree from a costly private college isn’t going to guarantee a job in this line of work. Focus on taking specific classes needed to build skills, and get as much hands-on experience as possible. 

高中和大学的事情
  • As mentioned, there’s no single path to becoming an SRE, so map out a few options
  • Look at job postings from companies you’d like to work for. Pay attention to required work and academic experiences, then reverse-engineer a career path to get there
  • In high school, build a solid foundation by taking as many IT electives as possible
  • Get as much hands-on skills practice as you can! Take courses related to the items in our Education and Training above
  • Don’t forget to work on your writing. Technical writing is important but you’ll also need to translate complex topics into layman’s terms
  • SREs need good teamwork and leadership skills. These are often neglected traits you’ll be expected to have later, so look for ways to develop them early on
  • Nothing beats having an experienced mentor so reach out to alumni or faculty for advice 
  • Teach others. Teaching facilitates new learning experiences for both parties
  • Read and join discussions on Quora, Reddit, Dev.to, and other sites
  • When your skills are good enough, get some paid experience on Upwork
  • Find internships on Indeed, or ask your college program if they have opportunities
  • Be a leader in IT-related clubs, and build a vast network of peers and associates!
典型线路图
Site Reliability Gladeo Roadmap
如何找到你的第一份工作
  • Put the word out! The majority of jobs are now found through networking
  • 参加TripleByte DevOps筛选测试。如果你通过了,你将获得他们网络中的雇主的面试机会。
  • Look for openings on Indeed, Monster, USAJobs, ZipRecruiter, LinkedIn, and Glassdoor
  • Find out what employers look for! Usenix has a downloadable .pdf listing insider tips on hiring SREs
  • Some employers train their SREs internally, so you may want to start out in one job but with a plan to work your way up within the company
  • Get an internship. They don’t always pay well but you’ll get your foot in the door and they can lead to full-time jobs
  • The jury is out on how useful job fairs are, but industry-specific fairs can certainly give you some exposure to what opportunities exist and offer a chance to chat with workers
  • Have your resume in order. Job Hero has some great Site Reliability Engineer resume templates to steal ideas from
  • 请一个专业的简历作家(或编辑)来为你的文件打分,使其成为最好的简历。但请记住,要根据你所申请的具体工作来定制每份简历。
  • Study GitHub’s massive database of resources and interview questions!
如何爬梯子
  • A lot depends on the size of the organization. Some companies promote from within; others might want external candidates. Promotion opportunity discussions should be had with your supervisor early on
  • Be proactive. Train yourself, take courses, keep learning. When there’s a new trend in technology, find out everything you can about it and be a subject matter expert
  • Display loyalty to your company and become a trusted, invaluable asset worthy of increased responsibility. Behave in a manner that indicates you’re ready to advance
  • Always remember the soft skills. Even the most technically-skilled employee will have a hard time moving up if they don’t get along well with others
  • Be a boss. Show your competency and leadership potential. An SRE must be able to direct others in a collaborative but decisive (and when needed, firm) fashion
  • Prove you are reliable. Be punctual, and if you’re on-call respond to the incident quickly, perform the work diligently, and find ways to mitigate future similar problems 
推荐资源

网站

  • Advanced Bash-Scripting 
  • Awesome Python 
  • Beej’s Guide to Network Programming 
  • Command Challenge 
  • Cyber Aces 
  • DevOps BootCamp 
  • DevOpsDays
  • Eli the Computer Guy 
  • Git
  • Git Immersion 
  • Intro to SQL: Querying and managing data
  • Katacoda
  • MIT’S Operating System Engineering
  • MongoDB University 
  • Ops School
  • Over the Wire 
  • Puppet Learning 
  • SQLZOO 
  • SREcon 
  • SRE Weekly
  • Sysadmin Casts 
  • The Big Blog Post of Information Security Training Materials
  • The Geek Stuff
  • The Google SRE Book
  • The Open Guide to Amazon Web Services 
  • The System Design Primer 
  • The Unix Workbench 
  • Unix Toolbox 

书籍

B计划

Site Reliability Engineering can be a thrilling career field with a ton of responsibility. However, the path to breaking in is not always cut-and-dry. Many people start off in other areas, and sometimes they end up staying in those areas. A few “Plan B” job options include::

  • 后端开发员
  • Computer and Information Systems Manager
  • 计算机程序员
  • Computer Support Specialist
  • 计算机系统分析员
  • 数据库管理员
  • 流程管理(DevOps
  • 前端开发员
  • 全栈式开发人员
  • 信息安全分析师

新闻联播

在线课程和工具