Notes that I collected reading the practice of system and network administration (2nd Ed)

on wikidot for reachability and format. One note is less flexible.
Still keeping it plain text friendly.

remarks

- writing personal interpretation of whatever input and sharing 
  __is very helpful__, I think, because from several shared thoughts one can derive new ones.
  So please share!
- I will do notes not on everything but for things that were highly useful for my
  viewpoint that is just one viewpoint.
- the reports are 'colored' with my interpretation. Often I found that reading
  books about military economy and military decisions (or problems "immediately
  behind the frontline") of the second world war, is extra helpful to gain
  hindsights about how (complex) organizations behave and what mistakes/success
  makes. It seems totally disconnected from IT maintainers but it is not,
  because at the end we are talking about human organizations with tools.

  For example "trying to keep projects only in the short term,
  closing the ones in the long term" (Germany 1940-1941) kills the ability
  to be competitive in the long run. I'm not sure, of course, if my interpretations
  derived from ww2 history (or from IT oriented articles) could be applied in
  other domains, I did not strictly prove it so far, but I feeel that yes, it could
  be applied.
- since it is a reference book, I won't read everything. What I do not read should
  be checked again an maybe I should mention it, but for now I assume that I will recognize 
  when I read it but I do not find in my notes (like this file), therefore
  I will not write about them.
- furthermore the more experience I accrue, the higher the chance I may change my view
  on some chapters or topics, so as every non trivial information, reviewing it
  over time won't be bad!

preface

- basic principles:
  simplicity: keep things simple, otherwise it is hard to do/maintain/understand them.
  clarity: keep easy to explain, people do not read minds.
  generality: solutions should be reused
  automation: if possible, automated steps helps the human to focus elsewhere
  communication/expectations: 
    the work is between humans, for humans, machines are helpers,
    not the final goal.
  basic firsts: without proper foundations, sooner or later big problems
    will arise. Technical debt.
chapter 1
- suggested checklists to use when some actions (normally big, like moving
  and office) should be done.

chapter 1 what to do when

- a collection of very short checklist that rember the reader what one could do
  and where to find more details in the book. Pretty nice.

chapter 2 organize

- use a trouble ticket system:
  . if there is a procedure (with tool if possible) that allows operators
    to get requests in order and not being overwhelmed, it is better.
    Otherwise the overview of the requests, that are likely to be more than
    the ability to process them, will be lost and with it the capacity to
    process those request with higher effectiveness. (not efficiency)
- manages quick requests right: 
  . for a group of sysadmins, also for a solo SA, 
    try to channel all the small requests and interruptions to one person 
    (or a smaller group of SAs) or try to schedule a time in the day where 
    small requests could be made, otherwise they are deferred to the next day.

    This will enable the others to focus on their tasks instead of having higher
    amount of interrutpions that are not helpful for complex tasks that requires
    focus.
- adopt time saving policies:
  . define procedures and terms to not be abused and overwhelmed.
    How to get help? (procedure)
    what is an emergency? (priority)
    what is the SA responsibility (scope, otherwise they do everything)
- standardization:
  . The book talks about "putting every new system in a known state" but I interpret
    it (through my "experience glasses") as standardization. The more
    one has standardization for recurring tasks, if possible automated,
    the more the benefits because one knows how to support the standard
    without having the trouble of analysing every possible non standard
    situation. So a standard to deploy the OS and applications and configuration
    over it makes things easier.
- keep things for the "important persons" running to get benefit.
  . Like a running mail system for the management allows easier resource
    allocations because the core IT for management shows no problems.
- documentation
  . checklists (helps lowering mistakes and delegation). Labeling resources
    helps other people to help even if they do not have domain specific skills,
    it also helps to speed up the operations in case of changes or problems.
    Documentation helps minimize the time to spend searching for solutions
    if the problem happens again in the future, especially for core procedures/systems
    not done so frequently.
- fix biggest time drain
  . if something is urgent and time consuming but it is not done because there are other smaller problems
    that keeps interrupting the team: split it. A part of the team will
    work on the big problem while the others works on the smaller requests.
    This because if the bigger problems do not get solved, could cause
    macro problems that may have a fatal impact on the organization.
- prioritize for the short term
  . a team cannot handle big problems at once. Therefore prioritize
    on solutions that can help a lot immediately and defer the others without
    forgetting them. Of course without picking only short term
    solutions because doing this is going to lead to fatal problems in the long
    run for the company.
- do not forget physical requirements
  . like cooling and power during power outages for servers. Do not understimate
    physical factors while working with something that seems not so related to physical
    world but, infact, it is.
- monitor your resources
  . avoid to be caught by surprise if core resources (servers) are down
    and you did not expect it. Even a simple monitoring system helps.

chapter 3 workstations

- workstations
  . one IT system that enable a single person to work (let's say
    his entry point for working with the IT infrastructure).
    The it system has a lifecycle, it is not eternal (so far).
    SA should maximize the time that the system is properly configured (according to
    standards) to let the person to work properly and to receive
    support and quick fixes. There are explained concepts how to
    keep a system in the configured state.
    Nevertheless, due to lifecycle of it systems, organization should
    consider this in a plan ready to slowly replace it systems, avoiding
    to be caught out of guard, having slowdowns in their it infrastructure or
    also critical problems due to aging systems.
    And, of course, to keep the system in the configured state without
    collecting too much entropy, it should be avoided that users/SA
    could easily change its configuration, maybe through permissions and policies.
- installing the system
  . be sure that the system is installed in a way that stricly respect
    the standards, otherwise more work will arise from non standard configuration of
    systems just installed.
    Moreover if the standardization of a system is automated as much as possible,
    it will save time. Because even having little interruptions to check the
    state of the installation can be very limiting for focus and productivity.
    Also the installation should report when it is finished to not be forgotten
    for too long time.
    Of course one of the best tests for automation or checklists is that
    also not so familiar persons can complete the task (this means:
    better delegation, less troubleshooting, less small but constant work, higher standardization).
- automation is a form of documentation but not the only form.
  . Because if someone not familiar with the automated procedure starts it
    he should know how to check it.
- stop gap measurements
  . to prevent that stop gap solutions become permanent. One create a solid reminder
    (in a ticket, backlog, calendar, etc..) to remember it for the future and
    to find a better solution (if possible).
- checklists
  . they are very helpful for recurring tasks that may also be delegated or
    as documentation for our future self or for new people in the technical group.
    Assuming that (a) people cannot share their mind without any other medium
    and (b) that the memory of a person is not extra reliable, checklists help
    to keep consistency with the standards, or the checklist itself could be a standard.
- installing updates or changes
  . again automation (and documentation or partial automation and documentation)
    saves a lot of effort. Imagine to do manually parts of an activity often repeated,
    while letting it be done to systems saves much efforts, apart from the control
    through checklists and quality procedures.
    Moreover updates and changes should be done, when possible, remotely.
- apply/distribute changes gradually.
  . do not start distributing changes to every system at once. Start small,
    then move in batches and when the change procedure is tested well enough
    to be reliable, distirbuted to the remaining systems.
    This avoids general problems when a change is applies but leads to crucial
    errors, those errors could be fixed on small number of systems before
    taking down the entire infrastructure.
- expectations and roll backs
  . communicate, since the activities are beneficial for humans in the end,
    expectations, plans and timelines to align expectations and have a better mood
    between organizations.
  . prepare for a roll back scenario or an halt scenario. What is the procedure if
    things go wrong?
- It is not only about task automation, also process automation /standardization is important
  . a task could be automated but also a process could be properly standardized to
    make the organization more efficient, of course keep in mind the loss of flexibility
    if the process becomes too "rigid".
  . try to standardize together with the customers, to fit their needs.
- Also physical systems could be standardized
  . like which vendor to use for servers, that improve, over time,
    the effectiveness over a certain type of hardware.

chapter 4 servers

- consider specialized server appliances
  . instead of building your own solution, server appliances (even a ip kvm)
    can free a person from caring about the custom solution and rely on the
    vendor solution that has more focus and experience in most of the case.
    Therefore is cost / effort effective.
- redundancy of servers
  . one can have servers with n+1 equipment, so if one module fails the others
    keep the server up.
  . one can have two or more servers, so if one server fails the others can
    substitute it. (this means that the servers are properly configured)
  . one can have two or more servers actively used, so if one fails, the others
    can be able to absorb its failure.

chapter 5 services

- services
  . in short services are why IT infrastructure exists, to provide
    (useful and needed) services to customers.
- focus on objectives and needs
  . when designing the infrastructure to provide a service, focus at first
    on the goal to achieve and the need to fulfill, not on how
    to do the, otherwise there is the risk of losing
    the overview and to solve other problems than the others
    actually needed.
- security and weakest link
  . an infrastructure providing services is, as rule of thumb, as secure
    as its weakest part. If one is able to gain access to its weakest part
    can slowly work to gain access to other services and data.
- keep elements, systems, services simple
  . a server or a system should provide mostly only a service, to make
    fixes and debug easier. Together with other systems the it infrastructure
    can be complicated but the single element should be simple to avoid
    long times before recovering a problem due to long analysis.

    A system could also be a service itself, not only a server machine.

    The service itself should be as simple as possible to limit the work when
    one has to debug errors or change the service itself.

    Even if dedicating a system only for a service seems a waste of resources,
    this is not because those additional resources lower the need of resources
    (manpower, time, etc..) for debugging, expansion, shrinking, capacity planning,
    monitoring, maintenance, etc.. of the service over that dedicated systems.
- to improve the reliability of services, employ standards
  . the more standard used to employ a service, the higher the reliability and
    the ability to maintain it. Otherwise further work is needed to handle
    particular configuration. Standard config, documentation standards, Standard hardware,
    etc...
- define expectations with customers
  . while asking their needs to design proper services, also define what they
    could expect to avoid leading to false expectations and friction or
    disappointment.
  . be patient, because the customer tries to get the most out of an agreement
    but it is also not familiar with the skills needed, otherwise he would be
    a SA. Try to be polite and inform him of what is possible and what is not.
- once focused on objective and needs, try to think about the future operations
  . assign proper resources to the service and also try to anticipate future
    changes (ipgrades, scaling, etc..) to be prepared for them a little bit.
    Otherwise there may be more work in the future.
- try to use open standards
  . this because closed standards lead the infrastructure depending on one vendor and its changes,
    open standards instead allow (when used) several companies to provide their
    products and those should be interoperable,a llowing flexibility
    and less costs.
    Also beware of extensions of standards that, in few words, create a new one
    incompatible with the original standard.
- vendor relations
  . to read
- service access through naming
  . services can reside on servers/systems but they should be accessed through
    naming server independent. So if the service is moved to another server,
    the naming of the service stays the same.
    One important thing is that the system were the service relies on should not
    include the name of the service in their name, otherwise then there is confusion,
    because if the service is moved to another system, then the name for accessing the service
    could collide with the name of the system.
- reliability and weakest link
  . an infrastructure providing services is, as rule of thumb, as reliable as
    its weakest part. If that part goes down, the entire infrastructure may be disrupted
    together with the service.

    the part could be a server system, another service, the network, the provision of power,
    etc... Anything related to the infrastructure for the service.
  . the infrastructure providing a service should be reliable and having compact dependencies,
    if any. Otherwise in case of failure, one dependency can cause other systems to
    fail in cascade. When dependencies are compact (localized logically
    and maybe also on the same type of system and in the same physical place) then debugging them
    is easier, otherwise longer recovery times (if it is possible to recover)
    are needed.
  . in general the simpler is the service infrastructure (and therefore the service)
    the higher reliability one could expect, or higher chances of proper recover from errors.
- restricted access
  . to read
- services and performance
  . a service may be useful but if it is not performant people will find workaround
    or will stop using it if not extra needed.
    Therefore be sure that the service infrastructure has enough performance to
    handle the requests. For example at the start could be oversized to
    handle whatever peak or bad configuration may be there, and then can be
    refined for avoiding always oversize to avoid performance bottleneck.
    Another is to provide the service to limited batch of user every time,
    using the idea "one, some, many", to see how performance scale over the time
    and adapt it to the new requests.

    One can also communicate properly the expectations of performance to not
    have customers disappointed. The first impression, unfortunately,
    counts a lot on the customer.
- service monitoring
  . obvious here: if the customer is the one always reporting
    major problems, he will feel not properly cared for.
    Instead with a monitoring system (if possible automated) once can
    anticipate the customer to see problems like him and the effect of knowing
    that the support is already working on the problem while the customer
    just notice is way more effective for good customer care.
- service rollout
  . rolling out a service not properly prepared, in terms of testing,
    documentation for customer and support teams, etc... is counterproductive
    because the customer will have bad impression and not so good expectations.
    Therefore better to start small but proper so the customer feel that the
    service is properly supported and not just offered to get money.
- redundant dedicated systems for services
  . while keeping simple the service infrastructure, having redundant infrastructure
    helps to limit the downtime or the pressure to the support team during outages
    and can save the day in mission critical services.

    If one use "load balancing" as redundant system, one should also
    foresee what happens if one or more system fails, if the other machines
    could handle the load or gets overloaded (failing as consequence).

    One way to select which services should be redundant is based on the fact
    of which services are more important and their likelihood to fail.
- dataflow analysis for performance scaling
  . a way to plan scaling of performance is to model the dataflow of a service (all the
    operations, activities and resources used by the infrastructure for a certain load
    of a service) and then scale it for the forecasted amount of load.
    It may involve a bit of statistics (average, median, mode or percentile),
    but the point is: if one knows the amount of resources used by every
    single use of the service, one has higher chances to forecast better the
    resources needed by a certain amount of concurrent uses of the service.

chapter 6: data centers

- data centers are not simply rooms with computers
  . a data center stores the core IT infrastructure of a company, either in
    a proper room or in a separate buuilding, and should
    be organized accordingly. One cannot expect good performances
    (not only in terms of reactivity, but also lower cost of maintenance,
    reliability, etc...) when part of the infrastructure (where the
    data center is located) is not functional.
- delivery dock for equipment
  . do not understimate the need of a dedicate access through which get the
    delivery of equipment that often is heavy and bulky.
- security
  . important but not read so far
- cooling
  . cooling is crucial for system that produce heat and could be
    stopped by the heat produced by themselves.
    Therefore it has to be properly scaled. A rule of thumb: assume that for
    every watt absorbed of energy, at least one watt of cooling (better more than one).
    Moreover care should be placed in organizing the cabling and other
    equipment that should not disrupt the airflow. Hot air should go outside
    while cold air should go inside the room.
- power outages
  . the author collected studies that shows that the frequency of power outages is
    either very short, like under a minute, or a not so small fraction of a day.
    Therefore, once the systems in the data center are configured to shutdown themselves when
    the UPS s getting low on battery (data centers or server rooms without UPS
    mean that the IT infrastructure is not so important for the company,
    and the company has to accept outages), if the outage is lasting longer than
    one hour, one can send the staff home.
    Therefore is not so economical to buy big UPS able to last
    hours with a lot of load, if it is possible to shutdown properly
    the systems.
- UPS
  . while a UPS is crucial for a data center, if it fails it can bring down the
    datacenter with it. Therefore always have a possibility of switch the power
    from the UPS to a normal power strip.
- power and cooling
  . skimmed, to read again.
- fire suppression
  . to read
- racks
  . do not understimate racks, they shape all the rest of possibilities (as well
    as cooling and power reosurces).
  . apart from the standard of the rack (the height in Units and so on), consider
    a cage nuts removal or inserting tools that facilitate the work and avoid
    small injuries that can stop the work.
  . skimmed other parts
- cabling
  . cable management is important. Having tidy cables allow quicker and more precise
    work without the risk of disturbing other equipment that depends on the cables.

    prewiring a rack (with documentation how the future components have to be connected)
    helps for future installation, instead of having the first installations
    easier and the later hard and time consuming.

  . remember that high quantity of electricy cause electromagnetic interferences,
    so run network cables and power cords separately.
  . documentation and labeling is crucial to be able tot ake away cables and put
    them in quickly. In general documentation avoid the time consuming and error prone
    operation to reconstruct the cabling connections in the operator mind
    from scratch.

    Of course the documentation has to match reality and should contain as less
    inconsistencies as possible.

    Documentation is crucial when debugging an error or recovering from outages
    or mistakes.
- rest of the chapter
  . to read
chapter 7: networks  
  - brief summary:
    . idf, mdf : 
      wiring closet with patch panels , switches and network components aroudn the buildings.
    . the chapter talks about hints how to design, locate and connect those idf with mdf.
    . hints about cabling, looking for the future
    . hints about cabling, it is useful for troubleshooting to document the end of the cables,
      especially if the cables are bundled tightly.
    . it is suggested to use netowrk hardware and not general purpose computers, especially
      one with many services.
    . it is suggested to monitor the status of the network hardware.
    . it is suggested to not mix vendor for hardware, because it requires more time to train
      SA on different vendor equipment.
    . it is suggested to avoid closed protocols because then one is locked with a certain
      group of vendors that support that protocol that is not a standard.
    . it is suggested that policies and administration of networks is done with
      one team or many team that coordinate each other well, otherwise a mess is likely
      to arise.
    . better to use reliable and tested technology than too new technology that has
      subtle errors to be debugged.
    . simple setups on the clients. Clients should not have complicated network
      rules, otherwise debugging problems is more demanding.
    . be aware of invariant points in the network (stable for some 5-10 years) and
      invariant (changing more frequently).

chapter 8: namespaces

- definition
  . for the book the definition of namespaces is: a set of unique keys and related attributes.
- crucial
  . namespaces seems overvalued but they are crucial. When resources (of whatever type,
    from files to websites to hard drives, ip addressesS and so on) are labeled
    but can be confused due a not properly defined or documented namespace,
    the confusion can lead to be problems especially in case of pressure,
    when the shorter time is wanted to fix a problem.
- namespace policies
  . a namespace should be controlled by written policies, so they are clear and not
    easy to twist by oral tradition and they can be revised easily.
    Policies are not technological systems.
  . namespace policies should answer questions like: which names are allowed,
    which not. Names can be renamed? etc.
    Policies are subject to constraints from technological limits and
    external policies like DNS policies.
  . suggested naming is: name something to maximize debugging information,
    which service, which location, etc... Use aliases to map easier names
    for resources (and moving the 'easy' or 'obvious' label of for a resource from one resource
    to another). For example mail-server-ny can have the alias mail, but the real
    name is convey the service on the machine, the type of the machine and the
    location of the machine.
  . namespace policies could contain access control policies, for example:
    who is allowed to add, rename or delete keys ?
  . namespace policies could include the possibility to rollback a change through
    a sort of versioning.
  . namespace scope define where this namespace will be used. By geographically distributed
    entities (an office, branches in a country, by everyone on internet, etc..)
    and by services (called also scope thickness).
  . consistency between namespaces. The same resources should keep attributes across different
    namespaces?
  . what about reuse of names? How much time should pass before the reuse of names?
    Is reusing names creating dangerous confusion?
- namespace procedures
  . procedures to define how to change, add, delete, renew, expand, etc... namespaces
    or names. As usual documenting them is healpful for new employees, to lower
    misunderstandings, etc...
- centralized management
  . if it is possible, having a centralized management of namespace makes things
    easier to maintain, enforce, keep consistency and so on. Easier means less cost in case of
    change, debug or training.
    For example having a server that
    distribute copies of the actual namespaces (for example dns names) to
    production servers. The same server should have documentation and tools to
    facilitate the management of the namespaces.
  . if possible use databases, that helps to keep consistency high.
- further automation
  . further layers of automation could be made to combine previous layers of automation
    or to let customer interact with namespaces with themselves. For example
    a manager could have access through a proper interface to the commands to
    create new users or to allocate resources labeled in a namespace.
- one namespace to guide all the services
  . if the namespace is powerful enough (in terms of consistency, value, data,
    maintenance, api, etc..) can guide all the other services. For example
    a single account for many services.

chapter 9: documentation

- what to document?
  . it seems that being selfish helps to be motivated to do good documentation.
    Tasks that you do not like, tasks that keep you on the nerves during vacation,
    tasks that are recurring and boring, etc... Document those to making your life
    easier, because documentation is about that: making life easier.
    To appreciate the documentation think that without documentation society
    could progress way slower, it means that the more (and meaningful) the documentation,
    the higher the chances for the team to progress to more challenging tasks.

    Another category of things that could be documented are those that could
    lead to unpleasant situations in case of mistakes. Documentation lower those mistakes.
- checklist are also documentation
  . Especially for procedure sthat ar enot so engaging or are complicated, so one could
    skip a step, implement checklist as documentation how to do that procedure.
- template for documents.
  . Also documents can have template, because structure helps to convey information
    and refining the structure (instead of creating a new one each time) helps.
    But the structure can be refined if there is a standard of templates.
  . a possible template: title (describing the the content); metadata (explaining notes
    about the document and who are the contributors and how to give feedback about the document);
    what (what is this document about, what one could achieve with it);
    how (how to accomplish the goals); why (not mandatory) - why something happened.
- proofread and testing
  . proofread the document, to see if it is comprehensible and follow the steps
    yourself to see if the document describes effectively the procedure,
    without using your own memory to fill the gaps.
    Then use another peer to test the document and give feedback.
- verbose and summarized documentation versions
  . a document can have two version available for readers.
    One long, verbose, where things are exposed in detail and one
    shorter, quick to follow for the ones that already have read
    the verbose document and they are following again the steps.
- additional ways to make documentation
  . taking screenshots and explain them briefly with text.
  . take the history of a command line session, clean it up and add brief
    text description.
  . if a task was explained by mail, those mails can be organized in documentation.
  . take notes from the ticket system regarding a problem and start from those
    to create documentation.
- remember, before automation: documentation
  . if one cannot repeat a task consistently from documentation, it cannot
    be automated because one does not know how to do the task.
    Besides, automation is a sort of very rough documentation for the ones able
    to read the automated procedure.
- checklists are great
  . an effective form to write documentation for a procedure is to write checklists
    that explain (or remind) how to do steps in a procedure.
- central repository for documentation
  . collaborating in a team through sharing documentation helps a lot, because
    otherwise everyone has to write his own procedure and there is less
    possibility to find commond standards and procedures. Moreover peer review
    always helps.

    One repository for documents is a wiki like system.
  . the documentation repositor should provide a search feature, otherwise
    with the time the content will be hard to find and update.
- documentation is useful if people will use it
  . otherwise it is a service that could affect only oneself. So convince properly
    people to use it.
- documentation manager
  . without a manager or supervisor, either people contributing to the documentation
    get self organized quickly, or the entropy in the documentation will quickly grow
    to defeat its usefulness. Therefore there should be at least one responsible that
    keeps the documentation usable, properly formatted and so on.
- structure
  . do not enforce too much structure at the start of the documentation repository.
    the structure will emerge slowly or one could always use a staging area for new documentation
    to order slowly.
- respect the contributions of everyone
  . while maintaining a certain structure to keep the usability of the documentation
    high, try to avoid stopping people to insert further documentation.
    One could be intimidated by this without contributing.
    One way could be: create a directory, in the document repository, for new
    content or a good guide about how to document something, in the worst
    case mistakes could be fixed through the revision control
- examples of documentation repositories used by SA organizations
  . skimmed

chapter 10: disaster recovery and data integrity

  . to read, it is crucial to lower stress
  . read but not summarized, to do.

chapter 11: security policy

  . to read later
  . check authorization matrix.

chapter 12: ethics

  . to read later
  . read privileged access

chapter 13: helpdesk

  . to read

chapter 14: customer care

- what is customer care about?
  . It is about customer request. A customer has a need, he wants possibly
    a solution. Taking care of the interaction customer - operations team is
    important (because it is a port of the solution or the solution that there
    are no solutions viable). How a operation team handle the interaction
    will determine future attitudes of customers towards the operation teams.
    There is no future for poorly handled interactions.
- a model for solving customer requests
  . One well tested method for interacting with customer requests is the following:
    > greet the customer for the contact
    > classify the problem
    > get a proper statement of the problem
    > verify that there is a problem
    > propose solutions
    > select a solution
    > apply the solution or execution
    > verify the solution
    > customer verification of the solution

    this method was tested and helped to reduce mistakes and improve the effectiveness
    in a number of organizations.

    The method is not one way, it could be iterative and the flow can go
    backwards.

    For example a problem could be misclassified, and so should be classified again.
    Or the verification shows that the problem is still there, and so should be
    re-analyzed and fixed, or just fixed.
- handling customer requests with proper tools
  . one very helpful tool for tracking and solving customer requests is
    a trouble tracking software. If your organization does not have one,
    set up one.    
- greeting the customer
 . The greeting phase introduces the customer how to get help (but the customer
   should see or know where to ask for help) and receive a response in
   an positive and friendly way. Increasing his attitude in accepting the
   solution.
- identify the problem
  . In this phase one has the problem identification made in three steps.
    First the problem is classified, then stated and then verified (if it is a problem)
  . the classification of the problem determines who is going to handle it
    due to several factors (technical knowledge, customer knowledge, and so on).
    If the classification is not straightforward, ask more question until
    it is clear who can handle it.
  . After every step, to keep the customer involved without letting him feeling
    that he is "lost", always communicate the actual status. Like "X is going to
    handle your problem, since it seems like a problem of Y". Also this feedback
    helps the customer to provide more input when the classification sounds
    not correct to him.
  . statement of the problem. In this step the person in the operation team
    should collect the proper information from the customer, helping him
    (because the customer may be no expert)
    to state the problem in the proper way to have higher chances to find a solution.
    Dor example information useful to reproduce the problem are often
    useful. Also the timeline of the events is useful, when the problem
    first was detected.

    Having checklists to properly collect information about the problem, to state
    it as better as possible, helps operatiors not familiar with the
    problematic.
  . Then the problem should be verified. One step to verify the problem is reproducing it.
    If it is not possible to reproduce maybe the problem was not communicated
    or stated properly. Reproducing the problem helps the team to understand it better
    and to solve it more effectively. For example without reporducing the
    problem the team does not now if it is working on the right issue.

    Of course how to reproduce the problem has to be documented for later usage,
    one never know when such documentation is useful to shorten solution times.

    If during the debug of a problem another is found but it is not critical,
    it has to be documented for later resolution but not considered at the moment
    to not slowing down the fix of the current problem.
- design, plan and implement solutions
  . Solution proposals. Proposals of solutions have to be collected (in an appropriate
    manner, without trying to collect all of them) if possible by the
    ones that are the expert in the field of the problem, or through
    brainstorming.

    If some solutions are not so cost effective, the request of finding solutions
    can be escalated to other member (with more experience or skills?) in the staff.
  . once proposals have been identified, they have to be selected for implementation
    and if more than one is selected, then they have to be prioritiezed to avoid
    applying multiple at the same time. Of course the customer should be involved
    in the selection (for example budget) and in the prioritization.

    The customer partecipation has to be modulated according to the experience of
    the customer. If the customer has no experience in IT, it could be overwhelming
    to ask for prioritization. At the same time, the SA team should keep in mind
    the business process of the customer and what is crucial for the customer.

    In general the customer should feel part of the solution and not an enemy.
  . then comes the implementation of the solution. It depends on the technical skills and
    the communication and coordination skills of the team with the person
    involved in the implementation process.
- verify the solution
  . once the solution is implemented, the job is not finished. One should verify it.
  . After the implementation, the team involved in the implementation has to
    verify that the solution fixes the problem identified in the steps before.
    If it is not, the problem may have to be identified again or other solutions
    have to be applied. In short, previous steps have to be checked again.
  . after the implementation team is able to verify that the stated problem is solved,
    then is the turn of the customer to verify the same. If the customer is not
    satisfied by the solution, maybe there was a misunderstanding during the statement
    of the problem, therefore the process of customer request has to be redone.
- the book report also what can happen if some steps are skipped.
  Then it reports possible suggestion how to fix the problems.
  . Writing policies to set expectations in the greeting phase, to avoid
    unfrendly communications.
  . writing decision trees to avoid misclassifications.
  . avoid skipping problem statement from the customer, that is crucial in later
    steps.
  . double check before executing commands.
  . Wrong expectations set by management about the solution.
  . wrong metrics for judging personal that foster incorrect behavior,
    for example the wish of closing tickets quickly.
- the transition from one phase to another should be smooth
  . this means that the SA team should cooperate and coordinate well,
    to avoid non-smooth transitions (for example with missing information,
    duties, waiting time, etc.)
- inform the customer that a major outage is detected
  . so the customer will not get panicked, because someone is working
    on a solution.
- SA familiar with the customer
  . single SA could be assigned to group of customer to raise the familiarity and
    their knowledge of the processes behind the SA. So instead of focusing on
    a certain technology the single SA focus on customer oriented solutions.
- collect statistics during every step of the customer request handling
  . those statistics could help you to improve the method and refine ways to
    avoid limit the errors, for example better questions to ask to the customer
    to improve the problem classification.
- always try to be friendly with customers, never patronizing or condescending
  . this helps the cooperation to solve a problem. Try to rephrase possibly insulting
    sentences like "is the power cord plugged in?" to "is the device powered,
    may be a power defect" and so on. Hide possible offensive question with another
    scope.
- Analyse the customer requests
  . the book reports possible analysis about customer request that could
    give insights to some problems.
  . are there customers that report more tickets than other customers (compared
    to their size)? Why? Could a systematic solution be found? (for example
    training).
  . are there many questions regarding one category? Maybe a service has to be explained
    in a better way or redesigned?
  . is it possible to automate something to make it on demand/self service?
  . further metrics based on the process explained above.
    For example if the requests start to consume more time to be solved, does it
    mean that the simple requests are already solved?
- training customer to help facilitate the solution
  . for example while requesting for help they may be require to report some information
    by default that help to classify the problem or state it.

chapter 15: debugging (problems)

- debugging is related to the customer care but with a focus on the person
  fixing the problem.
  . Debugging means: fix the problem, then understand it and fix the problem in
    a persistent way. A fix could also be documentation, like "known bugs".
- understand what the customer is trying to do
  . every time that the customer commmunicate a problem, it may be not properly
   stated, so one skill to refine is to understand the customer need, what is he
   trying to do and what does not work for him, to fix the real need.

   For example printing a mail with a short deadline really points to use another
   working printer instead of sending someone to check the seemingly broken printer.
- apart from the quick fix, fix the cause
  . if a server hangs periodically, maybe with loss of business data, it is not only
    a reboot that is needed. One should find the time to investigate the problem
    properly to find the main cause. And do not forget to document
    what was the cause and how to fix it. It may be used in similar case later.
- be systematic while debug a problem
  . make hypotheses
    test them
    note the results
    if necessary repeat the previous steps until the cause of the problem is fixed.

    Using process of elimination (removing parts in the ecosystem of the
    problematic service until the problem disappears) then using successive refinements,
    adding parts until the desidered change happens.

    Another technique is following the sequence of the operation (or path),
    until one stumble in a problem.

    Often the cause is due to one recent change, so documentation helps to
    collect them and review the changes.
- having right tools
  . Diagnostic requires right tools. Without it is hard to identify a problem.
    They could be bought or crafted in house.
  . Learning a tool is fairly easy compared to interpreted the data reported from
    the tool.
  . Do not trust 100% the tools, their reports should be questioned with critical thinking
    and previous knowledge of the environment. Otherwise they may be misleading.
- search for better tools
  . keep yourself informed in proper channels to find better debugging tools,
    but without being influenced by hype. The tools should be able to help
    in real life problems.
- ask for formal training on the tools
  . they help in many ways, for example not everyone has the equipment to
    thinker on certain machines.
- complete understand of a system/ecosystem (ecosystem: collection of systems)
  . having a complete understanding of a system, even complex ones,
    helps a lot to select quickly the right track to debug a problem.

chapter 16: fixing things once

- through fixing thing once one can expect to get three results
  . getting more time, instead of fixing the same problems always
  . be a better SA, since more problems are solved over time and not the same
  . have more arguments for customers about why fixing something takes long time
    (this should be explained later)
- Do not spent time on the same action too much.
  . In applying a fix often (even if it does not seem so or the fix is small,
    like environment variables in a shell) one loses time in many ways.
    For example an important procedure may require this continous fix that may
    be forgotten under pressure and cause a big investment of resources (ex: time)
    to be unfruitful due to the missing (possibly small) step forgotten at the start.
    Since it seems quite a pattern, that things that periodically breaks or obstacle work
    sooner or later will cause a lot of time wasted (or, as I prefer, not well spent), 
    if the involved services are going to be used for long time then just fix those things as soon
    as they repeat a bit of times.

    Moreover some observations applies:
    . if the (relatively little) problem is fixed, is likely fixed for long time.
      This saves time compared to all the occurrences of that little problem.
    . Could the problem be fixed leveraging what the other have done already?
      For example instead of creating a user config from scratch, using one
      that is already made an similar?
    . Could the fix be applied on all the similar systems, so it will not appear
      elsewhere?
- the real world dictate the constraints: temporary fixes
  . The book is quite positive about the possibility to fix things once,
    but it acknowledges also that real world raises several different contraints
    under which a team of SA may be unable to apply a permanent (or almost) fix.
  . the main idea is that after a temporary fix the SA team puts in its backlog
    the need of a permanent fix that follows the temporary one. Of course the
    backlog should be realiable. It can be a mail, a ticket, a bug tracker and so on.
  . temporary fixes are good because they, likely, let the SA feel better,
    like something important was achieved in little time instead of starting a longer
    project (note from me: but for this exist the task chunking idea!)
  . after enough temporary fixes accumulates and repeats, one is left with the analogy:
    instead of fixing the faucet, one continously mop the floor.
    To avoid this one has to be aware of the repetition of the temporary fix and
    break the cycle.
- learning from other fields, construction workers
  . with the time, human activities start to find analogies. The administration of IT
    systems starts to see some analogies with the field of construction. Therefore
    some advices from that field can be imported in the younger one, because
    construction is a more "tested" activity rather than IT administration.

    Some advice are:
    - measure twice, cut once: plan/design/check with a bit more care, then apply the change.
    - copy from existing pieces: learn to find already existing solutions, copy
      them and adapt them.
      - if possible the copy should be standardized, so errors in one copy can be fixed
        in all the others.
- automation is not a permanent fix
  . if one thinks that automation of a fix is a permanent fix, it could be wrong.

    For example deleting unwanted files when the disk usage gets big, in an
    automated fashion, does not help the problem that slowly the disk space will be
    filled by customers with files not ordered and difficult to sort. Which file can go?
    So in this case maybe a policy is the permanent fix, instead of automation.

    Of course there could be cases where automation is a permanent (or almost permanent)
    fix, but also cases where the fix comes with another approach.
- most of the time the solution is: policy and discipline
  . if possible, helped with software enforcement (automation, constraints,
    configurations and so on).
- automate a little bit every time
  . when automation can provide a permanent fix, there is no need to automate
    a task everything at once.

    One can start with documentation of the procedure, as checklist.
    Then one can automate some parts of the checklist.

    Then some more. And at the end all the parts that could be easily
    automated are infact automated and the task will take way less time.
    On the long run one should save time in this way.

chapter 17: change management

- change management is:
  . having changes on systems well documented, with plan B and reproducible,
    in other words is managing risks due to a change.
- basics
  . (a) communicate and schedule: communicate with customers and SA team
        so they know what is going to happen and schedule the change
        so it makes the least impact.
    (b) plan and test: plan how and when do the change, how to test if everything works
        and a plan B to return to a working situation if the change
        is not successful.
    (c) process and documentation: changes should follow standardized/approved procedures
        with foreseen problems covered. Changes mist be documented and approved before
        their application
    (d) Using revision control or other tools helps to roll back changes that
        put the system in an unwanted condition (not working).
        Automating changes helps to have processes that are performed always
        in the same way, so they are reproducible.
- not every change has to be within the change management
  . if change management icludes also minor changes, it could block or
    overload the work of a SA team.

    Change management should include those changes that could be critical if done
    wrong. Can delay work, can create more work with no need for it, can affect
    revenue or the effectiveness of the organization.

    Those changes should follow a standardized (and maybe continously improved)
    procedure of change management to lower the risk of wrong implementation of the
    change.
- ITIL
  . ITIL is a framework of concept that helps to standardize change management.
- risk management
  . An SA has to deal with risk, except in dummy positions. One type of risk is
    loss of service that may include loss of data. One way to diminish this risk is to
    make backups.

    In general one should look for risk discovgery and quantification (of course
    with a defined metric that adapts to the needs of the team/organization).

    What could happen due to a change?
    What could happen in the - realistic - (not extreme, were the world burns)
    scenario?
    What is the impact of those events?

    Then one should look for mitigation of the risk, and the mitigation proposal
    may go through the following five steps.
    (a) discuss with the others about impact of the change needed for mitigating the
        risk.
    (b) test plan: how do we know that the change is implemented successfully?
    (c) roll back plan: how do we roll back the plan if the change is not successful?
    (d) decision point: how and when should be decided to apply the roll back plan?
    (e) preparation: how can the change can be tested in advance to ensure
        that the real change will go as smooth as possible?
- communicating
  . When the SA team and the customer knows about a change (important enough for
    the change management) this is better because eventual related problems can be 
    spotted early and related to the change.    
  . communicating with customers is crucial. They should know when the work is
    successfully finished and for eventual errors and they should know in advance
    what is going to change. They should also know how to use the change.
    (for example if a new feature is added)
  . one should care about the volume of messaging. Too much messaging (or long
    information) will liekly be ignored by other people (that normally are busy too).
- scheduling
  . the implementation of the changes has to be scheduled properly according to
    the impact on the customer processes.
  . routine change: small changes that can happen at any time.
  . major change: this could take offline from the infrastructure one or more systems
    or important services. Those should be scheduled with the customer.
  . sensitive change: a change that should not cause any long downtime but if
    things go wrong it could. Like a gateway configuration change.
    It has to be scheduled wih the customer and the team should know when it happens and why
    so can relate subsequent errors to the sensitive change.
  . sensitive and major changes requires presence to fix eventual errors,
    one cannot just stop work. Or better he can but could be a problem
  . change freeze period are periods of time, defined in varioous way, where
    changes are freezed (except very minor ones) to avoid unwanted problems.
  . no changes before recover times (friday before WE, before holidays), otherwise
    mistakes are discovered slowly and with slow reaction time and the team
    has to time to recover and goes in burn out, that means little ability to
    produce solutions (and therefore no help from the team).
  . it is useful to have a teamwide documentation about guidelines to schedule a
    change in cooperation with the rest of the team.
- process and documentation
  . it is important to follow the processes of change management and producing the
    related documentation for reviewing problems or having lessons learned later.
  . is useful to have forms for change proposal and change control where the
    intended (non minor) change is described with the expected resources involved
    (systems and people) and the various phases explained, so:
    - detailed changes to make
    - systems, services and resources involved
    - reasons for the change
    - risks
    - test procedure
    - how to roll back the chages
    - how long the change will take to be done
    - how long the roll back will take

    the change has to be approved and the SA cannot derail from what was approved,
    otherwise others will have hard time to figure out eventual problems.

    If derailing is needed, it has to be immediately documented and has to be
    reasonable.
  . documentation should be done for very critical systems for the business,
    not for every system, otherwise the overhead would be too high for the team.
- technical helpers
  . the SA team should have a procedure, available to everyone in the team,
    to follow for updating (sensitive) configuration files.
  . software to keep a revision history of (sensitive) configuration files could
    help to review changes and debug eventual problems.
  . software to ensure that changes made in a config file are syntactically correct
    could help a lot, because it is
    difficult to tell, in a long file, if it is still syntactically consistent or
    not by a human in a short time.
  . When one makes a change, is important to update the processes that will use
    that change as soon as possible to discover potential errors. If a change is made
    but will be applied at the next reboot, and the reboot is delayed by months
    in the future, if there are problems would be difficult to related the change
    to the rebooting problem. Therefore every change has to be immediately tested
    if possible to verify that the process is not affected negatively by it.
- foster change management to increase stability
  . once the change management processes are defined, used and common in the team,
    they can be used to increase the stability of the systems.
  . creating front end for changes.
    having an interface that asks questions and check for the input
    and then modify the configurations and apply changes on the behalf of an SA
    lowers the risk of errors. Because there is not only syntax check on the
    values but, if possible, content check as well (for example avoiding
    impossible ip addresses).
  . change management meetings can be instituted to force the SA team to think
    properly about the change to be made and plan all the steps (especially the
    phases seen before).
    Meeting should involved all the people (or representatives, like managers)
    that may be involved and should know about the (major/sensitive) change to switfly report
    problems and provide debug data in case of problems. Or also feedback about
    the change itself, if it improves something or not.
    Moreover those meetings move the responsibility of the change from the SA team
    only to all the other persons that approve the change and the plan for it.

    Another advantage of meetings is that if the change could impact other
    deadlines, it can be deferred. But to be aware of the deadline one should have
    in the meeting the proper representatives.

    Of course change management meetings should be schefuled properly (as well as
    the changes) because with a too rare or too frequent schedule it can happen that
    either problems are not resolved quickly enough or there is no time to see
    if a change has "long term" implications if immediately others are made.
  . steamlining. Once change management process are mature enough, one could
    analyse them and check if parts can be eliminated or made more efficient.
    In the end this could be done with whatever mature process.

chapter 18: server upgrades

  . I wanted to skip it, but better to read a couple of pages to see if there are
    "immediately" interesting notes in a chapter.

    Server can be expanded to systems or collections of systems too.

    Upgrades cannot be only the OS, according to the checklist below,
    but whatever sensible configuration change
    that affect the system, like a change in ip address.
- basic goal
  . after an upgrade a system or a server should offer at least the same services
    that were offered before (unless some were willingly dismissed).
- suggested checklist
  . especially for important systems that hold crucial services.
  . # develop a service checkilist
      * what services are provided by the system?
      * who are the customers (human or not) of each service?
      * what software package/s provide which service?
    # verify that each software package will work after the upgrade or plan
      an upgrade of the package.
    # for each service, develop a test to verify that it is working
    # write a back out plan, possibly with triggers
    # select a maintenance windows
    # announce the upgrade to the users and customers
    # execute the test developed earlier on the services to see if the detect
      that the services are working and if they are still valid.
    # lock out users
    # do the upgrade together with someone else for help or mentoring.
    # repeat all the test previously mentioned.
    # if a test fails or a trigger for the back out plan is activated,
      execute the back out plan
   # let user back in
   # communication completion or back out action to customers/users
   # analyze the process to get lessons learned and improve the checklist.
- service checklist
  . write a checklist having
    * which services on the system are affected by the upgrade
    * who are the customers of the services affected (customers can be
      also other systems)
    * which software collection provide which service

    The checklist will guide the entire process.
  . Making the checklist available is good for feedback and discussions
    to improving it. Nevertheless write in the checklist the version so people
    will know if the are looking at the last copy or not.
  . Push the updates of the checklist to possible interested people, because its
    availability of the checklist on a website will not drive the people to pull it
    from time to time.
  . Review the checklist with people that will be somehow affected by the change 
    and with people that are involved in the update process. This helps to
    verify that the checklist is solid and steps are not missed, 
    other than distribute the responsibility.
  . Include also the customers in the reviewing processes, so they feel more
    involved in the ugrade and can give useful business hints.
  . to know which sotware and services are involved, one may need to analyze the
    machine itself or the related econsystem (for example in case of a
    middle gateway update, that is routing the traffic to many systems).
  . of course you may overlook something, so keep the margin for error or a plan
    B in case more services are effected than the ones that you planned.
- verify software compatibility with the upgrade
  . ask the vendors
  . check if it is possible to make a test of the upgrade or, if having a failure is
    not critical, just perform the upgrade and document the result. (have always a backup plan)
    If it is critical (for example an upgrade to reproduce on thousands of machines)
    you may test it before.
  . document the quirks of the software and where to find information or how to
    test the compatibility.
  . to read a bit better
- verification tests
  . A test should be developed to verify a minimum functionality for each test
    (a test cannot ensure complete functionality, just the functionality 
    convered by the test)
  . would be useful to document the test for later replication or also for monitoring
    of the functionality.
  . having scripts, or strict checklists, that perform the test is useful either as documentation either 
    as reproducibility and speed up for the operation, that can be done in unattended mode.

    For example if there are problems, at each debugging change one may have
    to redo the test, and doing them manually takes time.

    Of course doing a manual test may discover things that a script cannot catch.
  . also for test other person may be involved that are affected by the upgrade
    (either as support team or as customers), so test can be defined properly
    since is more likely that more heads produce a better tests than one (but
    they also generate more discussion, see https://en.wikipedia.org/wiki/Law_of_triviality)
  . one way to use tests is to capture the result of the service doing the test
    before the upgrade and then after the upgrade, checking the differenced for
    eventual problems.
  . One suggestion is to introduce test drive development (in this case
    test driven update) in the domain of system administration.
    So one relies on test as documentation "how things should work" and
    "what should be tested and how".
- back out plan
  . If something goes wrong, in a non-trivial way, during the update, 
    the services should be restored,
    how do you revert to a working situation? How do you undo the partial changes?
    When do you start to undo them? How long will it take?

  . One should consider the maintenance window in two parts, one for the upgrade,
    one for the back out plan. IF the point of not return is reached, the back out
    plan is implemented.
  . Since under pressure things are less clear, for keeping the "point of not return",
    one should use an alarm or a person not directly involved in the action
    to avoid letting the team using all the maintenance windows to fix the upgrade.
  . One way is to have backups of the systems or replication, so when the upgrade fails
    one can reinstated the backup
- maintenance window
  . to define the maintenance windows the SA must know (or estimate) how long the change will take
    and how much time is needed to implement the backup plan if something goes wrong.
    With those information one should agree a maintenance window with the customers
    using the services running on the server.
  . Would be helpful to fix those maintenance windows in contracts with customers,
    so no one is surprised when those are needed.
  . The mainetance window should contain all the steps planned until the users can reuse
    the services
  . Estimates can be doubled or tripled at the start, then with real result those
    can be refined to real needed times. In any case, be conservative.
  . As written before, it is good to fix a time when the back out plan is triggered to avoid
    being too late in the recovery.
  . Exaggerate the estimates but let people know when the maintenance is finished.
    So, unless they are demanding and taking you as dishonest, they will appreciate that
    the work was done faster. Expectations are the key.
- announcing maintenance
  . people do not have time to read, so make short but informative messages.
  . try to use a standard for announcements like those so people gets familiar.
  . try to use a channel that customers will read for sure.
  . try to use a blank template instead of filling one already filled to lower
    mistakes.
  . announce the maintenance window with enough time in advance to let the customer
    react in case they have to abort it.
- execute tests before the change
  . execute test also before the change, to be sure that the SA should not chase
    problems that were there also before the change.
- lock out customers
  . plan the time that customers have to leave the service and inform them with
    enough time in advance so they can gracefully leave the service.
- pair system administration: do not do the sensitive change alone.
  . as pair programming and other activities teach, perform a change with another colleague
    because two minds lower the rate of mistakes, since we are humans and
    we do subtle errors all the time.
    Checklists helps a lot also, if strictly followed.
    Moreover when one does the steps and another watches, the other has more
    focus to identify errors or mistakes in the plan or think in advance about the
    next steps.
- reexecute tests after the change
  . to prove that the system/services work as intended. Ideally the tests are the same
    as the tests done before the change.
  . the customer has to be involved to verify in person that the services works
    as wanted.
  . if automatic tests are done (monitoring), then those have to be reliable.
- remember the back out plan
  . if the planned triggers for starting the back out plan are activated,
    the back out plan has to be done consistently.
    This because the integrity of the service come first, even if maybe five minutes
    more would have been enough to finish the change. One does not know in advance.
  . discuss the timing of the back out plan with the stakeholders to get them
    to an agreement that is accepted that the change is rolled back if something
    does not work as planned.
  . after the back out plan is implemented, do the tests again to verify that everything
    works as intended.
  . collect information about why was not possible to proceed with the main plan,
    those information will help for the next time or when troubleshooting is needed.
- communicate to the customer that they can get back to the server
  . this has 3 goals at least:
    (a) let the customer know that the service is available again
    (b) let the customer be aware that there was a change (otherwise work is not
        appreciated)
    (c) if there is a problem, they will report it relating it to the change.
  . if a back out plan was implemented, customers should know that the system should
    work exactly as before.
  . provide information to the customers to help them, like a phone or a url
    to refer if they found problems.
- fresh install
  . sometimes instead of an upgrade or a change, would be better (if one can)
    do a fresh install containing the change on a spare system and then substitute
    the systems.
- reuse tests
  . if tests are properly defined and in the case they are automated, 
    they can be reused as monitoring tests.
- log changes
  . if SA log the changes, from the beginning (or froma  certain moment) done on
    a server or a system , then it is much easier to build the checklist
    mentioned at the start of the chapter.
- if possible simulate what is going to be done
  . if possible, reproducing the entire change how will it be done (or part of it)
    to be prepared in advance because experience of similar activities helps always.
- if possible, keep the old version and the new version of a change on the same machine
  . sometimes is possible to make a change, keeping the old version of the situation,
    maybe renaming a folder. If possible do this to make easier the back out plan.

chapter 19: service conversions

- Chapter about services that change for the end user (or also client machines)
  and how to perform those changes. A service that change could be
  a service of the same type (say, mails or apis) but that have a different interface
  from the previously used service.
  This mean that also major updates of services can be seen as service change.
- Try to roll out the change with less impact as possible.
  For example trying to minimize downtimes with analysis of the work schedule.
  Or providing training for a different interface.
  Training can be needed also for the operators supporting the new service.
- If possible, collecting data from the action and brainstorm improvements during the
  action wouldn't be bad.
- layer vs pillars. A change can be broken in several steps. Each step can be
  done for all the affected entities (clients, servers, whatever) at once , like
  layering, or all the steps can be done for few entities first and then
  moving to the next one (pillar approach).
  One should consider the pro and cons of each approach for each step.
  Sometimes steps could be done in layers without much impact, like creating
  new accounts (but not handing them over).
  Although normally the pillar approach has less impact.
  Also consider that doing a step in layer mode, and realizing it has errors later on,
  may imply reverting changes or doing a lot of fixes, since it was done for many
  entities and not few.
- reduce the risk of finding erros "too late" using a properly sized sample of the population
  that will get the new service to test the changes.
  The after this sample passed the test, roll out the change to another bigger sample,
  and so on until including everyone.
- communication is crucial. Don't roll out changes when there are important deadlines.
  In general is always good to set the expectations. So informing the affected
  people that a change will be done in a certain period will let them accept the
  changes better than sudden changes.
- Also define well the goals of the change. Otherwise it could happen that
  those goals gets extended or people keep in their mind a different set of goals
  and then they are not happy at the end.
  Instead defining the major goals helps keeping the focus without overstreching
  the team performing the change given the limited time that they have to perform
  it.
- if one has to apply a change affecting the entire userbase at once, with
  little alternatives, then try to test estensively before the global change,
  included load testing and plan a way to roll back changes if needed. 
  Like leaving the previous service still running.
  Another strategy is to largely advertise the change and let the new service
  be available and working beside the old service.
- Due to possible unforeseen problems after a change, try to plan a roll back plan
  (at least for some time after a change) and decide beforehand when to exectute it.
  It is tempting to try to solve the problem on the road, but it may cost too
  much downtime for the service. Time that could have been used to roll back the
  change.
- one strategy could be also avoid changes. For example picking proper pieces of
  infrastructure (software/hardware and what not) that use open protocolsor interfaces, 
  so when one piece has to be changed, it is not necessarily true that all the others have
  to be changed as well.
- while planning for changes, you may ask for vendor support for directions,
  guidelines and solving problems. Don't mind to keep your plans secret,
  the vendor will unlikely sell you out.

chapter 20: maintenance windows

- maintenance is needed, especially by big changes: clean up, moving large amount of data,
  big updates, rearranging servers and so on. Since the task is not that trivial,
  it should be planned accordingly.
- 3 stages are suggested. Preparation, Execution, Resolution.
  Preparation means: scheduling the windows, picking someone that will direct and supervise
  the task, prepare che changes that should be done, make a master plan.
  Execution means: disable the access to the systems and services going to be
  maintained; Determine the sequence for shutdown of systems and services
  (if needed); Execute the plan; perform testing to be sure that the functionality
  is there.
  Resolution means: inform the users (human or machines) that the maintenance is ended,
  Re-enable the access to the systems, be present and prepared for further bug fixing.
- maintenance window can be intense, according to the task. Therefore considering
  also the morale and fitness of the team that is going to do the maintenance is
  useful.
- As usual a "cost" , as it is maintenance, should be sold and explained to the ones going
  to "pay for it". One factor that helps explain it are measurements of availability
  or satisfaction or other measurable quantities.
  Another is the opportunity cost. What is the cost of not doing the maintenance.
  Also one should schedule the maintenance in a proper way.
  It is not that useful if the maintenance comes only once a year and more often than not
  after 6 months there are big problems due to the missing maintenance.
  The ecosystem of services and systems should run well until the next maintenance.
- the schedule of the maintenance should be discussed in advance with the representatives
  of the affected user. Also one should try to avoid hot times, to avoid, for example,
  to create further obstacles for projects that are over a due date or
  that are approaching it. Also the maintenance should be advertised as a reminder
  to the person involved, so they can plan around it.
  Scheduling is also important for the parts of the maintenace. Scheduling maintenance
  in a way that then problems will be found when the SA team is exhausted or not
  rested, is not helpful. For example imagine a maintenance in the night,
  with the day for "real world usage and testing". The SA team may be too exhausted
  for it.
  A schedule is also important because if equipment is needed and has to be ordered,
  one needs to order it in advance to have it ready (and tested) before the maintenance.
- The planning should cover the major steps and leave little problem solving during
  the maintenance windows. The period of the maintenance is likely already short
  so people cannot think too much about possible solutions. Most of the steps have to
  be clarified beforehand.
- there should be one responsible for the maintenance that can coordinate it and
  eventually rework the planning if the person spots dependencies and so on.
  When possible the person is someone with experience of the infrastructure and
  the process. Also the responsible should be good in following a plan with timely
  deadlines and under pressure if something went wrong.
- Change proposals. Before the maintenance, the SA team accumulates and discuss
  the changes to make with a list like.
  What changes are going to be made?
  Which machines will be affected and who will work on which machine?
  Preparation needed with or without due dates?
  What are the dependencies needed for the change to happen?
  What will be affected by the chance?
  How long the change will take, including testing (total time) and how many
  people will be needed for it?
  What are the tests going to be made and their requirements?
  What is the exit procedure if the change doesn't work? How long is it
  expected to take?
  Change proposals needs to be freezed before the maintenance, so the
  responsible for the maintenance (or director) can organize them.
- the plan of the director.
  Working on the change proposals the director develops a plan - with dependencies -
  highlighting which person is working on which task and when and which dependencies
  the task requires and should satisfy.
  In the plan there should be always a margin of error, so time and resources
  that stays idle during the maintenance window, so one could cover
  unforeseen steps that could go wrong. If one tries to have the maintenance window
  full from the start, more often than not it will not be enough.
  Difficult decisions, if possible, should be decided beforehand and not
  during the maintenance. During the maintenance there is time pressure and
  everyone is tired and stressed.
- before the maintenance starts , some days before it, one should ensure that the
  facilities that will be used to perform the maintenance will work. Since they are
  not so often used. If one starts the maintenance with the tools for the maintenance
  partially not working, it doesn't help.
- ensure a shutdown boot sequence and system dependencies. 
  Some systems when they reboot (or shutdown) need other
  systems to be there to do it properly. Therefore a sequence of reboot/shutdown should
  be respected to minimize unwanted and avoidable errors. Furthermore having boot
  or shutdown errors due to dependencies between systems is not so easy to
  troubleshoot. Such sequence should be identified before the maintenance, if possible.
  Otherwise the time run and one may have additional problems at his hands.
  The boot sequence can be group systems in stages. typically few machines are
  in the early (and most important) stages, like authentication servers, and then
  come all the others.

  Also a tip is that systems in a cluster or grouped by topic should boot independently
  from all the others. So for example systems in a datacenter should not rely on
  systems outside the datacenter.

  Shutdown sequences helps also in case of emergency, for example power outages and so.
  Saving additional problems due to improper shutdown or boot sequence saves
  time, stress and revenue.

  remote virtual keyboard/video (and their switches) may help, as well as
  additional communication channels like radio, phones and so on. To be independent
  from the infrastructure being maintained.
- The flight director, or the person responsible to coordinate the action (maintenance)
  should keep track of the time and the progress of the maintenance (and how the team is
  performing, if people are tired, nervous or the like, it doesn't help).
  This to decide either to have pauses or to decide a rollback of the changes before it
  gets too late and risky.

  If tasks have dependencies, a tasks that takes too much time may create a cascade catastrophe
  because it consumes all the time that was planned as buffer in case of need.
  In this case the responsible for the action has to be ready to guide the team
  towards a good decision (rollback instead of extended downtime is also a good decision)

- After the maintenance is applied, one should test the system to see if they are
  working. If not, which problem arose and how can be tackled or if it is needed to
  rollback the changes.

  Of course the tests should be planned before the maintenance. To know what to tests and what
  to expect. One cannot define meaningful tests after the maintenance, where people are tired
  and already so drained that cannot think clearly.

  Also tip. Always reboot a system before it gets changes, so you know that it can reboot.
  Otherwise if a system cannot reboot you don't know whether it is due to the changes
  or due to the fact that the system was already in a unstable state.
- After the maintenance and the tests passed, then one can communicate with the customer
  that the maintenance was successful. If possible with standard messages because again
  one cannot compose proper messages after an intense effort.
- After the maintenance be visible. Show your customer that you are there ready to hanlde
  problems created by the changes. This allow the customer to trust the Admin team.
  After that the team can rest.
- After each change (of group of changes) that were non trivial, the team should so a postmortem
  to identify what was problematic and how can it be avoided. Then documenting it.
  Over time pattern will emerge and non trivial actions will go smoother as the team learn
  from each execution. Of course the team should be also in the situation to learn.
- It is useful to have more than one "flight director" or reponsible for big operations in the team.
  This because otherwise the performance of the team in such big operations depends only on one
  person. Training other persons helps to substitute the main responsible or also provides
  peer review that may improve the approach to such actions.
- Collect data, extract trends and patterns. Collect data about every non trivial (or critical) or not short action.
  How long it takes, wht was required, what was the preparation, what were the errors and so on.
  In this way one can extract patterns and predict better similar future actions.
  For example if trnasferring files takes a long time, this may be predicted the next time to plan the
  file transfer better (for example in advance or the like)
- Try to improve the availability of the systems. If systems are redundant or you have other ways to work
  on them while they are available, try to achieve that. You never know if the customer has
  unexpected needs and cannot wait for a long maintenance. So build increased availability
  for your system so that you can work on various systems while the customr keeps working as well.
  When a customer cannot afford downtime, then it __can__ afford solutions that keeps the
  availability high.

chapter 21: Centralization and Decentralization

to read and summarize. But other chapters are more interesting at the moment.

chapter 22: Service Monitoring

to read and summarize. But other chapters are more interesting at the moment.

chapter 23: Email Service

to read and summarize. But other chapters are more interesting at the moment.

chapter 24: Print Service

to read and summarize. But other chapters are more interesting at the moment.

chapter 25: Data Storage

to read and summarize. But other chapters are more interesting at the moment.

chapter 26: Backup and Restore

to read and summarize. But other chapters are more interesting at the moment.

chapter 27: Remote Access Service

to read and summarize. But other chapters are more interesting at the moment.

chapter 28: Software Depot Service

- to read and summarize. But other chapters are more interesting at the moment.
- it is about a repository of installable packages.

chapter 29: Web Services

- https://en.wikipedia.org/wiki/History_of_the_Internet
- https://en.wikipedia.org/wiki/Global_Internet_usage
  thus in 2007 (the second edition of the book) 
  there was already plenty of traffic and solutions.
- It is about managing webservers. For those that do it (in the late 2010s lots of services
  are delivered through https) it is interesting.

- explanation of the web service basic building blocks (URL).
- At first it explains the difference between GET and POST. Both can send inputs,
  put POST can go beyond the limit of the URL length since it replies in the HTTP request.
  https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol this is enough.
- then it explains how dynamic pages are created. Quite a nice explanation
  but people should know (or check the wiki or buy the book).
  Here https://en.wikipedia.org/wiki/Dynamic_web_page it can help.
- Then HTTP codes and their categories.
  https://en.wikipedia.org/wiki/List_of_HTTP_status_codes here is ok.
- webmasters care about the content of a website. The sysasmin about the infrastructure
  that delivers the pages. Sometimes the sysadmin has to update the data
  that the website delivers.
- Still there should be agreements (see SLA https://en.wikipedia.org/wiki/Service-level_agreement)
  because otherwise the webmaster may ask unappropiate working hours for the sysadmin
  to update their website.
  Setting SLA is also useful to set expectations. In the SLA downtime for maintenance
  should be included unless the website is based on redundant (and well done and expensive)
  infrastructure.
- most of the requirements are: RAM for caching the served content (if this doesn't change).
  Then CPU if the content is often generated on the fly.
  Storage as well, in case of log files or other intensive data driven operations. Thus rather than
  pure space, what is needed is the IOPs capacity of the storage.
  Then the book goes analysing common types of websites (that are still common in 2019)
  suggesting possible setups.
- For media servers for example, there are cases of many users reading the same large file,
  thus keeping it in memory (cache) would be better than continually accessing the disk.
  As every user can access the media file in different locations. Like a video of 45 minutes.
  Some users may be at the minute 1, some others at 10 and so on. Thus in this case RAM is
  a strong requirement for the server.
  If only one user reads the file, it may be better not to keep it in cache.
  This, though, depends on the application managing the file.
- for the multiple hosts on one system what is suggested is to use many virtual ip addresses.
  Not really an actual solution. Virtual host is the way to go in late 2010s.
  https://en.wikipedia.org/wiki/Virtual_hosting
- Monitoring is always important to notice, before or at the same time of the users,
  whether a website is still available. Via monitoring one can verify which component
  failed or collect performances to see if the resources are enough for the website or
  a different sizing of the infrastruture can help.
- about scaling.
  Horizontal scaling: the website is replicated on different servers (or pool of servers)
  Of course one needs loadbalancing of requests or redirects. Loadbalancing comes with a
  good set of tricky problems, for example the connection should not be pruned before time,
  or active connections should be kept and so on.
  Vertical scaling is to keep one entry point of the website but distributing all the components
  so that they can deliver more performance than when residing on the same machine.
  Scaling could be done slowly for each part. What is suggested is to identify the component
  that produces the most load and scale it. Like in a profiler while coding, one identifies
  the most time intensive functions and tries to optimize those.
  As usual don't try to optmize everything at once. One component after another.
- security is important, as exposing a system to the internet means that someone
  may gain (partial) control of it using your resources.
  Proceeds to explain SSL certification. https://en.wikipedia.org/wiki/Public_key_certificate
  And then other common attacks at the time.
- it is always good to limit the possibilities that an attacker can have once he gains
  the control of the server. Thus the server itself is quite isolated or
  doesn't have too easy access to all the rest.
  Of course logging helps a ton. Not only to debug problems but also to log operators
  activities.
- once again separating duties is important. There should be a webmaster team that cares
  about the website, its content, formatting and co, and the Sysadmins caring about the
  infrastructure running the site.
  Visibility here is crucial as one should show what the sysadmin already does and what
  should be done by others. Many people can see what the other do only when informed,
  otherwise they are clueless.
- To avoid to push problems in production, create several stages of your website
  where people can test their ideas and changes. Then once everyone approves,
  the web team and the Sysadmin, the change can be scheduled to go online.
- prepare a checklist of questions to ask when a webserver is needed.
  This to dimension it properly (disk space, ram, cpu, etc...) and to
  add and configure the proper services or arrangements (like having services
  split among several servers).
- then the chapter exposes some careful planning about DNS and folder structure
  but this belongs mostly to the time before CMS and dynamic frameworks.
- read until 29.2.1 but the rest of the chapter can wait.

chapter 30. Organizational Structures

- to read and summarize. But other chapters are more interesting at the moment.
- it is about how to structure the SysAdmin team.

chapter 31. Perception and Visibility

- to read and summarize. But other chapters are more interesting at the moment.
- it is about how to communicate to the clients of a Sysadmin team that a lot of work happens
  to keep the infrastruture going and improving.
  It is like a performance in theather or in other professions, people are happy about the result
  but they do not often perceive how much effort the result needed.
- Perception is how people see you, it is a quality. Visibility is how much the people see you,
  it is a measure of quantity. The chapter is about how to improve the perception of the Sysadmin
  team.
  The perception of the people that work with you are their reality of you. If they think
  that you don't work properly or hard, although you do, their reality is that you aren't really
  doing much. If they don't know you exist, you don't exists for them.

  If they don't know what are you doing, more often than not they will assume negatively about you.
  (Pier note: proof: see comments on social networks, millions and billions of negative comments,
  thus humans tends to be negative towards the others until they know better)

  A lot of people in technical jobs thinks: if they technical job is properly done,
  it will be well received. The art of selling the job is done by someone else.
  Well experience (history) shows over and over that is not the case.

  Pier note: a job where the result is used by people is mostly communications and interaction
  with people. The technicality covers maybe 30 or 40% of the job, while perception is the rest.
- Now the authors reuse information from pshycology, experience and other fields about humans to
  convey some concepts.
- The first impressions are key ones. Every interaction can help build the perception of you by the others.
  It seems tha trying to act positively (with a positive attitude) helps more often than not.
  Thus try to interact with the people you work with with a positive attitude.
- One major metric for customer satisfaction is: how quick a problem is solved,
  where quick is "according to their view". If something is solved while they are busy with
  something else, even better.
- You are responsible for how good you are perceived (Pier note: in the limit of the audience; if the
  audience is ungrateful, there is only so much to do. But it is also true that it is not often the case).
- About the first impression the authors mention the 5 to 1 ratio (it is not the first time I heard of it).
  Ah yes here for example: https://www.extension.purdue.edu/extmedia/cfs/cfs-744-w.pdf .
  In practice for every 5 positive act, one negative act will be accepted (Pier note: I think
  the ratio can be lower, but yes it should be positive, like 2 to 1 or 3 to 1).
- Perception - unfairly? - is also about dress. People have a different value of other people
  already by their dress code.
- Don't yell. Well that is easy, as Aristotle said: the first that yells in an Argument has lost.
  I met so many people doing this.
  If you are angry, frustrated, the discussion is going nowhere, just leave the interaction.
  Excuse yourself and defer the interaction. For example the restroom.
- Interesting how the authors stress how important is to welcome an employee the first day.
  Do good first impressions, but not only to the upper part of the chain. Nice.
  On the other side Sysadmin are also a part of big companies thus receiving well new
  employee means that they will remember the team.
- The sysadmin should not disrespect their customers (as long as the customers are reasonable).
  On the other side the customer is not always right. One can say politely no.
  Once crucial point is to train the customers so that they don't ask constantly about
  small things.
- Being frustrated by a constant flood of problems (hey, they mean that you are not out of work)
  can be also a point. See the problems as puzzle to solve (ever played chess? near unendless
  puzzles with the same chessboard!) and maybe try to find a way to limit the recurrent problems.
  For example with documentation and co.
  Then don't forget to thank a customer that raises a request that let you identify a more
  general solution (or justifies that you can develop a more general solution picking the needed time).
- to read: 31.1.3

additional observations (not necessarily in the book)

- having non redundand code (script, configurations from servers and what not)
  is nice, and then having references to one central points in case of duplicates.
  But if one wants to do some testing, having a real duplicate of the main reference
  (say a configuration management manifest) helps. Why? Because on the duplicate you can
  make changes that do not affect the script that is actually in production.

  In other words, adapted to moder versioning systems, make a branch of the production code
  to improve it.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License