--------------------------------------------------------------- Debra Shepherd 4 May 2004 On 20 and 21 April, I visited ESO's Paranal Observatory and spoke with Jason Spyromilio, Deputy Director of Paranal, Steffano Bagnulo, Operations Staff Supervisor, and Dieter Neurnberger, postdoctoral fellow. The discussions centered on 'lessons learned' during commissioning and current Paranal operations. Listed below are recommendations and issues I've identified for ALMA software tests, commissioning, & operations based on these discussions. - Make it clear (in the commissioning plan, in the operations plan, from top management, from the IPT managers) that commissioning has top priority. Commissioning should be able to pull any resources they need from any part of of the project at any time. If the commissioning head calls someone to come to Chile, then they should be on the next flight. Resources should never be pushed on commissioning. + ALMA should define the rules/priorities for commissioning soon. This should include what is expected of the development teams during commissioning. The people who should do this will likely include: Robert Laing, Dave Silva, Dick Sramek, Software management. There needs to be a clear understanding within this group ASAP and then present a proposal to ALMA management so this can be distributed to the project. - Begin to collect operating constraints at the ATF (time to do software upgrade, time to do holography, resources needed for these types of tasks). Put them in a central repository so more than just the antenna IPT or control software know what is involved. - ALMA computing should ensure that code 'ownership' is minimized. Thus, software handed off to the commissioning team can be more easily modified by someone other than the person who wrote the code. Note: Ownership is minimized if no single person develops a code branch, functionality, or complete system. - Develop an understanding of the minimum number of people needed during specific times, e.g. during instrument startup or when testing an instrument. Compare our current staffing plans with the number required to support commissioning and operations to ensure they are compatible. To do this, there needs to be a description ASAP of how ALMA Operations sees the day to day operations of the project (On site, at the OSF, in Santiago, at the Regional Science Centers). This should be done for commissioning, early science, and full operations. - ALMA Computing will need to know the admin/software upgrade plan and security protocol for commissioning. Will it follow the VLT example of No remote access? This should be coordinated between the commissioning lead and ALMA Computing. - Proposal: the Pipeline subsystem takes the responsibility to develop 'logs' of parameters that are set in observing or test programs and record them in a standard format for distribution to the rest of the project. Pipeline should work with commissioning & operations to identify what needs to be logged in what format (this is an expansion of recording processing parameters for heuristics development). - During commissioning, ensure there is a dedicated software engineer to: 1) improve tool functioning and optimize content & layout; 2) automate critical daily tests as much as possible and generate automatic plots; 3) figure out the best way to visualize the observing queue so AoDs can easily evaluate the scheduling subsystem choices; and 4) develop/optimize operations software tools (problem reporting, action summaries, configuration lists). - Consider this suggestion by Jason Spyromilio, Deputy Director of Paranal: Reserve 2 or 3 telescopes for software and hardware commissioning teams. This will allow them to understand problems and issues to a greater depth and figure out optimal solutions. In the long run it will decrease maintenance and improve operations. Perhaps: compromise with the community: promise them more than 50% time for science if we can reserve 2 antennas at all times for commissioning issues/tests. It would be ideal if we could have different software for these test antennas - use the correlator test bed? Can 2 versions of the software be setup? Possibility: rotate test antennas so the new antenna to the array would go through this test setup. This may not be optimum for the 2-antenna test facility because engineers would always have a new antenna to deal with and check out rather than figuring out the optimum software/hardware/interface system. --------------------------------------------------------------- --------------------------------------------------------------- --------------------------------------------------------------- Visit to Paranal - detailed notes on discussions with - Jason Spyromilio, Deputy Director of Paranal Responsible for Paranal commissioning - Steffano Bagnulo, staff Responsible for current operations (day & night) - Dieter Neurnberger, postdoc does duty shift as Astronomer of the Day --------------------------------------------------------------- Discussion with Jason Spyromilio, Deputy Director of Paranal - A single person needs to be responsible for driving the commissioning process. This person should have the ultimate responsibility to deliver the working system. The key benefit to this is that that person provides a central goal to the project and makes the difficult decisions. Jason was this person for Paranal. What ever he said was done. Jason (the mountain) had total veto power of implementation for new software or hardware. The development center could never impose changes on the mountain. This imposed a focused client on the development teams - Paranal. They knew who they were working for. The distributed nature of ALMA makes this imperative. Or the individual teams might think they have 2 clients - their home institution, or their IPT, and/or Chatnantour. This will not be a good situation to be in. All development teams in ALMA should work toward the single goal of getting ALMA to work on Chatnantour. - Paranal commissioning details: All computer source code is in Garching. Development is in Garching and code is delivered to the mountain. Once code is on the mountain, it 'belonged to Jason' - he could change what he wanted to get it to work. This was critical to getting the entire system to work together and work well. Essentially, commissioning started with coding which passed specs. This did not necessarily imply that it worked as it needed to or that it worked well. The commissioning team would take the code and change it until it worked as they wanted it to. Individual tasks were changed, the hierarchy of how the tasks were implemented was changed. Anything could be done to the software on the mountain and the turn-around time for code changes was very fast. Changes could be experimented with, ideas explored, code branches developed and then archived as not working. In the end, the chosen implementation was sent to Garching for configuration control. Garching may want code changes at that point and it may take 6 months to get it approved and done according to CVS. In the mean time, software updates on the mountain could proceed. Note the way software was handed off from the development team to the commissioning team on the mountain. This could only be done if there was no code 'ownership.' Ownership exists if the code was developed by a single person and only that person can figure it out enough to change it. Ownership is destroyed if no single person develops a code branch, functionality, or complete system. If the code is developed by more than one person then: - this keeps documentation up to date - versatility of who can work on the code increases, e.g. the commissioning team can modify the code with relative ease without having to fly in developers at the last minute or for months at a time to get the code working right. - commissioning will have the tools they need to solve problems fast and find the best solutions. - the design document will be up to date - communications between teams will remain high. - if one person refuses to be a team player, then that person could be replaced (albeit perhaps with difficulty). If you have to bring in a single person to fix something (e.g. you have allowed a single point failure in the development/ commissioning process) then the system is broken. This is not to say that the commissioning team should do everything themselves. Commissioning should be able to call on anyone to help them figure something out or, if they need the FTE-power, then bring in a developer to solve the problem. But bringing someone in to fix a problem should not be the only way to proceed. If this is not being imposed by all teams now or there are places in the system that this is done in a lax way, and this process of no code ownership is imposed, it may set us back 6 months to a year. But it must be done as early in the development process as possible to allow commissioning to go smoothly and well. Note: the software commissioning team should be small (5 people?) and should not change if possible. Bring in software development folks as needed for weeks to months at a time. Note: development-level software engineers must do maintenance of their code. Otherwise the developed code will not be able to be maintained (developers won't appreciate the difficulty of maintenance so they won't write code that can be easily maintained. VLT commissioning team: Approx 10 software folks who could code. This allowed 5 on site at any given time. This number was available for about 3 years. Now there are 7 on site software engineers (14 total) because the code base has grown and rule of thumb: 1 software engineer to maintain 100,000 lines of code. There were about 10 engineers on the commissioning (again, 5 on site, 5 off site). There were very few scientists - essentially, they were barred from the commissioning process. Development teams for specific systems (cameras, CCD software, etc...) came in for 2-3 weeks at a time to install and verify their system. So there was a constant stream of people into Paranal during the entire commissioning process. Planning: Paranal thought that each team would need 1 or 2 people to install/verify systems. Instead, teams ranged from 5 to 15 people. This put serious pressure on the containers and living quarters. After the system was working for UT1, then a select group of astronomers were allowed in to participate in science verification. - As mentioned above, Paranal had total veto power of implementation for code changes and hardware installation/changes. This was almost never imposed but it set the tone for the priorities of the project. Aside: The central control software CCS (ACS counterpart) had upgrades every 6 months. After an upgrade was done, the system remained constant (no patches). It was an absolutely requirement that CCS was backward compatible. Once this was established, and the CCS delivery was guaranteed to be backward compatible, 'everything calmed down.' The commissioning team wasn't fighting fires imposed by a new CCS version. Also, after a CCS upgrade, automated tests were run to insure backward compatibility. - Given that there is a single head of commissioning, then software commissioning should work for the head of commissioning. This, in Jason's view, is the only way to interface the current software tests in ALMA computing to commissioning. - Jason was adamant: keep the scientists away from commissioning. This includes the SSR. If you don't, then there are too many ideas floating around, you cannot commission by committee. Suggestion: - Have a 'Not Chile' acceptance by the SSR, deliver the software, then have the SSR be a consultant to commissioning if needed. SSR would review the progress after this to follow changes, send comments if they think things are diverging from scientifically acceptable performance. Bring in Scientists at the end, when there is a working system, to do scientific verification. Not before. Jason only brought in a few people (managers/scientists/ engineers) whom he could 'trust' to help him evaluate processes, progress, or his decisions. These people had to be 'trusted' to 1) provide good input, and 2) go back to the community and say something like: "things are going well and it looks like it will be ready in 6 months" - not: "whoa, things are a mess there!" In Jason's view, this was critical - commissioning will be a messy progress by nature, the community will not understand this and they should not be alarmed and put pressure on commissioning when it is not needed. *** Of course, if commissioning is really not going well, then things can proceed to the point where the point of no return. The key difference is the head of commissioning must make the correct decisions (or at least ones that work). * Counter argument: I have spoken to a couple of scientists since this time, it appears that the European community may be angry about their lack of input to the project, the fact that they can't come to Paranal when requested (they can apply but they are often turned down), and the fact that they are given a rigid system when they visit that doesn't meet their scientific needs in a better way. They have little or no power to affect the situation and they are grumbling loudly. + My thoughts: in some respects, Jason is correct. Commissioning needs control over the system, a central decision head, flexibility, and the highest priority of the project. But, there should be some way to provide scientific involvement in the commissioning process that would keep the community engaged and happy, and develop a trusting user base. I do not think that the best solution for ALMA should follow the Paranal example of closed rigidity. - Issue for ALMA: we must make it clear (in the commissioning plan, in the operations plan, from top management, from the IPT managers) that commissioning has top priority. Commissioning can pull any resources they need from any part of of the project at any time. If the commissioning head calls someone to come to Chile, then they should be on the next flight. Jason did not often use this last resort but the rule got the attention and focus of the VLT development teams. Resources should never be pushed on commissioning. + ALMA should define the rules/priorities for commissioning soon. This should include what is expected of the development teams during commissioning. The people who should do this are: Robert Laing, Dave Silva, Dick Sramek, Software management. We need to have a clear understanding among ourselves ASAP and then present a proposal to ALMA management so this can be distributed to the project. - Note: for the VLT, it takes 24 hours to do a physical upgrade of the software for one UT. This takes 3 nights and this is even after the pre-install verification). The reason is that the disks have to be built, software installed and verified. These type of processes need to be characterized for ALMA. We need to start collecting numbers at the ATF, put them in a central repository so more than just the antenna IPT or control software know what this involves. This will be a critical input to commissioning and operations. Along these lines: We need to understand how ALMA Operations sees the day to day operations of the project (On site, at the OSF, in Santiago, at the Regional Science Centers). This should be done for commissioning, early science, and full operations. The reason software needs this: software needs to develop an understanding of the minimum number of people needed during specific times, e.g. during instrument startup, when testing an instrument. Compare our current staffing plans with the number required to support commissioning and operations to ensure they are compatible. - VLT has no remote access to the VLT software. The firewall has never been opened. Local people are able to work on all code (never allow remote developers access to the system). This is critical to minimize security issues. If the firewall is compromised, then there would have to be security patches in the software, and then the 6 month stability between CCS installations would be complicated. It made it so much easier to work on commissioning when this policy was instituted. Besides, if the commissioning team couldn't fix software and someone had to be flown out to fix something, then the system was broken (as mentioned above). - Software system administration was outsourced to another company. The operating system was installed/upgraded once a year, there were no patches so the system was very stable. The software engineers did not have the root password. Only system admin had it. The password was sealed in an envelope in case the single head of admin died (this never happened). - What is the best way to train operators: VLT tried 2 ways: 1. at the NTT: keep the operators involved in the process as it grows. They then knew every trick in the book, worked around problems, even when it was not safe or repeatable... 2. VLT: keep operators out until there is a working system. Had to deal with the operator learning curve shortly before they were critical to the system. No operator operator input early on but they had fewer bad habits). Neither way worked particularly well. The bottom line: good operators will learn the system no matter what. Jason thought that option 2. was probably best: hire operators 6 months before you need them. Thus, late introduction will imply fewer bad habits and you will have the same number of good operators. - Eric Allarert designed BOB (Broker for Observing Blocks). This was essentially the scheduling interface. It provided a clean interface to the P2PP system that was used to define parameter inputs (P2PP is the equivalent to our OT interface that generates xml to store input parameters). BOB asks P2PP what to do next. The system is: Archive -- VLT Control -- | -- BOB <--> P2PP (GUIs for | | | instruments) | executes scripts | ^ | sequencer code parameter files .seq inputs - Pipeline heuristics development: * log everything into the system to try to understand what is done. At Paranal, logs are automatically processed and are used to develop maintenance programs. Jason estimates that this allows maintenance to be only 1/4 the amount that standard maintenance plans would predict. * The pipeline should develop 'logs' of parameters that are set in observing programs and record them in a standard format. Paranal realized this somewhat late but once they started doing this, they could evaluate data quality better, and this helped with heuristics development. Also, the logs could be shipped to personal who were not in Paranal to evaluate system status and data quality - limited the data rate significantly. Paranal pipeline team did not want to do this at first, thinking it was not a true pipeline responsibility, but they were the best team to set this up and get it working during commissioning. * Dave Silva was responsible for data quality at the VLT so talk to him about details and see exactly what is done is Garching. - There should total intellectual honesty within the leadership of the project (of course) but consider: ensure this is extended to the idea that there should be no work-arounds by code developers and subsystem leads. If something isn't correct, do not work around the problem - bring up the problem and fix it, right there. I think it is too easy in ALMA for subsystems to work around another subsystem's problems because it is easier to do this than coordinate the solution with the other person... - Suggestion by Jason (after taking the night to think about ALMA issues): Reserve 2 telescopes for engineers for software and hardware commissioning teams. This will allow them to understand problems and issues to a greater depth and figure out optimal solutions. In the long run it will decrease maintenance and improve operations. Perhaps: compromise with the community: promise them more than 50% time for science if we can reserve 2 antennas at all times for commissioning issues/tests. It would be idea if we could have different software for these 2 test antennas - use the correlator test bed? Can 2 versions of the software be setup? Possibility: rotate test antennas so the new antenna to the array would go through this test setup. This may not be optimum for the 2-antenna test facility because engineers would always have a new antenna to deal with and check out rather than figuring out the optimum software/hardware/interface system. ------------------------------------------------ Discussion with Dieter Neurnberger Operations notes: - Paranal has daily meetings at 4:30pm for each UT team to discuss nightly operations and do day/night handover. - There are short monthly meetings with Garching folks for each instrument team. - There are monthly meetings with all operations people involved that can last 4 or 5 hours - all issues of concern are brought up. - Technicians and engineers have more contact with Garching because they need to know the instrument details. Astronomers are working day-to-day operations with users and data reduction. - People who do software and QA support at Garching make 2 week visits to Paranal each year to keep them in contact. - Components of Paranal staff/operations: * 1 weather monitor each night (this is usually an operator who is given this duty of monitoring weather every 1/2 hr and making log entries for all UTs) * 1 telescope operator/UT every night * 1 AoD/UT every night to either do the service observing or help with the night time observations of visiting astronomers. Night AoD is responsible for all QA and science drivers. * 2 AoDs for every day - each one must do calibrations for the previous night for 6-7 instruments on 2 UTs. This includes doing flats, instrument calibrations, absolute flux calibrations. This can be a significant amount of work - especially in the winter when nights are long and days are short. Also, day-shift AoDs are assigned to help visiting astronomers with ALL their needs before their observations. * Visiting astronomers are required to come 2 days before their observations to get ready and acclimate to the altitude. They are given an office and all support they need to get ready for their observations - most of the time they write their observing blocks (OBs) when they come. Visiting astronomers can request to have a discussion with the night astronomers if it is possible before their observations but most preparation work is done with the day AoDs while actual obs are done with the night AoDs. * Night and Day astronomers overlap in the afternoon-evening to work out problems, details, support, handover. * Since all AoDs are dedicated on-site, then overlap and handover is real - day AoDs will work until midnight if a problem needs to be solved. Night AoDs will come in the afternoon or stay in the morning if needed. It is important to have dedicated staff that can do what is needed when it is needed. --------------------------------------------------- Discussion with Steffano Bagnulo Group leader of day and night operations at Paranal - Principle problem with current operations: the day time operations are chaotic. One astronomer takes care of 2 UTs => 6-7 instruments/day. This can be done by one person OK if they are not interrupted. But, problems arise that must be fixed and only the day-AoD can do this. This interrupts the flow of the work. - Tools to evaluate status and check calibration must be fast, responsive, and robust. The VLT has a problem with viewing ps files of calibrated frames. It takes a few minutes just to view a frame - machines are too slow. Very annoying, operations staff are not happy about this but the decision was made to have a certain kind of machine and this apparently is not negotiable. - Tools should not crash. During commissioning, a dedicated software person was needed (day and night) to work on these issues (remember, there was no software ownership so this was possible). - Automate critical daily tests as much as possible and generate automatic plots. The principle reason for this (other than to save time) is because these tasks get boring, people therefore make mistakes, operations history is not tracked easily. Commissioning should ensure that a software engineer is available to do this and ensure that automatic features are built into the system. - Operations: clearly define who is in charge of what, e.g.: QA on site, QA pipeline, QA long term monitoring. - If OB constraints are not adequate during service observing, the observation is aborted and the PI is contacted. In principle, OBs should be one hour long. You can ask for more than 1 hour but conditions are usually only granted for the first 1 hour. Guarantee of a certain data quality beyond the first hour is rare. This appears to be the result of the fact that weather conditions change on 1-2 hour timescales. If weather or instrument failure occurs in first hour, then repeat observations are guaranteed. For the visiting astronomer: they get what weather is there when they visit. If the weather is bad, no repeat. - During development phase of the software, figure out the best way to visualize the observing queue so AoDs can evaluate easily the scheduling subsystem choices. This is critical when the scheduling heuristics may not be optimum at first. - ALMA needs an official statement on what happens when the queue empties in a given time range. Can staff and postdocs provide scientific fillers? Or do the staff just have to twiddle their thumbs if the engineers don't need the telescopes? - Operations software tools are critical for day/night handover. ALMA operations needs to provide requirements on what they need in this area. Components of the Paranal software tools are: * Problem reporting system (PRS) provides a list with: PRS number creation date problem type status telescope name priority keywords description * Paranal Night Log System provides automatic information (e.g. weather, status) along with AoD comments. * Action Remedy System provided day/night communications to follow up on actions. * List of all possible OBs and their status (like the OVRO config list). This links user requirements to operations through the user support group. It is web based with links on the list that provide details (e.g. PI comments, status). Each item includes info on: Rank Run number Constraint flags Carry over status special remarks and links Bar showing hours completed brief comments on each track completed The basis of this list is created by users creating a 'readme' file that are collected and manually inserted into this web page. This is in addition to the scheduling tool interface because the scheduling tool doesn't have anything to do with explaining the science constraints and the user readme files. This is the primary way the AoDs figure out what should be scheduled. This would be a good tool for ALMA - allow AoDs to check on whether the scheduler decisions are correct and link Astronomers at the RSCs concerned with user science inputs with astronomers at the OSF concerned with operations and guaranteeing science quality. - Note: each UT has a manager (a hardware or software engineer) who knows everything going on with that UT. They coordinate activities. The manager duty rotates every 2-3 days.