Changes in nodehandler to address message losses and other issues
Imaging 400 nodes
- After starting nodehandler (both imaging and experimentation), start communication layer process (ind1)
- 4 communication groups created for imaging all nodes. Each group is responsible for prespecified nodes. (Could be moved to a config file)
- Communication layer has to be started manually, but it will be terminated automatically by nodehandler at the end of the experiment
- Main steps
- 80 is the magic number for the group size.
- Switch on nodes in groups of 80.
- Retry upto three times..
- Give up for those nodes that do not boot into pxe
- Then switch on the next group of 80… and so on..
- Until whenAll, then start frisbee process
- Switch off nodes in the order of completion..
Frisbee time is fairly constant, main problem is with initial booting into pxe image
Total time taken: 49:23 (out of 400, 10 were excluded)
Tutorial with 100 nodes and OTG
The attached script was used to test the tutorial with 100 nodes