Difference between revisions of "AWS Auto Scaling"
|Line 136:||Line 136:|
Latest revision as of 23:18, 2 December 2021
A step-by-step instruction how to create a slurm cluster on AWS with auto-scaling possibility
a) upgrade your pip: pip install --upgrade pip
b) install aws client: pip install awscli
c) pip install aws-parallelcluster
d) prepare your aws_access_key_id and aws_secret_access_key.
Those can be found in "My Security Credentials -> Access keys (access key ID and secret access key)" section.
If you haven't got it yet, press "Create New Access Key" and follow the instructions.
aws configure Access Key ID [None]: _YOUR_ACCESS_KEY_ID_ AWS Secret Access Key [None]: _YOUR_SECRET_ACCESS_KEY_ this will be stored in .aws/credentials
e) parallel cluster configuration
NB: We use AWS Region us-east-1, which corresponds to N.Virginia.
You are welcome to re-consider this choice.
Allowed values for AWS Region ID: 1. ap-northeast-1 2. ap-northeast-2 3. ap-south-1 4. ap-southeast-1 5. ap-southeast-2 6. ca-central-1 7. eu-central-1 8. eu-north-1 9. eu-west-1 10. eu-west-2 11. eu-west-3 12. sa-east-1 13. us-east-1 14. us-east-2 15. us-west-1 16. us-west-2 AWS Region ID [us-east-1]:
Network & Security -> Key Pairs -> Create New Pair
Allowed values for EC2 Key Pair Name: 1. EC2_v1 EC2 Key Pair Name [EC2_v1]: Allowed values for Scheduler: 1. sge 2. torque 3. slurm 4. awsbatch Scheduler [slurm]: Allowed values for Scheduler: 1. sge 2. torque 3. slurm 4. awsbatch Scheduler [slurm]: Minimum cluster size (instances) : <------- THIS CAN BE CHANGED LATER Maximum cluster size (instances) : <------- THIS CAN BE CHANGED LATER Master instance type [t2.micro]: <------- THIS CAN BE CHANGED LATER Compute instance type [t2.micro]: <------- THIS CAN BE CHANGED LATER Automate VPC creation? (y/n) [n]: Allowed values for VPC ID: 1. vpc-579d8e2d | 0 subnets inside VPC ID [vpc-579d8e2d]: Allowed values for Network Configuration: 1. Master in a public subnet and compute fleet in a private subnet 2. Master and compute fleet in the same public subnet Network Configuration [Master in a public subnet and compute fleet in a private subnet]: 1
The config file is ready and stored in ~/.parallelcluster/config
You may revise it and edit, if needed.
To create a cluster on AWS, do pcluster create -c ~/.parallelcluster/config UCSFbeta
Beginning cluster creation for cluster: UCSFbeta Creating stack named: parallelcluster-UCSFbeta Status: ComputeFleet - CREATE_COMPLETE Status: parallelcluster-UCSFbeta - CREATE_COMPLETE ClusterUser: centos MasterPrivateIP: 172.31.0.25
Your cluster is ready to go!
In EC2->Instances->Instance you will see your master node awaiting for jobs.
Now you've got two launch template: one is for the master node, another - for computing nodes.
You can modify them here: EC2->Instances->Launche Templates
In EC2->Auto Scaling->Auto Scaling Groups you can modify your cluster shape parameters,
i.e., min size, max size, desired size, default cooldown (when to start terminating idle compute nodee), etc...
To connect to your master node via SSH, do similar to ssh -i "YOUR_PRIVATE_KEY.pem" email@example.com
As it can be seen via sinfo -lNe, there are no computing resources available (smart saving mode).
In order to bring the compute nodes up, it is sufficient to ask even a simple line: srun -n4 hostname
Answer --->srun: Required node not available (down, drained or reserved)
Then what happens: within 1 minute jobwatcher will notice that there are jobs in the queue.
The system will bring up extra resources (up to maxsize parameter) and queue will start computing.
Should the compute nodes become idle, the system will terminate the compute nodes(only those, which are idle longer than "cooldown time").
When the nodes are brought up, one will them in the list of available resources: sinfo -lNe
Tue Jun 2 14:20:55 2020 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON ip-172-31-16-148 1 compute* idle 1 1:1:1 1 0 1 (null) none ip-172-31-18-22 1 compute* idle 1 1:1:1 1 0 1 (null) none ip-172-31-22-52 1 compute* idle 1 1:1:1 1 0 1 (null) none ip-172-31-29-18 1 compute* idle 1 1:1:1 1 0 1 (null) none
Running a job for 4 cpus:
[centos@ip-172-31-0-25 ~]$ srun -n4 hostname ip-172-31-25-53 ip-172-31-18-252 ip-172-31-23-39 ip-172-31-16-128