Asiaq is a collection of tools useful for running production-grade infrastructure in AWS.
- History
- Overview
- Setting up AWS accounts
- Installation
- Baking Host Images
- Environments
- Provisioning a pipeline
- Autoscaling
- Defining a new hostclass
- Image management
- Logging
- Network Configuration
- EBS Snapshots
- Identity and Access Management
- Monitoring and alerting
- Working with DynamoDB
- Elastic Load Balancing
- Route53
- Chaos
- ElastiCache
Why Asiaq? In Inuit mythology, Asiaq is a weather goddess. If she had a collection of python tools to tame the clouds, this would have been it!
Why Disco? This repository evolved from an engineering team called Disco (short for discover). As such, you will find plenty of references to "disco" throughout the code and documentation. You will probably get the right picture if you mentally translate "disco" to "asiaq" whenever you encounter it. :)
Don't Panic. Starting up infrastructure in AWS is easy. Start off by installing the Asiaq tools. If you are starting in a brand new environment a few extra steps will be required to build or promote the machines images before they can be started up to create your cloud infrastructure.
The process of creating machine images is called baking. When baking a pie, it is often required to pre-bake the crust before adding in the filling. Similarly with these tools a machine image needs to be baked two times before it can be put to use.
During the first baking we convert a vanilla third party AMI into generic, fully updated (e.g., yum update), machine image. This image can then be converted into a host-specific image by a second baking. You might be wondering how does Asiaq know what my host ought to look like? There is no magic, there is a configuration process for that.
Now that we have images ready to use we need to create environment(s) so we have a place to start them in. Creating environment sets up all required networking and gateways (as per disco_aws.ini configuration).
Finally. All that's left is to start up all the images. You could do this by drafting out the start sequence on a piece of paper and provisioning each instance as you check it off the list. But why would you? A much better method is to write a small config and spin up the whole pipeline with one command.
Oh, and at some point you probably want to stop your pipeline. Hopefully sometime before you notice a huge bill from Amazon.
If you are a developer joining an existing team using this tool you do not need to set up any AWS accounts and you can skip this section for now.
Asiaq is designed to be used with three AWS accounts: production, audit and development.
Once set up, the audit account is administered by another team (likely a Core Ops team). The CloudTrail logs for the production and development accounts are forwarded to that account and are read-only from the production and development accounts respectively.
The development account houses the build and ci VPCs as well as all sandbox VPCs.
The production account houses the deploy and prod VPCs.
Create a ~/.aws/config using base_boto.cfg as your starting point and add the [audit] section to your credentials file and the [profile audit] section to your config file (see Working with multiple accounts below if this is unfamiliar).
- Set these as your default credentials with the following command
export AWS_PROFILE=audit
Ensure IAM roles are correct
disco_iam.py update --environment audit
Create buckets with proper permissions
- Edit disco_aws.ini ensuring dev_canonical_id and prod_canonical_id initialized with the AWS Canonical account IDs of the development and production accounts.
- Create the audit account buckets
Make sure you set the project variable in disco_aws.ini with a unique project name. This will be used to create credentials buckets and those S3 buckets need to be globally unique across all AWS accounts.
./jenkins/disco_bucket_update.sh us-west-2 audit
NOTE: Before setting up a development account, make sure you have setup the audit account first.
The commands below are intended to be run from the a config directory containing your various configs (iam,app_conf,app_auth), and it is expected that the ASIAQ environment variable is set to the path to asiaq (which contains the jenkins directory).
If you haven't already, create a ~/.aws/config using base_boto.cfg as your starting point. Then add your development credentials to the [default] section of ~/.aws/credentials. Finally, make sure you are using the default profile by running this command:
unset AWS_PROFILE
Ensure EC2 limits are sufficient
Limits are per-account. This means that if you have multiple VPCs in the same account, you have to ensure enough capacity for the union of resource requirements across all VPCs.
For example, if 'X' resources are needed for a deploy environment (1 VPC), and 'Y' for a dev environment, then you'd likely need to:
- Submit instance increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit autoscaling group increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit launch configuration increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit elastic ip increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
Ensure disco_aws.ini and disco_vpc.ini has been updated with proper Elastic IPs for instances that needs Elastic IPs.
Also make sure you set the project variable in disco_aws.ini with a unique project name. This will be used to create credentials buckets and those S3 buckets need to be globally unique across all AWS accounts.
Ensure CloudTrail is enabled for the each VPC under the account.
$ASIAQ/jenkins/disco_cloudtrail_init.sh us-west-2 dev
Ensure IAM roles are correct
disco_iam.py update --environment dev
Ensure needed buckets exist
$ASIAQ/jenkins/disco_bucket_update.sh us-west-2 dev
Ensure build and ci VPCs get created
disco_vpc_ui.py create --type build --name build
disco_vpc_ui.py create --type openvpn --name ci
Ensure that Application Configuration is up to date for each VPC
- Edit and commit the configuration in disco_aws_automation/app_conf to match your VPN configuration and then apply the configuration
$ASIAQ/jenkins/disco_app_conf_update.sh build
$ASIAQ/jenkins/disco_app_conf_update.sh ci
Ensure that Application Authorization tokens exist
$ASIAQ/jenkins/disco_app_auth_update.sh us-west-2 ci .
Ensure the Build SSH/SSL keys exist (read the output and manually set any necessary keys)
$ASIAQ/jenkins/disco_keys_rotate.sh us-west-2 build
Ensure the ci SSH/SSL keys exist (read the output and manually set any necessary keys)
$ASIAQ/jenkins/disco_keys_rotate.sh us-west-2 ci
Ensure that your new development environment publish AMIs to the production account
Add the account id to disco_aws.ini in the [bake] section under
trusted_account_ids
Bake a phase 1 AMI
disco_bake.py bake
Follow the link in the error message, then hit continue, select Manual Launch, click the Accept Terms button, wait 30 seconds and then re-run the bake command.
Apin up the build environment
disco_aws.py --env build spinup --pipeline pipelines/build.csv
NOTE: Before setting up a production account, make sure you have setup the audit account first.
If you haven't already, create a ~/.aws/config using base_boto.cfg as as your starting point. Then add a [prod] section to your ~/.aws/credentials file with your credentials for that account. Add the matching the [profile prod] section to your ~/.aws/config file (see Working with multiple accounts below if this is unfamiliar). Finally, set these as your credentials for the commands that follow in this section by setting the AWS_PROFILE environment variable:
export AWS_PROFILE=prod
Ensure EC2 limits are sufficient
Limits are per-account. This means that if you have multiple VPCs in the same account, you have to ensure enough capacity for the union of resource requirements across all VPCs.
For example, if 'X' resources are needed for a deploy environment (1 VPC), and 'Y' for a prod environment, then you'd likely need to:
- Submit instance increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit autoscaling group increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit launch configuration increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
- Submit elastic ip increase limit to (X + Y * number of VPCs) in us-west-2 region from EC2 dashboard
Ensure that your new production environment can see the Asiaq AMIs
- Add the account id to disco_aws.ini in the [bake] section under prod_account_ids
- Rebuild all hostclasses from the development environment’s jenkins
Ensure disco_aws.ini and disco_vpc.ini has been updated with proper Elastic IPs for instances that needs Elastic IPs.
Also make sure you set the project variable in disco_aws.ini with a unique project name. This will be used to create credentials buckets and those S3 buckets need to be globally unique across all AWS accounts.
Ensure CloudTrail is enabled for the each VPC under the account.
$ASIAQ/jenkins/disco_cloudtrail_init.sh us-west-2 prod
Ensure IAM roles are correct
disco_iam.py update --environment prod
Ensure needed buckets exist
$ASIAQ/jenkins/disco_bucket_update.sh us-west-2 prod
Ensure the staging, production and deploy VPCs get created
disco_vpc_ui.py create --type deploy --name deploy
disco_vpc_ui.py create --type staging --name staging
disco_vpc_ui.py create --type production --name production
Ensure that Application Configuration is up to date for each VPC
- Edit and commit the configuration in disco_aws_automation/app_conf to match your VPN configuration and then apply the configuration
$ASIAQ/jenkins/disco_app_conf_update.sh deploy
$ASIAQ/jenkins/disco_app_conf_update.sh staging
$ASIAQ/jenkins/disco_app_conf_update.sh production
Ensure that Application Authorization tokens exist
$ASIAQ/jenkins/disco_app_auth_update.sh us-west-2 staging .
$ASIAQ/jenkins/disco_app_auth_update.sh us-west-2 production .
Ensure the Deployment SSH/SSL keys exist (read the output and manually set any necessary keys)
$ASIAQ/jenkins/disco_keys_rotate.sh us-west-2 deploy
Ensure the Production SSH/SSL keys exist (read the output and manually set any necessary keys)
$ASIAQ/jenkins/disco_keys_rotate.sh us-west-2 staging
$ASIAQ/jenkins/disco_keys_rotate.sh us-west-2 production
Start up deployenator (a jenkins host for deployment automation) for each production VPC
#. For the moment, all production environments share the same produciton account. #. Every production VPC is managed by its own deployenator.
Get the latest copy of Asiaq tools:
git clone <repo_url>
cd asiaq
Using virtualenvwrapper, create an environment to run python in:
mkproject asiaq
Then install Ruby (for example, using rbenv).
And then install rake:
gem install rake
Finally, install the egg in your virtualenv:
rake setup:develop
Asiaq uses both boto 2 and boto 3 to communicate with the AWS APIs. Because we now use both, you should no longer be using a .boto with your credentials in them. This is because boto2 can find credentials specified in ~/.aws/credentials where boto3, but boto3 will not see credentials specified in your .boto file.
Instead create an ~/.aws directory and add a credentials and config file to it.
mkdir -p ~/.aws
cp jenkins/base_boto.cfg ~/.aws/config
ln -s ~/.aws/config ~/.boto
touch ~/.aws/credentials
The credentials file should contain default credentials for your main development account and profiles for any additional accounts you have credentials for.
[default]
aws_access_key_id = CafeFedF1d01DeadBeaf
aws_secret_access_key = My/AWS/Secret/Is/Awesome/How/Is/Yours/00
[litco-prod]
aws_access_key_id = CafeFedF1d01DeadBeaf
aws_secret_access_key = My/AWS/Secret/Is/Awesome/How/Is/Yours/00
Replace the aws_access_key_id and aws_secret_access_key with your own. If you don't know your credentials, you can create new ones in the AWS console:
- Log-in to the AWS Console as yourself (not as the root account).
- Go here https://console.aws.amazon.com/iam/home?#users
- Right click on your user name and select "Manage Access Keys"
- Then click on the "Create Access Key" button.
- Then click on the "Download Credentials" button
- Copy the access key and secret access keys from the CSV file into your boto file.
If there was an old access key you had forgotten:
- Left click on your user name and select "Manage Access Keys" (again)
- Delete the old Access Key (use the date created to disambiguate)
Now you will be able to run the disco_aws.py and friends.
You will need to create a [profile_name] section in your ~/.aws/credentials file and you will need to add a [profile profile_name] section for each profile in the ~/.aws/config with the same contents as the [default] section.
i.e. for a profile called [litco-prod] in your credentials file you will need this section in your ~/.aws/config:
[profile litco-prod]
region = us-west-2
And this section in your ~/.aws/credentials
[litco-prod]
aws_access_key_id = CafeFedF1d01DeadBeaf
aws_secret_access_key = My/AWS/Secret/Is/Awesome/How/Is/Yours/00
Then when wanting to run commands against the non-default account, use the AWS_PROFILE environment variable.
For example:
AWS_PROFILE=litco-prod disco_aws.py listhosts
The best way to have Jenkins use Asiaq is to spin up jenkins in AWS and assign proper AWS instance role that can interact with AWS. Please see disco_aws.ini mhcdiscojenkins section for 'instance_profile_name' attribute.
As a new member you need someone to initialize your AWS account.
- Create new user config by copying an existing one from 'iam/group_membership/dev' to a new file following the same naming pattern as the original.
- Run
disco_iam.py update --environment dev
to create apply the configuration to AWS. - Log in to AWS console and navigate over to IAM Users
page select newly created user and create a new password with the
Manage Password
action. - Share the new password with the user. And point them to the installation and [Adding Operator Unix Account](#adding operator unix account) section.
To be able to log into a provisioned machine you first need to add your
user credentials to an S3 bucket with the help of disco_accounts.py
To add a new operator account:
disco_account.py adduser --name foo
This will ask you to type a new password, it will then pre-generate a
user configuration, and open it with your favorite $EDITOR
. At this
point you can make any alterations to the account, including pasting in
you ssh key into the ssh_key_0
(without which you will not be able to
ssh to this account). Multiple ssh_keys are supported by adding more
ssh_key_X
fields where X is a number. When satisfied with changes save
file and close editor, configuration will now be uploaded to S3.
Should any subsequent alterations be needed to account, just run:
disco_account.py edituser --name foo
To disable a user account, open the config with edit and change active
field value to no.
To change your password, you need to paste in crypt(3) compatible hash,
which you can conveniently generate with
disco_account.py hashpassword
.
NOTE: disco_account.py
only updates configuration in s3, these changes
do not take immediate effect on any running instances. The instance
needs to be rebooted or the disco-add-operators
service is restarted.
When working with instances inside an environment you're especially forced to use sshaagent. This is because all ssh connections must go through a jump host and password based authentication is disabled. Obviously you don't want to store a private key on jump host not only because its ephemeral (in most cases anyway) but also because its insecure.
After SSH agent is running you can daisy chain ssh connections by either connecting to the jump host and then to target machine. Or using this command:
ssh -At root@$jump_host ssh root@$destination
In either case be sure to pass in -A when connecting to the jump host.
Alternatively, you can use the disco_ssh.sh
script to avoid having to
know the jump host and destination host IPs:
disco_ssh.sh $env_name $destination_instance_id
This script assumes you have the agent running, but will take care of the rest of the configuration for you.
If you prefer to use the agent manually (or if the disco_ssh.sh
script
doesn't work for you and for whatever reason you can't fix it), then
read the following section on configuring the agent.
The most simplistic (but not convenient) method of using ssh_agent:
- Start ssh-agent: 'eval $(ssh-agent)'
There is one rather annoying limitation when starting ssh-agent using this method. The agent is only active for current shell session, so sshing to VPC'd hosts will not be possible from any other terminal windows/tabs.
The convenient method of using ssh_agent is by making use of keychain(not to be confused with OSX keychain). After installing keychain the setup requires addition of the following to your ~/.bashrc:
/usr/bin/keychain -Q -q --nogui ~/.ssh/id_rsa
source ~/.keychain/$HOSTNAME-sh > /dev/null
With this configuration, when a shell is first started keychain will prompt to unlock keys. After this any subsequent sessions will re-use the unlocked key from the first session.
Before any baking can take place some infrastructure needs to be in place. Most likely it is already running, but if you are baking at odd times (late at night (maybe you should be sleeping instead?)) or re-building pipeline from scratch, then it needs to be started up.
At a minimum the build VPC needs to be up. To confirm whether this is indeed the case:
disco_vpc_ui.py list | grep build
With just an empty VPC, only base images and a few hostclasses (templates for groups of identical hosts) can be built, the ones that don't require any of our custom RPMs or eggs. List of these special hostclasses can be found under no_repo_hostclasses option of disco_aws.ini. If you are planning on baking these no additional infrastructure needs to be up.
For all other hostclasses the rpm server needs to be up, it lives on mhcdiscojenkins. Confirm that it's up:
disco_aws.py --env build listhosts --hostclass mhcdiscojenkins
If it's not, start up the build pipeline:
disco_aws.py --env build spinup --pipeline pipelines/build/build.csv
To prepare the base image we run this:
disco_bake.py bake
This takes a CentOS provided image and runs the init/phase1.sh script to perform a yum update and to install some packages we will need on all hosts, such as NTP and gmond. Because of the yum update this step takes almost 10 minutes.
To create an image with hostclass specific configuration / software:
disco_bake.py bake --hostclass mhcNameOfHostclass
This will take the latest base image and install hostclass specific software and configuration making it ready for use via provisioning process.
Much of baking involves issuing remote commands to a temporarily provisioned instance. These commands are performed over ssh, and there are two different ways we can connect to the instance over ssh. Using either its private or, its public address:
- Private address can only be used when the bake is invoked from the same metanetwork as where the instance being baked.
- Public address can be used whenever the default baking security group allows communication from the source address to the public on port 22.
Baking defaults into the mode where public address is used for ssh. But
when baking from within same metanetwork as the default bake security
group --use-local-ip
flag should be supplied to switch to private
address mode:
disco_bake.py bake --use-local-ip
Most generally, use --use-local-ip
when baking from
jenkins-in-the-cloud and don't use it when baking from your desktop.
Some information is too sensitive to be stored in git (e.g., passwords, SSL certificates, LADP password). We keep such "credentials" in S3 buckets whose permissions are more restrictive than git's.
To know what credentials need to be injected. Run the disco_keys_rotate.sh script from console or run disco-aws-key-rotate-prod. See the output for a list of keys that needs to be uploaded by disco_operator.
Here is how to use credentials injection.
Store the sensitive information in S3:
disco_creds.py set --key app_auth/my_cool_new_service/username --value "secret password"
Create the file you want to inject the information into:
vim discoroot/opt/wgen/etc/my_cool_new_service/config.ini
Include the S3 substitution string in the file:
# inside discoroot/opt/wgen/etc/my_cool_new_service/config.ini
user = username
password = {s3cred://app_auth/my_cool_new_service/username}
Note that the S3 substitution string will be baked into the AMI, and replaced by the credentials stored in S3 at provision time. This allows the same AMI to be deployed to multiple environments using different credentials. The bucket from which credentials are fetched will be chosen based on the environment in which the host is provisioned.
Environments created by Asiaq are separated out by VPCs. Each environment resides in its own VPC and has its own metanetworks, gateways, instances, and so on. All environment management is done with the disco_vpc.py tool.
List all the running VPC style environments:
disco_vpc_ui.py list
Creating an environment (a VPC) requires two pieces of information. The name of the environment, and the environment type.
The environment names must be unique, check that the selected name will not collide with any currently running environment by listing active environments.
An environment type defines network infrastructure including:
- what IPs that are used
- Whether it is unique
- Network topology
- EIP allocations
These are defined in disco_vpc.ini. Open this file to find one that fits your purpose or define a new one.
Creating environment:
disco_vpc_ui.py create --type sandbox --name test
Find the vpc id string by listing active environments.
Destroy the environment:
disco_vpc_ui.py destroy --vpc-id vpc-dc84d8b7
By default disco_aws.py selects the environment defined in disco_aws.ini. The --env parameter allows you to work in an environment other than the default one without having to alter the config.
If you will be doing a lot of work in a particular environment you can define the DEFAULT_ENV environment variable and disco_aws.py will default to working in that environment.
Once the images are baked you can spin up the pipeline using this command:
disco_aws.py spinup --pipeline \
pipelines/dev/disco_profiling_pipeline.csv
The format of the CSV file is pretty simple. Here is a short sample:
sequence,hostclass,min_size,desired_size,max_size,instance_type,extra_disk,iops,smoke_test,ami,deployable,integration_test
1,mhcdiscologger,,1,,m3.large,200,,no,ami-12345678,no,
1,mhcdiscoes,,2,,m3.large,200,,no,no,
2,mhcdiscotaskstatus,,1,,m3.large,,,no,yes,disco_profiling_task_status_service
2,mhcdiscoinferenceworer,1,1@45 19 * * *:3@33 19 * * *,,5,m3.large,,,no,yes,disco_inference_workflow
Field descriptions:
- Instance boot sequence number. The smaller the number the earlier the machines are started, relative to others. This field need not to be unique, in the example above the first three instance are started in parallel.
- Hostclass name of instance, this determines what AMI to use as well as applies host class specific configurations from disco_aws.ini (such as LB settings).
- Minimum Number of instances to use
- Desired Number instances to use.
- Maximum Number of instances to use
- Instance type
- [1-1000] Gigabytes of extra EBS storage to attach to the instances. This will need to be mounted to be of use.
- Provisioned IOPS for EBS volume
- If yes ensure instance passes smoke test before continuing on starting next sequence. By default smoke test just tries to ssh to the instance and only passes after ssh connection has bee established and authenticated.
- AMI specific AMI to use instead of latest tested AMI for hostclass
- deployable If true we can replace an instance with a newer one
- integration_test Name of the integration test to run to verify instances are in a good state
The desired_size can be either an integer or a colon (:) separated list
of integers with cron formatted times at which to apply each size. Using
the at symbol (@) to separate the desired size and the cron
specification. For example, "1@30 10 * * *:5@45 1 * * *"
says to scale
to one host at 10:30 AM UTC and scale to 5 hosts at 1:45 AM UTC.
But you can also provision machines one at a time using the provision command, for example:
disco_aws.py provision --hostclass mhcntp --min-size 1
There are a number of optional parameters as well, here is a selection:
--no-smoke-test This will skip the smoke test of newly provisioned hostclass.
--no-destroy This will forgo automatic destruction of a host that failed to provision
--ami ami-XXXX This will use a specific AMI rather than the latest for the hostclass
--extra-space This will resize the root partition on boot with the specified number of extra gigabytes of disk
--extra-disk This will attach an extra EBS volume with the specified number of gigabytes
Note: "Extra space" will automatically be added to the root partition, but this slows down provisioning. An "extra disk" has to be formatted and mounted before use and that is a much faster process. For these reasons extra disk should generally be preferred. However if you need to add storage capacity immediately to a host not already using "extra disk" then "extra space" allows that without any config or code changes, while adding "extra disk" support to an image requires that you figure out which paths need to be mounted, create a start-up script to prepare the volume, make sure it is run at the right time, and then bake and test the new image.
Similar to spin up a running pipeline can be torn down:
disco_aws.py spindown --pipeline \
pipelines/dev/disco_profiling_pipeline.csv
If you are planning to also discard the environment you can skip this step and proceed to Destroying an Active Environment_.
When disco_aws.py is used to provision a hostclass it creates an autoscaling group. You don't generally need to deal with these directly, but if you would like to see the state of autoscaling groups you can use disco_autoscale.py.
You can list autoscaling configurations:
disco_autoscale.py listconfigs
You can list autoscaling groups:
disco_autoscale.py listgroups
And you can list autoscaling policies:
disco_autoscale.py listpolicies
There are also commands to delete each of these. Generally you can simply use disco_aws.py and disco_vpc.py and they will handle the details for you.
A hostclass is our functional unit for deployment. Multiple machines may be instantiated from a single hostclass, but every machine in the hostclass has the same specific function, configuration and set of packages installed.
When defining a new hostclass use the create command to start:
disco_bake.py create --hostclass mhcbanana
Generally you will install some software in the init/mhcbanana.sh script:
yum install -y httpd
chkconfig httpd on
As you define the hostclass you will want to test your changes by baking:
disco_bake.py --debug bake --hostclass mhcbanana --no-destroy
If the bake succeeds this will print out the AMI id of the newly baked hostclass which you can provision normally. If it fails the --debug option ensures that everything is logged and the --no-destroy option ensures that you can ssh into the machine to investigate the problem.
You can find the IP of the machine by listing it:
disco_aws.py --env build listhosts --most | grep mhcbanana
Then you can log in as root:
ssh root@IP_OF_MACHINE
In addition to running commands on bake via the init/HOSTCLASS.sh script you can also add files and set their permissions via the acfg mechanism.
For a file or directory to appear on the new machine simply add it under discoroot and it will be copied over to the machine with root:root ownership. If you need special ownership or permissions add this to acfg.metadata in the form:
path/to/file 0755 user group
acfg is run before the host specific install script, so any users that might normally be created by package install scripts may need to be created in init/phase1.sh instead; the base image will also need to be regenerated after changes to phase1.sh
By default any file in discoroot will be installed on every Asiaq machine. If you would like to limit a configuration file to a specific machine append the tilde ~ and the name of the hostclass to the file. Here is an example with two different database hosts:
discoroot/etc/mongodb.conf~mhcobservationstore
discoroot/etc/mongodb.conf~mhcprofilestore
We do not run any post boot configuration such as Chef, Puppet or Ansible scripts. This means machines need to seek passwords from S3 and otherwise do any post-provision configuration via init.d scripts. You may want to look at other similar host's init.d scripts for self-discovery templates when adding a new hostclass.
Finally if this new hostclass should become part of the standard Asiaq pipeline add it to the appropriate pipeline definition CSV file. See Provisioning a pipeline_ for more details.
Images can go through various stages and vpc in the development environment can be configured to run only images in a particular stage unless the AMI is specified on provision.
The number and names of the stages are specified by the ami_stages variable in the disco_aws.ini's bake section. By default the stages are untested and tested.
Once an AMI reaches the last stage it will be shared with the accounts specified in the prod_account_ids variable in the same section of disco_aws.ini if it was baked by the prod_baker (as defined in the [bake] section of disco_aws.ini). This prevents developer created instances from accidentally being promoted to production use. We currently set this to jenkins since the jenkins user on our Jenkins machine builds all our production AMIs.
To test, promote and deploy (if deployable) the latest instance of a hostclass, run this command:
disco_deploy.py test --pipeline pipelines/ci/disco_profiling_pipeline.csv --hostclass HOSTCLASSNAME
If you leave out the --hostclass parameter this command will pick a hostclass at random for testing.
To explicitly promote an AMI without testing to the tested stage you can run this command:
disco_bake.py promote --ami ami-XXXXXXXX --stage tested
Although Amazon charges only a nominal fee of a cent per month per image, keeping many old images can lead to confusion.
An AMI can be deleted by simply running:
disco_bake.py deleteami --ami ami-XXXXXXXX
Deleting many AMIs one by one can be cumbersome. A special command exists in disco_aws called cleanupamis. It can be used to delete AMIs by age or by keep count (per hostclass). To delete all but last three tested images of each image type run:
disco_bake.py cleanupamis --stage tested --keep 3 --age 0
To delete all untested AMIs older than one month:
disco_bake.py cleanupamis --stage untested --keep 0 --age 31
The default age is 14 days and the default keep count is 3 and 14. This will delete the images older than 14 days, but keep the last three even if they are older than 14 days:
disco_bake.py cleanupamis --stage untested
disco_bake.py cleanupamis --stage tested
This is equivalent to running:
disco_bake.py cleanupamis --stage untested --keep 3 --age 14
disco_bake.py cleanupamis --stage tested --keep 3 --age 14
You can also restrict the cleanup to a particular hostclass using the --hostclass parameter. You can get the list of amis that cleanupamis would delete without actually deleting them using the --dryrun parameter.
Logging within Asiaq is all done over syslog. All hosts come
pre-configured with a local relaying rsyslog. To make use of it, send
syslog to UDP 514
, TCP 514
, or /dev/log
.
All logs should have a
syslogtag
corresponding to the application name. If its a custom Asiaq service
then syslogtag should be of the following format
disco.$app_name.$log_type
($log_type examples: app, errors, audit,
stat). This tag is used to organize log files on log aggregator as well
as makes it easy to search for logs through kibana. For services which
log to syslog with
syslogtag but don't
follow the disco.$app_name.$log_type
convention additional rule should
be added to discoroot/etc/rsyslog.conf~mhcdiscologger
. Specifying
which file the logs should be written to. For example:
# Handle mongo logs
:syslogtag, startswith, "mongod" /opt/wgen/log/mongo.log
& ~ # discard logs matching prev rule after processing
All application should write logs over syslog instead of a local file.
Options which affect network are split into disco_aws.ini and disco_vpc.ini. Former is used for all configuration that is hostclass specific while disco_vpc.ini is used for environment wide configuration.
An Asiaq environment network is composed of 4 metanetworks: intranet, tunnel, dmz, and maintenance. Each has its own Security Group (SG), Route Table, and one subnet per availability zone. This dictates the granularity of network configuration. That is, all instances in the same metanetwork will have the same SG Rules and routes. Even if those instances are of different hostclasses.
These metanetworks have their own specific purposes:
- intranet: All internal Asiaq services live here; web services, worker, rabbits & etc.
- tunnel: Used by instances that need to have full direct internet access, in any production-like environments only internet filtering proxies should be here. Such as the s3 http proxy (mhcs3proxy).
- dmz: In this metanetwork we terminate customers traffic and route it back to the services which live in the intranet. Generally there are only 2 types of hosts in this metanetworks. VPN terminator and a load balancer. This metanetwork should only have customer specific routes out to the internet.
- maintenance: Serves as an entry point, a vestibule if you will, for all administration tasks. No hosts which are necessary for the continuous running of the production pipeline should be here. In sandbox environments this metanetwork usually routes one IP back to the internet this way all the ssh https packets get back to the developer. In production no outbound routes exist in this metanetwork.
Currently these 4 metanetworks are hardcoded. Adding new ones requires changes to disco_vpc.py. However many tunables for these metanetworks are exposed through disco_vpc.ini.
What metanetwork an instance gets placed into is controlled by the
hostclass specific meta_network
option in disco_aws.ini:
[mhcmyhostclass]
meta_network=dmz
Different environments can share a large set of configuration options. To reduce duplication of configuration, all environments have their configuration split into two sections.
- environment type section, which comprises all the base options which are shared for a set of environments
- specific environment instance section, which only applies to only one environment.
These different ini-config section types are differentiated by the
envtype:
and env:
prefixes.
With this configuration you can, for example, define a set of security
group rules in the envtype
section which are shared for production and
staging but vary IP ranges that are used by each by defining VPC
IP-range in the env
section.
Both sections have the same options and any option defined in env
takes precedence over same option defined in envtype
.
The IP range for the full VPC is specified with vpc_cidr
option using
the standard CIDR format:
vpc_cidr=10.0.0.0/20
In the same way as the VPC address range, the individual metanetwork
ranges can also be defined with intranet_cidr
, tunnel_cidr
,
dmz_cidr
, and so on. There are some important facts to consider:
-
The metanetwork cidr must be a subset of the VPC range and not overlap with other metanetworks
-
The metanetworks will be automatically divided into multiple subnets, one per availability zone. Largely this is transparent to the user. With one exception: IP availability. CIDR range can only be split into 2^n fragments. And so, when disco_vpc has to divide unevenly it will discard possible IP space.
For example if there are 9 Availability zones and /24 CIDR. We need to over-divide the range into 16 ( Ceil(Log(9,2)) ) ranges but only use the first 9. So we lose almost 1/2 the original IP space. Usually the problem is not as severe as there are only 2-4 Availability zones per region.
Each metanetwork have many Security Group rules associated with it. Each rule is a space separated triplet consisting of:
- Protocol type (TCP/UDP/ICMP...)
- Traffic source. This can be either a CIDR, name of metanetwork, or all (for all metanetworks)
- Ports. Space separated list of ports to which traffic will be
allowed. The ports can also be specified in a range with
:
as a separator (Eg. 0:1024). IP protocols which do not have a notion of ports (Ie. ICMP), specify -1.
Each metanetwork have multiple rules associated with it, each one
separated from the other with a ,
.
Examples:
Allow port 0-1024, 443 and 8080 to intranet metanetwork from dmz and ICMP traffic from all of 10.0.0.0/8:
intranet_sg_rules=tcp dmz 0:1024 443 8080, icmp 10.0.0.0/8 -1
Just like with metanetwork IP range notation, each metanetwork has its
own sg_rule
option.
Opening ports for customers using {$my_metanetwork}_sg_rules
is somewhat
cumbersome. As customers can only talk to DMZ metanetwork, ports need to be
openeded on both intranet and DMZ. A shortcut is provided to make this easier:
customer_ports=443,80
customer_cidr=0.0.0.0/0
This will open port 443 and 80 to DMZ from internet, and also open the same ports from DMZ to intranet. This way DMZ load balancer can both serve customer and talk to services.
What traffic can be routed out of the metanetwork to the Internet
gateway (IGW) can be specified {$my_metanetwork}_igw_routes
option.
Any production like environments should use this option in Tunnel and
DMZ metanetworks. On Tunnel metanetwork we route all non-internal
traffic to internet:
tunnel_igw_routes=0.0.0.0/0
And for DMZ route the VPN traffic back to customer:
dmz_igw_routes=1.2.3.4/32
Multiple IGW routes can also be specified. So for example if the customer has two VPNs:
dmz_igw_routes=1.2.3.4/32 1.2.3.5/32
AWS allows to pass in some custom values to be subsequently handed out to instances on boot with DHCP. We expose a smaller subset of these:
internal_dns
Primary DNS serverexternal_dns
Secondary DNS serverdomain_name
The network domainntp_server
Address of NTP server.
We use AWS resolver by specifying AmazonProvidedDNS for internal_dns and external_dns, alternatively the APIPA address 169.254.169.253 can be used.
The important bits of DirectConnect / VPN Gateway are configured by 3rd parties. You file a request with your ISP or NOC, which establishes a connection between your data center and AWS on your behalf. From this point all we need to do is confirm the connection. There might be some other things that happen behind the scenes, but the outcome of this is a Virtual Gateway. This virtual gateway must be tagged with Name tag, with a value of the environment it belongs to. Unlike most other of our AWS, resources, this VGW is persistent. It is not deleted when the VPC is deleted but is merely detached, this way it can be re-attached when VPC is re-created without having to involve a 3rd party.
To establish a Direct Connection / VPN Gateway to a VPC, it needs to be
attached and route table entry must be created. This can be done by
enabling route propigation or for finer control, manually specified as
options in disco_vpc.ini
. The format is similar to Internet Gateway
routing.
What traffic can be routed out of the metanetwork to the Virtual
Private Gateway (VGW) can be specified {$my_metanetwork}_vgw_routes
option.
On Tunnel metanetwork we can route all 10.123.0.0/16 traffic to the
VGW:
tunnel_vgw_routes=10.123.0.0/16
Multiple VGW routes can also be specified. So for example if the customer has two networks to route via IGW:
dmz_vgw_routes=10.123.0.0/16 1.124.0.0/16
We make use of several types of addresses for our instances: * Public IP (dynamically assigned) * Elastic IP (EIP) * Dynamic / Allocated Private IP
Elastic IPs need to be pre-allocated before they can be used in any configuration file.
Allocating EIP:
disco_eip.py allocate
After an EIP is no longer needed it can be released. Before releasing EIP one should grep through all of Asiaq (at minimum the .ini) files to ensure the EIP is not being relied upon:
disco_eip.py release --eip 54.186.17.80
Listing all EIPs:
disco_eip.py list
To assign an EIP to a instance, add an eip option with the value of the desired EIP address. The option is specified for the instance hostclass in disco_aws.ini, as a result only the first started instance of the hostclass will get the EIP. Consequently, its often desirable to also specify public_ip=True so that any subsequent instances get a generic public IP when EIP is already assigned. EIPs can be assigned differently depending on environment, simply use the @ option convention. For example:
[myhostclass]
...
eip=5.5.5.5
eip@staging=1.3.3.7
eip@production=31.3.3.7
public_ip=True
...
with this configuration staging and production instances of this hostclass will be assigned EIPs 1.3.3.7 and 31.3.3.7 respectively. The first instance of any different environment will get EIP 5.5.5.5 and the rest will get a random public IP.
For the most part private IPs are assigned dynamically by DHCP: a arbitrary available IP in the metanetwork is allocated and assigned to the instance. This is the default behavior for private IPs and used in the vast majority of cases. For certain instances, such as DNS servers, static private IPs are required. To assign such an IP, the ip_address hostclass option can be used. Some care needs to be taken when choosing a private IP address; AWS reserves the first 5 IP addresses of the VPC range for its own use, the address must be within the IP range of the instance subnet, and the address must not be re-used. Also, due to the nature the way amazon maps subnets to specific availability zones any instance with a static IP gets locked into one arbitrary availability zone.
Just like with EIPs the @
notation (see above) can be used to assign
different private static IP to instance in different IPs.
An instance can be configured to become a sink for particular IP range. That is, any packets on the metanetwork will be routed back to an instance of specific hostclass. This can be used for sending VPN or NATing:
[myhostclass]
route=10.5.5.0/25
source_dest_check=False
The source_dest_check is used to allow the instance to receive packets whose destination does not match the instance IP. See SourceDestCheck for more details.
In AWS you almost always want to offload state to an AWS managed service there are times when you will want to store state on an EBS volume that is preserved across reboots. This can be done with EBS snapshots. Asiaq will automatically link the latest snapshot tagged with a hostclass name to the autoscaling group for that hostclass upon autoscaling group creation (i.e. on provision or spinup).
If you wish to update an existing autoscaling group with the latest snapshot you can run this command:
disco_snapshot.py --env build update --hostclass mhcverystateful
There are also commands to list, cleanup, delete and capture snapshots which work how you would expect them to work.
There is also a create command that allows you to create the initial EBS volume snapshot for a hostclass. This initial volume will not be formatted.
We make use of IAM for access control to AWS resources quite extensively. Despite this we treat the IAM configuration in AWS as ephemeral, it gets periodically reloaded from configuration stored in this repository. There are few reasons for this:
- Configuration can be easily copied between multiple accounts (such as audit, production, development).
- We do not accidentally leave temporary IAM changes behind when testing IAM policies and the like.
The IAM configuration is made up of several directories:
-
federation: Contains single file
AssumeRolePolicyDocument.iam
which specifies the policy to be used for federation. AWS documentation is rather scant on what the purpose ofAssumeRolePolicyDocument
/trust relationship
is, but it certainly needs it. Its unlikely this will need modification. -
group_membership: Contains set of directories, one per environment. Each of the environment directory has a file per used (with
.grp
extension) which species which groups the IAM user is part of (one group per line).Group membership is also use to create IAM users in the account. You might notice that there is no facility to specify keys or passwords, these must be created independently and will persist in IAM.
-
instance_policies: Specifies Instance Roles based on the containing IAM policy
.iam
file. The mapping between instance role and hostclass is defined in disco_aws.ini with theinstance_profile_name
parameter. -
user_policies: Specifies User Roles (Federated users) and Groups based (IAM Users) on policy
.iam
files. -
s3: Specifies policies for buckets.
There are several policy types: - .iam: These are the IAM policies themselves. They say what can be done to what resources from which locations. - .tr These determine the trust relationship between accounts. - .acp: These are S3 Access Control Lists. When IAM won't do what you need. - .lifecycle These determine when S3 buckets are sent to Glacier or expired. - .logging These determine whether and where S3 access logs are sent. - .versioning These determine whether versioning is enabled for an S3 bucket.
The syntax of the IAM policies is described in the AWS IAM Policy Reference.
In the disco_aws.ini file there is an [iam] section. This contains up to six variables: - saml_provider_name Name of SAML Identity Assertion Provider (for SSO) - saml_provider_url URL of SAML Identity Assertion Provider (for SSO) - role_prefix This plus "_" is prepended to user groups and roles. - policy_blacklist This prevents a subset of the defined groups and roles from being created. This is used with "@prod" to keep developer account policies from leaking into the production account. - prune_empty_groups When set to True any groups to which no user belongs are pruned. - naked_roles This specifies roles to which the role_prefix is not applied. This is used when policies need to have a specific name for cross account access. For example: SecurityMonkey.
To sync IAM configuration from the git repository to AWS:
disco_iam.py update --environment foo
This will sync the following:
- Federated User Roles & Policies
- Group (AKA Unfederated User) Roles & Policies
- Group membership
- Instance Roles & Policies
- SAML providers (based on disco_aws.ini's saml_provider_name and saml_provider_url parameters)
- Users (only the existence there of, no API keys / password are synced)
Synchronization deletes unmatched Groups / Roles / Policies / SAML-providers and overwrites with configuration from IAM directory. For IAM users, only existence is synchronized. They will be deleted / created as appropriate but will not be overwritten, this ensure keys and passwords persist.
You can list the IAM users and see the groups each is in:
disco_iam.py listusers
You can list the groups:
disco_iam.py listgroups
You can list the currently in force policies for a group:
disco_iam.py listgrouppolicies --group_name disco_dev
You can list keys (you can have zero, one or two API keys):
disco_iam.py listkeys --user_name disco_deployenator
You can create a new key (this is the only time the secret key is printed):
disco_iam.py createkey --user_name disco_deployenator
You can deactivate the old key:
disco_iam.py deactivatekey --user_name disco_deployenator --access_key_id AKIAJPXB3X3RVOOUSBAQ
If you have another key in your ~/.aws/credentials, you can reactivate the old key in case of an issue:
disco_iam.py activatekey --user_name disco_deployenator --access_key_id AKIAJPXB3X3RVOOUSBAQ
And, you can remove the old key:
disco_iam.py removekey --user_name disco_deployenator --access_key_id AKIAJPXB3X3RVOOUSBAQ
The work flow for rotating the Jenkins AWS API key is as follows:
- List the existing keys. There should be only one key there and it should be the same as the key in your ~/.aws/credentials file.
- Create a new key. Note the output, this is the only time the secret key is available.
- Rename the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and create new variables with those names and the new key id and secret key.
- Deactivate the old key.
- Run a Jenkins job requiring AWS API access, assuming it works delete the old environment variables.
- Remove the old key.
The work-flow for rotating your own API key is as follows:
- List the existing keys. There should be only one key there and it should be the same as the key in your ~/.aws/credentials file.
- Create a new key. Note the output, this is the only time the secret key is available.
- Place the output of of createkey in your ~/.aws/credentials file and comment out the old key there.
- Remove the old key (using your new key in the ~/.aws/credentials file).
You can update the environment application authentication password by environment:
disco_app_auth.py --env prod update
Or by bucket name:
disco_app_auth.py --bucket <bucket_name> update
We use Cloudwatch Metrics and Alarms for monitoring and alerting. The workflow is:
- metrics and alarms are manually configured in disco_alarms.ini
- on VPC creation, SNS topics are created and subscribed to first responders
- on hostclass provisioning, Clouwatch metrics and alarms are created, and configured to push to appropriate SNS topic
- on hostclass termination, Cloudwatch metrics and alarms are deleted, but SNS topics are left intact
- on VPC destruction, SNS topics are still left intact to avoid a manual confirmation step associated with re-subscribing first responders
The config file disco_alarms.py
contains one [notifications]
section
describing the SNS topics, and multiple [<metric_name>]
sections
describing the metrics and corresponding alarms to be created, either
globally or at a hostclass level.
Notification entries in disco_alarms.py
are of the form
<env>_<severity>=<subscribers>
, where:
env
is an environment name (i.e., a VPC name)severity
is one of:info
,critical
subscribers
is either a list of comma-separated email addresses, or a callback url (useful for PagerDuty)
A new SNS topic will be created for every line in the Notifications section, and all subscribers will be subscribed to this topic.
Note that every new email subscription will automatically trigger an email from AWS to the subscriber, asking for confirmation that the subscription is indeed desired. Notifications will not be sent out to this email address until the subscriber has confirmed.
Alarms and metric fall into two categories, custom and automatic.
The metrics for the latter are automatically created with auto scaling
group. To set up alarms on these metrics. A corresponding alarm section
needs to be added to disco_alarms.py
:
[AWS/EC2.CPUUtilization]
custom_metric=False
period=300
duration=5
threshold_max=99
threshold_min=0
statistic=Average
This will automatically create the alarm for all hostclasses that are
spun up. By appending the section with hostclass name (eg:
[AWS/EC2.CPUUtilization.mhcfoo]
) the alarm will be only applied to the
one specific hostclass. This can also be used to 'override' any of the
options on specific hostclass.
Section name breaks down into 2-3 period separated fields:
- Namespace
- MetricName
- Hostclass (Optional)
Options:
custom_metric
specifies whether its a custom or automatic (autoscale) metricperiod
Interval in seconds between checks of metricduration
Number of time check needs to be in critical state before alarm is triggeredthreshold_max
Optional maximum threshold for alarmthreshold_min
Optional minimum threshold for alarmstatistic
SampleCount | Average | Sum | Minimum | Maximumlog_pattern_metric
Optional, specifies whether the metric comes from the CloudWatch Logs agent
For more information on the options refer to CloudWatch' PutMetricAlarm API
To be able to create alarms on custom metrics, the metrics must have
dimension with a single field env_hostclass
assigned with the
environment and hostclass concatenated through a underscore (eg:
ci_mhcfoo
). There are examples on how to set up alarms in
disco_aws_automation/disco_metrics.py
.
With the exception of the custom_metic option, which needs to be set to True, Custom Alarm configuration is identical to that of Automatic Metrics.
If you make any changes to SNS topic subscriptions you may want to apply them without re-creating the VPC. This will do it:
TK
If you make any changes to metrics and alarms and want to apply them without cycling any hostclasses:
TK
CloudWatch Logs can create metrics from log files. For example, sending an alert when there are exceptions in log files.
It can be controlled with disco_log_metrics.py
and is configured with disco_log_metrics.ini
[mhcdummy.ErrorCount]
log_file=/var/log/httpd/error_log
filter_pattern=error
metric_value=1
Section name breaks down into 2 period separated fields:
- Hostclass
- MetricName
Options:
log_file
The log file to readfilter_pattern
The pattern to look for Pattern Syntaxmetric_value
A value to extract from the pattern or a static number to increment the metric by
And here's a more complicated example, extracting a metric value from log lines:
[mhcdummy.StatusCode]
log_file=/var/log/httpd/access_log
filter_pattern=[ip, id, user, timestamp, request, status_code, ...]
metric_value=$status_code
The agent runs on your instance and forwards logs it's configured to monitor. Installing and enabling the agent
are common steps that you may wish to have many hostclasses perform, so it's a good idea to put those steps
in a common initialization script, such as init/common.sh
, which is then sourced by your hostclass init script
at bake time.
The agent configuration file lives at /etc/awslogs.conf
. For the file format, refer to the Amazon docs.
A sample agent configuration file for Apache could look like this:
[general]
state_file = /var/awslogs/state/agent-state
[/var/log/httpd/access_log]
datetime_format = %Y-%m-%d %H:%M:%S
file = /var/log/httpd/access_log
buffer_duration = 5000
log_stream_name = {hostname}
initial_position = start_of_file
log_group_name = {env}/{hostclass}/var/log/httpd/access_log
[/var/log/httpd/error_log]
datetime_format = %Y-%m-%d %H:%M:%S
file = /var/log/httpd/error_log
buffer_duration = 5000
log_stream_name = {hostname}
initial_position = start_of_file
log_group_name = {env}/{hostclass}/var/log/httpd/error_log
List all log metrics for a hostclass.
disco_log_metrics.py [--debug] [--env ENV] list-metrics --hostclass HOSTCLASS
List all log groups for a hostclass.
disco_log_metrics.py [--debug] [--env ENV] list-groups --hostclass HOSTCLASS
Update log metrics for a hostclass from the config file.
disco_log_metrics.py [--debug] [--env ENV] update --hostclass HOSTCLASS
Delete all log metrics for a hostclass.
disco_log_metrics.py [--debug] [--env ENV] delete --hostclass HOSTCLASS
Managing DynamoDB tables is done using the disco_dynamodb.py
script.
The currently supported commands are: list, describe, create, update,
and delete.
Because DynamoDB tables are not physically tied to any one particular VPC,
their names have to be unique across all the environments we have. As such,
disco_dynamodb.py
transforms the input table name to the actual one
it uses by postfixing it with an "_" followed by the environment name. For
example, Notices
would become Notices_ci
for the ci environment and
Notices_staging
for the staging environment.
To list all the DynamoDB tables:
disco_dynamodb.py list
To create a DynamoDB table:
disco_dynamodb.py create --config notices.json
where notices.json is a JSON file containing the definition of the DynamoDB table. For a list of properties that need and can be included in the table definition, please go here.
After issuing a create command, the describe command can be used to check on the status:
disco_dynamodb.py describe --table Notices
To update a DynamoDB table:
disco_dynamodb.py update --config notices.json
To specify the environment against which the command is run, use the --env
option:
disco_dynamodb.py create --config notices.json --env integrationtest
where notices.json is a JSON file containing the new definition of the DynamoDB table. For a list of properties that need/can be included in the table definition, please go here.
To delete a DynamoDB table:
disco_dynamodb.py delete --table Notices
Elastic Load Balancers(ELB) add load balancers to AutoScaling groups. This is useful for creating a single service endpoint for a collection of instances. ELBs are automatically given a Route53 domain name based on the hostclass and environment name of the ELB
The following configuration is available in disco_aws.ini
to configure an ELB for a hostclass
[mhcbanana]
domain_name=aws.wgen.net
elb=True
elb_health_check_url=/banana/liveops/heartbeat/status
elb_sticky_app_cookie=banana_session
elb_idle_timeout=300
elb_connection_draining=300
Options:
domain_name
Top level domain name to use as a suffix for all ELB domain nameselb
Create an ELB for this hostclasselb_meta_network
[Optional] Meta network to run ELB in, defaults to same meta network as instanceselb_health_check_url
[Default /] The heartbeat end-point to test instance healthelb_instance_port
[Default=80] The port number that your services are running onelb_instance_protocol
[Default inferred from port] HTTP | HTTPS | SSL | TCPelb_port
[Default=80] The port number to expose in the ELBelb_protocol
[Default inferred from port] HTTP | HTTPS | SSL | TCPelb_public
[Default no] yes | no Should the ELB have a publicly routable IPelb_sticky_app_cookie
[Optional] Enable sticky sessions by setting the session cookie of your applicationelb_idle_timeout
[Default=300] Timeout before ELB kills idle connectionselb_connection_draining
[Default=300] The timeout, in seconds, for requests to unhealthy or de-registering instances to complete before instance is removed from ELB
List all ELBs in the environment
disco_elb.py --env ci list
Create a new ELB for a hostclass. Update the ELB if it already exists.
disco_elb.py --env ci create --hostclass <hostclass>
Delete a ELB. No error is thrown if ELB does not exist.
disco_elb.py --env ci delete --hostclass <hostclass>
To terminate HTTPS an ELB needs a valid TLS certificate. You can use either the AWS Certificate Manager service or IAM server certificates. The ELB code will look for a cert in ACS with a DomainName of HOSTCLASS-ENV.DOMAINNAME. If it fails to find the cert in ACS to will look for an IAM Server Certificate with a ServerCertificateName of the same name.
Example commands for a self-signed cert in IAM (for testing)
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout privateKey.key -out certificate.crt
openssl rsa -in privateKey.key -check
openssl x509 -in certificate.crt -text -noout
openssl rsa -in privateKey.key -text > private.pem
openssl x509 -inform PEM -in certificate.crt > public.pem
aws iam upload-server-certificate --server-certificate-name mhcadminproxy-ci.example.com \
--certificate-body file://`pwd`/public.pem \
--private-key file://`pwd`/private.pem
Example commands for an ACS cert
aws --region us-east-1 acm request-certificate \
--domain-name astro-jenkins.aws.wgen.net \
--domain-validation-options DomainName=mhcfoo-ci.dev.example.com,ValidationDomain=example.com \
--idempotency-token abcd1234
# An email will be sent to admin@example.com, click on the link in the e-mail.
aws --region us-east-1 acm list-certificates
Note: As of 2016-02-12 ACS is only supported in the us-east-1 region and so can only be used with ELBs in that region.
Route53 is used for managing our DNS domains and records.
It is automatically used by Asiaq to manage DNS for services
and can also be controlled with disco_route53.py
AWS Hosted Zones basically map to a DNS domain name and contain a list of DNS records. There are public and private Hosted Zones. Private Hosted Zones only work inside VPCs. For most of our needs we need to be able to use the domain names from outside a VPC so have been using Public Hosted Zones.
It is possible to create a Hosted Zone as a subdomain for a domain that is outside of Route53 (e.g wgen.net) In this case amazon gives us a set of DNS records we need to insert into our DNS when creating the Hosted Zone. This means that creating a Hosted Zone for a domain outside of Route53 needs to be a manual process. For the exact steps see CreatingNewSubdomain We have aws.wgen.net in this way.
Retrieve the list of Route53 Hosted Zones. These are our top level domain names
disco_route53.py list-zones
Retrieve the list of Route53 DNS records for all Hosted Zones. Of interest here are the CNAME records that we use to give DNS names to the Elastic Load Balancers
disco_route53.py list-records [--zone <zone-name>]
Create a new DNS record with a single value. Only non-alias records are supported.
disco_route53.py create-record <zone-name> <record-name> <type> <value>
Delete a DNS record
disco_route53.py delete-record <zone-name> <record-name> <type>
Machines do die and network partitions do happen. The more machines you run the more likely you are to experience this on any particular day. disco_chaos.py allows you to greatly increase the odds that identify recovery problems during working hours.
Modelled on the Netflix Chaos Monkey, the Asiaq chaos utility randomly kills instances for you. There is only one command and it intended to be run regularly and then followed by a run of integration tests so any problems can be identified.
For example this would terminate 0.5% of all machines in production, but maintain at least 33% of capacity for any particular hostclass.
disco_chaos.py --env production --level 0.5 --retainage 33.3
There is also a --dry-run option that will simply print out the instances it would have terminated if you hadn't used that parameter.
This Retainage number is rounded up. So if there is only one instance of a hostclass and the value is anything greater than 0 then at least one instance will be preserved.
The level is similarly rounded up to 1 instance. So if you specify a level of 1% and there are only 50 machines available for termination, one instance will be terminated.
You can opt-out of chaos. This might be done for some hostclass which you don't need to be resilient because you never intend to run it in production. In this case specify you don't wanty chaos in your disco_aws.ini, for example:
[mhccapiloadtester]
chaos=no
On the nature of Chaos. There are many ways this could be implemented, but currently we construct a list of machines eligle for termination and then select a set percentage of those machines and terminate them without any bias. A bias might make for better tests. For example, if we kill 66% of all machines of one type that will test our ability to cope with self-DDOS than killing 33% of one type and 33% of the machine that calls it. Another example is losing all the machines in a single AZ, which can happen due to network partition, natural disaster, fire, etc. We welcome improvements.
ElastiCache provides cache servers as a service. Using the ElastiCache service means
we don't have to create hostclasses in EC2. Spinning up a new cache cluster involves
editing the disco_elasticache.ini
config file and running disco_elasticache.py
Currently only Redis is supported in Asiaq.
The following configuration is available in disco_elasticache.ini
to configure an ElastiCache cluster.
[ci:dummy-redis]
instance_type=cache.m1.small
engine_version=2.8.23
port=6379
parameter_group=default.redis2.8
num_nodes=2
Options:
instance_type
The instance types to use. Must use instance types that start with "cache."engine_version
The version of Redisport
Port that Redis should be available onparameter_group
The set of Redis parameters to usenum_nodes
Number of nodes in cache cluster
ElastiCache also depends on some configuration from disco_aws.ini
[disco_aws]
default_domain_name=aws.wgen.net
default_meta_network=intranet
default_product_line=example_team
List cache clusters from ElastiCache
disco_elasticache.py [--env ENV] list
Create/update the clusters in a environment from the `disco_elasticache.ini config
disco_elasticache.py [--env ENV] update [--cluster CLUSTER]
Delete a cache cluster
disco_elasticache.py [--env ENV] delete --cluster CLUSTER