2019-03-18

The #AWS EC2 Windows Secret Sauce

Now that I have got your attention with a catchy title - let me share with some of my thoughts regarding how AWS shines and how much your experience as a customer matters.

Deploying instances in the cloud is something that is relatively fast - at least when it comes to the deployment of a Linux instance.

Windows Operating Systems - is a whole different story.

Have you ever thought why it takes such a long amount of time to deploy a Windows instance in the cloud? There are a number of reasons why this takes so much longer.

Let me count the ways:
  1. Running Windows in the cloud - is a dumb idea - so you deserve it!! (just kidding :) ) 
  2. Seriously though - Windows images are big - absolutely massive compared to a Linux image - we are talking 30 times larger (on the best of days) so copying these large images to the hypervisor nodes takes time.
  3. They are slow to start.. Windows is not a thin operating system - so it takes time. 
With all the above said - it seems that AWS has created a really interesting mechanism with which they can reduce the amount of time it takes for an instance to start. Yes they say it can take anything up to 4 minutes for you to be able to remotely connect to the instance - but if you think about it - that is really a very short amount of time.

I started to look into the start time of Windows (for a whole different reason) and found something really interesting.

This is not documented anywhere - and I doubt I will receive any confirmation from AWS on the points in this post - but I am pretty confident that is the way this works.


It seems that there is a standby pool of Windows instances that are just waiting in the background to be allocated to a customer - based on customer demand.

Let that sink in for second, this means there is a powered-off Windows instance - somewhere in the AZ waiting for you.

When you request a new Windows EC2 instance, an instance is taken from the pool and allocated to you. This is some of the magic sauce that AWS does in the background.

This information is not documented anywhere - I have only found a single reference to this behavior on one of the AWS forums - Slow Launch on Windows Instances

forum_post_slow



I did some digging of my own and went through the logs of a deployed Windows instance and this provided me with a solid picture of how this actually works. This is what I have discovered about the process (with the logs to back it up).

The date that this was provisioned was the 17th of March.
  1. On the 17th I launched a Windows instance in my account at 13:46:41 through the EC2 console.

    ec2_launch
  2. You can see that AWS does not make the instance available for about 4 minutes - until then you cannot login

    (have you ever wondered why?? - hint, hint carry on reading.. )

    4_minutes
  3. After waiting for just under 4 minutes I logged into the instance and from the Windows event log - you will see that the first entry in the System log is from February 13th at 06:52 (more than a month before I even requested an instance).

    This is the day that the AMI was released.

    1st_boot
  4. At 06:53 that same day the instance was generalized and shutdown

    sysprep

    shutdown
  5. The next entry in the log was at 04:55 on the 17th of March - which was just under
    8 hours before I even started my EC2 instance!!

    start_in_pool

  6. The hostname was changed at 04:56

    rename_generalize
  7. And then restarted at 04:57

    reboot_generalize
  8. After the instance came back up - it was shutdown once more and returned to the pool at 04:59.

    shutdown-return-to-pool.

    shutdown-return-to-pool2
  9. The instance was powered on again (from the pool) at 11:47:11 (30 seconds after my request)

    power-on-from-pool

    More about what this whole process entails further on down the post.

  10. The secret-sauce service then changes the ownership on the instance - and does some magic to manipulate the metadata on the instance - to allow the user to decrypt the credentials with their unique key and allow them to log in.

    ssm_agent
  11. The user now has access to their instance.

I wanted to go a bit more into the entity that I named the "Instance Pool". Here I assume that there is a whole process in the background that does the following  (and where the secret sauce really lies).

This is is how I would assume how the flow would be:


There are two different entities here at work - one is the AWS backbone service (in orange) and the User/Customer (in blue). Both of the sequences work in parallel and also independent of each other.

  • AWS pre-warm a number of Windows instances in what I named the "Instance pool". They preemptively spin up instances in the background based on their predictions and the usage patterns in each region. I assume that these instances are constantly spun up and down on a regular basis - many times a day.
  • A notification is received that a customer requested an instance from a specific AMI (in a specific region, in a specific AZ and from a specific instance type  - because all of these have to match the customers request).
  • The request is matched to an instance that is in the pool (by AMI, region, AZ, instance type)
  • The instance is then powered on (with the correct modifications of the instance flavor - and disk configuration)
  • The backend then goes and makes the necessary modifications
    • ENI allocation (correct subnet + VPC)
    • Account association for the instance
    • Private key allocation
    • User-data script (if supplied) 
    • Password rotation
    • etc.. etc..
I know that this sounds simple and straight forward - but the amount of work that goes into this "Instance Pool" is probably something that we cannot fathom. The predictive analysis that is needed here to understand how many instances should be provisioned, in which region, in which AZ - is where AWS shines and have been doing so for a significant amount of time.

This also makes perfect sense that when you deploy a custom Windows AMI - this process will not work anymore, because this is a custom AMI and therefore the provisioning time is significantly longer.

And all of this is done why?

To allow you to shave off a number of minutes / seconds wait time to get access to your Windows instance. This is what it means to provide an exceptional service to you the customer and make sure that the experience you have is the best one possible.

I started to think - could this possibly be the way that AWS provisions Linux instances as well?

Based on how I understand the cloud and how Linux works (and some digging in the instance logs) - this is not needed, because the image sizes are much smaller and bootup times are a lot shorter as well, so it seems to me that this "Instance Pool" is only used for Windows Operating systems, and only for AMI's that are owned by AWS.

Amazing what you can find from some digging - isn't it?

Please feel free to share this post and share your feedback on Twitter - @maishsk

2019-03-11

The Anatomy of an AWS Key Leak to a Public Code Repository

Many of us working with any cloud provider know that you should never ever commit access keys to a public github repo. Some really bad things can happen if you do.

AWS (and I assume all the cloud providers have their equivalent) publish their own best practices about how you should manage access keys.

One of the items mentioned there - is never to commit your credentials into your source code!!

Let me show you a real case that happened last week. 
(of course all identifiable information has been redacted - except for the specific Access key that was used - and of course it has been disabled)

Someone committed an access key to a public github repository. 

Here is the commit message 

commit xxxxxxxx26ff48a83d1154xxxxxxxxxxxxa802
Author: SomePerson <someone@some_email.com>
Date:   Mon Mar 4 10:31:04 2019 +0200

--- (All events will be counted from this point) ---

55 seconds later - I received an email from AWS (T+55s)

From: "Amazon Web Services, Inc." <no-reply-aws@amazon.com>
To: john@doe.com
Subject: Action Required: Your AWS account xxxxxxxxxxxx is compromised
Date: Mon, 4 Mar 2019 08:31:59 +0000

1 second later (T+56s) AWS had already opened a support ticket about incident




Just over 1 minute later (T+2:02m) someone tried to use the key - but since the IAM role attached to the user (and its exposed key) did not have the permissions required - the attempt failed!!

(This is why you should make sure you only give the minimum required permissions for a specific task and not the kitchen sink..)

Here is the access attempt that was logged in Cloudtrail




Here is where I went in and disabled the access key (T+5:58m)



Here was the notification message I received from GuardDuty which was enabled on the account (T+24:58m)

Date: Mon, 4 Mar 2019 08:56:02 +0000
From: AWS Notifications <no-reply@sns.amazonaws.com>
To: john@doe.com
Message-ID: <0100016947eac6b1-7b5de111-502d-4988-8077-ae4fe58a87c9-000000@email.amazonses.com>
Subject: AWS Notification Message



Points for Consideration

There are a few things I would like to point out regarding the incident above (which we in the categorized to one of a low severity). 

  1. As you can see above the first thing that the attacker tried to do was to run a list keys. That would usually be the first thing someone would try - to try and understand which users are available in the system (assuming that the user has the permission to perform that action)

    You can read more about how a potential hacker would exploit this in this series of posts.

  2. I assume since the attacker saw that they do not have enough permissions - they decided this was not a worthy enough target to continue to try the exploit. Why waste the time if you are going to have to work really hard to get what you want. That is why we only saw a single attempt to use the key.

    If I was the hacker - I would just wait for the next compromised key and try again.

  3. The reason this attack was not successful - was because the role attached to the User (and its access keys) was built in such a way that they did not have permissions to do anything in IAM.

    This was by design. The concept of least privilege is so important - and 10 times more when you are working in the cloud - that you should implement it - in every part of your design and infrastructure.

  4. AWS responded extremely fast - that is due to them (I assume) scraping the API of all public github commits (for example). It could have been that I was just in time for a cycle - but based on my past experience - the response time is usually within a minute. It would be great if they could share how they do this and handle the huge amount of events that flow through these feeds.

    They still have to match up the exact compromised key to the account, and kick off the automatic process (email+ticket). All of this was done in less than 60 seconds.

    I am impressed (as should we all be).

  5. One thing I do not understand is that why AWS would not immediately disable the key. The business implications of having a key out in a public repo - are so severe - and the  use case that would require a key in the open - is something that I cannot fathom as being a valid scenario. If AWS already find a compromised key, know which account it belongs to, and kick off a process - then why not already disable the key in the process??

    The amount of time and work that AWS would have to invest (in support tickets and calls) working with a customer to clean up the account, forfeit the charges incurred because of the leak - are above and beyond anything they would incur by automatically disabling the key in the first place.

    AWS has started to take a stance on some security features - by disabling thing by default (for example - public S3 buckets) to protect their customers from causing harm to themselves.

    I for one would welcome this change with open arms!



  6. It took me over 5 minutes to actually act on the exposed credential - in 5 minutes, a malicious actor can do some real and serious damage to your AWS account.

  7. GuardDuty - was slow, but it obvious why this was the case. It takes about 15 minutes until the event is delivered to CloudTrail - and GuardDuty then has to analyze based on previous behavior. So this product should not be used for prevention - but rather - for forensic analysis after the fact. There is no real way to identify this data on your own and analyze against your baseline for behavior - so this product is in my honest opinion still very valuable.

  8. How does one stop this from happening?

    There are a number of ways to tackle this question.

    In my honest opinion, it is mainly raising awareness - from the bottom all the way to the top. The same way people know that if you leave your credit card on the floor - there is a very good chance it will be abused. Drill this into people from day 1 and hopefully it will not happen again.

    There are tools that are out there - that you can use as part of your workflow - such as
    git-secrets that prevent such incidents from even happening - but you would have to assure that every single person, and every single computer they ever work on - would have this installed - which is a much bigger problem to solve.

    Install your own tools to monitor your repositories - or use a service such as GitGuardian that does this for you (not only for AWS - but other credentials as well). 
As always please feel free to share this post and leave your feedback on on Twitter @maishsk