2018-10-15

How Long Until you Get the New Shiny Toys from re:Invent?

re:Invent is coming - and the frenzy of releases that will build up to the event is just around the corner.

I have always had in the back of my mind that all the products announced at re:Invent are great for the press releases and the small digs at other vendors, but sometimes it takes a while until we actually get what was announced on stage in front of ~20,000 people and the rest of the world.

And I went out to look for some data. It is obvious that not everything that we heard about on stage was baked and ready for production use.

Andy Jassy - re:invent 2017 keynote

Here are some examples from last years re:Invent


re:Invent 2017

EKS (188 days)

https://aws.amazon.com/blogs/aws/amazon-elastic-container-service-for-kubernetes/
https://aws.amazon.com/blogs/aws/amazon-eks-now-generally-available/ (June 5, 2018)

 
Bare Metal (170 days)

https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-instances-with-direct-access-to-hardware/
https://aws.amazon.com/about-aws/whats-new/2018/05/announcing-general-availability-of-amazon-ec2-bare-metal-instances/ (May 17, 2018)

 
Serverless App repo (83 days)

https://aws.amazon.com/blogs/aws/aws-serverless-app-repo/
https://aws.amazon.com/blogs/aws/now-available-aws-serverless-application-repository/ (Feb 21, 2018)

 
Neptune (183 days)

https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-neptune-fast-reliable-graph-database-built-for-the-cloud/
https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/ (May 30, 2018)

 
Aurora Multi-master (Still not released)

https://aws.amazon.com/about-aws/whats-new/2017/11/sign-up-for-the-preview-of-amazon-aurora-multi-master/
Yet to be released (Oct 14, 2018)

 
Aurora Serverless (254 days)

https://aws.amazon.com/blogs/aws/in-the-works-amazon-aurora-serverless/
https://aws.amazon.com/blogs/aws/aurora-serverless-ga/ (Aug 9, 2018)

 
IOT 1-click (169 days)

https://aws.amazon.com/about-aws/whats-new/2017/11/aws-iot-one-click-now-in-preview/
https://aws.amazon.com/about-aws/whats-new/2018/05/aws-iot-1-click-generally-available/ (May 16, 2018)

 
Translate (127 days)

https://aws.amazon.com/blogs/aws/introducing-amazon-translate-real-time-text-language-translation/
https://aws.amazon.com/blogs/aws/amazon-translate-now-generally-available/ (Apr 4, 2018)

 
Transcribe (127 days)

https://aws.amazon.com/blogs/aws/amazon-transcribe-scalable-and-accurate-automatic-speech-recognition/
https://aws.amazon.com/blogs/aws/amazon-transcribe-now-generally-available/ (Apr 4, 2018)

 
Appsync (137 days)

https://aws.amazon.com/blogs/aws/introducing-amazon-appsync/
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-appsync-now-ga/ (Apr 13, 2018)

 
S3 Select (126 days)

https://aws.amazon.com/blogs/aws/s3-glacier-select/
https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/ (Apr 3, 2018)


re:Invent 2016

Lex (141 days)

https://aws.amazon.com/blogs/aws/amazon-lex-build-conversational-voice-text-interfaces/ https://aws.amazon.com/blogs/aws/amazon-lex-now-generally-available/ (Apr 19, 2017)

 
PostgreSQL for Aurora (329 days)

https://aws.amazon.com/blogs/aws/amazon-aurora-update-postgresql-compatibility/ 
https://aws.amazon.com/blogs/aws/now-available-amazon-aurora-with-postgresql-compatibility/ (Oct 24, 2017)

 
GreenGrass (190 days)

https://aws.amazon.com/blogs/aws/aws-greengrass-ubiquitous-real-world-computing/
https://aws.amazon.com/blogs/aws/aws-greengrass-run-aws-lambda-functions-on-connected-devices/ (Jun 07, 2017)

 
X-Ray (140 days)

https://aws.amazon.com/blogs/aws/aws-x-ray-see-inside-of-your-distributed-application/
https://aws.amazon.com/blogs/aws/aws-x-ray-update-general-availability-including-lambda-integration/ (Apr 19, 2017)

 
Batch (36 days)

https://aws.amazon.com/blogs/aws/aws-batch-run-batch-computing-jobs-on-aws/
https://aws.amazon.com/about-aws/whats-new/2017/01/aws-batch-now-generally-available/ (Jan 5, 2017)

 
Lambda Edge (229 days)

https://aws.amazon.com/blogs/aws/coming-soon-lambda-at-the-edge/
https://aws.amazon.com/about-aws/whats-new/2017/07/lambda-at-edge-now-generally-available/ (Jul 17, 2017)

 
At a glance it looks like the average amount of time from the list above was about 5 months.

Now don’t get me wrong. For all of the above items that were not actually available at re:Invent - I would estimate that there were the same number of products (if not more) that were available (at least in a limited number of regions) the same day they were announced. Above and beyond - the problems that AWS is trying solve and really complex - and a almost all of them have never been done before - so please AWS take your time in developing the game changing technology that you have been giving to the world.

So when Andy Jassy and Werner Vogels get up on stage at the end of November, and announce whatever wonderful stuff they are going to announce - we should all take into account that it could take anything from 1 day to almost a year until we can actually use it in all the AWS regions that we are consuming today.


Werner Vogels - re:invent 2017 keynote

How can this / does this affect you? I can give an example from the EKS announcement. We were actively looking at a kubernetes deployment on AWS and were contemplating whether we should deploy our own or wait for the managed solution that was announced at re:Invent.

Since we did not have an official release date - we decided to roll our own - and not wait for some some unknown time in the future.

It is nice to know what is coming. You will need to evaluate how long you can wait - are you ready to go with a version one product (that could / will probably have a good number of limitations) or come up with a contingency plan to solve your issues.

2018-10-08

#AWS PrivateLink vs. NAT Gateway from a Pricing Perspective

A customer came to me with a request. They do not want to use a NAT gateway from their VPC to access the AWS API's. They had a number of security concerns regarding the use of a NAT gateway (no control, logs, auditing - but that is a for a different post) and they asked for a solution.

The AWS API's that they needed access to were:Endpoints

  • S3
  • KMS
  • SSM
  • Cloudwatch
  • Cloudformation

Last year at re:Invent AWS announced the option to create VPC Interface endpoints using PrivateLink and have steadily been adding more endpoints over the past year.

With the use of these endpoints you can actually have a VPC with instances that will not have any internet access (at least not through AWS) and still be able to interact with all the AWS API's.

This is technically possible - and can easily be automated, but I wanted to look at the cost perspective.

The VPC in us-east-1 has 2 Availability Zones (you should always have at minimum 2).

That would mean deploying 2 NAT gateways in your VPC (Pricing)

I am going to assume that you have the same amount of data going through both options - so I will not factor this into the price.

Usually you have 730 hours in a month.

Each NAT gateway will cost you 0.045*730 = ~$33.

Total for 2 NAT Gateways would be $66 per month (not including traffic).

What does this look like for Interface Endpoints? (Pricing)

Each Endpoint will need to be deployed in both AZ's in pairs.

Each Interace Endpoint will cost 0.01*730*2 = ~15

Total for all the endpoints above (4 Interface Endpoints - KMS, SSM, CloudWatch and Cloudformation) would be $60 per month.
The S3 endpoint is a Gateway endpoint - and therefore does not cost you any extra.

As you can see - it is not that much cheaper.

Take into account the following scenario - you need API access to 15 out of the 21 possible interface Endpoints

This would run you the steep amount of $225 per month - which is a lot more than just a NAT Gateway.

Design decisions always have tradeoffs - sometimes you prefer security and other times it will be cost. I hope that this will enable you to make an informed decision in your VPC design.

2018-10-02

Bastardizing #DevOps

I have come across two separate discussions this past week where it became clear that some people have no idea what DevOps is.

The first one was an Israeli company here in Israel - https://devopsexperts.co.il/. Here is the proposed syllabus:

[UNSET]


They are offering this course - for a fee (of course), selling the hope that if someone would graduate the course - then they would be able to get a position as an DevOps engineer.

Someone asked on a channel - "Was this course worthwhile?".

I would like to share with you my answer.

I do not want to take away anyone's livelihood but there is no such a thing a "teaching/learning" DevOps. There is no single course that can encompass all the capabilities that one would need to become a successful DevOps professional. Above and beyond that - in each and every organization - the term DevOps will mean something completely different.

There are a number of basic topics that one can learn - and with them build up a strong foundation of skills in order help your specific company. If I would evaluate a potential candidate - and their education was based mainly on this course - I would not hire such a candidate.

The demand for talented professionals is high , everyone wants DevOps engineers - and there are not many people that have enough experience or the know how. Of course with a demand - people identify an opportunity to make money.

Looking at syllabus - it has so many flaws. The course was 45 hours (which means about 1 work week)

  • Scripting - what language are they going to teach you? Python? But who says that the company you might work for - could be using something completely different.
  • Version Control - so this is basically git.
  • Linux fundamentals - basic Linux course
  • Provisioning Resources - with what? Terraform? Ansible? something else?
  • Build Automation - Building a pipeline - with which tools?
  • Continuous Monitoring - is that even a concept?
  • Working with containers - docker run, docker build, docker pull/push
  • Configuration Management - use which technology - I can name at least 3 CM tools that you might use

As you can see, this is a 50,000 ft. view of what you might do in your day to day work as a DevOps engineer - but in no way or form can you learn any of these things in a course - and definitely not in 45 hours.

For me a good candidate would be someone that has the ability to learn, understands the big picture of how software is built, deployed and managed on a regular basis. There is no list of technologies that could be checked off a list that would qualify a candidate. Does someone know Jenkins? That might be great - but if we use something else - CircleCI, Electric Commander? What will the specific Jenkins knowledge help?

DevOps is not something that you can learn in school, or in a course. It is a collection of technologies that you collect during the years, it is a state of mind that you become accustomed to as you grow, it is a set of organizational practices that you pick up on your journey.

Not something you can learn in school.

Next one was Microsoft - who decided to rebrand VSTS into Azure DevOps. Again a shiny buzzword which Microsoft assumes will attract people to the product and their offering.

“Azure now has a new set of five DevOps services,” Jamie Cool, Microsoft’s newly retitled director of product management for Azure DevOps, told The New Stack. “They’re going to help developers be able to ship faster, [with] higher quality. Oftentimes when I have conversations, ‘DevOps’ can mean different things to different folks. So to us in this context, we really think of DevOps as the people, the process, and the products for delivering constant value to customers.”

Here in the statement above is the problem (the emphasis is mine). Products do not deliver DevOps, at least not what Azure is offering. I do agree with the part about the people and the process - but not the products. Maybe the tools - but not products.

If they would have branded the product Azure CI/CD  then I would have been all for it - but to me it seems that this is marketing play - trying to catch a goal that today everyone is trying to achieve.

2018-09-27

Replacing the AWS ELB - Final Thoughts

This is the last part in the Replacing the AWS ELB series.
  1. Replacing the AWS ELB - The Problem
  2. Replacing the AWS ELB - The Challenges 
    1. Replacing the AWS ELB - The Design
    2. Replacing the AWS ELB - The Network Deep Dive
    3. Replacing the AWS ELB - Automation
    4. Replacing the AWS ELB - Final Thoughts (this post)

    If you haven't already read the previous posts in the series - please take the time to go through them.

    So here are some additional thoughts and ideas about the whole journey.

    First and foremost - none of this would have been possible without group effort of the team that worked on this.
    Udi, Mark, and Mike - thank you all for your input, help and hard work that went into this.

    Was it all worth it?

    Yes, yes and hell yes!! The cost of having to refactor applications to work with the way that the AWS ELB works - was not financially viable and would take far to long . There was no way we could make our delivery dates and have all the applications modify the way they worked.

    So not only was it worth it - it was a necessity, without this - the project was a non-starter.

    What was the hardest part of the solution?

    Definitely the automation. We had the solution white-boarded out after a an hour or two, brought up a PoC within another hour or two.

    As I said somewhere else in the post - if this was a one-off then it would not have been worth while - but we needed about 10 pairs of haproxy instances in each deployment - and there were 10- 15 deployments - so manual was not going to work here. There was a learning curve that we needed to get over and that took some time.

    This can't be all you were doing with haproxy..

    Of course not.. The configurations in the examples are really basic and simple. The actual haproxy.cfg was a lot more complicated and was generated on the fly using Consul and consul-template. This allows for some very interesting and wonderful things that can be accomplished. The instances were what could be considered as pets, because they were hardly re-provisioned, but the configuration was constantly changing based on the environment.

    So did you save money?

    No! This was more expensive than provisioning an ELB from AWS. The constraints dictated that this was the chosen solution - not cost. Well in a way this was wasted resources, because there are instances that are sitting idle most of the time - without actually doing anything. The master-slave model is not a cost effective solution because you are spending money to address a scenario when (and if)  you lose a node.

    Does this scale? How?

    We played around with this a bit and also created a prototype that provisioned an auto scaling group with that would work active-active-active with multiple haproxy's - but this required some changes in the way we did our service discovery. This happened a good number of months after we went live - as part of the optimization stage.  Ideally - this would have been the way we would have chosen if we could do it over again.

    For this example the only way to scale is to scale up the instances sizes - not to scale out.

    So to answer the question above - in the published form - no it does not.

    Any additional benefits to rolling your own solution?

    This could be ported to any and every cloud - or deployment you would like. All you need to do it change the modules and the parts that interact directly with AWS with the cloud of your choice - and it would probably work. It is not a simple rip and replace - but the method would work - just would take a bit of extra time and coding.

    What about external facing load balancers - will this work?

    Yes, all you will need to do is replace the routes - with an elastic IP, and have the keepalived script switch the EIP from one instance to another. I should really post about that as well.

    So why did you not use an EIP in the first place?

    Because the this was internal traffic. If I was to use an external facing load balancer, the traffic would essentially go out to the internet and come back in - for two instances that were in the same subnet in the same AZ. This does not make sense neither from a financial nor a security perspective. 

    Can I contact you if I have any specific questions on the implementation?

    Please feel free to do so. You can either leave a comment on any of the posts in the series, ping me on Twitter (@maishsk), or use the contact me on the top.

    Replacing the AWS ELB - Automation

    This is Part 5 in the Replacing the AWS ELB series.
    1. Replacing the AWS ELB - The Problem
    2. Replacing the AWS ELB - The Challenges
      1. Replacing the AWS ELB - The Design
      2. Replacing the AWS ELB - The Network Deep Dive
      3. Replacing the AWS ELB - Automation (this post)
      4. Replacing the AWS ELB - Final Thoughts
      It goes without saying that anything that I have described in the previous posts can be accomplished - it is just a really tedious work to go through all the stages when you are doing this manually.
      Let's have a look at the stages
      1. Create an IAM role with a specific policy that will allow you to execute commands from within the EC2 instances
      2. Create a security group that will allow the traffic to flow between and to your haproxy instances
      3. Deploy 2 EC2 instances - one in each availability zone
      4. Install the haproxy and keepalived on each of the instances
      5. Configure the correct scripts on each of the nodes (one for master and the other for slave) and setup the correct script for transferring ownership on each instance.

      If you were to to all of this manually then this could probably take you a good 2-3 hours to set up a highly-available haproxy pair. And how long does it take to setup an AWS ELB? Less than 2 minutes? This of course is not viable - especially since it should be something that is automated and something that is easy to use.
      This one will be a long post - so please bare with me - because I would like to explain in detail how this exactly works.
      First and foremost - all the code for this post can be found here on GitHub - https://github.com/maishsk/replace-aws-elb (please feel free to contribute/raise issues/questions)

      (Ansible was my tool of choice - because that is what I am currently working with - but this can also be done in any tool that you prefer).

      The Ansible playbook is relatively simple

      Part one has 3 roles.

      1. Create the IAM role
      2. Create the security group
      3. Create the instances

      The part two - set's up the correct routing that will send the traffic to the correct instance
      The part three -  goes into the instances themselves and sets up all the software.

      Let's dive into each of these.

      Part One

      In order to allow the haproxy instances to modify the route they will need access to the AWS API - this is what you should use an IAM role for. The two policy files you will need are here. Essentially for this - the only permissions that the instance will need are:

      I chose to create this IAM role as a managed policy and not as a inline policy for some reasons that will be explained in a future blog post - both of these work - so you can choose whatever works for you.

      Next was the security group - and the ingress rule I used here - was far too permissive - it opens the SG to all ports within the VPC - the reason that this was done was because the haproxy here was used to proxy a number of applications - on a significant number of ports - so the decision was to open all the ports on the instances. You should evaluate the correct security posture for your applications.

      Last but not least - deploying the EC2 instances - pretty straight forward - except for the last part where I preserve a few bits of instance details for future use.

      Part Two

      Here I get some information about all the rout tables in the VPC you are currently using. This is important because you will need to update the route table entries here for each of the entries. The reason that this is done through a shell script and not an Ansible module - was because the module does not support updates - only create or delete - which would made the process of collecting all the existing entries, storing them and them adding a new one to the list - was far too complicated. This is an Ansible limitation - and a simple way to get around it.

      Part Three

      So the instances themselves have been provisioned. The whole idea of VRRP presumes that one of the nodes is a master and the other is  the slave. The critical question is how did I decide what should be the master and which one would be the slave?

      This was done here. When the instances are provisioned - they are provisioned in a random order, but they have a sequence in which they were provisioned - and it is possible to access this sequence - from this fact. I then exposed it in a simpler form here - for easier re-use.

      facts

      Using this fact - I can now run some logic during the software installation based on the identity of the instance. you can see how this was done here.

      identity

      The other part of where the identity of the node is used is in the jinja templates. the IP address of the node is injected into the file based on the identity.

      And of course the script that the instance uses to update the route table uses facts and variables collected from different places throughout the playbook.

      bash_script

      One last thing of course. The instance I used was the Amazon Linux - which means that the AWS cli is pre-installed. If you are using something else - then you will need to install the CLI on your own.  The instances of course get their credentials from the IAM role that is attached, but when running an AWS cli command - you also need to provide an AWS region - otherwise - the command will fail. This is done with jinja (again) here.

      One last thing - in order for haproxy to expose the logs - a few short commands are necessary.
      Here you have a fully provisioned haproxy pair that will serve traffic internally with a single virtual IP.

      Here is asciinema recording of the process - takes just of 3 minutes


      In the last post - I will go into some of the thoughts and lessons learned during this whole exercise.

      2018-09-02

      Replacing the AWS ELB - The Design

      This is Part 3 in the Replacing the AWS ELB series.
      1. Replacing the AWS ELB - The Problem
      2. Replacing the AWS ELB - The Challenges
        1. Replacing the AWS ELB - The Design (this post)
        2. Replacing the AWS ELB - The Network Deep Dive
        3. Replacing the AWS ELB - Automation
        4. Replacing the AWS ELB - Final Thoughts

        So how do you go about using an IP address in a VPC and allow it to jump between availability zones?

        The solution to this problem was mentioned briefly in a slide in a re:invent session - which for the life of me I could not find (when I do I will post the link).

        The idea is to create an "overlay" network within the VPC - which allows you to manage IP addresses even though they don't really exist in the VPC.

        A simple diagram of such a solution would look something like this:

        standard_haproxy

        Each instance would be configured with an additional virtual interface - with an IP address that was not part of the CIDR block of the VPC - that way it would not be a problem to move it from one subnet to another.

        If the IP address does not actually exist inside the VPC - how do you get traffic to go to it?

        That is actually a simple one to solve - by creating a specific route on each of the subnets - that routes traffic to a specific ENI (yes it is possible).

        add_route

        The process would be something like this:

        start

        An instance will try to access the virtual IP - it will go to the Route table on the Subnet and and because of the specific entry - it will be routed to a specific instance.

        The last piece of the puzzle is how do you get the route to jump from one instance to the other instance of haproxy, this would be the initial state.

        initial

        haproxya fails or the AZ goes down

        haproxya_fail
        haproxyb recognizes this failure
        recognize_failure

        And then makes a call to the AWS API to move the route to a different ENI located on haproxyb

        move_to_haproxyb

        In the next post - we will go into a bit more detail on how the network is actually built and how the failover works.

        2018-08-29

        Replacing the AWS ELB - The Network Deep Dive

        This is Part 4 in the Replacing the AWS ELB series.

      3. Replacing the AWS ELB - The Problem
      4. Replacing the AWS ELB - The Challenges
      5. Replacing the AWS ELB - The Design
      6. Replacing the AWS ELB - The Network Deep Dive  (this post)
      7. Replacing the AWS ELB - Automation
      8. Replacing the AWS ELB - Final Thoughts

      9. Why does this whole thing with the network actually work? Networking in AWS is not that complicated - (sometimes it can be - but it is usually pretty simple) so why do you need to add in an additional IP address into the loop - and one that is not even really part of he VPC?

        To answer that question - we need to understand the basic construct of the route table in an AWS VPC. Think of the route table as a road sign - which tells you where you should go .

        directions
        Maybe not such a clear sign after all
        (Source: https://www.flickr.com/photos/nnecapa)


        Here is what a standard route table (RT) would look like

        route

        The first line says that all traffic that is part of your VPC - stays local - i.e. it is routed in your VPC, and the second line says that all other traffic that does not belong in the VPC - will be sent another device (in this case a NAT Gateway).

        You are the master of your RT - which means you can route traffic destined for any address you would like - to any destination you would like. Of course - you cannot have duplicate entries in the RT or you will receive an error.

        route_error1

        And you cannot have a smaller subset the traffic routed to a different location - if a larger route already exists.

        route_error2

        But otherwise you can really do what you would like.
        So defining a additional interface on an instance is something that is straight forward.

        For example on a Centos/RHEL instance you create a new file in /etc/sysconfig/network-scripts/
        DEVICE="eth0:1"
        BOOTPROTO="none"
        MTU="1500"
        ONBOOT="yes"
        TYPE="Ethernet"
        NETMASK=255.255.255.0
        IPADDR=172.16.1.100
        USERCTL=no

        This will create a second interface on your instance.

        ip

        Now of course the only entity in the subnet that knows that the IP exists on the network - except the instance itself.
        That is why you can assign the same IP address to more than a single instance.

        network_4


        Transferring the VIP to another instance

        In the previous post the last graphic showed in the case of a failure - haproxyb would send an API request that would transfer the route to the new instance.

        keepalived has the option to run a script that execute when the it's pair fails - it is called a notify

        vrrp_instance haproxy {
          [...]
          notify /etc/keepalived/master.sh
        }


        That is a regular bash script - and that bash script - can do whatever you would like, luckily that allows you to manipulate the routes through the AWS cli.

        aws ec2 replace-route --route-table-id <ROUTE_TABLE> --destination-cidr-block <CIDR_BLOCK> --instance-id <INSTANCE_ID>

        The parameters you would need to know are:

        • The ID of the route-table entry you need to change
        • The network of that you want to change
        • The ID of the instance that it should be

        Now of course there are a lot of moving parts that need to come into place for all of this to work - and doing it manually would be a nightmare - especially at scale - that is why automation is crucial.

        In the next post - I will explain how you can achieve this with a set of Ansible playbooks.

        Replacing the AWS ELB - The Challenges

        This is Part 2 in the Replacing the AWS ELB series.
        1. Replacing the AWS ELB - The Problem
        2. Replacing the AWS ELB - The Challenges (this post)
          1. Replacing the AWS ELB - The Design
          2. Replacing the AWS ELB - The Network Deep Dive
          3. Replacing the AWS ELB - Automation
          4. Replacing the AWS ELB - Final Thoughts

          Now that you know the history from the previous post - I would like to dive into the challenges that I faced during the design process and how they were solved.


          High Availability


          One of the critical requirements was "Must not be a single point of failure" - which means whatever solution that we went with - must have some kind of High availability.

          Deploying a highly available haproxy cluster (well it is a master/slave deployment - it cannot really scale) is not the that hard of a task to accomplish.

          Here is a simple diagram to explain what is going on.

          haproxy-ha

          Two instances, each one has the haproxy software installed - and they each have their own IP address.

          A virtual IP is configured for the cluster and and with keepalived we maintain the state between the two instances. Each of them is configured with a priority (to determine which one of them is the master/slave) and there is a heartbeat between them vrrp is used to maintain a virtual a virtual router (or interface between them). If the master goes down - then the slave will take over. When the master comes back up - then the slave will relinquish control back to the master.
          This works - flawlessly.

          Both haproxy's have the same configuration - so if something falls over - then the second instance can (almost) instantly start serving traffic.


          Problem #1 - VRRP

            VRRP uses multicast - https://serverfault.com/questions/842357/keepalived-sends-both-unicast-and-multicast-vrrp-advertisements - but that was relatively simple to overcome - you can configure keepalived to use unicast - so that was one problem solved.

            Problem #2 - Additional IP address

            In order for this solution to work - we need an additional IP address - the VIP. How do you get an additional IP address in AWS - well that is well documented here - https://aws.amazon.com/premiumsupport/knowledge-center/secondary-private-ip-address/. Problem solved.

            Problem #3 - High Availability

            So we have the option to attach an additional ENI to the cluster - which would allow us to achieve something similar to what we have above - but this introduced a bigger problem.

            All of this would only work in a single Availability Zone - which means that the AZ was a single point of failure - and therefore was in violation of requirement #2 - which would not work.

            As it states clearly in the AWS documentation a subnet cannot span across multiple AZ's

            vpc-faq

            Which means this will not work..

            cross-az

            Let me explain why not.

            A network cannot span multiple AZ's. That means if we want the solution deployed in multiple AZ's - then it needs to be deployed across multiple subnets (192.168.1.0/24 and 192.168.2.0/24) each in their on AZ. The idea of taking a additional ENI from one of the subnets and using it as the VIP - will work only in a single AZ - because you cannot move the ENI from one subnet in AZ1 - to another subnet in AZ2.

            This means that the solution of having a VIP in one of the subnets would not work.

            Another solution would have to explored - because having both haproxy nodes in a single AZ - was more or less the same as having a single node not exactly the same but still subject to a complete outage if the the entire AZ would go down).

            Problem #4 - Creating a VIP and allow it to traverse AZ's

            One of the biggest problems that I had to tackle was how do I get an IP address to traverse Availability zones?

            The way this was done can be found in the next post.

            Replacing the AWS ELB - The Problem

            This topic has been long overdue.

            This will be a series of posts on how you can replace the AWS ELB’s inside your VPC’s with a self managed load balancing solution. This will be too long for a single blog post so I decided it was best to split it up into parts.

            1. Replacing the AWS ELB - The Problem (this post)
            2. Replacing the AWS ELB - The Challenges
              1. Replacing the AWS ELB - The Design
              2. Replacing the AWS ELB - The Network Deep Dive
              3. Replacing the AWS ELB - Automation
              4. Replacing the AWS ELB - Final Thoughts

              Let me start at the beginning.

              The product I was working with had a significant number of components that were communicating with each other. A long while back the product had decided to front all communication between components with a load balancer - for a good number of reasons such as:

              • scaling
              • high availability
              • smart routing

              Here is a basic view of what the communication between component1 to component2 would like (and just to clarify - there were just over 50 components that were in this solution - not just one or two)


              simple_diagram


              Simple load balancing example - something that you can see anywhere. The load balancers - were HAproxy.

              There was some additional logic going on inside the traffic between the components which was based on HTTP headers which allowed us to route some to certain instances and versions of the application (you can think of it as blue/green or canary routing


              A simple visualization of this would look something like this.

              routing 


              The team had already deployed this product and now it was time to move over to AWS.

              The first part of the discovery was to identify how we could accomplish as much as possible of the solution while using the services that were provided by AWS - instead of deploying our own. one of the items for  discussion that came was of course the Load balancers we were using - and if they could be replaced with AWS ELB's.

              Here were the list of requirements (and the current solution that was used met all the requirements):

              1. Must be able to serve as a load balancer
                • Define frontends
                • Define backends
              2. Must not be a single point of failure
              3. Provisioning will have no manual interaction
              4. Must be able to route traffic based on specific custom HTTP headers

              AWS met all the requirements except for #4.

              There are options to route traffic based on HTTP headers in the AWS Application Load Balancer but they are limited (to say the least), you can only used a Host header or a path in the URL.


              hostname_route     path_route


              This was not an option, the engineering teams were not going to refactor the applications just to adapt to the AWS ELB. This caused me to go back to the drawing board an see how we could still use the know HAproxy solution inside of AWS - despite the well known problems.

              More about those in the next post.

              2018-08-21

              Scratching an itch with aws-vault-url

              I think that aws-vault is a really nice tool. It prevents you from saving your AWS credentials in plain text on your machines (which is always a good thing)

              Since I started using it – I found a number of difficulties along the way.

              1. aws-vault does not support aarch64 #261

                To solve this - I created my own binary - aws-vault on a Chromebook

              2. aws-vault only supports storing credentials when using a fully blown GUI. Here is a really good walkthrough how to get this working https://www.tastycidr.net/using-aws-vault-with-linux/

              3. aws-vault login will give you a URL with which you can paste into a browser and it will log you in automatically to the AWS console. My pet peeve with this was that it always brings you to the default console page.

                image

                So I was thinking – why would I not be able to open up the specific console that I would like to access – such as S3 or EC2 – I mean come on … these are just different URLs that need to be opened in the same way.


              Now if I was a go developer – I would happily have contributed this back to the original project – but I am not. I am not really a developer at all. I can play with code – I can also create stuff – but I would not dare call myself someone who can write an application.

              So I wrote a small wrapper script to provide this functionality.

              Say hello to aws-vault-url – an easier way to open a direct conosle for a specific product.

              (This is in no way a robust tool – and if you would like to contribute and improve it – please feel free to submit a PR)

              Update – 22/08/2018

              So I did some thinking about this – and came to the conclusion that it makes no sense to maintain a separate tool – so I decided to take the leap and push myself to go into the code it self – so I sat for an hour or two last night, and extended the current functionality of aws-vault to accommodate this.

              Here is the PR - https://github.com/99designs/aws-vault/pull/278.

              Once this is merged – I suggest that you move over to the complete tool.

              2018-08-19

              A Triangle is Not a Circle & Some Things Don’t Fit in the Cloud

              Baby Blocks

              We all started off as babies, and I am sure that not many of you remember that one of the first toys you played with (and if you do not remember - then I am sure those of you with kids have probably done the same with your children) was a plastic container with different shapes on the lid and blocks that were made of different shapes.

              A triangle would only go into the triangle, a circle in the circle, a block in the block and so on.

              This is a basic skill that teaches us that no matter how hard we try, there are some things that just do not work. Things can only work in a certain way (of course coordination, patience and whole lot of other educational things).

              It is a skill that we acquire, it takes time, patience, everyone gets there in the end.

              And why am I blogging about this – you may ask?

              This analogy came up a few days ago in a discussion of a way to provide a highly available database in the cloud.

              And it got me thinking….

              There are certain things that are not meant to be deployed in a cloud environment because they were never meant to be there in the first place. The application needed an Oracle database and it was supposed to be deployed in a cloud environment.

              What is the default way to deploy Oracle in highly available configuration? Oracle RAC. There are a number of basic requirements (simplified) you need for Oracle RAC.

              1. Shared disk between the nodes.
                That will not work in a cloud environment.
                So we can try using dNFS – as the shared storage for the nodes – that might work..
                But then you have to make an NFS mount available to the nodes – in the cloud.
                So let’s deploy an NFS node as part of the solution.
                But then we have to make that NFS node highly available.
              2. Multicast between the nodes - that also does not work well in the cloud.
                So maybe create a networking environment in the cloud that will support multicast?
                Deploy a router appliance in the cloud.
                Now connect all the instances in the cloud into the router.
                But the router poses as a single point of failure.
                Make the router highly available.

              And if not Oracle RAC – then how about Data Guard – which does not require shared storage?

              But it has a steep licensing fee.
              And you have to find a way for managing the virtual IP address – that you not necessarily will have control over.
              But that can be overcome by deploying a VRRP solution with IP addresses that are manually managed.

              ENOUGH!!!

              Trying to fit a triangle into a square – yes if you push hard enough (it will break the lid and fit).
              If you cry hard enough – Mom/Dad will come over and put it in for you.

              Or you come up with half-assbaked solution like the one below…

              blocks

              Some things will not fit. Trying to make them fit creates even more (and sometimes even bigger) problems.

              In this case the solution should have been - change the code to use a NoSQL database that can be deployed easily and reliably in a cloud environment.

              As always your thoughts and comments are welcome.

              2018-08-15

              Saving a Few Shekels on your AWS bill

              I have a jumpbox that I use to access resources in the cloud – and I use it at work, only during work hours and only on workdays.

              There are usually 720 work hours in the month or 744 in months that have 31 days. Assuming that I want to run the instance for 12 hours a day and for 5 days a week. In order to calculate how many hours exactly – we will need an example.

              The month of August, 2018

              image

              The work week in Israel is Sunday-Thursday (yeah – I know – we are special…).

              August would have 22 work days. Total number of hours in August (31*24 = 744). 220 working hours in the month (22 working days multiplied by 10 hours per day).

              The math is simple 220/744 – I only need the instance for 30% of the month – so why would I pay for all of it?

              744 hours * $0.0464 (for a t2.medium instance in us-east-2) = $34.5216 and if I was to only pay for the hours that I was actually using the instance that would be 220 * $0.0464 = $10.208. A third of the cost. Simple math.

              So there are multiple ways to do this – as a Lambda script, Cloud custodian – each of these work – very well and will work brilliantly at scale. For me it was a single machine and honestly I  could not be bothered to set up all the requirements to get everything working.

              Simple solution – use cron. I don’t pay for resource usage by hour in my corporate network (If someone does – then you have my sympathies..) so I set up a simple cron job to do this.

              To start up the instance:

              0 8 * * 0,1,2,3,4 start-jumpbox

              And to stop the instance at night

              0 18 * * 0,1,2,3,4 stop-jumpbox

              And what is the start/stop-jumpbox comand you might ask – really simple aws cli command

              aws ec2 start-instances –region <__REGION__>  --instance-ids <__INSTANCE_ID__>

              aws ec2 stop-instances –region <__REGION__>  --instance-ids <__INSTANCE_ID__>

              Of course in the background the correct credentials and access keys are set up on my linux box – not going to go into how to that here – AWS has enough documentation on that.

              The last thing that I needed to solve was the jumpbox has a public IP (obviously) and If I really wanted to save the money – I do not want to have pay for a static Elastic IP provisioned and sitting there idle for 70% of the month (because the instance is powered down

              After doing the calculation – it was chump change for my use case (524hrs * $0.005=$2.62) so maybe I should have not worried about it – but the resulted script is still useful.

              I wanted to use the allocated IP address that AWS provides to the instance at no cost. The problem with this is – every time you stop the instance the IP address is reclaimed by AWS and when you power it on – you will get a new one.

              Me being the lazy bum I am – I did not want to have to lookup up the IP each and every time so I went down the path of updating a DNS record on each reboot.

              Take the Public IP allocated to the instance and update a known FQDN that I would use on a regular basis.

              The script can be found here (pretty self explanatory)



              Now of course this is only a single instance – but if you are interested in saving money this is one of the considerations you think about looking to save. (of course this should be managed properly at scale a single cron job will not suffice…)

              For example – if you have a 1000 development machines that are not being used after working hours (and I know that not everything can be shut down after hours there are background jobs that run 24/7),  and they are not a measly t2.medium but rather an m4.large 

              1000 (instances) * 220 (work hours) * $0.1 (cost per hour) = $22,200

              1000 (instances) * 744 (hours in the month) * $0.1 (cost per hour) =  $74,400

              See how you just saved $50k a month on your AWS bill?

              You are welcome :)

              (If you want spend a some your well saved cash on my new book – The Cloud Walkabout – feel free).

              If you have any questions about the script / solution or just want to leave a comment – please feel free to do so below

              2018-07-23

              aws-vault on a Chromebook

              I have moved almost exclusively to a Chromebook for my day-to-day work
              (a whole other set of blog posts - on the journey and outcome are planned), and I was missing one of the tools in belt and that was aws-vault.

              If you look at the releases you will see that there is no binary available for arm.

              I opened up an issue on the repository - and the answer that I got was - that it is not likely to have any binary released for ARM in the near future, I should go and compile it for myself.

              I did, here are the steps.


              Hope it is useful for someone in the future.


              2018-07-19

              The #AWS World Shook and Nobody Noticed

              A few days ago at the AWS Summit in New York there was an announcement which in my honest opinion went very noticeably under the radar and i don't think many people understand exactly what it means.
              The announcement i'm talking about is this one EC2 Compute Instances for Snowball Edge
              Let's dig into the announcement. There are new instance types released the sbe1 family which can been on AWS Snowball Edge device which essentially a computer with a lot of disks inside.

              The Snowball is a service that AWS provides to enable you to upload large amounts of data from your datacenter up to S3. Since its inception it is actually a very interesting concept and to me it has always been as a one off way enticing you to bring more of your workloads and your data in a much easier way to AWS.

              I also posted this on Twitter

              Since its inception AWS has always beaten the drum and pushed the message that everything will run in the cloud - and only there. That was the premise they build a large part of their business model upon. You don't need to run anything on-premises because everything that you would ever want or ever need is available on the cloud, consume as a service, through an API.

              During the course of my career a number a number of times the question came up asking, "Does AWS deploy on-prem?" Of course the answer was always "No, never gonna happen."

              Most environments out there are datacenter snowflakes, built differently, none of them look the same, have the same capabilities, features or functionality. They are unique and integrating a system into different datacenters is not easy. Adapting to so many different snoflakes is really hard job, and something we have been trying to solve for many years - trying to build layers of abstraction, automation and standards across the industry. In some way we as an idustry have suceeded, and in others we have failed dismally.

              In June 2017 AWS announced general availability of GreenGrass. A service that allows you to run Lambda functions on Connected devices wherever they are in the world (and more importantly - they are not part of the AWS cloud).

              This is the first leg in the door - to allow AWS into your datacenter. The first step of the transformation.

              Back to the announcement.

              It seems that each Snowball is a server with approximately 16 CPUS's and 32GB of RAM (I assume a bit more to manage the overhead for the background processes). So essentially a small hypervisor - most of us have servers which are much beefeir than this little box - as our home labs or our laptops even. It is not a strong machine - not by any means.

              But now you have a the option to run Pre-provisioned EC2 instances on this box. Of course it is locked down and you have a limited set of functionality availble to you (the same way that you have a set of pre-defined option availble in AWS itself. Yes there are literraly tens of thousands of operations you can perform - but it is not a free for all).

              Here is what stopped me in my tracks

              EC2_endpoint
              Connecting and Configuring the Device
              After I create the job, I wait until my Snowball Edge device arrives. I connect it to my network, power it on, and then unlock it using my manifest and device code, as detailed in Unlock the Snowball Edge. Then I configure my EC2 CLI to use the EC2 endpoint on the device and launch an instance. Since I configured my AMI for SSH access, I can connect to it as if it were an EC2 instance in the cloud.

              Did you notice what Jeff wrote ?
              "Then I configure my EC2 CLI to use the EC2 endpoint on the device and launch an instance"

              Also this little tidbit..

              S3_Endpoint
              "S3 Access – Each Snowball Edge device includes an S3-compatible endpoint that you can access from your on-device code. You can also make use of existing S3 tools and applications"
              That means AWS just brought the functionality of the public cloud - right into your datacenter.

              Is it all the bells and whistles? Infinitely scalable, can run complex map reduce jobs? Hell no - this is not what this is for.  (Honestly - I cannot actually think of any use case that I personally would want to run a EC2 instance on a Snowball - at least not yet).

              Now if you ask me - this is a trial balloon that they are putting out there to see if the solution is viable - and something that their customers are interested in using.

              If this works - for me it is obvious what the next step is. Snowmobile 

              SnowMobile

              Imagine being able to run significantly more workloads on prem - same AWS experience, same API - and seamlessly connected to the public cloud.

              Ladies and gentlemen. AWS has just brought the public cloud smack bang right into your datacenter.

              They are no longer only a public cloud only company - they provide hybrid cloud solutions as well.


              If you have any ideas for a use case to run workloads on Snowball - or if you have any thoughts or comments - please feel free to leave below.

              2018-07-09

              Comparing CloudFormation, Terraform and Ansible - Part #2

              The feedback I received from the first comparison was great – thank you all.

              Obviously the example I used was not really something that you would use in the real world – because no-one actually creates a only a VPC – and does not create anything inside it, that is pretty futile.

              So let’s go to the next example.

              The scenario is to create a VPC, with a public presences and a private presence. This will be deployed across two availability zones. Public subnets should be able to route to the internet through an Internet Gateway, private subnets should be able to access the internet through a NAT Gateway.

              This is slightly more complicated than just creating a simple VPC with a one-liner

              So to summarize - the end state I expect to have is:

              • 1x VPC (192.168.90.0/24)
              • 4x Subnets
                • 2x Public
                  • 192.168.90.0/26 (AZ1)
                  • 192.168.90.64/26 (AZ2)
                • 2x Private
                  • 192.168.90.128/26 (AZ1)
                  • 192.168.90.192/26 (AZ2)
              • 1x Internet Gateway
              • 2x NAT Gateway (I really could do with one – but since the subnets and resources are supposed to be deployed in more than a single AZ – there will be two – and here I minimize the risk impact of loss of service if a single AZ fails)
              • 1x Public Route Table
              • 2x Private Route Table (1 for each AZ)

              And all of these should have simple tags to identify them.

              (The code for all of these scenarios is located here  https://github.com/maishsk/automation-standoff/tree/master/intermediate)

              First lets have a look at CloudFormation


              So this is a bit more complicated than the previous example. I still used the native resources in CloudFormation, and set defaults for the my parameters. You will see some built in functions that are available in CloudFormation – namely !Ref which is a reference function to lookup a value that has previously been created/defined in the template and !Sub that will substitute a value in the template with an environment variable.

              So there are a few nifty things that are going here.

              1. You do not have remember resource names – CloudFormation keeps all the references in check and allows you to address them by name in other places in the template.
              2. CloudFormation manages the order in which the resources are created and takes of care of all of that for – and it will take care of the order what resources are created.

                For example – the route table for the private subnets will only be created after the NAT gateways have been created.
              3. More importantly – when you tear everything down – then CloudFormation takes care of the ordering for you, i.e. you cannot tear down a VPC – while the NAT gateways and Internet gateway are still there – so you need to delete those first and then you can go ahead and rip the everything else up.


              Lets look at Ansible. There are built-in modules for this ec2_vpc_net, ec2_vpc_subnet, ec2_vpc_igw, ec2_vpc_nat_gateway, ec2_vpc_route_table.


              As you can see this is bit more complicated than the previous example – because the subnets have to be assigned to the correct availability zones.

              There are are a few extra variables that needed to be defined in order for this to work.


              Last but not least – Terraform.

              And a new set of variables



              First Score - # lines of Code (Including all nested files)

              Terraform – 164

              CloudFormation - 172

              Ansible – 204

              (Interesting to see here how the order has changed)

              Second Score - Easy of deployment / teardown.

              I will not give a numerical score here - just to mention a basic difference between the three options.

              Each of the tools use a  simple command line syntax to deploy

              1. CloudFormation

                aws cloudformation create-stack --stack-name testvpc --template-body file://vpc_cloudformation_template.yml

              2. Ansible

                ansible-playbook create-vpc.yml

              3. Terraform

                terraform apply -auto-approve

              The teardown is a bit different

              1. CloudFormation stores the information as a stack - and all you need to do to remove the stack and all of its resources is to run a simple command of:

                aws cloudformation delete-stack --stack-name <STACKNAME>

              2. Ansible - you will need to create an additional playbook for tearing down the environment - it does not store the state locally. This is a significant drawback – you have to make sure that you have the order correct – otherwise the teardown will fail. this means you need to understand as well how exactly the resources are created.

                ansible-playbook remove-vpc.yml

              3. Terraform - stores the state of the deployment - so a simple run will destroy all the resources

                terraform destroy -auto-approve

              You will see below that the duration of the runs are much longer than the previous example – the main reason being that the amount of time it takes to create a NAT gateway is long – really long (at least 1 minute per NAT GW) because AWS does a lot of grunt work in the background to provision this “magical” resource for you.

              You can find the full output here of the runs below:

              Results

              Terraform
              create: 2m33s
              destroy: 1m24s

              Ansible:
              create: 3m56s
              destroy: 2m12s

              CloudFormation:
              create: 3m26s
              destroy: 2m14s

              Some interesting observations. It seems that terraform was the fastest one of the three – at least in this case.

              1. The times are all over the place – and I cannot say one of the tools is faster than the other because the process is something that happens in the background and you have to wait for it complete. SO I am not sure how reliable the timings are.
              2. The code for the Ansible playbook is by far the largest – mainly because in order to tear everything down – it requires going through the deployed pieces and ripping them out – which requires a complete set of code.
              3. I decided to compare how much more code (you could compare increase in the amount of code to increased complexity) was added from the previous create step to this one

                Ansible: 14 –> 117 (~8x increase)
                CloudFormation: 24 –> 172 (~x7 Increase)
                Terraform: 7 –> 105 (~x15 increase)
              4. It is clear to me that allowing the provisioning tool to manage the dependencies on its own – is a lot simpler to handle – especially for large and complex environments.


              This is by no means a recommendation to use one tool or the other - or to say that one tool is better than the other - just a simple side by side comparison between the three options that I have used in the past.

              Thoughts and comments are always welcome, please feel free to leave them below.

              2018-07-05

              Getting Hit by a Boat - Defensive Design

              In a group discussion last week – I heard a story (I could not find the origin – if you know where it comes from – please let me know) – which I would like to share with you.
              John was floating out in the ocean, on his back, with his shades, just enjoying the sun, the quiet, the time to himself, not a care in the world.
              When all of a sudden he got bumped on the head (not hard enough to cause any serious damage) with a small rowing boat.
              John was pissed…. All sorts of thoughts running through his head.
              • Who gave the driver their license?
              • Why are they not more careful?
              • I could have been killed?
              • Why are they sailing out here – this is not even a place for boats.
              And with all that anger and emotion he pulled himself over the side of the boat, ready to give the owner/driver one hell of a mouthful.
              When he pulls himself over the side, he sees an empty boat. No–one there, no-one to scream at.
              And at that moment all the anger and rage that was building up inside – slowly went away.
              We encounter things every day – many of them we think are directly aimed at us – deliberately or not – but we immediately become all defensive, build up a bias against the other and are ready to go ballistic. Until we understand that there is no-one to direct all this emotion and energy at.
              And then we understand that sometimes thing just happen, things beyond our control and we cannot or should not put our fate into some else’s hands.
              That was the original story – which I really can relate to.
              14221418411_385101705b_z
              (Source: Flickr – Steenaire)
              But before I heard the last part of the story – my mind took this to a totally different place – which is (of course) architecture related.
              John was enjoying a great day in the sun – and all of a sudden he got hit in the head by a boat.
              Where did that boat come from?
              No-one knows.. I assume the owner had tied it up properly on the dock.
              • Maybe the rope was cut.
              • Maybe someone stole it and dumped it when they were done.
              • Maybe there was a storm that set the boat loose.
              • Or maybe there was a bloopers company that was following the boat all along to see who would get hit in the head.
              There are endless options as to how the boat got there. But they all have something in common. The boat was never supposed to end up hitting John in the head.. John expected to be able to bake nicely in the sun and not be hit in the head by a boat
              But what if John had taken additional precautionary measures?
              • Set up a fence / guardrail around where he was floating
              • Put someone as a lookout to warn him about floating boats
              • Have a drone above his head hooked into a heads-up-display in his sunglasses that he can see what is going around him
              There are endless possibilities and you can really let your imagination take you to where you want to go as to how John could have prevented this accident.
              What does this have to do with Defensive Design?
              When we design an application – we think that we are going to be ok – because we expect to be able to do what we want to do without interference.
              For example.
              My web server is suppose to serve web requests of a certain type. I did not plan for someone crafting a specific request that would crash my server or bombarding the webserver with such an influx of traffic that would bring the application to its knees.
              But then something unexpected happens.
              When you design your application you will never be able to predict every possibility of attack or some esoteric ways  people are going to use your software. There is always something new that comes up – or someone thinks of a different way to use your idea that you did not even think of.
              What you can do, is put some basic guardrails into your software that will protect you from what you do know or think can happen.
              • Throttling the number of connections or request – to prevent DDOS attacks.
              • Introducing circuit breakers to prevent cascading failures
              • Open only specific ports / sockets
              • Sufficient authentication to verify that you should be doing what you are supposed to
              • Monitoring for weird or suspicious behavior.
              Again the options are practically endless. And you will not think of it all. You should address the issues as they happen, iterate, rinse, repeat.
              That was a 4 minute read into thing that I think about during the day.
              What kind of things do you think about when during your daily work? I would be interested in hearing. Please feel free to leave comments down below.