In Part 2 I explored if I could apply some Microservice Patterns & Best Practices to a sample FaaS use case implementation using AWS Lambda.
This is the part of the series where I will discuss lessons learned and my conclusions.
Like learning about all new technology you come to the table with some assumptions of what should and what should not happen with applying this new tech. You also may have a feel of what should be easy and then find out what turns out to be difficult. And finally, you bring your prejudices about what you think should happen and what is the truth. Well, this use case covers most of the bases. Let's discuss some of the "gotchas."
One of the big complaints of Lambda is the cold-start times. So what is a cold-start? A cold-start occurs when a Lambda function is invoked after not being used for an extended period of time resulting in increased invocation latency. This "extended period of time" can range from 30 to 40 minutes of inactivity (based upon Web tribal knowledge). The idle timeout period is not a constant. If it does restart it could take up to 15+ seconds depending on many factors such as language used. Under the hood, Lambda is a container and you have no idea what class machine it is running on or what other "tenants" are commingling with you. This is what you give up when you put trust in your Cloud provider for the correct deployment/configuration.
There are some techniques to reduce cold-starts by keeping your functions "warm" which means it is kept idle (this is not free). A technique you can use to keep your function warm is to set up a CloudWatch rule to ping your lambda every 20 or 30 minutes. Architectural decisions (functionality/design vs. cost) need to made to determine how "atomic" your lambdas will be and which ones to keep warm. This, however, comes with a cost to both Lambda and CloudWatch.
Example cost estimate:
Free Tier not included + Default WarmUP options + 10 lambdas to warm, each with
memorySize = 1024 and
duration = 10:
WarmUP: runs 8640 times per month = $0.18
10 warm lambdas: each invoked 8640 times per month = $14.4
Total = $14.58/month
CloudWatch costs are generally very low.
Here are some considerations if you decide to put your Lambda into a VPC.
When you add a Lambda function to a VPC, it can only access resources in that VPC. If a Lambda function needs to access both VPC resources and the public Internet, the VPC needs to have a Network Address Translation (NAT) instance for that VPC. NATs cost money.
When a Lambda function is configured to access a resource (e.g., RDS) within a VPC, it incurs an additional Elastic Network Interface (ENI) start-up penalty. This means address resolution may be delayed when trying to connect to network resources. If your VPC does not have sufficient ENIs or subnet IPs, your Lambda function will not scale as requests increase, and you will see an increase in function failures. AWS Lambda currently does not log errors to CloudWatch Logs that are caused by insufficient ENIs or IP addresses.
VPC ENI Calculator
Projected peak concurrent executions * (Memory in GB / 3GB)
Executing Lambdas inside VPC needs to include a role policy of "AWSLambdaVPCAccessExecutionRole" to enable detachment of unused ENIs. This can take awhile. Also, every time a VPC based Lambda function is invoked, it creates an ENI. The Lambda function execution role also must have permissions to delete these ENIs. If the role does not have permission to delete ENIs, then Lambda cannot clean up ENIs after use. This can impact the delete stack scripts because during the destruction of the Lambda will not be able to delete any ENIs and result in the Cloudformation template to fail (e.g subnet dependency on ENI). To avoid this, be sure that the "AWSLambdaENIManagementAccess" policy is included in your VPC based Lambda role. This gives permission to delete any Lambda ENIs.
Since one of the main reasons for putting your Lambda in a VPC is its need to interface with RDS, consider possibly DynamoDB instead. DynamoDB doesn't require a VPC.
DynamoDB or RDS?
For this project, RDS was selected for several reasons
RDMS is still more mainstream for most enterprise organizations
Client code for JDBC much simpler to implement than the DynamoDB SDK (can get complex)
Better at handling complex joins, indexing, and queries
Stronger typing and better data integrity
More straightforward backup procedure
On the other hand, DynamoDB has its set of advantages:
Generally lower cost. Burst bulk operations, however, can quickly escalate cost to exceed RDS equivalent
No need for VPC
No ENI penalty
No Connection Pool issues
No NAT requirements
Scales well and can handle large amounts of data
So which data service is this best? As always, it "depends." There are definite cost advantages to scaling with using DynamoDB if the Lambda's data structure and query demands align with DynamoDB wheelhouse. If the application requirements put you on the fence between the two, a careful architectural assessment will be needed to evaluate total cost vs. capability/simplicity.
One of the advantages of Lambda is that scaling is transparent. But there are some tweaks you can do through concurrency (a unit of scale for concurrent executions) to improve performance.
You can set a concurrency limit on individual AWS Lambda functions (default is 1,000/region). The concurrency limit you set will reserve a portion of your account level concurrency limit for a given function. This feature allows you to throttle a given function if it reaches a maximum number of concurrent executions allowed, which you can choose to set. This is useful when you want to limit traffic rates to downstream resources called by Lambda (e.g., databases) or if you want to control the consumption of elastic network interfaces (ENI) and IP addresses for functions accessing a private VPC. If you reach a synchronous limit (and are throttled), Lambda will return a 429 error (Too Many Requests). If it is invoked asynchronously and is throttled, Lambda automatically retries the throttled event for up to six hours, with delays between retries. Asynchronous events are queued before they are used to invoke the Lambda function. If you need more concurrent executions you need to submit a request to the AWS Support Center.
In a JEE Application, you typically use the App Server container to manage your connection pool, with Lambda you have to "roll your own". For this project an open source connection pool, HikariCP was used. But picking the right architecture is critical for having your functions operate scale smoothly. It first starts with the correct RDS instance sizes. For example, RDS-MySQL comes out of the box with different max connections based upon size (http://tritoneco.com/2015/11/23/max_connections-at-aws-rds-mysql-instance-sizes.) It ranges from 66 (t2.micro) to 2540 (R3.xLarge). You will need to understand what your maximum connections could be. That will depend upon how long and many calls are made to the RDS instance The other component is how large (or the right size) is the HikariCP pool for sufficient scaling? One thing you don't have real control is how many Lambda instances AWS will spin-up at runtime. When an instance is hitting certain thresholds, another instance may spin up and thus allocate another pool. It is possible to exceed an RDS instance max connections under certain situations. And the final area to control is connections via API Gateway throttling to prevent lambda instance explosions.
So how do you implement the pooling code? Having each Lambda invocation creating a new connection handle would be expensive for performance. The best practice for this is to store your pool connection reference outside the scope of your handler (RequestHandler). This allows future invocations to reference the initialized pool. For this project, every Lambda invocation checked to see if the pool reference was null. If it was, the pool was initialized. Otherwise a connection handle was fetched from the pool.
Depending upon the tier you use and the robustness of your architecture, the pricing varies. RDS has its cost, API Gateway is another and Lambda and network charges like NAT have their own. Below are some breakdowns of major components and services:
- Lambda Pricing. The pricing is based on the number of requests and the duration of script's execution, billed in 100 millisecond increments. So, if a Lambda function runs for 15 milliseconds, it will be billed for 100. This could be an issue for very high-volume applications with lots of short-running functions. A crude hack to get the best bang for the buck would be to combine short-running Lambda operations into a larger one. Also, if you want to expose your Lambda methods as REST end-points using AWS API Gateway, you'd incur extra costs as the API Gateway has separate pricing.
- NAT Pricing. You are charged for each "NAT Gateway-hour" that your NAT gateway is provisioned and available. Data processing charges apply for each Gigabyte processed through the NAT gateway regardless of the traffic's source or destination. Each partial NAT Gateway-hour consumed is billed as a full hour. You also incur standard AWS data transfer charges for all data transferred via the NAT gateway.
- RDS Pricing. There are a few cloud cost management points to consider, as RDS setups are more than just paying for the databases. Understanding where the major costs of RDS come from will set proper cost and usage expectations and should lower the chances of seeing unwanted surprises in next month's bill. A good way to parse out pricing across all of RDS' offerings is with this open source price comparison tool.
Step Functions were introduced in late 2016 and were used in a similar use case but are out of scope for this discussion. So what are Step Functions? (SF. Or State As A Service). In a nutshell, SF provides the ability to coordinate (orchestrate) components (Lambda and other services) as a series of "steps" in a visual workflow. It is similar to AWS Simple Workflow Service (SWF). SF incorporates a state machine to specify and execute the steps of the application at scale. The state machine uses a JSON based language called Amazon States Language (ASL). Each state machine defines a set of states and the transitions between them. States can be activated sequentially or in parallel with retry and error trapping capabilities. Step Functions will make sure that all parallel states run to completion before moving forward. States perform work, make decisions, and control progress through the state machine. Like with Lambda, SF "magically" scales under the hood ;-)
- Good for long executions (which would be prohibitively expensive with just Lambda alone)
- Deep integrations with CloudWatch Logs, Metrics, and CloudTrail
- Throttling is 2 executions/second (default). It can be raised
- When integrated as part of an API Gateway solution, use a proxy (e.g, a Lambda Proxy) and consider an asynchronous API
- Step Functions are unique to AWS but provide an orchestration service to architect more complicated FaaS implementations.
FaaS is part of a continuous evolving pattern in our technology space where we are always striving for simplicity through abstraction. The Cloud space is becoming more competitive. It is no longer are you in the Cloud, but which one? Cost is always a consideration, but getting your stack up and running quickly while reducing complexity (e.g., managing infrastructure) with great tools makes a Cloud provider more attractive. FaaS is another architecture (tool in the belt) to draw customers to the Cloud providers platform.
But Is It Prime Time?
I personally like what I see (so far) because I can really focus on more of the "essence" of my application with less effort on infrastructure plumbing. However, at this point, I would not use FaaS for a highly complex architecture. There are some enterprise use cases where the focus is strictly confined to elastic high performance compute needs like this where it works quite well and has real cost savings. Those applications that don't fit well into those type of scenarios should consider other enterprise Microservice based solutions. An example would be a Kubernetes/Spring-Cloud (Java) type of architecture because it can scale, has cross-Cloud portability, and can handle complex transactional requirements if needed.
Where FaaS is prime-time, is handling simple event processing where in the past you needed to spin-up a compute instance. This is now over-kill and Lambda from a cost and simplicity perspective hits the mark.
The other perfect application for FaaS is a Web style architecture. You really don't need a web server anymore. By using an API Gateway (your virtual web server), Lambda and S3 you can easily construct dynamic and static web pages. The gateway (via APIs) can authenticate and route/fetch static content from a S3 bucket and Lambda to build dynamic content using templated (e.g., Mustache) pages.
So on your next project consider FaaS if it meets your requirements.