The Challenge
Pivotree had a client who wanted to autoscale ATG application servers under certain conditions. The challenge was that previously servers were provisioned and built up based on requirements and were static. Server configurations were baked into the EAR and called using JVM args. Any platform integration was all done manually and it could take several days to weeks to complete a build-out.
The client had been previously migrated to AWS cloud but at the time it was a basic lift and shift with the environment deployed using terraform and then built and managed manually. Common patterns/tooling/processes from the data center were also migrated over. This was based on the client’s requirements at the time.
The ATG application consists of several layers. The client-facing application servers are commonly referred to as page or com servers, the content management system called BCC which typically is configured manually to register the various application servers and publish/update content as required, and the Endeca search engine which also requires manually updating various configuration files to add/remove application servers from the configuration.
Several things had to be considered when developing this solution. How we were going to build the images, how we were going to store the runtime parameters, update the various configuration files, start the application and register the application servers with BCC and Endeca servers.
Additional consideration around centralized logging, user access, and troubleshooting was also a factor. Our SI and OP teams were used to having ssh to all boxes, finding logs on the boxes, and resolving issues on the boxes. PCI requirements were also a factor as we needed to be sure we were developing a solution we could certify at a later date.
On Building Immutable Infrastructure
After a careful analysis of the software stack and previous knowledge working with the platform, it was decided we would need to create immutable images. Using an immutable image meant we could use the same image across environments as well as potentially re-use this solution in the future with additional clients. We were already working with Hashicorp Packer in other parts of the organization and it was a natural fit for what we planned on building.
During the development of the solution, we also decided that we would leverage the AWS System Manager Parameter Store to hold all the runtime values required. We developed a hierarchy that we could use to ensure each environment’s values were separated based on the hierarchical path. As part of this, a process was developed to update a “property” file that was then parsed and pushed into SSM.
Due to the size of some of the software we install and the need for autoscaling to occur rapidly another key decision was separating the building of images and loading of runtime values. Working with Packer we built 2 machine images. One for the application stack based on Java and Jboss 7.2, the other for Endeca based on Endeca 11.3.2. We also started developing various scripts to populate our configuration and startup files.
We determined early in the process that we would leverage AWS resource tags to determine what environment we were working on as well as the role of the server. Using those tags allowed us to make decisions and build our queries to SSM. We also decided that we would have 1 download.sh script in our user folder which would be called via cloud-init and would then query tags and download the appropriate automation package from an S3 bucket we had. This gave us the flexibility to update those scripts/templates without requiring a new image.
Both images required the mounting of efs volumes so we developed a script to query SSM for the EFS endpoints, update our template and then append the file to fstab and remount everything. The Endeca image was fairly basic with a few installations completed using silent install options and required scripts copied into the home folder.
Once we had completed the Endeca image the next step was to figure out every value we edit in the JBoss configuration files and start replacing those with a token. We also developed a script that would query SSM for that required value and then replace the token with the updated value. This allowed for different configurations such as Datasource URL, Datasource username, or passwords to be unique across env as it would get loaded at runtime.
%3Cdiv%20class%3D%22highlight%22%3E%3Cpre%20style%3D%22color%3A%23f8f8f2%3Bbackground-color%3A%23272822%3B-moz-tab-size%3A4%3B-o-tab-size%3A4%3Btab-size%3A4%22%3E%3Ccode%20class%3D%22language-bash%22%20data-lang%3D%22bash%22%20style%3D%22border%3A%20none%20%21important%3B%22%3E%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%3C%2Fspan%3E%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%20Query%20SSM%20and%20assign%20to%20variables%3C%2Fspan%3E%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%3C%2Fspan%3E%0A%0Ards_url%3Cspan%20style%3D%22color%3A%23f92672%22%3E%3D%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Eaws%20ssm%20--region%20%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Ecurl%20-s%20http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2Fplacement%2Favailability-zone%20%7C%20sed%20-e%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2F.%3C%2Fspan%3E%24%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2F%2F%22%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%20%20%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%5C%0A%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%3C%2Fspan%3Eget-parameters%20--names%20%2F%24env%2Frds%2Frds_url%20--query%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22Parameters%5B%2A%5D.%7BValue%3AValue%7D%22%3C%2Fspan%3E%20%7C%20grep%20Value%20%7C%20cut%20-d%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%27%22%27%3C%2Fspan%3E%20-f4%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%0A%0Ards2_url%3Cspan%20style%3D%22color%3A%23f92672%22%3E%3D%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Eaws%20ssm%20--region%20%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Ecurl%20-s%20http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2Fplacement%2Favailability-zone%20%7C%20sed%20-e%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2F.%3C%2Fspan%3E%24%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2F%2F%22%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%20%20%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%5C%0A%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%3C%2Fspan%3Eget-parameters%20--names%20%2F%24env%2Frds%2Frds2_url%20--query%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22Parameters%5B%2A%5D.%7BValue%3AValue%7D%22%3C%2Fspan%3E%20%7C%20grep%20Value%20%7C%20cut%20-d%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%27%22%27%3C%2Fspan%3E%20-f4%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%0A%0A%0Ards_db_name%3Cspan%20style%3D%22color%3A%23f92672%22%3E%3D%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Eaws%20ssm%20--region%20%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Ecurl%20-s%20http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2Fplacement%2Favailability-zone%20%7C%20sed%20-e%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2F.%3C%2Fspan%3E%24%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2F%2F%22%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%20%20%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%5C%0A%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%3C%2Fspan%3Eget-parameters%20--names%20rds_db_name%20--query%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22Parameters%5B%2A%5D.%7BValue%3AValue%7D%22%3C%2Fspan%3E%20%7C%20grep%20Value%20%7C%20cut%20-d%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%27%22%27%3C%2Fspan%3E%20-f4%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%0A%0Ards_core_username%3Cspan%20style%3D%22color%3A%23f92672%22%3E%3D%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Eaws%20ssm%20--region%20%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Ecurl%20-s%20http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2Fplacement%2Favailability-zone%20%7C%20sed%20-e%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2F.%3C%2Fspan%3E%24%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2F%2F%22%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%20%20%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%5C%0A%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%3C%2Fspan%3Eget-parameters%20--names%20%2F%24env%2Fcore%2Fusername%20--query%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22Parameters%5B%2A%5D.%7BValue%3AValue%7D%22%3C%2Fspan%3E%20%7C%20grep%20Value%20%7C%20cut%20-d%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%27%22%27%3C%2Fspan%3E%20-f4%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%0A%0Ards_core_password%3Cspan%20style%3D%22color%3A%23f92672%22%3E%3D%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Eaws%20ssm%20--region%20%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%24%28%3C%2Fspan%3Ecurl%20-s%20http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2Fplacement%2Favailability-zone%20%7C%20sed%20-e%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2F.%3C%2Fspan%3E%24%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2F%2F%22%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%20%20%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%5C%0A%3C%2Fspan%3E%3Cspan%20style%3D%22color%3A%23ae81ff%22%3E%3C%2Fspan%3Eget-parameters%20--names%20%2F%24env%2Fcore%2Fpassword%20--with-decryption%20--query%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22Parameters%5B%2A%5D.%7BValue%3AValue%7D%22%3C%2Fspan%3E%20%7C%20grep%20Value%20%7C%20cut%20-d%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%27%22%27%3C%2Fspan%3E%20-f4%3Cspan%20style%3D%22color%3A%2366d9ef%22%3E%29%3C%2Fspan%3E%0A%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%3C%2Fspan%3E%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%20Replace%20tokens%20with%20assigned%20variables%3C%2Fspan%3E%0A%3Cspan%20style%3D%22color%3A%2375715e%22%3E%23%23%23%23%23%3C%2Fspan%3E%0A%0Ased%20-i%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2Frds_url%2F%3C%2Fspan%3E%24rds_url%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2Fg%22%3C%2Fspan%3E%20%2Fhome%2Fcentos%2Fautomation%2Fstandalone.xml%0A%0A%3Cspan%20style%3D%22color%3A%2366d9ef%22%3Eif%3C%2Fspan%3E%20%3Cspan%20style%3D%22color%3A%23f92672%22%3E%5B%3C%2Fspan%3E%20-z%20%24rds2_url%20%3Cspan%20style%3D%22color%3A%23f92672%22%3E%5D%3C%2Fspan%3E%0A%3Cspan%20style%3D%22color%3A%2366d9ef%22%3Ethen%3C%2Fspan%3E%0A%20sed%20-i%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2Frds2_url%2F%3C%2Fspan%3E%24rds2_url%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2Fg%22%3C%2Fspan%3E%20%2Fhome%2Fcentos%2Fautomation%2Fstandalone.xml%0A%3Cspan%20style%3D%22color%3A%2366d9ef%22%3Eelse%3C%2Fspan%3E%0A%20sed%20-i%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2Frds2_url%2F%3C%2Fspan%3E%24rds_url%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2Fg%22%3C%2Fspan%3E%20%2Fhome%2Fcentos%2Fautomation%2Fstandalone.xml%0A%3Cspan%20style%3D%22color%3A%2366d9ef%22%3Efi%3C%2Fspan%3E%0A%0A%0Ased%20-i%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2Fcore_username%2F%3C%2Fspan%3E%24rds_core_username%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2Fg%22%3C%2Fspan%3E%20%2Fhome%2Fcentos%2Fautomation%2Fstandalone.xml%0Ased%20-i%20%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%22s%2Fcore_password%2F%3C%2Fspan%3E%24rds_core_password%3Cspan%20style%3D%22color%3A%23e6db74%22%3E%2Fg%22%3C%2Fspan%3E%20%2Fhome%2Fcentos%2Fautomation%2Fstandalone.xml%0A%3Cspan%20style%3D%22color%3A%23f92672%22%3E%7D%3C%2Fspan%3E%0A%3C%2Fcode%3E%3C%2Fpre%3E%3C%2Fdiv%3E
Next, we took a look at our startup scripts. We pass in several java arguments from JVM name, heap sizes, along with other various runtime values that can be unique depending on the role of the server or requirements. We updated the scripts with tokens and created another script to update those values and then start the application server.
Once we had a viable image that was updating the required parameters and starting Jboss successfully we then had to start looking at solving the application requirements. This included registering with the ATG BCC and Endeca.
Oracle ATG had developed and promoted, as part of the Oracle IAAS toolset, a BCC rest service that when compiled into the ear would allow for external calls to register/deregister from the topology. We took that and developed a script that is called once JBoss is determined to be online and registers itself. We also developed scripts to add/remove an app server from the Endeca configuration files.
Once everything is up and running we register with the application load balancer for the client-facing apps which are called from the web tier.
All Things Terraform
Pivotree uses Terraform for IaC. Most of our cloud solutions have been built using Terraform. Terraform modules are either maintained internally or are in some cases internally approved community modules. We have several reference architectures depending on the platform and solution required
Using Terraform to stand up the environment was a matter of taking our reference architecture, updating it to leverage auto-scaling groups and a few other variables in our main configuration file. This would include anything we want to customize within the environment, EFS, RDS, EC2 instances, ASGs with EC2 info etc. We also developed a module to push any required values to SSM such as EFS endpoints, RDS URL and some server instances to load at runtime.
One of the challenges we realized early on was how to approach the patching/scaling events due to the way AWS manages instances provisioned by autoscaling groups. Terraform has no problems updating the launch configs but doesn’t do anything with that updated config. A solution we found was to use an open-source product called Bouncer which addresses this issue. We can call Bouncer from within Terraform and it will manage the scaling out of the ASG either in serial or canary mode. Our single instances in self-healing groups are all serial while our client-facing application servers are done canary to ensure it’s always available.
- The user_data was updated to reference the download.sh file we had built into our packer images. Once we had everything prepared it was a matter of executing our Terraform plan and approving it in Terraform Cloud.
As part of standing it all up, the ASG is configured to scale with parameters that trigger a scale-out at the appropriate threshold to give the new instance time to come online before the other servers are overwhelmed. ATG does require a little more time to come online and this has to be accounted for in the scaling thresholds.
Code Releases
With the infrastructure being immutable you no longer deploy applications to an instance and restart the application but rather deploy your code to an efs volume, update SSM with the path and using the AWS CLI you can execute the required changes on the ASG to launch new instances which will query SSM as part of its launch and symlink the appropriate package on the EFS volume.
Additional Benefits
System Integrators and Operators no longer require access to the instances, all logging has been centralized within Cloudwatch. Direct ssh is no longer supported as we have enabled ec2-instance-connect and plan on migrating to AWS SSM Session Manager in the coming months. This helps address PCI requirements as well as simplifying the support experience.
With this solution in place, we developed a patching pipeline within Jenkins. First, we built the Packer package and placed the new AMI id into SSM. We have another job that will checkout our updated Terraform source code with the AMI ID of the new image, comments and commits the code back into git. This triggers a build in Terraform which we have decided for the time being is our gate and an operator has to approve the update. Once the operator approves the change, Terraform and Bouncer will complete it.
What’s Next?
Developing this solution we have given the client the ability to rapidly provision new environments, scale-out production on a demand basis without having additional servers laying around in wait and an improved patching process that reduces the risks and ensures a consistent image.
Over the next few months, we hope to extend this out so we can quickly turn a client onboarding around in a matter of days vs weeks or months. These are some of the areas we are looking at:
- Using Oracle RDS S3 integration we are building a solution to have a data pumped DB uploaded to a known space and then imported into the RDS instance.
- Working with the system integrator and client to further integrate the EAR configuration to improve the model even further.