Autoscaling Oracle Commerce (ATG) on AWS

The Challenge

Pivotree’s client, a technology consulting and services company with 10,000 associates in 33 global locations, wanted to autoscale their Oracle Commerce (ATG) application servers for a large fashion retailer, enabling them to respond to increased web traffic more readily. As a platform designed for on-premise use, autoscaling was not an out-of-the-box ability. Our team had to find a unique solution as the client’s previous servers were provisioned and built up based on requirements and were static. Server configurations were “baked-in” to the EAR and called using JVM args. Any platform integration was manual and could take several days to weeks to complete a build-out.

The client had previously migrated the ATG environment to AWS cloud through a basic lift and shift; we deployed the environment using Terraform and then built and managed it manually. We also migrated the common patterns, tooling, and processes from the data center to AWS, based on the client’s requirements at the time.

The ATG application consists of several layers:

  1. The client-facing application servers – commonly referred to as page or com servers.
  2. The CMS (Content Management System) called BCC (Business Control Center), typically configured manually to register the various application servers and publish/update content as required.
  3. The Endeca search engine also requires manual updates for various configuration files to add/remove application servers from the configuration.

We considered several elements when developing this solution:

  • How we were going to build the images,
  • How we were going to store the runtime parameters,
  • Updating the various configuration files, • Starting the application, and
  • Registering the application servers with BCC and Endeca servers.

Additional considerations included centralized logging, user access and troubleshooting. PCI requirements were also a factor as we needed to be sure we were developing a solution we could certify for auditing purposes.

The Solution

Why AWS

AWS was chosen by the client for its versatility in hosting enterprise-grade platforms, and the number of tools and services it provides that help brands run complex technologies on the cloud. As a long-term provider of Oracle ATG services, Pivotree has developed industry-leading knowledge of running Oracle ATG on AWS, which made it a natural fit for the client when they came to Pivotree with their client requirements.

Using Immutable Infrastructure Pivotree built immutable infrastructure to support the utilization of immutable images – this meant we could use the same image across environments as well as potentially re-use this solution for future clients.

Quick Summary

The following technology was part of the solution:

  • Parameter Store, a capability of AWS Systems Manager (SSM) to hold all the runtime values required
  • Hashicorp Packer allowed us to build 2 machine images – one for the application stack based on Java and Jboss 7.2, the other for Endeca based on Endeca 11.3.2.
  • AWS resource tags to determine what environment we were working on during the process as well as the role of the server.
  • 1 download.sh script in the user folder which would be called via cloud-init and would then query tags and download the appropriate automation package from an S3 bucket. This gave us the flexibility to update scripts/ templates without requiring a new image.
  • Both images required the mounting of Amazon Elastic File System (Amazon EFS) volumes so we developed a script to query SSM for the EFS endpoints, update our template and then append the file to fstab and remount everything.
  • To register with the ATG BCC and Endeca we used a BCC rest service that when compiled into the EAR would allow for external calls to register/deregister from the topology. We took that and developed a script that is called once JBoss is determined to be online and registers itself. We also developed scripts to add/ remove an app server from the Endeca configuration files.

Using Terraform for AWS Infrastructure as Code (IaC)

Most of Pivotree’s AWS cloud solutions have been built using Terraform. Terraform modules are either maintained internally or are in some cases internally approved community modules. We have several reference architectures depending on the platform and solution required.

  • Used alongside Terraform was an open-source product called Bouncer which addresses the issue of patching/scaling events. We can call Bouncer from within Terraform and it will manage the scaling out of the Autoscaling Group (ASG) in either serial or canary mode. Our single instances in self-healing groups are all serial while our client-facing application servers are done canary to ensure it’s always available.
  • The user_data was updated to reference the download.sh file we had built into our packer images. Once we had everything prepared it was a matter of executing our Terraform plan and approving it in Terraform Cloud.
  • The ASG is configured to scale with parameters that trigger a scale-out at the appropriate threshold to give the new instance time to come online before the other servers are overwhelmed. ATG does require a little more time to come online and this has to be accounted for in the scaling thresholds.

The Benefit

With the infrastructure being immutable the client no longer needs to deploy applications to an instance and restart the application but rather deploy their code to an EFS volume, update SSM with the path and using the AWS Command Line Interface (CLI) you can execute required changes on the ASG to launch new instances which will query SSM as part of its launch and symlink the appropriate package on the EFS volume.

For Streamlined System Integration

The client and any Operators no longer require access to the instances, as all logging has been centralized within Cloudwatch. Direct ssh is no longer supported as we have enabled ec2-instance-connect and plan on migrating to AWS SSM Session Manager in the coming months.

This helps address PCI requirements as well as simplifying the support experience.

Developing this solution we have given the client the ability to rapidly provision new environments, scale-out production on an on-demand basis without having additional servers laying around in wait, and an improved patching process that reduces the risks and ensures a consistent image.

This solution will allow us to quickly turn a client onboarding around for a similar solution in a matter of days vs weeks or months.