Building an Oracle Commerce (ATG) Autoscaling Solution

Building an Oracle Commerce (ATG) Autoscaling Solution

#####
##### Query SSM and assign to variables
#####

rds_url=$(aws ssm --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//")  \
get-parameters --names /$env/rds/rds_url --query "Parameters[*].{Value:Value}" | grep Value | cut -d'"' -f4)

rds2_url=$(aws ssm --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//")  \
get-parameters --names /$env/rds/rds2_url --query "Parameters[*].{Value:Value}" | grep Value | cut -d'"' -f4)


rds_db_name=$(aws ssm --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//")  \
get-parameters --names rds_db_name --query "Parameters[*].{Value:Value}" | grep Value | cut -d'"' -f4)

rds_core_username=$(aws ssm --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//")  \
get-parameters --names /$env/core/username --query "Parameters[*].{Value:Value}" | grep Value | cut -d'"' -f4)

rds_core_password=$(aws ssm --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//")  \
get-parameters --names /$env/core/password --with-decryption --query "Parameters[*].{Value:Value}" | grep Value | cut -d'"' -f4)

#####
##### Replace tokens with assigned variables
#####

sed -i "s/rds_url/$rds_url/g" /home/centos/automation/standalone.xml

if [ -z $rds2_url ]
then
 sed -i "s/rds2_url/$rds2_url/g" /home/centos/automation/standalone.xml
else
 sed -i "s/rds2_url/$rds_url/g" /home/centos/automation/standalone.xml
fi


sed -i "s/core_username/$rds_core_username/g" /home/centos/automation/standalone.xml
sed -i "s/core_password/$rds_core_password/g" /home/centos/automation/standalone.xml
}

Next, we took a look at our startup scripts. We pass in several java arguments from JVM name, heap sizes, along with other various runtime values that can be unique depending on the role of the server or requirements. We updated the scripts with tokens and created another script to update those values and then start the application server.

StartUp Process

Once we had a viable image that was updating the required parameters and starting Jboss successfully we then had to start looking at solving the application requirements. This included registering with the ATG BCC and Endeca.

Oracle ATG had developed and promoted, as part of the Oracle IAAS toolset, a BCC rest service that when compiled into the ear would allow for external calls to register/deregister from the topology. We took that and developed a script that is called once JBoss is determined to be online and registers itself. We also developed scripts to add/remove an app server from the Endeca configuration files.

Once everything is up and running we register with the application load balancer for the client-facing apps which are called from the web tier.

All Things Terraform

Pivotree uses Terraform for IaC. Most of our cloud solutions have been built using Terraform. Terraform modules are either maintained internally or are in some cases internally approved community modules. We have several reference architectures depending on the platform and solution required

Using Terraform to stand up the environment was a matter of taking our reference architecture, updating it to leverage auto-scaling groups and a few other variables in our main configuration file. This would include anything we want to customize within the environment, EFS, RDS, EC2 instances, ASGs with EC2 info etc. We also developed a module to push any required values to SSM such as EFS endpoints, RDS URL and some server instances to load at runtime.

One of the challenges we realized early on was how to approach the patching/scaling events due to the way AWS manages instances provisioned by autoscaling groups. Terraform has no problems updating the launch configs but doesn’t do anything with that updated config. A solution we found was to use an open-source product called Bouncer which addresses this issue. We can call Bouncer from within Terraform and it will manage the scaling out of the ASG either in serial or canary mode. Our single instances in self-healing groups are all serial while our client-facing application servers are done canary to ensure it’s always available.

  • The user_data was updated to reference the download.sh file we had built into our packer images. Once we had everything prepared it was a matter of executing our Terraform plan and approving it in Terraform Cloud.

As part of standing it all up, the ASG is configured to scale with parameters that trigger a scale-out at the appropriate threshold to give the new instance time to come online before the other servers are overwhelmed. ATG does require a little more time to come online and this has to be accounted for in the scaling thresholds.

Code Releases

With the infrastructure being immutable you no longer deploy applications to an instance and restart the application but rather deploy your code to an efs volume, update SSM with the path and using the AWS CLI you can execute the required changes on the ASG to launch new instances which will query SSM as part of its launch and symlink the appropriate package on the EFS volume.

Code Release Process

Additional Benefits

System Integrators and Operators no longer require access to the instances, all logging has been centralized within Cloudwatch. Direct ssh is no longer supported as we have enabled ec2-instance-connect and plan on migrating to AWS SSM Session Manager in the coming months. This helps address PCI requirements as well as simplifying the support experience.

With this solution in place, we developed a patching pipeline within Jenkins. First, we built the Packer package and placed the new AMI id into SSM. We have another job that will checkout our updated Terraform source code with the AMI ID of the new image, comments and commits the code back into git. This triggers a build in Terraform which we have decided for the time being is our gate and an operator has to approve the update. Once the operator approves the change, Terraform and Bouncer will complete it.

AMI Image Process

What’s Next?

Developing this solution we have given the client the ability to rapidly provision new environments, scale-out production on a demand basis without having additional servers laying around in wait and an improved patching process that reduces the risks and ensures a consistent image.

Over the next few months, we hope to extend this out so we can quickly turn a client onboarding around in a matter of days vs weeks or months. These are some of the areas we are looking at:

  • Using Oracle RDS S3 integration we are building a solution to have a data pumped DB uploaded to a known space and then imported into the RDS instance.
  • Working with the system integrator and client to further integrate the EAR configuration to improve the model even further.
Daniel Mulrooney is a Cloud Solutions Architect currently working for Pivotree designing and creating cloud-based system infrastructure. He is also an ultra runner having successfully completed several ultra distances races.