TL; DR;

Analysing performance issues in triggers and preparing a strategy for effective improvements.

The Problem

“We can’t save more than x records at the time.”

“Every now and again we get this locking error.”

”It’s so slow today I can’t get anything done.”

I’ve been here many times before. I invite you to follow me on a journey through taming the beast that is a “complex” trigger cycle.

As I am writing this, I’m not actually finished and I don’t know yet just how much things will improve. I’ve tested enough to have an idea, but there may still be many surprises. What you will read about in the next couple of posts is my process rather than definitive answers. Let’s start with some planning and analysis.

Choosing My Battles

How do I know where the main performance bottlenecks are? How to know where I can make the most improvement? Which part is just going to have to stay slow (for now)?

There are a few angles from which we can look at transaction performance. And this will change where we need to put most attention and where we want to start when trying to improve things. It’ll also affect what options we have to deal with the problems.

Limit Exceptions

It’s very clear when there are specific errors.

If we are getting hit by Too Many SOQL Queries Limit exceptions all the time we are often dealing with repeating queries or unmanaged trigger recursion. So we might be looking at some record caching possibilities and preventing trigger recursion to improve such situations.

Non-selective Query problems are a different thing though and we need to be looking at specific SOQL statements, amounts of data and indexes.

With Apex CPU or Heap Size issues it’s more to do with code architecture and inefficient processing in Apex. Endless routines, nested loops, bloated wrappers and again trigger recursion are the usual culprits. We don’t necessarily need to deal with SOQL statements and how long they take to execute.

Unable to Lock Row problems can be tougher. Is it that each transaction takes far too long? Or is it that there are far too many transactions related to the same records? Can the data model be improved (think relationship skew and so on).

Just within Limits

When we don’t really get errors though (it’s just slow) that’s when we have to think a little more about where to start. First of all, who is complaining? Which users and what are they doing? It could be that the integrations are being slowed down and this is causing issues in other systems (so we’re focusing on custom api endpoints and triggers, other automations too) or it could be internal users (add the UI and potentially slow front end components and bloated Flexi Pages to the mix). Maybe there is a Managed Package involved too (alternatives?).

Model Scenario

I’m dealing with Apex CPU Limit exceptions as well as Unable to Lock Row at the same time. Well not strictly at the same time, but both are an issue. The former for internal users and the later for the middle-ware calling in (who’s already dealing with CPU Limit by breaking requests down to batches of smaller size).

In both cases I’m looking at creating records in the central most utilised and customised objects – something like Opportunity and Line Items. They come from a managed package which itself contains quite a bit of automation. Then there is a single custom trigger for each, but they do a lot and they don’t really have recursion protection. Finally there are dozens of Workflows, Validation Rules, even Process Builder and Flow in there. Some async processes are getting spawned as well.

There are some obvious options presenting themselves here, like unifying the different automation technologies to reduce overhead (or at least getting rid of Process Builder) or introducing more sophisticated trigger orchestration. But we also have to consider the complexity of each of the changes, including the risk it carries and amount of re-testing required.

What I want to do here is focus more on my approach in analysing where most of the time is spent and identifying the areas that could provide most improvement with least amount of work or risk.

Analysing Logs

So I’ve set the Debug Trace to FINEST, prepared some test records that I know fly quite close to the sun and hit save. The downloaded log file is just over 18MB so I think it’s not truncated (close). Overall transaction time is 111s (ouch). Let’s open Log Analyzer (link) and get to work.

Screenshot of Log Analyzer timeline view of a particularly bad trigger cycle — Log Analyzer Timeline View

If you are surprised by the colours. Go to settings and have your life improved just like I recently did. Don’t know how I missed that for so long.

This is a fantastic tool to get a general idea of what is going on. You’ll easily see all the queries conveniently grouped including how many times each of them was executed (with a bit of clicking through you see how long they took too). You can also quite nicely observe how your triggers go round and round and round. Identifying obviously slow parts of Apex is also simple, seeing again how many times methods were called and how much time was spent on each.

What the Logs Don’t Tell You

It’s important to remember that the time the logs show is actual time on the clock and not the Apex CPU Time. That’s why the time can vary wildly between different executions of the same logic. It’s a good idea to run the same test a few times to confirm the representative value of the log file you will focus on. It’s also why this tool may be not quite enough when really focusing on the CPU Time exceptions.

To get accurate CPU Time we need to look in the raw file for profiling info. It is shown every now and then like below. There are ways to get it printed more often or at specific places in your code to get more accurate info. But about that another day. For now I’m content with knowing my transaction used 14,000 ms of CPU Time.

06:35:06.765 (91765319106)|LIMIT_USAGE_FOR_NS|(default)|
 Number of SOQL queries: 45 out of 100
 Number of query rows: 113 out of 50000
 Number of SOSL queries: 0 out of 20
 Number of DML statements: 10 out of 150
 Number of Publish Immediate DML: 0 out of 150
 Number of DML rows: 30 out of 10000
 Maximum CPU time: 12144 out of 10000 ******* CLOSE TO LIMIT
 Maximum heap size: 0 out of 6000000
 Number of callouts: 0 out of 100
 Number of Email Invocations: 0 out of 10
 Number of future calls: 1 out of 50
 Number of queueable jobs added to the queue: 1 out of 50
 Number of Mobile Apex push calls: 0 out of 10

That’s a lot (over the limit in fact, this happens often), but overall only a bit over 10% of total time. In my particular case the transaction consists of 2 main insert operations into the same object and a couple of other (small in comparison) updates to other objects. The main takeaway is that for each of the main inserts, an update follows as well, twice!

Least Risk and Most Benefit

There is a comparable amount of logic on insert and update, so I can roughly divide the CPU Time by 3 to get the cost of each trip. It’s obvious that I can make huge improvements by controlling re-entry better, rather than trying to optimise code in the “trigger actions”. If I can confirm that the trigger logic doesn’t need to run again I can cut out the 2nd update cycle. That’s maybe up to 4000 ms of CPU Time, but also it’s likely to cut down queries and some other automations having a much larger overall impact. That is going to be important for my locking problems.

We have a ton of Workflows in place. Some of them fire often. But refactoring all of that would be quite difficult and they don’t seem to take much time.

I’ve added debug statements into the triggers to list out all the changed fields in all records, as the transaction goes through trigger cycles. This helps me identify what caused the re-entry and identify which parts of the trigger are safe to skip the second or third time round.

public without sharing class CompareItems {
   public static void showDiff(ADvendio__Campaign_Item__c newRecord, ADvendio__Campaign_Item__c oldRecord) {
       Map<String, Object> newValues = newRecord.getPopulatedFieldsAsMap();
       Map<String, Object> oldValues = oldRecord.getPopulatedFieldsAsMap();
       for (String fieldName : newValues.keySet()) {
           Object newValue = newValues.get(fieldName);
           Object oldValue = oldValues.get(fieldName);
           if (newValue != oldValue) {
               System.debug('DIFF ' + fieldName + ': ' + newValue + ' > ' + oldValue);
           }
       }
   }
}

Alternative tool to Log Analyzer is the Analysis view in the Developer Console. A nice thing it does is showing me the total time spent in the Database. I know many queries get repeated from previous analysis. So the next thing to focus on will be improving our pre-existing Query Result caching mechanisms. It’s not going to help CPU time, but it will help with record locking by reducing the overall transaction time (and the user experience too of course).

Screenshot of the same log as before looking at the Developer Console — Developer Console Analysis View

What’s the Plan Then

So far I’ve came up with this actions:

Get rid of Process Builder Flow. It’s not massive but it will make a difference (and it’s Process Builder – yuck!)
Improve and extend Query Caching
Introduce Trigger Recursion control – the big one

PS: Don’t forget Async

There is another route that needs to be explored as well. If you have been to my blog before you might know I am a fan of Async Apex (link). This time the actual Apex Code complexity is not the biggest problem we have, but it could be significantly improved if lots of it could be pushed to asynchronous processes. It doesn’t need to be a massive refactoring effort either. We could just take the whole trigger handler, split it in half (or thereabouts) and have the second one execute in the Change Event trigger (CDC). A fairly new approach I’d love to try out.

There is a big problem with that in my situation though. Especially if this async trigger (or a group of Async Jobs) updates records again issuing database locks. And I’ve also found that the Managed Package calls a @Future method which again updates the record and takes quite long to complete. So more Async jobs executed at round about the same time just increase the risk of hitting the locking exception. A quick PoC had shown this worry was justified so until the general update cycle gets a lot faster, fewer jobs is better. The sum of many smaller jobs will always be more than one or two big ones.

What’s Next

I’m working on various things that I want to share in this “series”. We’ll look at Trigger Frameworks and fflib Domains as well as Query Caching and fflib Selectors. We are also trying out some long-term performance monitoring and I want to give the async path another bash. A lot to look forward to.

I’ve not really done all of it yet, so .. safe harbour 🙂

Subscribe to new posts