Wednesday, November 12, 2008

Kehoe - Your Performance Is Our Business

By John Kehoe - 11 December 2008

I know of an outfit with the motto of ‘Your Performance is our Business.’ It doesn’t matter what they sell (OK, it’s not for an E.D. product, let’s get that snicker out of the way), it is what the mindset the motto instills: ‘You’ve got IT problems with a real impact to business, and we have a solution and are focused on your business needs’. It got me thinking, what mindset is at work at our IT shops? What would an IT shop look like with a performance mindset?

The meaning of the mindset varies by how we define performance to our customers. Some shops have a simple requirement. I can’t count the times I find the critical application is Email. It is a straightforward domain of expertise, albeit with significant configuration, security and archiving requirements. The key metrics for performance are availability, responsiveness and recoverability. Not an insurmountable challenge except in the most complex of deployments.

There isn’t much performance to provide for messaging outside of availability and responsiveness. There isn’t alpha to be found. It’s a cost center, an enablement tool for information workers.

Performance comes into play with alpha projects. These are the mission critical, money making monstrosities (OK, they aren’t all monstrosities, but the one’s I’m invited to see are monsters) being rolled out. This is very often the realm of enterprise message busses, ERP and CRM suites, n-tier applications and n-application applications (I like the term n-application applications, I made that one up. My mother is so proud). All are interconnected, cross connected or outside of your direct control. Today, the standard performance approach is enterprise system management (ESM). These are the large central alerting consoles with agents too numerous to count. ESM’s approach the problem from the building blocks and work their way up.

This is not the way to manage performance.

The problem with low level hardware and software monitoring approach is that performance data cannot be mapped up to the transaction levels (yes, I said it; consider it flame bait). Consider the conventional capacity planning centers we have today. The typical approach is to gather system and application statistics and apply a growth factor to it. The more sophisticated approach uses additional algorithms and perhaps takes sampled system load to extrapolate future performance. The capacity planning model lacks the perspective of the end user experience.

The need for Transaction Performance Management (TPM)

TPM looks at the responsiveness of discrete business transactions over a complex application landscape. Think of this in the context of n-application applications: TPM follows the flow regardless of the steps involved, the application owner or “cloud” (did I ever say how much I loathe the term cloud?) The TPM method follows the transaction, determines the wait states and then drills into the system and application layers. Think of it as the inverse of ESM.

To exploit the TPM model you need a dedicated performance team. This is a break from the traditional S.W.A.T. team approach for application triage and the mishmash of tools and data used to ‘solve’ the problem. In a previous column, we discussed the implications of that approach, namely long resolution time and tangible business costs.

What would a TPM team look like?

  1. We need a common analysis tool and data source monitoring the application(s) architecture. This is essential to eliminate the ‘Everybody’s data is accurate and shows no problems, yet the application is slow’ syndrome.
  2. We need to look at the wait states of the transactions. It isn’t possible to build up transactions models from low level metrics. The ‘bottom up’ tools that exist are complex to configure and rigid in the definition of transactions. They invariable rely on a complex amalgamation of multiple vendor tools in the ESM space. Cleanliness of data and transparency of transactions are hallmarks of a TPM strategy.
  3. We need historical perspective. Historical perspective provides our definition of ‘normal’. It drives a model based on transaction activity over a continuous time spectrum, not just a performance analysis at a single point in time.
  4. We need ownership and strong management. TPM implementations don’t exist in a vacuum, they cannot be given to a system administrator on an ‘other duties’ priority. TPM implementations require a lot of attention. They are not a simple profiler we turn on when applications go pear shaped. Systems, storage and headcount are needed, subject matter experts retained and architects available. Management must set the expectation that TPM is the endorsed methodology for performance measurement, performance troubleshooting and degradation prevention.
  5. TPM must quickly show value. Our team and implementation must be driven by tactical reality with a clear relationship to strategic value. We need to show quick performance wins and build momentum. For instance, there needs to be a focus on a handful of core applications, systematically deployed and showing value. Build success from this basis. Failure is guaranteed by the big bang global deployment or a hasty deployment on the crisis du jour.
  6. There must be a core user community. Our user community consists of people up and down the line, using the tool for tactical problem solving to high level business transaction monitoring. These people know the application or the costs to the business. Training is essential to an organization-wide TPM buy-in.
  7. TPM must be embraced by the whole application lifecycle. It’s not enough to have operations involved. QA, load testing and development must be active participants. They can improve the quality of software product early on using the TPM methodology.

An example of a TPM success is a web based CRM application I worked on recently. The global user population reported poor performance on all pages. The TPM team saw isolated performance issues on the backend systems, but they did not account for the overall application malaise. The supposition was that last mile was too slow or that the application was too fat. We implemented the last TPM module to analyze the ‘last mile’. It turns out each page has a single pixel GIF used for tracking. This GIF was missing and caused massive connect timeouts for each page. Put the image in the file system and the problem disappeared. This begs the question: how would we have found this problem without the TPM approach?

TPM and Alpha

This TPM model sounds great, but what is the practical impact? Has anyone done this and had any bottom-line result to show for it? Many shops have succeeded with the approach. Here are two examples.

In the first one, a major American bank (yes, one that still exists) wrote a real-time credit card fraud detection application. The application has to handle 20,000 TPS and provide split second fraud analysis. Before we implemented the TMP model, the application couldn’t scale beyond 250 TPS. Within weeks of implementing TPM, we scaled the application to 22,500 TPS. This was a $5mm development effort that prevented $50mm in credit card fraud. The TPM implementation was $100k and done within development.

The second is one of the top five global ERP deployments. The capacity planning approach forecast 800 CPU’s would be needed over two years to handle the growth in application usage. The prior year, a TPM center was established that performs ruthless application optimization solely based on end user response times and transaction performance. The planning data from the TPM center indicated the need for 300 CPU’s. The gun shy (hey, they’ve only been active a year) TPM center recommended 400 CPU’s. The reality was that only 250 CPU’s were needed when the real world performance of the application was measured and tuned over the subsequent year. The company saved $25mm in software and hardware upgrades. The savings provided the funds for a new alpha-driving CRM system that helped close $125mm in additional revenues over two years.

What is the cost of implementing a TPM center at a large operation? A typical outlay is $2mm including hardware, software, training, consulting and maintenance. An FTE is required as well as time commitments from application architects and application owners. Altogether we can have a rock solid TPM center for the price of a single server: the very same server we think we need to buy to make that one bad application perform as it should.




About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry). You can reach John at exoticproblems@gmail.com.

No comments:

Post a Comment