alphaITjournal: John Kehoe

Showing posts with label John Kehoe. Show all posts

Saturday, January 23, 2010

ITIL is considered essential, but does it internalize transaction cost?

By John Kehoe - 19 February 2010

I have slept though many an ITIL briefing. There is something about those process charts that just cause me to glaze over; I must have been thinking about the cost of applying ITIL. It is a business process and process must be paid for. Does it earn its keep, or does it waste money on needless feedback loops for monitoring the quality of printing services. Certainly, it should only be applied in a supporting role for revenue facing activities. If so, why is there no mention of application management and revenue alignment? Transaction Performance Management (TPM) fits that niche nicely.

My first encounter with ITIL occurred some years back at a companywide sales kickoff for a $5B software company. We didn’t sell ITIL software, but we wanted all the consultants, system engineers and account executives fluent in ITIL and position our products in that context. We had a very nice, engaging and intelligent fellow for an instructor. He had impeccable credentials and came from a name brand global consultancy.

He introduced the history of ITIL, which was my first warning of impending doom (and eye glazing). ITIL was created by UK bureaucrats over a decade ago. It details exacting process and frameworks for IT organizations. The idea being that process can bring order to chaos. Its process is well suited for change management and run books, but there are parts that are just plain silly. Do we need a TQM style approach with elaborate feedback loops for printing, file, network services and availability? Most IT activities are utilities. There are two variables of concern. One, is the service available? Two, is it performing? There is no need for detailed statistics, they don’t matter.

Another concern about ITIL is cost. ITIL is a process and process costs money. I’ve not seen ITIL performance statistics tied to revenue in any shop I’ve visited. There are claims of increased output, reports showing three digit improvements in productivity, downtime reports to two decimal places. We’ll that is great news. How much did it cost? How much revenue did it contribute? Did we apply resources efficiently? ITIL doesn’t answer those questions.

What is missing from ITIL (other than affordable documentation)? Performance management is the key piece. There is no mention of Transaction Performance Management. By using a TPM approach one can quantify end user experience, determine root cause degradation and use an ingrained feedback loop. Why do this, to focus effort real business impact.

Let’s lay out a scenario. End users are complaining about order entry. We hit our run books and start the triage. We hit the server and application stats. We review the log files. We have each team investigate their respective technology with their tools. Maybe there are some sophisticated workflows to follow. Noticed what happened. Where is the end user? Where is the business alignment? A formal framework around dodgy process is a dodgy formal framework. Worse still, it does nothing to resolve the performance issue.

Now let’s keep the eye on the transaction. We see what, who and where it is coming from. Maybe it’s a slow network link. Then it’s a last mile problem and we needn’t bother the developers or DBA’s. Maybe there something unique about the transactions, we see that and notify the middleware team. You see where I’m going with this. You see all the waits for transactions down to the storage array. You see the characteristics of the transactions and where and how they break. This is Transaction Performance Management. One keeps an eye on all of the transactions over the whole of the architecture in context of the business, user and technologies. This is what ITIL lacks, a tight, comprehensive approach to the performance of the business and underlying technologies.

Recently I worked with a very clever group of consultants. Their specialty is mapping business transactions to risk, and then mapping that risk to revenue. The old style of doing that is fixing a cost to an outage. For example, I might lose $60 for every minute I’m out of service. For two 9’s (88 hours), I’ll lose $315k. For four 9’s (52 minutes), I lose $3,100. Is it worth investing $300k to reduce this number? Ah, that is the question isn’t it. We don’t have enough information to justify this. Where does the $60 come from? Is it a semi- factually derived value (SWAG)? What transactions are in play? What happens if the transactions are slow as opposed to unavailable? What opportunity costs are incurred?

Are you investing in the right application? How do you know?

To answer those questions, my consulting friends built a sophisticated model based on how business transactions are executed and the revenue associated with each. They then added their secret risk models and viola; my customer can map their transaction data to her transaction risk model. She can see exactly where transactions are breaking down, have sliding scales of performance impact to risk and understand the opportunity cost for a given outage or SLO (Service Level Objective) breach.

This is a compelling approach. You can now make focused expenditure judgments based on transactional risk. You can immediately see the impact of a transaction change to the business. You can’t have a tighter and more focused feedback loop than that.

This is why the TPM approach makes so much sense. It ties business transactions to risk in a way that ITIL or ITSM or Six Sigma cannot do. In doing so it complements the structured approach, except TPM is the process that that pays for itself and builds business value.

About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry). You can reach John at exoticproblems@gmail.com.

Tuesday, March 10, 2009

Kehoe - Smug Post-Modernisms and Other Notions We Get Wrong

By John Kehoe - 10 March 2009

I was watching Gremlins 2 with my daughter this weekend (yes, I’m a bad dad, but don’t hold the sequel against me, just the fictional violence). What strikes me about the movie is how cheesy it is. Not the plot but the technology. The video conferencing system, the voice based building controls. I particularly like the talking fire alarm system giving a history of fire, but I digress. It is a great period piece for late 80’s business and technology (Did you know that you could smoke in an office in 1990?). Yes, post modern sophistication relegated to a period piece. Such is father time.

It got me thinking in a broader context. What are we getting wrong today that will be revealed with the passage of time? We can look at the history of scientific progress. Examples abound in astronomy, biology and physics. The same can be said in social sciences, economics and politics. Up until the 1950's, the universe was thought to be quite small. Up until last year, bundled mortgages looked as a good way to diversify risk.

How do we know which horse to back? The first place to look is the ecosystem (yeah, sounds touchy feely, but it isn’t) of the technology. Diamond created the first MP3 player, a 64MB job. They did this years before Apple. Apple won the race, but why? They created a fully contained ecosystem. It consisted of a closed DRM format, content, exclusivity of content, blessing of RIAA and a logo program. It didn’t hurt that they hyped the heck out of it. Microsoft tried the same with Zune, but hasn’t had anywhere near the success. Microsoft was too late to the market and didn’t have the best marketing or industrial design (people like polished plastics and nickel alloy). The same is true with the other media players.

The ecosystem became pivotal. As a consumer, do I go with another ecosystem or do I go with iPod? My best mate abhors all things Apple (except his trusty Newton) and argues against the iPod. iTunes and iPod are closed DRM systems, the music isn’t portable to other systems, Apple locks in content providers. The arguments are similar to the Linux, Apple, Microsoft or [fill in the blank with a comperable technology] proponents or opponents. The fact remains that most people choose the iPod because it has the most mature ecosystem.

So what if there is no ecosystem? How do I pick the winner? I resort to need and simplicity. What do I need to accomplish? For instance, suppose I have a customer facing application that brings in $100 per minute. When the transaction rate slows, I lose money. I can quantify "normal," define a cost of abnormal activity and prove what additional revenue I can create with further capacity. I can determine my cost for that performance delta. It is a simple model and readily understood. It guides what the impact is, what is my need and what can I afford. It’s a good way to avoid the technology weeds.

Time makes fools of us all. We can use that to our advantage. If you don’t need technology XYZ, can’t afford it or can’t absorb it, then don’t buy it. The new classic example is BluRay v. HD-DVD. Both were expensive technologies that consumers would not absorb. The end result of a hard press by Sony lead to the capitulation of HD-DVD within a two week period in 2007. This made winners of the people who bought BluRay and the consumers that waited. Don’t mistake the initial BluRay owners as brilliant strategists: HD-DVD could have won as well. At any rate, the first adopters of BluRay paid $900 for bulky players. Better to wait for Wal*Mart to sell them for $99.95. The real winners are the consumers that sat out the battle.

So we use need and time to our advantage as best we can. We can use a contrarian perspective to the technology cycle. Think of this as the Devil’s Advocate (and yes there is a Devil’s Advocate in the Vatican). This would be considered the "B.S. detector" (a characteristic well honed by Mrs. Kehoe and applied to the auto dealer or to me asking for a 52” big screen). This leads to a skeptical mindset, a healthy maladjustment of the trusting mind.

Consider the evolution of broadband. Fifteen years ago technologists thought it essential, but prohibitive in cost (think ISDN a.k.a. “I Still Don’t Need”). We knew (or at least strongly suspected) what we could do with broadband communications: distribute information, telecommute (the real reason IT guys pushed broadband), new forms of communication, WebEx (which didn’t exist fifteen years ago), shopping, expansion of markets, outsourcing, offshoring, distributive teams, etc. The wheels come off the bus when we start standing up 100 Mbps internet, free municipal Wi-Fi and universal broadband. Why are they needed? Is to keep up with Elbonia? Why should there be a government run Wi-Fi network? If people don’t want broadband why force the build out that capacity? The sixty-four-million-dollar question is: when does a technology become valuable? Fibre to the house was goofy 20 years ago. If you have ask The Creator why he needs a starship.

Despite our best efforts, time will still embarrass us (really, the K car was brilliant idea). What has been the long term impact of Michael Jackson? He went from being King of Pop to Regent of Ridicule in short order. Will Miley Cryus be the Max Headroom of today? (I do have to claim that my daughter is not a Miley fan, I can’t be that bad of a father.) So foolishness can rule the day, but I doubt that Sir Mix-A-Lot’s ‘Baby Got Back’ will be considered "classical music" in two hundred years. We don’t see the Sun Microsystem's ‘We puts the dot in dot com’ commercials (’99-’00) as being seen as the launch pad of corporate success, but an apex of hubris signaling the impending internet bust of ’00.

Looking at the merits of the solution in the context of its ecosystem, need, simplicity, time and our return models, we minimize our risks and bring a skeptical mindset to the hype cycle. Let's not be the next "dot" in "dot bomb."

Wednesday, January 28, 2009

Kehoe - A DEC VMS Cluster, a Funky Chair and a Toaster Oven Walk Into a Bar...

By John Kehoe - 28 January 2009

I do not like change forced by ‘the Man’, change out of my control or change that doesn’t help. I do like change that I push, change that makes my life easy and change that puts money in my pocket. As IT professionals, we push change. We often do so without appreciating what our users require, what the users are doing or the historical context of our solutions.

What are we doing wrong? Let’s look at a manufacturing example. I once owned a nice little under-the-cabinet toaster oven, circa 1991, that worked reliably for almost two decades. When it hiccupped I simply fixed it. Alas, it eventually started belching smoke and so I replaced it.

In the IT world, this belching represents the old, unsupported application that we are afraid to touch, lest it combust. It has exceeded its carrying cost and the time has come to replace it. My quest to find a new toaster oven took a while as the new models had been redesigned to meet current codes (think the manufacturing equivalent of SOX, HIPAA and whatever the politicians are about to do to the financial sector). My Elbonian made Char-Nobyl 8000 toaster oven, finally arrived (long past due, as often happens in software system replacement). We had to measure and mount the heat shield as well as mate the oven to the mount (application infrastructure). The instructions turned out to be wrong. I could not reuse the existing mounting equipment. Matching up the supplied components was problematic. Poor design, poor build, poor execution and dodgy installer.

Now we get to use the oven. My first thought: why is my toaster oven plastered with English, Spanish and French instruction and warnings? I bought it in Chicago, not Paris or Buenos Aires. This is the small electronics world equivalent to the user interface. It’s annoying (sure I’m ethnocentric, but a Frenchman and Argentinian would say the same: they didn’t buy a toaster in Chicago).

Now I want to toast a bagel. Simple task: insert bagel; set darkness; start the job; repeat as desired (or ‘insérer bagel; set obscurité; démarrer le poste de répétition comme souhaité’). It’s not that simple with this toaster oven. This baby comes with a plethora of dials. First you set the darkness level. Then you rotate a knob clockwise, then counterclockwise. The third step is to wait for the light to turn green (which, for some reason, turns blue instead). Finally I push the button to ‘launch my bagel into orbit’. Success! I toast another and it’s burnt. Turns out I need to recalibrate my darkness level with each toasting to avoid burning. At least it doesn’t belch smoke.

This is a good metaphor for the weakness of IT applications: we toss in every conceivable feature, the UI is unnatural, the instructions are off, the business rules are mercurial… you get the idea.

How do we make a better toaster oven? Consider the past and the trade-offs. Do we need to replace the heat shield and mounting locations? Can we at least use the same bolt layout? Do we need the languages? Do I need this degree of globalization? What about the UI, why have the extra dial? Why can’t the oven maintain a regulated temperature? In IT we tend to design what we think we need, not what our customer needs.

Asking the customer is no guarantee of success. We’re still likely to end up with dozens of ideas, many of them contradictory or just plain bad. A few gems may exist among amongst the cruft, but they are sometimes difficult to distinguish. We may even avoid considering customers or users who have a propensity to complain.

So where does that leave us? We must look to the past. Just because an idea is old doesn’t mean it’s stale. Consider this story in the context of today’s web 2.0. In the late nineties, I worked with Digital Equipment Corp (DEC) systems. I once chatted up a DEC field SE about the reliability of their hardware. He proudly told the story of a 911 emergency system that had been running on a VMS (a real operating system) cluster (DEC invented the technology) for 16 years with zero downtime. The design was thoughtfully centered on what the user required and what the technologist could do. Yes, they periodically refreshed the software and the operating system. All this was done without bringing the application down. Try that with today’s technology mix and applications. Would you want a 911 system running on a cloud or web 2.0?

Here’s another example. Remember the really cool chairs in the 1960’s sci-fi movies. In all our sophistication, we mocked the strange designs. Surely they had to joke (OK, they were being serious, but the design did not age well) that anyone would use such a contrivance. Well forty years later, I’m at a state of the art (opened three days earlier) corporate video conference center for a $34b company. What’s the first thing I noticed, after the three HD armed cameramen and thirty foot projector setup that put IMAX to shame? I see these wild-looking chairs that could have come from 2001: A Space Odyssey or Logan’s Run.

So what can we glean from my tales? First, the problems we see today have analogues in the past. A lack of historical perspective in our current crop of technologists hampers good problem solving (Action item: crack open a VMS manual to steal some ideas). Two, our solutions are too often divorced from the use case. Our users are not forthcoming in what they want nor do they fully appreciate what they need. We must crawl into their heads to understand what is required and build from those use cases. I say use cases deliberately: we don’t need pretty interface ideas. We do need to value the UI, but we can't simply put out something shiny and think that's the end of it. Form must follow function, which means we must first understand the function. We need to know how they do their jobs today and how the tool we are building will help them do it. Three, we are making software that is too difficult to use. Consider the toaster oven. Why the abrupt disconnect from the past? If your users are coming from 3270 terminal land or fat client world, our objective is to make lives easier with technology, not to simply require users to change from one technology to another. Finally, consider the plumbing. Are we proposing a more reliable solution than the customer has or are we opening ourselves to reliability and performance grief?

So crack open those old manuals, play with some assembly code and enjoy some classic sci-fi movies this weekend. On Monday you can start looking at use cases. By Friday you might be able to deliver some ‘change’ that doesn’t ‘look’ like change and make change resistant users think they came up with it.

About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry). You can reach John at exoticproblems@gmail.com.

Wednesday, November 12, 2008

Kehoe - Your Performance Is Our Business

By John Kehoe - 11 December 2008

I know of an outfit with the motto of ‘Your Performance is our Business.’ It doesn’t matter what they sell (OK, it’s not for an E.D. product, let’s get that snicker out of the way), it is what the mindset the motto instills: ‘You’ve got IT problems with a real impact to business, and we have a solution and are focused on your business needs’. It got me thinking, what mindset is at work at our IT shops? What would an IT shop look like with a performance mindset?

The meaning of the mindset varies by how we define performance to our customers. Some shops have a simple requirement. I can’t count the times I find the critical application is Email. It is a straightforward domain of expertise, albeit with significant configuration, security and archiving requirements. The key metrics for performance are availability, responsiveness and recoverability. Not an insurmountable challenge except in the most complex of deployments.

There isn’t much performance to provide for messaging outside of availability and responsiveness. There isn’t alpha to be found. It’s a cost center, an enablement tool for information workers.

Performance comes into play with alpha projects. These are the mission critical, money making monstrosities (OK, they aren’t all monstrosities, but the one’s I’m invited to see are monsters) being rolled out. This is very often the realm of enterprise message busses, ERP and CRM suites, n-tier applications and n-application applications (I like the term n-application applications, I made that one up. My mother is so proud). All are interconnected, cross connected or outside of your direct control. Today, the standard performance approach is enterprise system management (ESM). These are the large central alerting consoles with agents too numerous to count. ESM’s approach the problem from the building blocks and work their way up.

This is not the way to manage performance.

The problem with low level hardware and software monitoring approach is that performance data cannot be mapped up to the transaction levels (yes, I said it; consider it flame bait). Consider the conventional capacity planning centers we have today. The typical approach is to gather system and application statistics and apply a growth factor to it. The more sophisticated approach uses additional algorithms and perhaps takes sampled system load to extrapolate future performance. The capacity planning model lacks the perspective of the end user experience.

The need for Transaction Performance Management (TPM)

TPM looks at the responsiveness of discrete business transactions over a complex application landscape. Think of this in the context of n-application applications: TPM follows the flow regardless of the steps involved, the application owner or “cloud” (did I ever say how much I loathe the term cloud?) The TPM method follows the transaction, determines the wait states and then drills into the system and application layers. Think of it as the inverse of ESM.

To exploit the TPM model you need a dedicated performance team. This is a break from the traditional S.W.A.T. team approach for application triage and the mishmash of tools and data used to ‘solve’ the problem. In a previous column, we discussed the implications of that approach, namely long resolution time and tangible business costs.

What would a TPM team look like?

We need a common analysis tool and data source monitoring the application(s) architecture. This is essential to eliminate the ‘Everybody’s data is accurate and shows no problems, yet the application is slow’ syndrome.
We need to look at the wait states of the transactions. It isn’t possible to build up transactions models from low level metrics. The ‘bottom up’ tools that exist are complex to configure and rigid in the definition of transactions. They invariable rely on a complex amalgamation of multiple vendor tools in the ESM space. Cleanliness of data and transparency of transactions are hallmarks of a TPM strategy.
We need historical perspective. Historical perspective provides our definition of ‘normal’. It drives a model based on transaction activity over a continuous time spectrum, not just a performance analysis at a single point in time.
We need ownership and strong management. TPM implementations don’t exist in a vacuum, they cannot be given to a system administrator on an ‘other duties’ priority. TPM implementations require a lot of attention. They are not a simple profiler we turn on when applications go pear shaped. Systems, storage and headcount are needed, subject matter experts retained and architects available. Management must set the expectation that TPM is the endorsed methodology for performance measurement, performance troubleshooting and degradation prevention.
TPM must quickly show value. Our team and implementation must be driven by tactical reality with a clear relationship to strategic value. We need to show quick performance wins and build momentum. For instance, there needs to be a focus on a handful of core applications, systematically deployed and showing value. Build success from this basis. Failure is guaranteed by the big bang global deployment or a hasty deployment on the crisis du jour.
There must be a core user community. Our user community consists of people up and down the line, using the tool for tactical problem solving to high level business transaction monitoring. These people know the application or the costs to the business. Training is essential to an organization-wide TPM buy-in.
TPM must be embraced by the whole application lifecycle. It’s not enough to have operations involved. QA, load testing and development must be active participants. They can improve the quality of software product early on using the TPM methodology.

An example of a TPM success is a web based CRM application I worked on recently. The global user population reported poor performance on all pages. The TPM team saw isolated performance issues on the backend systems, but they did not account for the overall application malaise. The supposition was that last mile was too slow or that the application was too fat. We implemented the last TPM module to analyze the ‘last mile’. It turns out each page has a single pixel GIF used for tracking. This GIF was missing and caused massive connect timeouts for each page. Put the image in the file system and the problem disappeared. This begs the question: how would we have found this problem without the TPM approach?

TPM and Alpha

This TPM model sounds great, but what is the practical impact? Has anyone done this and had any bottom-line result to show for it? Many shops have succeeded with the approach. Here are two examples.

In the first one, a major American bank (yes, one that still exists) wrote a real-time credit card fraud detection application. The application has to handle 20,000 TPS and provide split second fraud analysis. Before we implemented the TMP model, the application couldn’t scale beyond 250 TPS. Within weeks of implementing TPM, we scaled the application to 22,500 TPS. This was a $5mm development effort that prevented $50mm in credit card fraud. The TPM implementation was $100k and done within development.

The second is one of the top five global ERP deployments. The capacity planning approach forecast 800 CPU’s would be needed over two years to handle the growth in application usage. The prior year, a TPM center was established that performs ruthless application optimization solely based on end user response times and transaction performance. The planning data from the TPM center indicated the need for 300 CPU’s. The gun shy (hey, they’ve only been active a year) TPM center recommended 400 CPU’s. The reality was that only 250 CPU’s were needed when the real world performance of the application was measured and tuned over the subsequent year. The company saved $25mm in software and hardware upgrades. The savings provided the funds for a new alpha-driving CRM system that helped close $125mm in additional revenues over two years.

What is the cost of implementing a TPM center at a large operation? A typical outlay is $2mm including hardware, software, training, consulting and maintenance. An FTE is required as well as time commitments from application architects and application owners. Altogether we can have a rock solid TPM center for the price of a single server: the very same server we think we need to buy to make that one bad application perform as it should.

Wednesday, October 15, 2008

Kehoe - The German General Staff Model and IT Organizations

By John Kehoe - 15 October 2008

A couple of stories caught my eye this month and dovetail with a model I’m investigating. First, a column, Meet the IT Guy, outlines the typical, hackneyed view of the IT archetype (still funny stuff). The second, an excerpt from Numerati in Business Week, examines IBM’s effort to model consultant abilities and cost to map the right people to the right job and model how a successful expert is created (neat for high level experts, but a bit scary for lower level consultants).

This leads me to the characteristics an IT organization needs to excel. Curiously enough, they are descendents of the German General Staff (GGS) from the period between 1814 and 1900).

Staffing Model

The German General Staff (GGS) was created at the end of the Napoleonic Wars as a reaction to the military officer corps being staffed by men who purchased their positions. The "commissioned" officers (as in, paid for their commission) were on a quest for personal glory. As a result, they fought by rote, did not create new strategies, got a lot of men killed and wasted resources. The GGS mission was to create a professional officer corps to change that mindset and with it, achieve better results.

Candidates for the GGS were rigorously vetted for competency and motivation before an invitation was extended. Once in, GGS officers were categorized in two dimensions: motivation and competency. This can be represented in a 2 x 2:

This is how the GGS staffed the right person for the right role.

Clearly, in any organization, a mismatch breaks the lot. Play the mental exercise of putting your people in these positions. It is easy to find an implementer in a manager position or a manager in the general’s seat. Recognizing the mismatch makes obvious why things bog down or spin out of control.

Organizational Values

The GGS instilled the following values in its members:

Dedication: By investing in an individual, that individual is more committed to the organization
Motivation: Devise good strategies that are efficient, flexible and victorious
Dissemination: Get the ideas into the field
Doctrine: Get a common process in play
Innovation: Think outside the box to adjust to the circumstances
Philosophy: How do we go about our business? What are the ways and means we chose to use or not use.

Now, why put forward the GGS as a role model? Because the Germans where the first to do it. Today, every major nation has some sort of joint military command and training structure. In order for a military to succeed in a campaign, it must leverage every resource to maximum productivity and align tactical activities with strategic goals. The most successful operations – the most successful businesses – have the concept ingrained from top to bottom.

This approach makes clear how important it is to place each person to his or her abilities. Suppose a junior level person is designing a technical architecture for performance management. This mismatch is a high risk for the organization and the individual. Don’t axe the person because they perform poorly, move him or her to a task commensurate with their ability. Make clear that this move is not a punishment, just a rebalancing of the skills portfolio. If, however, a person falls into the lazy/incompetent quadrant, well, you know what to do.

Another interesting characteristic of the GGS is that it rotated people off the line into staff positions, and then back to the line. Many IT organizations have one set of people who put code into production and a cadre of architects in staff (or non-line) positions. Or they have people managing projects in their own way, ignoring efforts of a central PMO to create consistent and professional PM practices. Either is an example of an organizational separation within IT, with “central office” people on one side and “executors” on the other. By rotating people in and out of staff positions, IT policies are more likely to be actionable and not academic. There is also more likely to be buy-in and not resistance to IT standards. Perhaps less obvious, it contains the expansion of IT overhead as there are few career "staffers:" everybody will be back to the line before long. Finally, it makes IT less of a cowboy practice, and more of a professionally executing capability.

Mapping the Concepts

How do we map the GGS to an organization to measure its potential?

Dedication is obvious by the rate of employee exit and replacement. IT is a highly mobile profession; "healthy" IT organizations will have 15% turnover or less.
Motivation is recognized by the creation of value, not the number of overtime hours worked. Do we deliver in a timely fashion? Are we receptive to other organizations? Do we know and appreciate the impact of our action and inaction?
Dissemination is judged by joint cooperation. Do tasks get done in a timely fashion across a joint team? Do people know the resources and responsibilities around them?
Doctrine can be summed up simply by asking: do rules define the exceptions or do exceptions define the rules? If our rules are exception based, we have a problem. We have multiple options and cannot rationally consider their impact.
Innovation is the game changer that gets us ahead of the competition. What can we change that drives revenue or improves margins?
Philosophy defines the actions we accept and reject. What motivations do we accept or reject? Are they telegraphed throughout the organization? For instance, we should have profitability as our guide. How to we achieve it, growth or cost cutting. Does advantage come from outsourcing or offshoring? How do we balance the other 5 values?

The GGS was not a rarified career opportunity devoid of delivery expectations and obligations, but provided a means by which to circulate expertise and provide experience. It offers IT a straightforward way to fashion planning and career development, as well as a means to incubate ideas and individuals. It starts with a clean sheet of paper, a cup of coffee and insight into your business, organization and services portfolio. Give it a try today.

Tuesday, August 12, 2008

Kehoe - So I Get This Call...

by John Kehoe - 12 August 2008

It’s Friday, the day before my 40^th birthday (well, in fine Irish tradition, my birthday “wake”). I get a call from a customer, a major US-based air carrier. They’ve spent the last two months troubleshooting an online check-in system that powers their departure kiosks. They ask me to look at the problem.

The new check-in system was designed to complete the passenger ticketing process in 30 seconds, cut queue time and reduce counter staffing. Unfortunately, it wasn’t working out that well in production: the check-in process was taking five minutes, ten times longer than expected and longer than it would take to have an agent check in a passenger.

As a result, queues are long, customers are angry, and the customer has to increase the counter staff. Meanwhile, the airline the next counter over is fat, dumb and happy, successfully executing the business plan my customer was trying to implement. How dare they!

Each morning, my customer brings together a meeting of twenty people representing every vendor, owner and tier. Each presents the latest findings. All report acceptable metrics. Nobody can solve the end to end problem.

Before going any further, let’s do some math on the cost of these meetings. Sixty-three days, times twenty people, times (purely for round number’s sake) $100 equals $126,000 lost to just this meeting. This doesn’t include troubleshooting time, opportunity cost and proposed expenses to fix the problem (not to mention the meetings to implement that fix). So much for the returns the customer is trying to achieve with the new system.

This is a multi-million dollar problem. It isn't a seven digit problem, it’s an eight digit problem. The customer has already sunk millions in developing the software and acquiring the hardware and staff to deploy the application. They are past their planned deployment date and are paying dearly for FTE’s they want to shift. On top of it, they're losing business passengers who are the target of the system (the frequent flyer miles simply aren't worth the hassle).

To be fair, this application is a bear. There are four databases, three application tiers a data conversion tier, an application server and a remote data provider (that is, a third party, external vendor). There is no possible way to understand what is going on by looking at the problem atomically.

Now, back to the situation at hand. I join call number sixty-three. (Did I mention it’s my day off and I’m missing my birthday party?) There are four current fronts of the attack: the network load balancer is not cutting the mustard; the web servers are misconfigured; the Java guys think there might be an issue with the application server configuration; and the server pool is being tripled in size. I ask for seventy-two hours. My first – and only – act is to get two fellows from the customer to install some performance management software they bought a year earlier for a different project.

I sit back and wait.

It turns out that the team was off the mark. The Java guy was right, but for the wrong reasons.

Here is how the wait analysis breaks down. Authentication was responsible for 3% of wait, remote vendor response another 2%. One application component was responsible for 95% of the delay. The issue boiled down to async MDB calls.

Let’s consider the actual effort of what it took to isolate the problem.

First, we eliminated 90% of the people from the equation in two days. The network and systems were good. There was no issue with the web servers or system capacity. We could gain some single digit improvements by tweaking the authentication process (fix a couple of queries) and enforcing the SLA for our third party data provider. This left only the middleware team. This reduced the meeting of 20 people down to three: a customer Java guy, a rep from the Java App Server vendor and me.

Second, we eliminated a $1mm in hardware “solution” that was being given serious consideration. The web team genuinely believed they were the bottleneck and that if they scaled out and tripled their footprint, all would be better. Management (perhaps in a panic) was about to give them the money. It would have made no difference.

Third, we turned around a fix within seventy-two hours.

So, lets do the math again. One performance guy, times seventy-two hours (I really wasn’t working the whole time, I found the Scotch the family set aside for my birthday wake), times $100 (we didn’t charge so this is a bit inflated), comes to $7,200. Compare that to the (conservatively estimated) $126,000 spent for the daily firedrill meetings.

We eliminated waste by closing up the time wasting, money drawing, soul-sucking, morning meetings; avoiding a $1MM hardware upgrade that wouldn’t fix the problem; enabling the underlying system to achieve the business operations goals (reduction of counter staff and queue time) so that the it could come close to the business impact originally forecast, and providing a standard measurement system across all applications and tiers.

Consider this last point very carefully. We have to have a systematic approach to measuring the performance of applications. The approach must be holistic, i.e., capture transaction data from the desktop to the backend storage and all the tiers in between. We have to see and understand the relationships and interactions in the technology stack for our transactions. We cannot rely on just using free vendor supplied tool and a "toss the spaghetti to the wall to see what sticks" approach. This gives us only isolated, uncorrelated data points that show no problems or just symptoms, but not root cause.

From an IT perspective, the cost of the path that led to the solution was negligable: the time and tools over the three days spent actually solving the problem wasn’t much different from the cost of the morning meetings (except for the pounding the IT group was getting from the business owners while the application’s wings were clipped). From a business perspective, the cost of the path that led to the solution was nothing compared with the business impact: reduction of counter staff, faster check-ins, and happy customers. (Well, perhaps not "happy:" this is an airline we’re talking about... perhaps customers becoming disgruntled at a later point in the air travel experience.)

For all the panic and worry that it causes, a situation like this doesn’t need to be an exercise in “not my problem” and it can bring the business and vendors into alignment. But this is true only if vendors bear in mind that a holistic performance approach has real value associated with it, and if customers bear in mind that a holistic performance measurement system will set them back little more than the cost of futile execution.

Holistic performance management is an essential piece of successful business application deployment. Though viewed an afterthought, performance management is the least expensive part of application deployment. When used, it releases untapped value in applications. At the very least, it’s a cheap insurance policy for the business when the fire alarm rings.

About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry).

Tuesday, July 15, 2008

Kehoe - Futility Computing

By John Kehoe - 15 July 2008

For some time we’ve witnessed the push for utility computing: technologies such as server virtualization, storage virtualization, and grids that shift loads. Then there’s data source virtualization: natural language queries that retrieve a steaming heap of data from a mix of sources without being transparent about how it all got there. Sounds like the tomatoes the FDA can’t track down.

It’s best described as "Futility Computing," an idea Frank Gens of IDC came up with it in 2003.

Here's why utility computing is problematic.

First, the technologies have had a long maturity curve. Remember when a certain RDBMS vendor (who shall remain anonymous because I might need a job someday) promised the first grid capable of dynamically shifting load? We've been in pursuit of heterogeneous storage virtualization for a long, long time. Has there ever been a cluster that wasn’t a cluster-[expletive]?

Second, utility computing “solutions” are money spent on the wrong problem. The argument can be made that there are savings to be had by creating a utility structure. We save rack space, fully utilize storage, cut the electric bill and reduce HVAC requirements. We even get to do a nice little PR piece about how green we are and how we're saving the polar bears because we care. But what is the real cost? Do we have the right hardware for scalability? Can our business solutions exploit virtualization or will performance be degraded with the utility approach? What is the risk of vendor lock-in? Does the utility solution support the mix of technologies we already use, or do we need separate tools to support our mix of technologies? Is virtualization robust? Above all else, how much obscurity do we introduce with the utility model? Not only do we risk distorting our costs, but with all the jet fuel we'll need to burn flying consultants back and forth to keep the virtualization lights lit, we may not be doing the polar bears any great favors after all.

Most shops have a Rube Goldberg feel to them: applications are often pieced together and interconnected to the point where they make as much sense as an Escher drawing. IT doesn’t know the first place to start, let alone know what all the pieces are (which is why SOA and auto-discovery are pushed, but that is another diatribe). Any virtualization effort requires a complete understanding of the application landscape. Without it, a utility foundation can’t be established.

One byproduct of virtualization initiatives is the further stratification or isolation of expertise. The storage team is paid to maximize storage efficiency and satisfy a response time service level agreement (SLA). The Database Administrator (DBA) has to satisfy the response SLA for SQL. The middleware team (e.g., Java or .NET) has to optimize response, apply business logic and pull data from all sorts of remote databases, applications and web services calls. The web server and network teams are focused on client response time and network performance. Everybody has a tool. Everyone has accurate data. Nobody sees a problem.

Meanwhile, Rome burns.

Unfortunately, nobody is talking to one another. The root of the problem is that we have broken our teams into silos. That leads us to overly clever solutions: servers that automatically shift load while sorting this nastiness out, storage that shifts around in the background, or even virtual machines moving about. We never stop to ask: does any of this address our real problem, or is it just addressing the symptoms?

This brings to mind some metaphors. The first is an episode of the television series MacGyver (Episode 3 ‘Thief of Budapest’) where MacGyver has to rescue a diplomat. During his escape, the villains (the Elbonians) shoot and damage the engine of his getaway car. Mac goes under the hood to fix the engine with the car traveling at highway speed. Outrageous as it may sound, this is not too far from the day-to-day reality of IT. Of course, IT reality is much worse than this: the car is going downhill at 75 MPH on an unlit, twisty road, the Elbonians are still shooting away, and the car is on fire.

The second metaphor that comes to mind is an anti-drug television campaign that ran in the US during the 1980s. It opened with a voiceover, "this is your brain" accompanied by the visual of an egg. This was followed by another voiceover, "this is your brain on drugs" with a visual of the egg in a scaldingly-hot frying pan. The same formula is applicable to an application on utility computing: this is your application; this is your application on a stack of virtualization.

As we fiddle, we’re pinching pennies on the next application that our business partners believe will give it competitive advantage. Because we’ve discounted application performance, and supplemented that by having no means of finding where business transactions die, we’ve put functional requirements at significant risk. In fact, we’ve misplaced our priorities: we spend prodigiously on utility enablement, silo-ing and obscuring IT while simultaneously ignoring end-to-end performance. We do this because we take performance for granted: vmstat and vendor console tools are all we need, right?

Very few virtualization/utility models succeed. Those that do have common characteristics. There is a clean application landscape. People have a shared understanding of the applications. They have dedicated performance analysis teams staffed with highly capable people. They have very low turnover. Finally, they have a methodology in place to cut though the silos and pinpoint the cause of a performance problem. From personal experience, about 1.5% of all IT shops can say they have all of these things in place. That means 98.5% are underperforming in this model.

Without the right capability or environment, a utility approach is going to cause more harm than good. It is possible to get away with some virtualization deployments: one-off VMs are easy, and some degree of server consolidation is possible. But look out for the grand-theft-datacenter utility solution. It’s pretty violent.