Wednesday, November 19, 2008

Pettit - States of IT Governance

By Ross Pettit - 19 November 2008

As we head into a period of economic uncertainty, one thing we can count on is that IT effectiveness will be called into question. This isn’t so much because IT has grown excessively large in recent years, but because of results: industry surveys still show that as many as 60% of IT projects fail outright or disappoint their sponsors. As we head into a downturn, executives may look to IT to create business efficiency and innovation, but they won’t do that until they have scrutinised spend, practices and controls of IT.

This makes the need for good IT governance more urgent.

Governance is a curious concept. Governance doesn’t create value, it reduces the likelihood of self-inflicted wounds. It’s taken for granted, and there isn’t a consistent means to show that it actually exists. It is conspicuous only when it is absent, as it is easier to identify lapses in governance than disasters averted. And we tend not to think of an organisation as having “great” governance, but of having “good” or “bad” governance.

This strongly suggests that “good” is the best that governance gets. It also suggests, as the nursery rhyme goes, that when it is bad, it is horrid.

Any critical examination of our governance practices will focus on what we do poorly (or not at all) more than on what we do well. But rather than cataloging shortcomings, it is better to start by characterising how we’re governing. This will give us an opportunity to not only assess our current state, but forecast the likely outcome of our governance initiatives.

To characterise how we govern, we need some definition of the different types or states of governance. To do that, we can categorize governance into one state of “good” and multiple states of "bad."

We'll start with the bad. The most benign case of “bad” governance is unaligned behaviour. There may be guiding principles, but they're not fully imbued in day-to-day decisions. Individual actions aren't necessarily malicious as much as they are uninformed, although they may be intentionally uninformed. Consider an engineer in a Formula1 team who makes a change to the car, but fails to take the steps necessary to certify that the change is within regulation. This omission may be negligent at best, or a "don't ask / don't tell" mentality at worst. This first category of “bad governance” is a breakdown of participation.

The next category is naivety. Consider the board of directors of a bank staffed by people with no banking background. They enjoyed outsized returns for many years but failed to challenge the underlying nature of those returns.1 By not adequately questioning – in fact, by not knowing the questions that need to be asked in the first place – the bank has unknowingly acquired tremendous exposure. This lapse in rigor ultimately leads to a destruction of value when markets turn. We see the same phenomenon in IT: hardware, software and services are subjected to a battery of well-intended but often lightweight and misguided selection criteria. Do we know that we're sourcing highly-capable professionals and not just bodies at keyboards? How will we know that the solution delivered will not be laden with high-cost-to-maintain technical debt? Naïve governance is a failure of leadership.

Worse than being naïve is placing complete faith in models. We have all kinds of elaborate models in business for everything from financial instruments to IT project plans. We also have extensive rule-based regulation that attempts to define and mandate behaviour. As a result, there is a tendency to place too much confidence in models. Consider the Airbus A380. No doubt the construction plan appeared to be very thorough when Airbus committed $12b to the program. During construction, a team in Germany and another team in France each completed sections of the aircraft. Unfortunately, while those sections of the aircraft were "done", the electrical systems couldn’t be connected. This created a rather large, expensive and completely unanticipated system integration project to rewire the aircraft in the middle of the A380 program.2 We see these same phenomenon in IT. We have detailed project plans that are surrogates for on-the-spot leadership, and we organise people in work silos. While initial project status reports are encouraging, system integration or quality problems seemingly appear out of nowhere late in development. Faith in models is an abrogation of leadership, as we look to models instead of competent leaders to guide behaviour toward results.

Finally, there is wanton neglect, or the absence of governance. It is not uncommon for organisations to make optimistic assumptions and follow through with little (if any) validation of performance. Especially at the high end of IT, managers may assume that because they pay top dollar, they must have the best talent, and therefore don’t need oversight. People will soon recognise the lack of accountability, and work devolves into a free-for-all. In the worst case, we end up with a corporate version of the United Nation’s oil for food program: there's lots of money going around, but only marginal results to show for it. Where there is wanton neglect of governance, there is a complete absence of leadership.

This brings us to a definition of good governance. The key characteristics in question are, of course, trust and competent leadership. Effective governance is a function of leadership that is engaged and competent to perform its duties, and trustworthy participation that reconciles actions with organisational expectation. Supporting this, governance must also be transparent: compliance can only be built-in when facts are visible, verifiable, easily collected and readily accessible to everybody. This means that even in a highly regulated environment, reaction can be swift because decisions can be effectively distributed. In IT this is especially important, because an IT professional – be it a developer, business analyst, QA analyst or project manager – constantly makes decisions, hundreds of times over the life of a project. Distributed responsibility enables rapid response, and it poses less of a compliance risk when there is a foundation of trust, competent leadership, and transparency.

This happy state isn’t a magical fantasy-land. This is achievable today by adhering to best practices, integrating metrics with continuous integration, using an Agile-oriented application lifecycle management process that enables localised decision-making, and applying a balanced scorecard. Good IT governance is in the realm of the possible, and there are examples of it today. It simply needs vision, discipline, and the will to execute.

In the coming months, we are likely to see new national and international regulatory agencies created. This, it is hoped, will provide greater stability and predictability of markets. But urgency for better governance doesn't guarantee that there will be effective governance, and regulation offers no solution if it is poorly implemented. The launch of new regulatory bodies - and the actions of the people who take on new regulatory roles - will offer IT a window into effective and ineffective characteristics of governance. By paying close attention to this, IT can get its house so that it can better withstand the fury of the coming economic storm. It will also allow IT leaders to emerge as business leaders who deliver operating efficiency, scalability and innovation at a time when it's needed the most.

1 See Hahn, Peter. “Blame the Bank Boards.” The Wall Street Journal, 26 November 2007.

2 See Michaels, Daniel. “Airbus, Amid Turmoil, Revives Troubled Plane” The Wall Street Journal, 15 October 2007.




About Ross Pettit: Ross has over 15 years' experience as a developer, project manager, and program manager working on enterprise applications. A former COO, Managing Director, and CTO, he also brings extensive experience managing distributed development operations and global consulting companies. His industry background includes investment and retail banking, insurance, manufacturing, distribution, media, utilities, market research and government. He has most recently consulted to global financial services and media companies on transformation programs, with an emphasis on metrics and measurement. Ross is a frequent speaker and active blogger on topics of IT management, governance and innovation. He is also the editor of alphaITjournal.com.

Thursday, November 13, 2008

Cross - Technical Equity

By Brad Cross - 13 November 2008

Recall that previously, we tallied code metrics by component to produce a table of technical liabilities. Then we scored the degree of substitutability of each component on a scale of 1 to 4 to assign each an asset value. To determine owner's equity, we need to compare asset and liability valuations, which means we need to transform the collection of metrics in the table of technical liabilities into an index comparable to our asset valuation.

Are you Equity Rich or Over-Leveraged?

There are two ways we can make this transformation. One is to devise a purely mathematical transformation to weigh all the technical scores into a single metric scaled from 1 to n. To do this, convert each metric to a 100% scale (where 100% is the good side of the scale), sum the scaled values and divide by 100. This will give you a number from 0 to n, with n being the number of metrics you have in your table. Then multiply by 4/n, which will transform your aggregated metric to a scale with a maxiumum of 4. For example, if you have 60% test coverage, 20% percent duplication and 150 bug warnings from static analysis, you can do the following: (1-60%) + 20% + 150/200 = 135% (where 200 is the maximum number of bug warnings in any of your components.) Since we have 3 metrics and we want the final result to be on a 4 point scale, we multiply by 4/3. This gives a score of 1.8 out of 4.

Alternatively, we can devise a 1-to-4 scale similar to our asset valuation scale. This allows us to combine quantitative metrics with the qualitative experience we have from working with a code base.

As in finance, excessive liabilities can lead to poor credit ratings, which leads to increasing interest rates on borrowing. In software, we can think of the burden of interest rate payments on technical debt as our cost of change, something that will reduce the speed of development.

Technical debt, like any form of debt, can be rated. Bond credit ratings are cool, so we will steal the idea. Consider four credit rating scores according to sustainability of debt burden, and how these apply to code:

  1. AAA - Credit risk almost zero.
  2. BBB - Medium safe investment.
  3. CCC - High likelihood of default or other business interruption.
  4. WTF - Bankruptcy or lasting inability to make payments.

A AAA rated component is in pretty good shape. Things may not be perfect, but there is nothing to worry about. There is a reasonable level of test coverage, the design is clean and the component's data is well encapsulated. There is a low risk of failure to service debt payments. Interest rates are low. The cost of change in this component is very low. Development can proceed at a fast pace.

A BBB component needs work, but is not scary. It is bad enough to warrant observation. Problems may arise that weaken the ability to service debt payments. Interest rates are a bit higher. This component is more costly to change, but not unmanageable. The pace of development is moderate.

A CCC is pretty scary code. Sloppy and convoluted design, duplication, low test coverage, poor flexibility and testability, high bug concentration, and poor encapsulation are hallmarks of CCC-rated code. The situation is expected to deterioriate, risk of interruption of debt payments is high and bankruptcy is a possibility. Interest rates are high. Changes in this component are lengthy, painful, and expensive.

A WTF component is the kind of code that makes you consider a new line of work. Bankruptcy-insolvency is the most likely scenario. Interest rates are astronomically high. An attempt to make a change in this component is sure to be a miserable, slow and expensive experience.

Expanding on the example we've been using, let's fill out the rest of the balance sheet and see what owner's equity looks like.

Component

Assets

Liabilities

Equity

Leverage

Brokers

2

3

-1

Infinity

Data

2

3

-1

Infinity

DataProviders

2

3

-1

Infinity

DataServer

2

2

0

Infinity

Execution

3

2

1

3

FIX

1

4

-3

Infinity

Instruments

3

3

0

Infinity

Mathematics

3

1

2

3/2

Optimization

3

1

2

3/2

Performance

3

1

2

3/2

Providers

1

1

0

Infinity

Simulation

3

1

2

3/2

Trading

3

3

0

Infinity

TradingLogic

4

3

1

4

This table naturally leads to a discussion of tradeoffs such as rewriting versus refactoring. Components with negative equity and low asset asset value are candidates for replacement. Components with positive equity and middling asset value are not of much interest: while owning something of little value is neither exciting nor worrying, owning something of little value that carries a heavy debt burden is actually of negative utility. Components of high asset value but low equity are a big concern; these are the components we need to invest in.

In addition to thinking about how much equity we have in each component, we can also think about how leveraged each component is, i.e. how much of a given component's asset value is backed by equity, and how much is backed by liability. This measure of leverage is called the debt to equity ratio. An asset value of 3 with a liability of 2 leaves you with an equity value of 1, and you are leveraged 3-to-1, i.e. your asset value of 3 is backed by only 1 point of equity. Any asset with negative equity has infinite leverage, which indicates a severe debt burden.

The Technical Balance Sheet Applied

This exercise makes the decisions we face easier to make. In the example above, there are a number of components with an asset value of 2 and liabilities of 2 or 3. This led me to replace all the custom persistence code with a thin layer of my own design on top of db4o (an embeddable object database.) I deleted the components Brokers and DataProviders, then developed my own components from scratch and extracted new interfaces.

The FIX component, with an asset value of 1 and liabilities of 4, obviously needed to go. However, although the component has high negative equity, I did some experimenting and found that the cost of removing this component was actually quite high due to proliferation of trivial references to the component. I have gradually replaced references to the FIX component and chipped away at dependencies upon it, and it will soon be deleted entirely.

There are a number of components with an asset value of 3 or 4 but with liabilities of 2 or 3. These are the most valuable parts of the product that contain the core business code. However, some have 20% or less test coverage, loads of duplication, sloppy design, and many worrisome potential bugs. Due to the high asset value, these components warrant investment. I thought about rewriting them, but in these cases most often the best bet is to pay down the technical debt by incrementally refactoring the code. A subtle bonus from refactoring instead of rewriting is that each mistake in the code reveals something that doesn't work well, which is valuable information for future developement. When code is rewritten, these lessons are typically lost and mistakes are repeated.

We now have a sense of debt, asset value and technical ownership that we've accumulated in our code base. Our next step is to more fully understand trade off decisions based on weighing discounted cash flows: the cost of carry versus the cost of switching. Or alternatively stated, the cost of supporting the technical debt versus the cost of eliminating it.

Wednesday, November 12, 2008

Kehoe - Your Performance Is Our Business

By John Kehoe - 11 December 2008

I know of an outfit with the motto of ‘Your Performance is our Business.’ It doesn’t matter what they sell (OK, it’s not for an E.D. product, let’s get that snicker out of the way), it is what the mindset the motto instills: ‘You’ve got IT problems with a real impact to business, and we have a solution and are focused on your business needs’. It got me thinking, what mindset is at work at our IT shops? What would an IT shop look like with a performance mindset?

The meaning of the mindset varies by how we define performance to our customers. Some shops have a simple requirement. I can’t count the times I find the critical application is Email. It is a straightforward domain of expertise, albeit with significant configuration, security and archiving requirements. The key metrics for performance are availability, responsiveness and recoverability. Not an insurmountable challenge except in the most complex of deployments.

There isn’t much performance to provide for messaging outside of availability and responsiveness. There isn’t alpha to be found. It’s a cost center, an enablement tool for information workers.

Performance comes into play with alpha projects. These are the mission critical, money making monstrosities (OK, they aren’t all monstrosities, but the one’s I’m invited to see are monsters) being rolled out. This is very often the realm of enterprise message busses, ERP and CRM suites, n-tier applications and n-application applications (I like the term n-application applications, I made that one up. My mother is so proud). All are interconnected, cross connected or outside of your direct control. Today, the standard performance approach is enterprise system management (ESM). These are the large central alerting consoles with agents too numerous to count. ESM’s approach the problem from the building blocks and work their way up.

This is not the way to manage performance.

The problem with low level hardware and software monitoring approach is that performance data cannot be mapped up to the transaction levels (yes, I said it; consider it flame bait). Consider the conventional capacity planning centers we have today. The typical approach is to gather system and application statistics and apply a growth factor to it. The more sophisticated approach uses additional algorithms and perhaps takes sampled system load to extrapolate future performance. The capacity planning model lacks the perspective of the end user experience.

The need for Transaction Performance Management (TPM)

TPM looks at the responsiveness of discrete business transactions over a complex application landscape. Think of this in the context of n-application applications: TPM follows the flow regardless of the steps involved, the application owner or “cloud” (did I ever say how much I loathe the term cloud?) The TPM method follows the transaction, determines the wait states and then drills into the system and application layers. Think of it as the inverse of ESM.

To exploit the TPM model you need a dedicated performance team. This is a break from the traditional S.W.A.T. team approach for application triage and the mishmash of tools and data used to ‘solve’ the problem. In a previous column, we discussed the implications of that approach, namely long resolution time and tangible business costs.

What would a TPM team look like?

  1. We need a common analysis tool and data source monitoring the application(s) architecture. This is essential to eliminate the ‘Everybody’s data is accurate and shows no problems, yet the application is slow’ syndrome.
  2. We need to look at the wait states of the transactions. It isn’t possible to build up transactions models from low level metrics. The ‘bottom up’ tools that exist are complex to configure and rigid in the definition of transactions. They invariable rely on a complex amalgamation of multiple vendor tools in the ESM space. Cleanliness of data and transparency of transactions are hallmarks of a TPM strategy.
  3. We need historical perspective. Historical perspective provides our definition of ‘normal’. It drives a model based on transaction activity over a continuous time spectrum, not just a performance analysis at a single point in time.
  4. We need ownership and strong management. TPM implementations don’t exist in a vacuum, they cannot be given to a system administrator on an ‘other duties’ priority. TPM implementations require a lot of attention. They are not a simple profiler we turn on when applications go pear shaped. Systems, storage and headcount are needed, subject matter experts retained and architects available. Management must set the expectation that TPM is the endorsed methodology for performance measurement, performance troubleshooting and degradation prevention.
  5. TPM must quickly show value. Our team and implementation must be driven by tactical reality with a clear relationship to strategic value. We need to show quick performance wins and build momentum. For instance, there needs to be a focus on a handful of core applications, systematically deployed and showing value. Build success from this basis. Failure is guaranteed by the big bang global deployment or a hasty deployment on the crisis du jour.
  6. There must be a core user community. Our user community consists of people up and down the line, using the tool for tactical problem solving to high level business transaction monitoring. These people know the application or the costs to the business. Training is essential to an organization-wide TPM buy-in.
  7. TPM must be embraced by the whole application lifecycle. It’s not enough to have operations involved. QA, load testing and development must be active participants. They can improve the quality of software product early on using the TPM methodology.

An example of a TPM success is a web based CRM application I worked on recently. The global user population reported poor performance on all pages. The TPM team saw isolated performance issues on the backend systems, but they did not account for the overall application malaise. The supposition was that last mile was too slow or that the application was too fat. We implemented the last TPM module to analyze the ‘last mile’. It turns out each page has a single pixel GIF used for tracking. This GIF was missing and caused massive connect timeouts for each page. Put the image in the file system and the problem disappeared. This begs the question: how would we have found this problem without the TPM approach?

TPM and Alpha

This TPM model sounds great, but what is the practical impact? Has anyone done this and had any bottom-line result to show for it? Many shops have succeeded with the approach. Here are two examples.

In the first one, a major American bank (yes, one that still exists) wrote a real-time credit card fraud detection application. The application has to handle 20,000 TPS and provide split second fraud analysis. Before we implemented the TMP model, the application couldn’t scale beyond 250 TPS. Within weeks of implementing TPM, we scaled the application to 22,500 TPS. This was a $5mm development effort that prevented $50mm in credit card fraud. The TPM implementation was $100k and done within development.

The second is one of the top five global ERP deployments. The capacity planning approach forecast 800 CPU’s would be needed over two years to handle the growth in application usage. The prior year, a TPM center was established that performs ruthless application optimization solely based on end user response times and transaction performance. The planning data from the TPM center indicated the need for 300 CPU’s. The gun shy (hey, they’ve only been active a year) TPM center recommended 400 CPU’s. The reality was that only 250 CPU’s were needed when the real world performance of the application was measured and tuned over the subsequent year. The company saved $25mm in software and hardware upgrades. The savings provided the funds for a new alpha-driving CRM system that helped close $125mm in additional revenues over two years.

What is the cost of implementing a TPM center at a large operation? A typical outlay is $2mm including hardware, software, training, consulting and maintenance. An FTE is required as well as time commitments from application architects and application owners. Altogether we can have a rock solid TPM center for the price of a single server: the very same server we think we need to buy to make that one bad application perform as it should.




About John Kehoe: John is a performance technologist plying his dark craft since the early nineties. John has a penchant for parenthetical editorializing, puns and mixed metaphors (sorry). You can reach John at exoticproblems@gmail.com.

Thursday, November 6, 2008

Cross - Discovering Assets and Valuing Intangibles

By Brad Cross 6 November 2008

In the last article in this series, we identified our technical liabilities by component, with each component representing a functional area. Now we will do the same with the asset side of the balance sheet.

The way a code base is structured determines the way it can be valued as an asset. If a code base matches up well to a business domain, chunks of code can be mapped to the revenue streams that they support. This is easier with software products than supporting infrastructure. Sometimes it is extremely difficult to put together a straightforward chain of logic to map from revenue to chunks of the code base. In situations like this, where mapping is not so obvious, you have to be a bit more creative. One possibility is to consider valuing it as an intangible asset, following practices of goodwill accounting.

An even more complicated scenario is a monolithic design with excessive interdependence. In this case, it is very difficult to even consider an asset valuation breakdown, since you cannot reliably map from functional value to isolated software components. This situation is exemplified by monolithic projects that become unwieldy and undergo a re-write. This is a case where bad design and lack of encapsulation hurt flexibility. Without a well encapsulated component-based design, you can only analyze value at the level of the entire project.

Substitutability as a Proxy for Asset Valuation

One way to think about asset valuation is substitutability. The notion of code substitutability is influenced by Michael Porter's Five Forces model, and is similar to the economic concept of a substitute good. Think of code components in a distribution according to the degree to which they are cheap or expensive to replace. "Cheap to replace" components are those that are replaceable by open source or commercial components, where customization is minimal. "Expensive to replace" components are those that are only replaceable through custom development, i.e. refactoring or rewriting.

Substitutability of code gives us four levels of asset valuation that can be rank ordered:

  1. Deletable (lowest valuation / cheapest to replace)
  2. Replaceable
  3. Valuable
  4. Critical (highest valuation / most expensive to replace)

Consider a collection of software modules and their corresponding valuation:

ModuleSubstitutibility ModuleSubstitutibility
Brokers

2

Mathematics

3

Data

2

Optimization

3

DataProviders

2

Performance

3

DataServer

2

Providers

1

Execution

3

Simulation

3

FIX

1

Trading

3

Instruments

3

TradingLogic

4


A critical component is top level domain logic. In the example table, only the trading logic component is critical. This component represents the investment strategies themselves - supporting the research, development, and execution of trading strategies is the purpose of the software.

A valuable component exists to support the top level domain logic. Here, there is a trading component that supports the trading logic by providing building blocks for developing investment strategies. There is also a simulation component that is the engine for simulating the investment strategies using historical data or using Monte Carlo methods. You can also see a handful of other components that provide specialized support to the mission of research, development and execution of investment strategies.

A replaceable component is useful, but it is infrastructure that can be replaced with open source or commercial component. Things in this category include homegrown components that could be replaced by other open source or commercial tools. Components in this category may do more than off-the-shelf alternatives, but an off-the-shelf replacement can be easily modified to meet requirements. In the example above you can see four components in this category. They are related to APIs, brokers, or the persistence layer of the software. Both broker APIs and persistence are replaceable by a wide variety of different alternatives.

A deletable component is a component from which little or no functionality is used, allowing us to either delete the entire thing, or extract a very small part and then delete the rest. This includes the case when you "might need something similar" but the current implementation is useless. In the example, there is an entire component, “Providers,” which is entirely useless.

Accepting Substitution

It is important to consider psychological factors when making estimates of value. Emotionally clinging to work we have done in the past will derail this process. For example, perhaps I have written a number of custom components of which I am very proud. Perhaps I think that work is a technical masterpiece, and over time I may have convinced others that it provides a genuine business edge. If we can't put our biases aside, we talk ourselves into investing time into code of low value, and we miss the opportunity to leverage off-the-shelf or open-source alternatives that can solve the same problem in a fraction of the time.

In this article we've defined a way to identify our assets by business function and to value them either based on a relationship to revenue streams or by proxy using their degree of substitutability. In the next installment, we’ll take a look at switching costs and volatility to set the stage for thinking in terms of cash flows when making trade-off decisions in our code.




About Brad Cross: Brad is a programmer.