Object-Oriented Equality

I discussed in my previous post the fact that we need a better support of immutability in object-oriented language. The problem is that it’s no so easy to add because the object paradigm is rooted in the concept of an object being an entity whose state is normally mutable. I discuss here about one of its cousin: object equality.

Some objects are more equal than others

First a quick recap. Object equality is an equivalence relation traditionally implemented through equals and hashCode. Here is the corresponding Javadoc:

The equals method implements an equivalence relation on non-null object references:
•    It is reflexive: for any non-null reference value x, x.equals(x) should return true.
•    It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
•    It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
•    It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
•    For any non-null reference value x, x.equals(null) should return false.

The definition implicitly allows mutable state to be used for equality. This is however dangerous, and equals should be defined only based on immutable state when possible.

This is notably a requirement for Collections, which require a stronger equality contract. If the object equality changes while it is in the collection, the behavior of the collection may not be consistent. See pitfall #3 in How to write equality in Java.

The simplest solution to this problem is to make object equality immutable, that is, the fields participating in the equivalence relation are final – the equality can never change post-construction.

This is however not always possible, especially if the object requires an initialization in a post-construction phase. If the object equality depends on related objects, a strictly immutable object might still see its hashCode change if one of the depending object is mutated. This leads to the concept of ownership, where object that are owned should also be immutable. Pure ownership is however an object-level/run-time concern which is not easy to ensure (class-based ownership is possible but more limited).

As presented in “Understanding the Impact of Collection Contracts on Design”, we should then consider (i) the construction of object (ii) the mutability of the object state and (iii) the “deepness” of the equivalence relationship to reason about object equality.

We have then actually three types of fields:

–    final. Never change post-construction and referenced object is immutable
–    eventually final. Can change post-construction but will be frozen in a point in time
–    mutable. Can be mutated anytime.

Object could be frozen post-creation as is proposed in “Flexible immutability with frozen objects”. The equivalence relation could use only final and eventually final fields. Owned object could be modified only through the parent object, as to ensure that the equality contract is never broken.

There is no one notion of equivalence

The problem remains that there is not “correct” implementation of object equality. It mostly depends on the usage. You may want to compare list based on reference identity, but also based on their content sometimes. Should we then have a notion of  “deepness” right into the equals operator, in a way similar to the variants of shallow and deep cloning?

aList.deepEquals( anotherList )

Well, that’s actually what already exists with OrderedCollection, where you specify the Comparator to be used. Is it the solution? Should we remove the object equality form the object and move it the client of the object?

Set s = new HashSet( new ReferenceEqualityContract() );

Or should we be able to instantiate an object and specify which equality contract (or more generally a comparison contract) it should fulfill?

In this case, an object is limited to have only one equivalence relation at a time.  (See this post for a Haskell sample of how to do it)

One could specify the behavior of the object at creation time and see that as an object-level behavioral variation. The types of two objects with different equivalence relation should however be different to ensure the relation is always symmetric (objects with different equivalence relation are always different). This means that the each specific variation would correspond to a type, which inherits from a common type.

– Is object equality as subset of the problem of object comparison? Or are they fundamentally different? State used for comparison shouldn’t change while object is in collection, but it’s no necessary part of the “primary key”…

– Should an object have only one equivalence or comparison definition at a time? Or could an object have several ones at the same time? Or one equivalence relation but several comparison definitions at a time? (We could easily imagine two lists with the same objects, but one list storing the object in ascending order and the other one in descending order)

The fact is that in a pure object world we should uniquely compare object by reference. There should be no two objects with the same “primary key”. In “Declarative object identity using relation type“, the authors introduced the notion of scope as to create and obtain the same reference for a given key. The risk then is to alter objects in an unexpected way and break other object invariants (dangling alias, etc.). It is simply impractical. The world is constantly evolving, and we are forces sometimes to have two objects for the same entity: one for the old and one for the new entity (See this other post on persistent data structure). Such objects somehow relates together and are not completely “different”, hence the notion of object equality.

But this little digression still doesn’t solve the issue that whatever the equality contract that the object must fulfill, we cannot ensure that it will be the case.

And it gets worse with inheritance

“There is no way to extend an instantiable class and add a value component while preserving the equals contract, unless you are willing to forgo the benefits of object-oriented abstraction.”  — Effective Java

Indeed if we have a Point class and a ColorPoint class, the equivalence relation may be broken if ColorPoint doesn’t redefine equals and hashCode. See pitfall #4 in How to write equality in Java. To enforce the equality contract, two objects may be equal only if they have the same type.

The problem is that it is too restrictive. A context that uses color points, such as a graphic editor, may want to share the points with a context that doesn’t have a notion of color, such as 2D algorithm.  The graphic editor would test the equivalence of two figures according to the color as well. And the 2D algorithm would test the equivalence of two points according to their position solely, in a way to prevent division by zero.

Also, once an object has overridden the default equals and hashCode which implements reference equality, it is also impossible for an object to fall back to this mode. As a consequence we may end up in a situation with subclasses whose equivalence relation can’t be expressed. In such case, it should be forbidden to compare objects.

Should we then re-think object equality with ternary logic, so as to be able to return true, false and N/A?

EDIT

Here I will gather other links related to the subjects

Object-Oriented Immutability

There seem to be a growing awareness about the benefit of immutability. Particular cases where immutable objects are much welcome are for instance in the case of value objects or for defensive copying. The problem is that immutable object don’t really go well with the object paradigm.

Let’s consider an application, which need to deal with the concept of revision. A revision is composed of a major and a minor revision number, e.g. “1.4”. This is typical candidate for a value object. The only sensible methods that deal with revision are inc which increases it, and promote which jumps to the next major version (“1.3” -> “2.0”). I will consider only inc in this post.

Here would be the (trivial) code in Java:

public class Revision
{
   private final int major;
   private final int minor;
   public Revision( int major, int minor ) { ... }

   public Revision inc() {
       return new Revision( major, minor+1 );
   }
}

This is however only a ad-hoc solution. This does not work well with polymorphism for instance: if a subclass of Revision is created, say NameRevision, calling inc on an instance would result in a Revision not NameRevision. This problem can be solved in other ways, and would for instance not appear in Smalltalk where instanciation would be done with super.basicNew on the class-side. But this just shows how mainstream OO language are not designed for immutability.

How would that be in a functional language such as Clojure? A map would probably be used to hold the major and minor number and the inc function would be as simple as setting a new value to the minor key in the map. Because any data modification results in a new structure, we don’t have the explicity make a difference between creating and modifying a structure. See my discussion on persitente data structure

(defn create-revision [] (hash-map :minor 1 :major 0))
(defn inc-revision [r] (assoc r :minor ( + (get r :minor) 1 ))

Note that the inc function is “polymorphic” in the sense that if I pass a map with :minor, :major and :name, I obtain a copy which has the three keys values.

The fundamental mismatch between OO and functional programming is that assignment has different semantics. In OO, assignment change the data in-place. In functional programming, assignments result in a copy.

And what if assignment would also result in a copy in OO? That sounds a bit contradictory to the OO spirit, but well, let’s give it a try…

If assignments result in a copy, this means that something like this.field = “newvalue” actually returns a new instance of the object, right? We could then imagine a class like this:

public class immutable Revision
{
   int major;
   int minor;

   public int getMajor {
      return this.major
   }

   public Revision setMajor( int value ) {
      // assignment result in a copy!
      return ( this.major = value );
   }

   // same for minor

   public Revision inc()  {
      return this.setMajor( this.getMinor() + 1 );
   }
}

We can somehow see that a generalized form of the builder pattern, and it fit well with fluent interface:

Revision r = new Revision( 1, 0 );
r.inc(); // return 1,1
r.inc(); // return 1,1

But

Revision r =
new Revision( 1, 0 )
.inc()
.inc(); // return 1,2

Note the keyword immutable that is necessary to specify the semantics of assignment for the class (there will indeed always be some mutable classes).

We now have two semantics for the assignment, one producing a copy and one changing the data in-place.

That sounds logical, that all classes in class hierarchy share the same mutability.  Assigning a field from a subclass or a superclass should indeed produce the same result. But at the same time, this also means that we have two Object types? One mutable and one immutable? Hmm…

Another sound restriction for such a language would be to have assignment restricted to this. Something like obj.field = "newvalue" would be forbidden. This is anyway a good practice.

Also assuming that the language supports closures, do they fit in the picture? I would say yes, but I’m not sure. Given the signature of the closure is determined (it’s a function after all), and it cannot access any field of the object directly, so the semantics should be consistent.  This would however require a bit more formalization to ensure there is no hole…

Like in some other post, I don’t know if the idea presented here hold for real. But I do like immutable object even in OO, and think that OO language lacks supports for such design. Current solution for immutable object is ad-hoc, but we could imagine blending functional programming and OO a bit more, resulting in some object having a hybrid behavior. Such object would essentially be immutable data structure embedding at the same time the function that corresponds to them, which still fit the equation Object = Data + Logic.

More About Immutability

  • To mutate or not to mutate?
  • Frozen object in ruby
  • Java.next : immutability “But the benefit of using immutable data increases when you compose objects together. If you wanted to compose the synchronized version with another object, you would have have to dig back into the internals of both objects, study their locks, and pick a new lock strategy to cover the two objects together. Composing operations with immutable objects is much easier.”

Taming Size and Complexity

The only real problem with modern software is size and complexity. If we had a bigger brain able to apprehend and reason about software as a whole without omitting details, we wouldn’t have that many issues. Unfortunately, our mental abilities are limited, and as a consequence, we need to have ways to build software whose complexity is beyond our own analytical power. The same is true for any large scale engineering initiative. Apart from discipline, which is a prerequisite to manage size and complexity, traditional ways to address size & complexity are: abstraction, automation and intuition.

Abstraction

The traditional way to address complexity is to raise the abstraction level. Get rid of details and stay focused on the essential – complexity goes away. You can then reason on different parts at various abstraction levels independently of each other. This is the ground-breaking argument about any modeling effort or methodology. The problem is that the whole is not equal to the sum of its parts. Unforeseen interactions will emerge resulting in a myriad of potential problems.  An other major problem is the traceability of the different parts.

Automation

The traditional way to address size is through automation. A lot of task that we perform are not tedious due to their very nature, but due to the effort they demand. Our concentration is also limited which implies we will make mistakes. Automation leads then not only to higher productivity but higher quality.  There are too many examples of automated task, but code formatting and refactoring fall for instance into this category.  Even though automation is extremely effective for specific task, automation is also impacted by the complexity of the software to produce. State explosion is for instance one of the main problems of a technique such as symbolic execution.

Intuition

Actually, problem solving implies not only strong analytical skills but also some form of intuition. The same goes with software and program understanding. Software exploration and visualization are powerful techniques to reason about abstract information in an intuitive way. Software is intangible and has by consequence no natural representation – this leaves the door open for new visualization metaphors. Examples of interactive visual development technologies are BPEL workflow, DSM, or polymetric views.

A Simple Categorization of Unit Tests

Unit testing has become incredibly popular during the past years. As a consequence I sometimes feel like we’ve lost the focus on why we write unit test and what we expect as a return on investment.

The success of unit testing probably comes from the simplicity of the appraoch. However, we’ve learned since then that unit testing is not so easy neither. A high percentage of code coverage does not necessary mean quality software, doeasn’t mean that border cases have been covered, and can even impede software maintenance if the test suite is poorly organized.

Don’t get me wrong, I do value the benefit of unit testing.  But unit testing for the sake of unit testing has no value to me. If you think an other form of automated testing will be a most rewarding strategy, then do it. If you go for unit tests, the important questions to ask are:

  • Did the unit tests revealed bug in the production code?
  • Did you organise the unit test in a meaningful way?
  • Did you spend time to identify and test border cases?

A clear “yes” to these three questions indicates an intelligent and probably rewarding testing effort. A pertinent test suite should reveal the bugs that you introduce in the system — and don’t pretend you don’t introduce any. A well-organized test suite should give you a high-level view of the features in the module. A well-crafted test suite should cover the nasty use cases of the system before they hurt you when the system is in production.

There is an abundant literature about unit testing, but nothing that I read seemed to cover the reality of the unit tests that I write. I therefore analysed the nature of my own unit tests and came with a personal categorization which differs from existing one. Here is the 4 categories I’ve identified as well as a short description of the pro/cons of each kind.

Basic unit tests

A basic unit test is a test suite which covers one unique class and test each method in a more-or-less individual fashion. This is the core idea of unit test where each tiny functionality is tested in full isolation, possible with the help of mock objects to break the dependencies. As a result, the test should be repeatable and also independent (of the environment and of other modules).

Problem with real unit tests are:

Unit test with indirect assertions

For basic unit tests, the subject under test (SUT) is the very one for which we assert the behavior. We perform an action on the SUT and ensures it behaves correctly. This is however sometimes not possible as soon as the system become a bit more complicated. As a consequence, the actions are performed on the SUT, but we rely on a level of indirection for the assertions; we then assert the behavior of another object than the SUT. This is for instance the case when we mock the database and want to ensure that  the rows are correctly altered.

  • Coupling with the implementation is still high
  • Fake objects migth be used instead of Mocks — they contain state and aren’t purely hard-code anymore
  • Behaviour of the SUT is harder to understand due to the level of indirection

Inflection point unit test

Inflection points — a term coined by Michael Feather if I’m right — are somehow the entry points to a given software module. For utility libraries or services, the inflection points correspond to the public API. Testing these specific points is the most rewarding strategy to me, and other people think the same.

  • Inflection points are less subject to changes, and are closer to a black-box form of testing
  • Tests become the first client of the interface and give you a change to check if the API is practical
  • After having covered all the common use cases of the inflection point, the test coverage of the underlying classes should be close to 100%. If not, this indicates a potential design weakness or useless code.

Dynamic unit tests

I qualify tests as “dynamic” when their execution change from one run to the other. Primary goal of such test is to simulate the dynamicity and variability of the productive system. Such test are however quite far away from the concept of basic unit tests and could be considered to some extend as integration tests; they are however still repeatable and independent of other modules. Execution in the real system may indeed be aftected by either

  • Threading issues
  • Contextual data, e.g. cache
  • Nature of the input

Most of these tests rely on randomization, for instance to generate input data or disrupt the scheduling of threads. Though it’s more complicated,  randomization and fuzzing have been proved as effective techniques to detect issues which would never arise with fixed execution condition. Think for instance about phase of the moon bugs and date problems.

Abstraction-Level vs. Meta-Level

As per wikipedia, a model is a pattern, plan, representation (especially in miniature), or description designed to show the main object or workings of an object, system, or concept. In the case of software engineering, models are used to represent specific aspects of the software system, such as static structures, communication paths, algorithms, etc.

Models can be created along two axis:

Abstraction level

An abstraction level is a way of hiding the implementation of a particular set of functionality. Both representations are however two descriptions of the same reality. The representation can vary in their level of detail or in their expressiveness. Example

C code – assembler code.
Both representation describe a sequence of operation in an imperative way, but C is much more expressive than assembler.

Java code – sequence diagram.
Both representations describe the run-time behavior of the system. The sequence diagram does however frequently omit details of the call flow.

Meta-level

A meta-level (and meta-model) is way to highlight the properties – the nature – of a particular set of elements. A meta-model describes then the constraints or the semantics to any set of such elements.

Object – class.
In OO technologies, object and classes belong to two different meta-level. The class is a representation of the structure (constraints) that any of the instance will fulfill.

XML – DTD.
An XML file can be validated against a DTD, which is not a representation of the particular file, but of the structure (constraints) that the file must fulfill.

Tower of models

Models themselves can be refined towards either higher abstraction level or meta-levels.

Component diagram – class diagram – Java code.
All three artifacts are representations of the same reality at various level of refinement and reasoning.

XML – Schema – Meta schema.
In this case, each model is meta description of the underlying one. XML Schema themselves must conform to the schema http://www.w3.org/2001/XMLSchema.

As with the XML example, the tower of meta-model generally ends up in model specified in itself. That is, a recursion is introduced to stop the tower of meta-model from growing infinitely.
The term “modeling” refers commonly to “raising the abstraction level”; the term “meta-modelling” is used otherwise. Different tools/technologies don’t necessary represent different meta-levels. Inversly, one technology may be used to address different meta-level.
With respect to OO technologies, several UML diagrams and notations work at different meta-level. Object diagram and sequence diagrams describe the interaction between specific instances of objects. Class diagram specifies constraints (such as cardinality) or semantical relationship between objects (such as aggregation). Stereotypes can then be used to specifies constraints (such as “singleton”) or semantics (such as “entity”) to the class itself. For instance, an “entity” stereotype could be defined to flag all classes with an attribute “primary key” of type integer.
Level \ Technology

Language

UML
L0 Object  Object diagram,  Sequence diagram
L1 Class Class diagram
L2 Stereotype

Examples where meta-models are useful

Having three meta-level provides one extra level of indirection that can be very practical in some cases:

Level

XML transformation

OO refactoring

Annotation processing

L0 – information Data Object Class annotation (e.g. @Entity)
L1 – model Schema Class Annotation processor (e.g. JPA)
L2 – meta-model Meta schema Metaclass Classloader

Here are three common examples which actually leverage the strength of meta-model.

XML transformation

The success of XML is related to the existence of a meta-model which makes it a semi-structured format. Structured data are interpretable only by their host application which is the only one to know the model. Unstructured data (plain text, image, etc.) have not model and are not interpretable at all. Semi-structured data have a model which is itself understandable by any other application which knows the meta-model. Due to this extra-level, reusable parser, validator, encryptor or even arbitrary transformer (think XSLT) could be created. The schema could also be extended without necessary breaking backward compatibility (think XPath).

OO refactoring

Refactoring rules is a typical example of the usage of the OO meta-model: refactoring rules are proved to preserve the validity of the model against the meta-model, and are reusable for any model.

Annotation processing

Annotation processing is yet another example where a meta-model is used under the hood. Just like the class-loader knows how to load a class because of the meta-model of any class, it is also able to load the corresponding annotations. This reliefs the application from storing application-specific semantics in separate files (think deployment descriptor). The application (annotation processor) can then query the annotation at run-time to refine the behavior of the system. The annotations enrich classes with additional constraints and behaviour in a resuable way: the same annotation processor can be used for arbitrary class sets.

Pharo By Example

This is an introduction book to Smalltalk and the Pharo development environment. The book is split in two parts: the first one covers the Smalltalk language, the Pharo IDE, the Morphic package for GUI application and the Seaside framework for web application. The second part is relatively thin and covers more advanced topics such as reflection.

The book is easy to read, with many examples to follow, and the reader will quickly acquire a good comprehension of Smalltalk. The object-oriented concepts specific to Smalltalk are however rigorously discussed. A good example would be the paragraph about class variables in case of a class inheritance: it’s easy to follow in the book but still an OO concept that could be otherwise misunderstood. These paragraphs are the most important one for people (like me) coming from other OO languages.

The GUI and web frameworks are briefly covered but enough to get a idea of how to they work. Both are component-based, and a concrete component is built from A to Z in both cases to get insight about the frameworks.

The second part about reflection is where the beauty of Smalltalk shines. First the concept behing meta-object and reflection are presented and then a few useful patterns which use reflection are described, for instance how to build a dynamic proxy in Smalltalk. This is clearly the most interesting part, where the dynamic nature of Smalltalk can be exploited in a constructive way.

Software Evolution

Software evolution studies software engineering under the perspective of software maintenance and change management. The term was coined in the 70 after the seminal work of Lehman and his laws of software evolution.

As such, software evolution is a topic as broad as software engineering itself – any aspect of software engineering (being a process, methodology, tool, or technique) can indeed be studied under the perspective of software evolution.

The book is a comprehensive overview of the current body of knowledge in software evolution. It contains 10 chapters addressing each one a specific field of software engineering under the perspective of software evolution. The articles present recent works and open challenges; the book can then be seen as comprehensive survey as well as a research agenda. The book bibliography is impressive and more than 500 articles are referenced throughout the 10 chapters.

Apart from this selection of articles, the book has an excellent introduction the field of software evolution, and a set of useful appendices pointing to additional resources in this field. The 10 chapters are organized into 3 parts:

Part I: Program understanding and analysis

Identifying and Removing Software Clones
Analysing Software Repositories to Understand Software Evolution
Predicting Bugs from History

Part II: Re-engineering

Object-Oriented Reengineering
Migration of Legacy Information Systems
Architectural Transformations: From Legacy to Three-Tier and Services

Part III: Novel trends in software evolution

On the Interplay Between Software Testing and Evolution and its Effect on Program Comprehension
Evolution Issues in Aspect-Oriented Programming
Software Architecture Evolution
Empirical Studies of Open Source Evolution

Chapters I enjoyed the most were “Object-oriented reengineering”, “On the interplay between…” and “Evolution Issues in AOP”.

I wish there was a chapter about evolutionary issues in model-driven engineering, which is an important area of research. I therefore would recommend “Model-Driven Software Evolution: A Research Agenda” as a complement to this book.

The book is accessible to non-expert and I learned a lot while reading it. The book is definitively worth looking at for anyone interested to understand what “software evolution” is all about.

Quotes

… mostly on software engineering, but not only.

“How does a project get to be a year late?…One day at a time.” — The Mytical Man Month

“Complexity kills. It sucks the life out of developers, it makes products difficult to plan, build and test. […] Each of us should […] explore and embrace techniques to reduce complexity.”  — Ray Ozzie, CTO, Microsoft

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” — Saint-Exupéry

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” — D. Knuth

“Beware of bugs in the above code; I have only proved it correct, not tried it” — D. Knuth

“Research is less about discovering the fantastic, as it is about revealing the obvious. The obvious is always there. It needs no discovery. It only needs us to take a look from a different perspective to see it.” — On T. Girba’s blog

“It is impossible for any number which is a power greater than the second to be written as a sum of two like powers. I have a truly marvelous demonstration of this proposition which this margin is too narrow to contain.” — P. Fermat

“These are my principles. If you don’t like them, I have others.” — Groucho Marx

“if all you have is a hammer, everything looks like a nail…”  — B. Baruch

“Discipline is the bridge between goals and accomplishment.” — Jim Rohn.

“If I am to speak ten minutes, I need a week for preparation; if fifteen minutes, three days; if half an hour, two days; if an hour, I am ready now.”  —  Woodrow Wilson

“The devil is in the details” — German proverb quote

“The first idea is always wrong” — Popular wisdom

“What can go wrong will go wrong” — Murphy’s law

“The nice thing about standards is that there are so many of them to choose from.” — Andrew S. Tanenbaum

“The manager asks how and when; the leader asks what and why.” — Warren Bennis

“Experience is what you get when you don’t get what you want.” — Dan Stanford

“A mission is something that you do in a bounded amount of time with bounded resources. A purpose is something that you never achieve.” — William Cook

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” — Martin Fowler

“The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague.” — E.W. Dijkstra, The Humble Programmer

There are wavelengths that people cannot see, there are sounds that people cannot hear, and maybe computers have thoughts that people cannot think” — Richard Hamming

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” — Brian Kernighan

“If others would think as hard as I did, then they would get similar results.” — Newton

“Rules of thumb are sometimes what their acronym infers…ROT” — unknown

“Any intelligent fool can make things bigger and more complex. It takes a touch of genius – and a lot of courage – to move in the opposite direction.” — Albert Einstein

“When I use a word,” Humpty Dumpty said, in a rather scornful tone, “it means just what I choose it to mean — neither more nor less.” — Lewis Caroll, Through the Looking Glass

“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” — Tony Hoare

“He who hasn’t hacked assembly language as a youth has no heart. He who does as an adult has no brain.” — John Moore

“If you can’t explain it simply, you don’t understand it well enough” — A. Einstein

“There are two groups of languages: those which everyone hates and those which nobody uses.” — Unknown

“Any problem in computer science can be solved with another layer of indirection” — David Wheeler

“Idiomatic usage often matters more than what is strictly possible.” — A comment from Stuart Halloway

“You learn lisp for the same reason you learn latin. It won’t get you a job but it will improve your mind” — Paul Graham

The Internet is the world’s largest library. It’s just that all the books are on the floor.” —John Allen Paulos

“There are only two hard things in Computer Science: cache invalidation and naming things” — Phil Karlton

Some abstraction is great, but eventually you need to write code that does something useful.” — Bill Karwin

“Tell me, I forget. Show me, I remember. Involve me, I understand.” — Chinese proverb.

Don’t reinvent the wheel unless you want to learn about wheels.” — Popular wisdom

“I’m sorry this letter is so long, I didn’t have time to write a shorter one”  — Mark Twain

“To be trusted is a greater compliement than to be loved” — George Mcdonald

“Most people work just hard enough not to get fired and get paid just enough money not to quit” — George Carlin

“Find your obsession, make it your profession, and you’ll never work a day in your life” — Unknown

Plans are nothing; planning is everything.” – Dwight D. Eisenhower

“If you set your goals ridiculously high and it’s a failure, you will fail above everyone else’s success.” — James Cameron

“A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.” — Robert A. Heinlein

“Tools can enable change in behavior and eventually change culture” — Patrick Debois

“First you learn the value of abstraction, then you learn the cost of abstraction, then you’re ready to engineer.”@KentBeck

“Simplicity does not precede complexity, but follows it.” — Alan Perlis

“The most important thing in communication is hearing what isn’t said.” – Peter F. Drucker

“Computer science is now about systems. It hasn’t been about algorithms since the 1960’s.” – Alan Kay

[Some people] just can’t wrap their head around that someone might build something because they like building things. -– Mark Zuckerberg in an inteview

“So much complexity in software comes from trying to make one thing do two things.” – Ryan Singer

Threat modeling: tools in practice

We’ve investigated two tools for our threat model. Here is an overview of both tools (from Microsoft) and our experience with them.

Threat Modeling Tool

The first tool supports system modelling with the definition of Entry Points, Trust Levels, Protected Resources, plus some general background information. Data Flows can be authored directory with the tool or imported from Visio. The tool main strength comes however from its Threat Tree modelling features. Threats can be identified and decomposed into a series of step following the same approach as Attack Tree. The tool supports AND and OR semantics when constructing advanced threat tree. E.g.: “Disclose password” can be achieved with either “Brute force password” or “Use default password”. Each step in the threat tree can be rated according to DREAD and mitigation notes can be added.

Microsoft Threat Analysis & Modeling

The second tool supports system modelling – called application decomposition in this case – with the definition of Roles, Data, Components and External dependencies. The traditional data flow are replaced with “application use cases” that can be modelled right form the tool using the predefined components, data and roles. The threat identification and decomposition follows the threat-attack-vulnerability-countermeasures model. The tool comes with existing base of attacks and vulnerability. E.g.:  Attack “SQL injection” is made possible because of “Usage of Dynamic SQL” vulnerability. Threats that have been identified can then be categorized using the CIA (confidentiality, integrity, availability) classification. Threats are not related to attacks or vulnerability but directly to countermeasures. This was a bit awkward to me and I don’t understand exactly the rationale behind this choice. It’s also interesting to note the existence of so-called “Relevancies” that can then be linked to attacks. E.g.: relevancy “Component utilizes HTTP” is linked to attack “Response Splitting”, “Session Hijacking”, and “Repudiation Attack”.
The tool’s ambitions are much bigger and go beyond a simple documentation of the system security. The tool contains analytics and visualization features. Analytics features aim at assessing the system automatically (e.g.: “Data access control matrix”), whereas visualization features help to understand the system from different view points (e.g.: “call flow”, “data flow”, etc.). These features make sense only if the complete system was modelled with the tool, especially the application use cases.
The tool is actively maintained and significant efforts have been invested in it in to promote the Threat Modelling practice.

Conclusion

Threat modelling is not easy. Several approaches can be used and it can be a time consuming activity. We didn’t model the system enough in detail to leverage the analytics and visualization features of the Microsoft Threat Analysis & Modeling tool. We can then not assess the relevance of such an analysis. Speaking generally about the process, our main problem was to decide to which level of details we wanted to go, and then how to organize our findings in a meaningful way.

The basic notion of threats, attack, vulnerability and countermeasures can be vague or overlapping and they must first be defined in the scope of the analysis in a way to ensure a consistency of the threat model. Sample questions to ask prior to start the analysis are:

  •  What will be the granularity of the threats to identify? E.g.: “disclose customer information” vs. “disclose customer document”, “disclose customer number”, “disclose credit card number”. Knowing to which extent the threat analysis will be performed must be defined in advanced and will drive the complete process.
  • What will be the granularity and the nature of the attacks to identify? An attack can be a concrete (e.g.: “password brute force attack”) or abstract (e.g.: “denial of service attack”). In the first case, an attack is a concrete technique that can be applied to exploit vulnerability. In the later case, an attack is an abstraction of a set of existing techniques that are related. Such abstract attack could be refined into a list of concrete attacks (e.g.: “SYN flood”, “XML bomb”, etc.).
  • How to capture generic attack/threats? This again related to the question of granularity of the analysis. Consider for instance the attack “Submit HTTP form twice”. Because it can be applied on all web pages, the list of threats will be an exhaustive listing of the application features: “Place wrong order”, “Rate the item more than once”, etc. Such attack can be capture into generic threats such as “Abuse web application” or “Disrupt system”.  Similarly, all the “Denial of service attacks” will lead to generic threat “Degrade service availability”.

The threat model must no be an exhaustive, useless document. Therefore, the analyst must find a balance between generic attack and threat (and the corresponding generic hardening best practice) and more detailed attacks and threats related to the specificities of the system under study.

Threat modeling: overview

Threat Modelling is a process of assessing and documenting a system’s security risks. The threat model identifies and describes the set of possible attacks to your system, as well as mitigation strategies and countermeasures. Your security threat modelling efforts also enable your team to justify security features within a system, or security practices for using the system, to protect your corporate assets.

Any threat modelling process will usually encompass the following steps:

1) A model of the system that is relevant for the threat analysis
2) A model of the potential threats
3) A categorization and rating of the threats
4) A set of countermeasures and mitigation strategies

There are however several approaches to perform each of the steps. We will now briefly give an overview of each step.

Step 1: Model your system

The system model is an abstraction of your system that fits the threat analysis. It differs from other traditional models in the sense that it is a mix between a deployment view, a data view and a use case view.

The system entry & exit points The entry & exit points are the ways through which data enter and leave the system, from and to the external environment.
The actors & external dependencies The actors and the external dependencies are the entities that legitimately interact with a system. The actors tend to represent real user roles whereas an external dependency refers usually to a third-party system. The distinction between both can sometimes be blurry: an external system could be considered as an actor in case it is the active participant in the interaction.
The trust levels & boundaries Trust levels define the minimal access granted to an actor within a system. For example, a system administrator actor may have a trust level that allows them to modify files in a certain directory on a file server, while another user entity may be restricted from modifying files.
The assets An asset is an item of value, an item security efforts are designed to protect. It is usually the destruction or acquisition of assets that drives malicious intent. A collection of credit card numbers is a high-value asset, while a database that contains candy store inventory is probably a lower-value asset. An asset is sometimes called a protected resource.
Use cases Identify the use case for operating on that data that the application will facilitate.
The assumptions All assumptions that were driving the modelling effort. Considering the cryptographic algorithm either public or private is for instance an assumption worth being mentioned.

Step 2: Model your threats

Let’s first define some concepts:

Threat – The possibility of something bad happening
Attack -A mean though which a threat is realized
Vulnerability – A flaw in the product
Countermeasure – A mean to mitigate the vulnerability

« Threats are realized through attacks which can materialize through certain vulnerabilities if they have not been mitigated with appropriate countermeasures »

A concrete example would be:

Threat  – Perform arbitrary query on the system
Attack – Access internal service exposed to the end-user
Vulnerabilities – (1) Firewall misconfigured (2) Lack of access control
Countermeasure – (1) Correct firewall rules (2) Secure the EJB correctly

« Arbitrary query can be executed through access to the back-end EJB  which can be possible because of wrong firewall configuration and lack of access control if the infrastructure and the application server were not configured correctly »

An advanced attack is frequently composed of a series of preliminary attacks which will exploit several vulnerabilities. The attacks can be represented as a tree, called an attack tree.

The level of details in the identification of threats, attacks and vulnerabilities is up to the analyst.

Step 3: Categorize and rate your threats

Once the threats, attacks and vulnerabilities have been identified, the threats can be categorized. Popular categorization schemes are STRIDE or CIA.

STRIDE:

Spoofing – To illegally acquire confidential information of someone and use it
Tampering – To modify maliciously information that is stored, in transit or otherwise.
Repudiation – A malicious used denying the fact of committing an action that he/she is unauthorised to do or that hampers security of an organisation, and the system has no trace of such action. This action cannot be proved.
Information Disclosure – To view information that is not meant to be disclosed.
Denial of Service – Sending or directing network traffic to a host or network that it cannot handle thus they become unusable to others.
Elevation of privileges – To increase the adversary’s system trust level, permitting additional attacks.

CIA:

Confidentiality – To ensure that information is accessible only to those authorized to have access
Availability – The ratio of the total time a functional unit is capable of being used during a given interval
Integrity – To ensure that the data remain an accurate reflection of the universe of discourse it is modelling or representing, that no inconsistencies exists.

Once categorized, the threats can be rated according to the risk the represent. The total risk can evaluated according to DREAD:

DREAD

Damage Potential – Defines the amount of potential damage that an attack may cause if successfully executed.
Reproducibility– Defines the ease in which the attack can be executed and repeated.
Exploitability– Defines the skill level and resources required to successfully execute an attack.
Affected Users – Defines the number of valid user entities affected if the attack is successfully executed.
Discoverability– Defines how quickly and easily an occurrence of an attack can be identified.

If attack trees have been modelled, the risk can be estimated based on the likelihood each step in the attack tree.

Step 4: set up countermeasures and mitigation strategies

Once the threats, attacks and vulnerabilities have been identified and documented, a set of countermeasure can be set up. Such strategies aim at reducing the risk surface and mitigating the potential effects of an attack. If the existence of the threat can not be removed altogether, the probability of such threat should be reduced to an acceptable threshold.

Wrap up

The following quote summarize well the rationale behind thread modeling. « Threat modeling is not a magic process/tool where you just throw stuff in and out comes goodness. Threat modeling is a structured way of thinking about and addressing the risks to what you are about to build rather than going about it randomly. »

References 

http://blogs.msdn.com/threatmodeling/archive/2007/10/30/a-discussion-on-threat-modeling.aspx
http://blogs.msdn.com/krishnanr/archive/2004/10/27/248780.aspx
http://www.devx.com/security/Article/37502/0/page/4
http://www.schneier.com/paper-attacktrees-ddj-ft.html#rf2
http://www.agilemodeling.com/artifacts/securityThreatModel.htm 
https://martinfowler.com/articles/agile-threat-modelling.html