SAFe: The Lean Mindset

An interesting aspect of the SAFe framework is that it tries to combine two agile mindsets. The first mindset is the iterative mindset of methods like Scrum. It’s a cornerstone of agile development and SAFe “scales” it from the team-level to the program-level, for instance with the PI Planning.

Another mindset in SAFe is the lean mindset. The lean mindset is not about iteration, but about optimising the flow of value.

Lean came initially from manufacturing where the goal is to (1) reduce the time to produce physical good, and (2) reduce the “inventory” needed in the process, and (3) reduce the “waste” produced during manufacturing. In manufacturing, managing inventory requires warehousing and logistics, this costs money. Materials that end up as waste cost money too but do not produce value. To reduce delivery time, each step in the delivery process must be optimised and wait time be reduced to the minimum.

These ideas can be translated to the software world if we consider that features under development are “inventory” and the development process is a pipeline that can be optimised. Features under development are “inventory” since they don’t produce value but must be managed. Waste is a bit harder to map but it represents all the unnecessary work that end up not being used (think of unused design document, analysis, etc.). The development pipeline can take many forms but is always a variation of define, build, verify, and release. The quicker a feature can transition in the pipeline the faster you produce value.

Lean in itself doesn’t require iteration. Iterations are needed to manage uncertainty and course-correct the product development in the face of new information. Lean is about optimising a delivery process. But the delivery process could be about the delivery of a similar item every time, like cars in the manufacturing world.

But Lean is also a great complement to iterative approaches like Scrum. In this case, the goal of the lean mindset is in a way to optimise the iteration speed. Rather than having several features with long delivery time, focus on few features and short delivery time.

SAFe emphasises the lean mindset with concepts like the continuous delivery pipeline and value stream mapping. Besides presiding over the process, the RTE are also charged to improve the flow of value in the organisation.

The lean mindset isn’t as established as the iterative mindset. I find it interesting that SAFe integrates it and promotes it. We conducted a value stream mapping session at work, and it was very enlightening. Thinking in waiting time, inventory, waste does indeed work in the software world, too.

It’s a simple way to highlight process and organisational issues. It gives clarity to what should be optimised and not get lost in organisation design. Chances are, if you want to reduce waiting time, you will have to solve a bunch of other problems first. The lean mindset positions these problems not as end in themselves, but as bottlenecks to short delivery time. It helps you prioritize these problems. It’s a bit like Test-driven Development (TDD). Making things testable requires that you figure out a good design first. But assessing testability is easier than assessing “good design”. In the case of Lean, minimising “waiting time” requires that you figure out a good organisation first, but measuring “waiting time” is easier than measuring “good organisation”.

Silly Product Ideas that Win

When Twitter appeared more than a decade ago, I though it was silly. I saw little value in a service that only allowed sharing 140-character long text messages. I registered on a bunch of social media platforms and created nevertheless a twitter account. Some years later, the only social media platform I’m actively using is… twitter.

There’s a lesson here for me and it’s that it’s hard to predict what will succeed. A lot of products can appear silly or superficial at first. They may appear so in the current time frame, but this can change in the future. Initially, twitter was full of people microblogging their life. It was boring. But it morphed in a platform that is useful to follow the news.

A startup like mighty can look silly now – why would you stream your browser from a powerful computer in the cloud? But as applications are ported to the web, maybe the boundary between thin client and server will move again.

We prefer to endorse project that appear profound and ethical, like supporting green energy, or reducing poverty. Product ideas that are silly or superficial don’t match these criterion and it’s easy to dismiss them. But innovation happens often because of such products. No matter how silly or superficial you think they are, if they gain traction, they need to solve their problem well at scale. These products are incubators for other technologies that can be used in other contexts. Twitter, for instance, open sourced several components. If Mighty gains traction, it might lead to new protocols for low-latency interactive streaming interfaces. An obvious candidate for such a technology could be set-top TV boxes.

These products might appears superficial at first and might lack the “credibility” of other domains, but here too, the first impression might be misguiding. A platform like twitter can support free speech and democracy (sure, there are problems with the platform, but it at least showed there are other ways to have public discourse). A product like Mighty might in turn make it more affordable to own computers for poor people, since it minimises hardware requirements. Because these product don’t have an “noble” goal initially attached to them, doesn’t mean they don’t serve noble cause in the long term.

There are of course silly ideas that are simply silly and will fail. But the difference between products that are superficially silly and truly silly is not obvious. I took in this text the example of twitter and mighty. In retrospect, the case for twitter is clear. For mighty, I still don’t know. The idea puzzles me because it’s at the boundary. There’s a fine line between silly and genius.

“Mostly Aligned” is good Enough

Few years ago, I would have described a good organization as one where everyone is on the same page. By it, I would have meant exactly on the same page. I realize now that I was wrong. You don’t need to be perfectly on the same page. Being mostly on the same page is enough, and a little bit a chaos is ok.

Engineers are very well positioned to understand why: to be on the same page you need to coordinate, and coordination is expensive. This holds for actors in a software system (threads, processes) but also actors in an organization (person, teams, units). Coordinating between actors takes time, and as such slows the system. You should first try to design your system so that the need for coordination is reduced, and then if necessary, balance coordination with consistency (being on the same page).

The analogy works surprisingly well (maybe it’s not an analogy but a property of system in general?). Take optimistic locking in software systems: it’s a tradeoff between consistency and performance. Rather than lock the resource on each change, you only check when you do the final write if you’ve been working on the most up to date information. If not, you do a retry. In this case, there’s a performance hit, but overall the system is faster this way. The equivalent in an organization would be to accept that some people somewhere have outdated information. They will work based on this outdated information until a synchronization point happens and they realized the information is outdated. Some work will have to be corrected or redone. It may be upsetting, but should happen rarely.

The art of organization design is to reduce coordination and when needed use the right synchronization points. The goal is to prevent catastrophic mistakes. Some inconsistencies here and there, if timely resolved and with small consequences, are fine. Do not synchronize on everything (it’s way too expensive) but synchronize often enough to keep the risks small. Prefer many small risks than looming, large big risks.

There are lots of patterns in software system to synchronize and coordinate actors in the system. There are also a lot of patterns to synchronize and coordinate actors in an organization: all-hand sessions, company memo, internal trainings, review boards, formal processes, team meetings, etc.

Interestingly, software systems and organizations have different profiles when it comes to the tradeoffs between consistency and speed. For software systems, relaxing consistency beyond simple techniques like optimistic locking is usually hard. Transactional systems are still a lot easier to build than systems with relaxed consistency. On the other hand, an organization will always work with relaxed consistency somehow: it’s impossible for an organization to update the “collective brain” in a transaction. It’s the nature of people to misunderstand information, forget things, or simply take vacations or be sick.

Speaking of coordination and alignment, Elon Musk put it like this:

“Every person in your company is a vector. Your progress is determined by the sum of all vectors.” – Elon Musk.

What this analogy does not consider is the time needed to align. If lots of time is lost on coordination, the vectors are smaller. You then have to choose between an expensive perfect alignment, or some inexpensive imperfect alignment. Given that organizations constantly course-correct, vectors accumulate projects after projects (or task after task) and there are plenty of opportunities to adjust the alignment, even each time in an imperfect manner. This is why in a good organization, a little bit of chaos is ok.

What’s My Exposure to Data Lock-out?

My computer died a few days ago. Fortunately, I had a backup and could restore my data without problem on another laptop. Still, I’ve been wondering in the meantime: what if the restore hadn’t worked? How easily could I be locked out of my data ?

I have data online and data offline. My online data are mostly stored by google. If say, my account is compromised and due to a misbehavior from the hacker, my account is disabled. Would I ever be able to recover my online data? Not sure.

My data offline are stored on the harddrive, which I regularly backup with time machine. If a ransomware encrypts all my data, the backup shouldn’t be affected. Unless the ransomware encrypts slowly over months, without me noticing, and suddenly activates the lock out. Am I sure ransomeware don’t work like this? Not sure.

My laptop suffered a hardware failure. It hanged during booting, and no safe booting mode made it through. The “target disk” mode seemed still to work, though. It would have been a very bad luck, to not be able to access either the data on the harddisk or the backup. Both should fail simultaneously. But can we rule out this possibility? Not sure.

Harddisks and backup can be encrypted with passwords. I don’t make use of this option because I believe it could make things harder in case I have to recover the data. I could for instance have simply forgotten my password. Or some part could be corrupted. Without encryption I guess the bad segment can be skipped; with encryption I don’t know. Granted, these are speculative considerations. But are they completely irrational? Not sure.

Connecting my old backup to the new computer turned out to be more complicated than I thought. It involved two adapters: one for firewire to thunderbolt 2 adapter and one thunderbolt 2 to thunderbolt 4 adapter. Protocol and hardware evolve. With some more older technology, could it have turned out to be impossible to connect it to the new world? Not sure.

The probability of any of these scenario happening is small. It would be very bad luck and in some case would require multiple things to go wrong at once. But the impact would be very big—20 years of memory not lost, but inaccessible. There’s no need to be paranoid, but it’s worth reflecting on the risks and reduce the exposure.

The Superpower of Framing Problems

Some problem we work on a concrete. They have a clear scope and you know what has to be solved exactly. Sometimes, problems we need to address are however muddy, or unclear.

When something used to work, but doesn’t work any more, the problem is clearly framed: the thing is broken and must be repaired. However, if you have someting like a “software quality problem”, the problem isn’t clearly framed. Quality takes many form. It’s unclear what you have to solve.

To explore solutions you need first to frame the problem in a meaningful way. With this frame in place, you can explore the solution space and check how well the various solutions solve the problem. Without a proper frame, you might not even be able to identify when you have solved your problem, because the problem is defined in such a muddy way.

The “quality problem” mentionned previsouly could be reframed more precisely for instance as a problem or reliability, usability, or performance. It could be framed in terms of the number of tickets open per release, or about the time it takes to resolve tickets.

Depending on how you frame your problem, you will find different solutions. Using the wrong frame limits the solution space, or in the worst case, means you will solve the wrong problem. It’s worth investing the time to understand the problem and frame it correctly.

If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and five minutes thinking about solutions.– Albert Einstein

I’ve talked up to now about framing problems. Framing does however work even in a broader sense and can be used each time there is a challenge or an open question. Each time you should come up with a solution, there is some framing going on.

Something interesting about framing is, that in itself, it isn’t about proposing a solution. It’s about framing the solution space. As such, people are usually quite open to reframing problems or explore with new frames. Whereas if you propose solutions, you can except heated discussions, when it’s only about framing, usually the friction with other people is pretty low. While framing in itself is not a solution, it does however impact the solution that you will find. When people don’t agree on some solution, usually, people have different implicit frames for the problem. Working on understanding the frames is sometimes more productive than debating the solutions themselves.

A second thing interesting about framing is that you don’t need to be an expert in the solution to help framing problems. You need to be a an expert in the solution space, but not the actual solution. Going back the the example of “software quality problem”, you can help with framing if you know about software delivery in general. You don’t need to be a cloud expert or or process expert. This means that good framing skills are more transferable than skills about specific solutions.

I wrote long time ago about using breadth & depth to assess whether a thesis we good. In essence, this is a specific frame for the problem of thesis quality. Finding good frames for problems helps in many other cases. Framing problems is a great skill to learn.

SAFe: What’s a Release Train Engineer?

SAFe introduces a new role in the industry: the release train engineer (RTE). A RTE is, according to the framework:

The Release Train Engineer (RTE) is a servant leader and coach for the Agile Release Train (ART). The RTE’s major responsibilities are to facilitate the ART events and processes and assist the teams in delivering value. RTEs communicate with stakeholders, escalate impediments, help manage risk, and drive relentless improvement.

The role is designed like a scrum master at the ART level. At a minimum, a RTE ensures that the process is followed. But a good RTE helps teams improve their performance – that’s the essence of the job. A RTE doesn’t have any authority over the content in the backlog. The focus on only on improvement at the organisational level. As such, the wording “assist the teams in delivering value” leaves quite some lattitude in how impactful an RTE can be.

What do you expect from a RTE? I am wondering how this role will establish itself in the industry. Here are my personal expectations.

Level I – The Organizer. The RTE ensures that the process is followed. He/She ensures that information flows between the teams using the elements of the framework. The RTE helps resolve problems related to the work environement as they appear. Example of such problem could be: tools to communicate, organisation of the program backlog, running the ART events. He/She makes sure people can work.

Level II – The Moderator. The RTE is able to create plattforms or use existing plattforms to encourge discussions in the ART / Solution. With some moderation talent, he/she can help instill change, support improvements, or create alignment. The RTE helps resolve problems about team performance as they appear. Example of such problem could be: interpersonal issues, improving the collaboration with a specific provider, managing morale in challenging time, ensuring transparency, suggesting a feature stop to address the existing bugs first.

Level III – The Influencer. The RTE identifies systemic performance issues in the organisation and work towards resolving them by instilling change at the organisation, technical, or product management levels. Example of such issues could be: addressing systemic quality issues due to the work culture, working with the system architects/teams/system team to make the continous delivery pipeline faster, encouraging decentral decision-making (while managing risks), improving feedback loops.

The higher the level, the more interdisciplinary the RTE will have to work. While little knowledge of product management or architecture is needed to be proficient at level I, problems at level II and III will require a good understanding of how engineering works and how product management, technology and processes influence each others. On the technology front, the RTE is also a key stakeholder to support mindset like DevOps, which means he must also have some good understanding of how technology supports delivery and operations.

The RTE role ressembles that of the more established delivery manager. Both focus on similar sets of issues.

The big difference between both roles lies I think in the mindset. A RTE is a coach and as such has little formal authority in itself. He leads by helping other take the right call. A delivery manager will typically have more formal authority. For instance, a RTE has no authority over the priorisation of backlog in itself. The PM and PO have formally this responsability. The RTE coaches the PM/PO in priorizing work.

The higher the level, the more the RTE works at the level of the engineering culture. It’s easy to define values and visions that nobody follows. Culture is defined by how people effectively behaves. It’s hard to be a good RTE. Just like it’s hard to be a good scrum master. Changing how people work isn’t easy.

SAFe: Systems Thinking

I was pleasently surprised to see Systems Thinking as principle #2 in SAFe. I recently came in contact with systems thinking when reading Limits to Growth, which explores the feedack loops in the global economy. Donella Meadows is also the author of Thinking in Systems, which addresses more generally how to understand complex systems dynamics with such feedback loops (the book is in my list of to-read).

This is the definiton of systems thinking according to SAFe:

Systems thinking takes a holistic approach to solution development, incorporating all aspects of a system and its environment into the design, development, deployment, and maintenance of the system itself.

It’s quite general. But arguably, there isn’t one definiton of systems thinking. If you read Tools for Systems Thinker, the study of feedback loops is only one aspect of systems thinking. The more general theme is to understand the “interconntedness” of the elements in the system.

A system is a set of releated components that work together in a particular environment to perform watherver funtions are required to achieve the system’s objective. – Donella Meadows

Principle #2 in SAFe is about realizing that the solution, but also the organisation, are complex systems that benefit from systems thinking.

Interestingly, Large Scale Scrum (LeSS) also has systems thinking as principle. It’s more concrete than the equivalent principle in SAFe. The emphasis is on seeing system dynamics, espectially with causal loop diagrams. The article is a very good introduction to such diagram. Here’s an exmaple of a very simple causal loop diagram:

systems thinking-7.png

I like the emphasis on actively visualizing system dynamic:

The practical aspect of this tip (NB: visualizing) is more important than may first be appreciated. It is vague and low-impact to suggest “be a systems thinker.” But if you and four colleagues get into the habit of standing together at a large whiteboard, sketching causal loop diagrams together, then there is a concrete and potentially high-impact practice that connects “be a systems thinker” with “do systems thinking.”

The idea is that it’s only when you start visualizing the systems dynamics that you also start understanding the mental models that people have, and only then can you start discussing about improvements.

I like the more concrete way to address system thinking in LeSS as in SAFe. Recently, I discussed with our RTE about some cultural issue related to knowhow sharing. Using a causal loop diagram would have been a very good vehicule to brainstom about the problem. I think I will borrow the tip from LeSS and start sketching such diagrams during conversations.

SAFe: The Good Parts

The Scaled Agile Framework (SAFe) is a complex framework. I mean, just look at this picture:

Long is gone the simplicity of Scrum. Its glossary contains 102 items (I counted them!), ranging from obvious concepts like a “story” to esoteric notions like “set-based design” with “customer centricity“ in between. The framework is meant to impose some structure, but at the same time, it has so many elements that with some creativity, you can probably retrofit any organisation in it without changing anything (for instance by abusing the concept of shared services). If agile was meant to be about simplicity, then SAFe is far from it.

SAFe comes in various “configurations”. The picture above is “portfolio” SAFe. And mind you, there is a “full” SAFe configuration which is even more complicated. But the core of SAFe – the “essential” configuration – has actually good parts:

  • An agile release train (ART) is a collection of teams. They synchronize through the program backlog and the PI Planning (PI stands for “program increment“)
  • ARTs should align with value streams. You organise you company in ARTs based on how you generate value to your customers so that each ART focuses on one part of the value stream. The definition of value streams is of course complicated in the glossary, with development and operational value streams distinct from each others, but the idea is actually good. You align IT and Business this way.
  • At the ART level, the leadership is split across three roles: Product Management, System Architecture, Release Train Engineer. I think that this split is a nice point in SAFe. It creates some balance in responsability and makes it clear the to be efficient, you need to address product features, architecture, and work culture together since they all impact each others.
  • SAFe also introduces a special terminology for things that aren’t features on their own: enablers. Chances are, you had this kind of work item already, just with a different name. But naming matters, and SAFe make a good use of the concept of “enabling” at various level. I like it.
  • Community of Practice as the naming for working groups around specific issues.
  • The System Team helps with toolchains, infrastructure, build pipelines, integration testing.

Most companies develop their own organisation when growing, which will have some similar elements. Maybe you have different roles (e.g. “engineering managers”), or different ways to synchronize, or some other way to manage architectural work. Some things are surely different, but some things are probably similar, but named differently, or implemented differently. If you want to move to SAFe, how much you will need to adapt will depend. But for most enterprise, the change isn’t radical.

In this sense, SAFe is as collection of patterns. What SAFe gives you is a standard frame of reference to discuss about these aspects. SAFe establishes a common vocabulary to talk about the organisation and how to improve it. Where this analogy with patterns fails, though, is that you usually can decide to implement some pattern individually. SAFe come as a framework of patterns, where all of them must be implemented.

The „large solution“ configuration adds an additional level of scale with product management, train engineer, and architecture at the solution level. Solution and ARTs have the same cadence and synchronize through the same PI plannings. They have the same program backlog. This makes sense. (Historical note: “Program Level“ was replaced with “Essential” in SAFe 5, but the rest of the “program” terminology survived)

With the “portfolio” configuration, you have an additional level of “lean portfolio management” (LPM) whose goal is to « align strategy, funding and execution ». This adds epic owners, enterprise architects, lean business cases, KPI and the like the framework. According to the framework, only with this configuration can you achieve business agility. Something I like with SAFe is that idea to fund value streams rather than projects.

I understand that this level may match well with existing organizations, with funding systems and steering boards. But the portfolio level still has a bit the feeling of ARTs and Solution Trains as “factories”, divorced from real business accountability. If the goal is to bring IT and business closer to each other, why not push these elements to the ARTs and Solutions? Make them accountable for the value their products generate. In a way, I wished that this level wouldn’t exist, or existed in another form – for instance not beeing an addtional level but rather a vertical that complement the existing levels. I understand that some initiatives will impact several steps in the value stream, and thus possibly several ARTs or Solutions. But I hope it’s the exception, not the norm. On the other hand, maybe that’s also precisely the point of the portofolio level beeing above the Solution / ARTs. If your business (and thus value streams) isn’t yet clearly establisehd, you need another level to be able to shape the value streams based on feedback from the market. I think that the portfolio level will be used very differently from enterprise to enterprise.

In its core values, SAFe recognizes its influences: Agile development, Lean product development, systems thinking, and DevOps. The framework actively tries to combine these influences into a consistent whole. The problem is that it feels sometimes a bit too much: The SAFe core values page lists 4 values. The lean-agile mindset page lists 4 pillars. The SAFe lean-agile principles page lists 10 principles. The lean-agile leadership lists 3 dimensions. Business agility lists 7 competencies that are required (on the left in the picture, but “competency” isn’t in the glossary). I like conceptual frameworks, really. But it’s hard for me to not get lost here.

I guess that companies moving to SAFe will still need to tailor it to their needs anyway. Where I’m working, they added „subject-matter expect“, for instance. That’s fully in the idea of agility- tailor processes when you need it. But with this idea in mind, SAFe could have been kept smaller rather rather than trying to be all encompassing.

SAFe: Evolution Over the Years

It’s very interesting to see how SAFe evolved over the years. The version 2, circa 2013, looked like this:

Some things are worth noting:

  • There is no large solution. Only Team/Program/Portfolio
  • At the program, we find release management.
  • The symmetry between PO/Team/ScrumMaster and PM/Arch/RTE isn’t yet established
  • Spikes and Refactors, a terminology comming from eXtreme Programming
  • Epics are primarily characterized as something that spans releases, to be broken down into features that fit in releases

Interestingly, this setup is very like the structure I know from my work.

This is version 3, circa 2014:

There aren’t that many changes compared to v2. The biggest change seems to be the introduction of value streams at the portfolio level. With it comes the ideas that we fund value streams. We also see some “principles” appear, like the House of Lean, the Lean-Agile Leadership, Built-In Quality at the Team Level.

Here is version 4, circa 2016:

Major changes include:

  • An additional level between program and portfolio: the value stream. The “Solution Train Engineering” from version 5 is a “Value Stream Engineer”. The value stream is very present in this configuration.
  • The symmetry between PO/Team/SM – PM/Arch/RTE – SolMgmt/Arch/VSE is established
  • Release management is subsumed with shared services
  • Community of Practices appears
  • Some additional “principles”: Economic Framework, MBSE, Set-based, Agile Architecture, Core Values, Lean-Agile Mindset, SAFe Principles.

Here’s version 4.5, circa 2018

  • Value Stream Level is replaced with Large Solution Level. With it the Value Stream Engineer becomes a Solution Train Engineer.
  • Supporting artefacts and teams regroupped in a sidebar.

Here is the current version 5.1:

We have several major changes (here’s an detailed analysis of them)

  • The introduction of “Business Agility” as the overarching goal, to be achieved with the profolio level.
  • Introduction of the 7 core competencies (Organizational Agility, etc.)
  • The levels Program and Team merge into “Essential”.
  • Some more “principles”: customer-centricity, design thinking.

By studying the evolution of the framework I understand some things better now.

  • The core of the framework with agile release train and portfolio levels remained quite stable over the years
  • The large-solution level appeared over time, morphing from the value-stream level. The symmetry between the ART and solution level with the 3 roles PM/Arch/RTE took some time to evolve to how it is now.
  • The term epic became more complicated to understand. It started as “something bigger than a release” and existed only at the porfolio level. In SAFe 5, epics can occur at all levels.
  • Supporting artefacts and teams evolved over time, but these were much minor changes. The biggest change was probably the “subsumption” of release management in the shared services.
  • Principles have generously been added continously to the framework. There are now a lot of them.

Just with anything that evolved, some inconsistencies accumulate. I find it interesting to observe this in domains other than code and architecture. For instance, in SAFe 5 the term “program” is still in use (Program Increment, Program Backlog), but the program level disappeared. This is due to historic reasons. Starting directly with the version 5, you would probably name things differently (e.g. “Solution Increment”). Also just like with code and architecture, the framework suffers from feature creep.

Somehow I’m a bit sad that they decided to go away with the “value stream level”. The idea of value stream is very powerful and putting in at center stage was nice. The version 4 has another spin as version 4.5. from an engineering standpoint. With the “value stream level”, various programs deliver independent products that together realize a value stream. With the terminology “large solution” of version 4.5, you get the impression that you have one “large solution” broken down in several components deliverd by various ARTs, that need to be integrated together. The difference can seem subtile, but I prefer the spin of version 4. The “large solution” terminology will tend towards centralization more than the “value stream” terminology.

As for the principles, there are simply too many of them. I believe that the signal-to-noise-ratio here is too high.

Introducing business agility is an interesting move from SAFe. I expect the discussion “development agility “ vs “business agility “ to be all the rage in the coming years. We know how to do agile development. But we still don’t get the expected outcome at the business level. The link is somehow not that trivial as in theory. Version 5 recognizes this and makes it clear that agile development is a mean to an end, not the end itself. It reminds us why we’re making all this. Here’s there’s a clear signal without noise, and it’s valuable.

Architectures for Mobile Messaging

A project I’m working on involves changing the messaging technology for the delivery of realtime information to train drivers using iPad. This project made me interested in the various ways to design realtime messaging plattforms for mobile clients.

Unlike realtime messaging systems for web or desktop applications, mobile applications have to deal with the additional concern of unreliable connectivity. This unreliable connectivity is more subtle than I though. Here are for instance a few points to consider

  • no connectivity or poor connectivity (tunnel, etc.)
  • the device my switch from 5G to WLAN
  • connection breaks when app goes in the background
  • Different WLAN HotSpots (Androis, iOS) result in different behavior

You need to design your application to support these use cases correctly.

Here are some aspects of the communication that you need to consider

  • Does the client need to load some state upon connection?
  • Have updates a TTL?
  • Are messages broadcasted to several clients or unique for the clients?
  • Is message loss important or not?
  • Does the server need to know which clients are connected?
  • Do you have firewall between client and server?

Depending on the answers to these questions, you migth decide to establish a point-to-point onnection from the device to the backend. If you want to broadcast information to several clients you need to do this yourself in this case. You will also need to manage the sate in the backend yourself. Tracking the presence of the client is trivial, since there is one connection per client. Several technologies exist for this use case:

  • Server-Side Event
  • HTTP Long Polling
  • gRCP
  • WebSocket

You might otherwise decide to rely on a messaging system with publish-subscribe. The most common protocol for mobile messaging in this case is MQTT, but there are others. With a message broker, the broker takes care to broadcast message and persist the state according to the TTL. Tracking the presence of the client can be achieve with MQTT by sending a message upon connection and using MQTT’s “Last Will Testament” upon connection loss.

There are of course more details to take care when comparing both approaches, especially around state management. For instance, how to make sure that outdated messages are ignored.

We chose the latter option (MQTT) for our project, but I’m sure we could have achieved our goal with another architecture, too.

MORE

Apprently Uber and LinkedIn rely on SSE: