...the bug was in our expectations.

What is Swift?

Simply put, Swift is the S3 of Openstack (an open source cloud platform). Like Amazon's S3 service Openstack's Swift is a is a highly available, distributed, eventually consistent object/blob store.

As an object store, Swift manages data as 'objects', as opposed to other storage architectures like file systems which manages data as a hierarchy of files and directories, or block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, some metadata, and a unique identifier.

Terraform implements numerous remote-state backends among which are Swift and S3.

The Problem

After configuring the Swift backend and running terraform init and terraform plan for the first time - subsequent Terraform actions will fail to acquire a state lock, resulting in the following error message.

$ terraform plan
Acquiring state lock. This may take a few moments...

Error: Error locking state: Error acquiring the state lock: Couldn't read lock info: Resource not found

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

Digging down

The DEBUG log shows Terraform is failing to create the lock state and then trying to retrieve the lock state it never created in the first place - thus failing again.

$ export TF_LOG=TRACE
$ terraform plan
[...]
Acquiring state lock. This may take a few moments...
2021/01/13 18:17:25 [DEBUG] Couldn't write lock 0ed1e907-6d55-4cde-46d0-61961f6afdf4. One already exists.
2021/01/13 18:17:25 [DEBUG] Getting object test-1-terraform-state/tfstate.tf.lock

Error: Error locking state: Error acquiring the state lock: Couldn't read lock info: Resource not found

Some additional custom debug logging reveals the create request is failing with a status of 412: Precondition Failed.

httpErr: Expected HTTP response code [] when accessing [PUT https://object-storage.nz-hlz-1.catalystcloud.io:443/v1/[REDACTED]/test-5-terraform-state/tfstate.tf.lock], but got 412 instead
<html><h1>Precondition Failed</h1><p>A precondition for this request was not met.</p></html>

MDN documentation advises that this status code occurs "when the condition defined by the If-Unmodified-Since or If-None-Match headers is not fulfilled."

412 Precondition Failed - HTTP | MDN
The HyperText Transfer Protocol (HTTP) 412 Precondition Failed client error response code indicates that access to the target resource has been denied. This happens with conditional requests on methods other than GET or HEAD when the condition defined by the If-Unmodified-Since or If-None-…

Sure enough, Terraform is setting If-None-Match: '*' as a header when uploading the lock object.

func (c *RemoteClient) writeLockInfo(info *statemgr.LockInfo, deleteAfter time.Duration, ifNoneMatch string) error {
	err := c.put(c.lockFilePath(), info.Marshal(), int(deleteAfter.Seconds()), ifNoneMatch)
// ...

}
if err := c.writeLockInfo(info, lockTTL, "*"); err != nil {
   return "", err
}

The If-None-Match header tells the server only to process the request, if there is no resource with an etag matching the one(s) specified in the value of the header. Setting If-None-Match: '*' tells Swift to only create the object, if no object with the same name, regardless of it's etag value, already exists. Otherwise, by default, Swift will override any existing object with the new object - which is a behaviour deeply not desired for the lock file.

By returning 412 Swift is telling us an object called tfstate.tf.lock already exists, yet when attempting to retrieve the lock contents, Swift is responding with 404: Not Found. A contradiction

A Bug in Swift?

I've never experienced an issue like this until the If-None-Match header came into play. I went digging through the Swift source code to see how Swift handles that header, and confirm that it is, in fact, the cause for the 412 response.

if req.if_none_match is not None and '*' in req.if_none_match:
   statuses = [
       putter.resp.status for putter in putters if putter.resp]
   if HTTP_PRECONDITION_FAILED in statuses:
       # If we find any copy of the file, it shouldn't be uploaded
       self.app.logger.debug(
           _('Object PUT returning 412, %(statuses)r'),
           {'statuses': statuses})
       raise HTTPPreconditionFailed(request=req)

The check used for the If-None-Match header simply tries to put a copy of the file in every replica and if any replica responds with 412, bubbles the status up to the user.

Notably, this is a significantly stricter requirement compared with ordinarily creating a new object; where the Swift object proxy only requires that a quorum of replicas store the object successfully.

OpenStack Swift Architecture — SwiftStack Documentation

Including the If-None-Match header when creating object assumes a much higher level of consitency than when the header is omitted.

Workaround?

My initial thought was that if Swift's handling If-None-Match header is buggy then maybe Terraform can simply check if a lock file exists before trying to create the lock file.

However this would open up a race condition where two actors are trying to act on the lock file at the same time and the state changes between checking for a lock file and creating a lock file. For example

  1. Actor 1 checks for a lock - no lock
  2. Actor 2 checks for a lock - no lock
  3. Actor 1 creates a lock - 1 lock exists
  4. Actor 2 creates a lock - overwrites Actor 1's lock.

Regardless of how likely this scenario is, the results could be disastrous if it occurs.

The If-None-Match header solves exactly this problem. According to MDN: "Used with the * value, If-None-Match can be used to save a file not known to exist, guaranteeing that another upload didn't happen before, losing the data of the previous put; this problem is a variation of the lost update problem."

The original authors decision to set the If-None-Match: '*' header is the right decision. Maybe Swift needs some work to handle this better.

But actually there's a bigger challenge.

The CAP Theorem, Eventual Consistency and Swift

The CAP Theorem states it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Swift is designed to be highly available and partition tolerant, keeping your data safe and available even if an entire datacenter were to fall into the ocean.

The trade off, then, is consistency - at least in the short term.

OpenStack Swift eventual consistency analysis & bottlenecks
Swift [https://launchpad.net/swift] is the software behind the OpenStack ObjectStorage [http://openstack.org/projects/storage/] service. This service provides a simple storage service for applications using RESTfulinterfaces [http://docs.openstack.org/api/openstack-object-storage/1.0/content/…

Big Picture

The issue is that the locking requires high levels of consistency in order to function as intended and object storage services are specifically designed with priorities for anything but consistency.

The Terraform Swift backend is expecting too much from an object storage service alone - trying to make Swift do something it's fundamentally fundamentally not designed to do.

How does the Terraform S3 backend achieve state locking with S3?

Like Swift, S3 is also designed for Availability and Partition tolerance - not consistency, so surely it would suffer from the same problem right? So how does Terraform implement state locking using S3?

It doesn't.

The S3 Terraform backend additionaly requires the integration of DynamoDB, a strongly consistent key value database service in AWS.

Using Terraform S3 backend without dynamodb does not provide any state locking capabilities.

Why not do the same for the Swift Backend?

It would be trivial to integrate a secondary storage service to provide consistent locking for the Swift backend so the question is not 'how' but 'what' and 'why'.

For AWS the choice is easy, they have a suitable service (probably multiple) built right into the platform.

There is no such service available in Openstack at this time. At one stage there was a key value database service in Opnestack called MagnetoDB which would have been ideal for this usecase - however that project is currently not actively maintained.

Further more, even if it were active, most Openstack clusters do not deploy most of the 60+ Openstack services and projects so there's no guarantee that MagnetoDB would be ubiquitously available.

So if there is no appropriate service in Openstack, then what about a third party integation? Again, the question is which service to chose and why? How do you make that decision for your users? What happens if that service goes away?

If that service can support state management in addition to state locking then why not just use it for both - especially if it is a service already supported as a terraform Backend such as etcd.

Maybe the best option is just to transition it to a backend that does not support locking. There's no shame in that - there are several backends which don't support state locking. What is a problem is promising users a feature which fundamentally can't be delivered.

It should not break anyones workflow to drop locking support as the feature is already broken and If MagnetoDB is revived or a suitable service rears its head, then the Terraform Swift backend could add back support for locking without breaking any workflows.