Reliable software: About trust in software systems

I recently heard about the following situation. The Ukrainian side has created some kind of video message. They claimed that the message was created on a certain date. But the Russian side stated that it was recorded in advance. So I started thinking if it was possible for ordinary people to check when the video was created.

Of course, there is a very simple method that has been known for a long time. You buy a fresh newspaper and shoot your video showing the first page of the newspaper. In this case, you can be sure that the author of the message created it no earlier than the newspaper was published.

But in the case where the video was created by some government, there are still doubts. The government can have sufficient influence on the press and force it to create the necessary publications. Of course, this problem can be solved quite easily. You can ask your supporters in the USA or another big country to take a photo of some famous newspaper and send it to you. Then you can show your audience this photo.

But I, as a software developer, am interested to know if it is possible to solve this problem with software. Perhaps, in the process of thinking, I could find some ideas on how to increase confidence in software systems.

Description of our system

Let's imagine what a system might look like to solve our problem. Instead of the first page of the Times, our author shows a computer screen. There's a browser open showing some kind of website (for example, clock.org ). This site shows the current time in UTC (for example, March 4, 2022 16:47) and some sequence of numbers and letters (for example, 1G34HF4JH3). For brevity, let's call this sequence a time code. This code changes once a minute. Now we turn to the viewer's side. Let's say he sees the video in a week. He sees the time and code on the video. If he wants to know if the video was actually made at that time, he goes to another page at clock.org . There he enters the time and code, and the site tells him whether this code was actually shown at that time.

Theoretically, for this purpose, you can use the website of some news agency. But, firstly, the content of this site may not change often enough for our purposes. And, secondly, they usually do not provide an opportunity to show the content of the site at some previous point in time. We could try to extract this content from some Google cache, but it is not designed for such tasks. It may not contain the required content or the content may be deleted to free up disk space. And also today it is difficult to believe in the independence of various news agencies. I would like to avoid all this.

You may ask, "Can we trust your site? Maybe you made a deal with the author of the video, and your site shows what they want..." Well, let's figure it out.

Trust in the source code

My site has some source code. If the user doesn't know this code, they have no reason to trust it. No one can know what is actually running on the server side. Maybe I created a complex algorithm for generating codes for each moment in time and passed this algorithm to the author of the video. And now they can generate these codes at any time.

But what if the source code of the site is available to anyone? Suppose I posted it on GitHub. It's hard to suspect GitHub that I bribed its authors. And we don't really need to trust GitHub. Anyone can download the source code from there to their own computer. "So what?" - you can say. "How can you prove that your site executes exactly this code?" Here's how it's done.

First, I will need a compiler that uniquely converts the source code into an executable file. This means that if you take some source code and compile it 1000 times, you will get an absolutely identical executable file all 1000 times. This means no compilation timestamps, no built-in GUIDs for debugging, ... But technically I don't see any problems.

Also, there are some programming languages (like PHP) where you don't even need to compile anything. It is enough to have the source code that will be interpreted on the hosting provider's side.

Now let's move on to the hosting provider. We're going to need some help from him. I want to ask him to calculate the hash for several of my files. I mean those files that the provider actually performs for my site. These can be compiled assemblies or source code files. The hosting provider will show this hash to anyone who wants to know it. It looks like this. You visit the hosting provider's website (not on clock.org , but to the website of the provider that hosts clock.org ). There you type `clock.org " in some input field, and the provider will show you the hash and the full paths to all the files for which the hash was calculated. I'll explain later why we need these full paths.

We need one more piece of information. My website (clock.org ) will provide information about the GitHub repository and the commit ID in it.

Now let's get it all together. Let's say I want to be sure that the site clock.org executes the same code that its author posted on GitHub. I'm going to clock.org and I get the repository name and commit ID there. Then I go to GitHub and download this particular version of the source code to my computer. Then I compile it. Now I go to the hosting provider's website and get there a hash and a list of files for which this hash was calculated. Then on my machine I calculate the hash for the same files. If the hashes are equal, then everything is fine. If not... then there is no trust.

As you can see, we have replaced trust in the site owner with trust in the hosting provider. But we can further reduce the need for this trust. If I really need the trust of my users, I can host instances of my site all over the world with different hosting providers in different countries. It will be really hard to imagine that I colluded with all of them.

There are more questions. Why should we choose the files for which the provider will calculate the hash? Why can't we just calculate it for all files in the root folder and all subfolders? Usually the site writes something to disk (for example, logs). Any changes to the files also change the hash, so in this case the system will not work. It is better to specify several immutable files and calculate a hash for them. Of course, this list of files should include everything that actually runs on the provider's website: index.php , web.config, ... This may lead to some restrictions on what these files can contain, since they must be hosted on GitHub and visible to everyone. But I think it's not a very big problem. All confidential information can be passed through environment variables.

I promised to explain why the hosting provider should show the full paths to all files involved in hash calculation. Otherwise, I could do the following trick. I will upload any arbitrary code to the supplier. But I will also create a separate folder where I will put the compiled code from GitHub. After that, I will ask the provider to calculate the hash for the files in this folder. This means that one code will be executed, and the hash will be calculated for a completely different code. Knowing the full file paths will protect against such a problem.

Now we have several copies of my site all over the world from different hosting providers. The author of the video can open several of them to demonstrate that they show the same UTC time and the same time code. The audience can check any of them, depending on which hosting provider they trust more.

Wait a minute! Did I say "same time code"?!

Data storage system

"So you got caught!" - you can say. "Obviously, your site instances must have some kind of storage system, some kind of database in which you keep the correspondence between UTC time and time codes. And who has access to this database? You have it! This means that if you wish, you can insert any information you like there, change the data at your discretion. How can I trust this?" Yes, this is a very serious question.

First of all, each instance of my site can have a separate independent storage. For example, the SQLite file in the root folder. But in this case, I do not know how to establish trust. It's better to replace trust in my repository with trust in something else. To the database provider.

All my site instances will interact with the same database. They will get access to it via the Internet from some database provider. The address to access (e.g. URL), the database name, the names of all tables/collections (and possibly the username) will be hardcoded in my source code. I'll let anyone make sure that I always work with the same database, with the same tables, that I haven't replaced them on the sly. I will only get the password through the environment variable.

"Big deal! You can still make any changes to this database." - you can tell. Yes. Unless...

Unless the database itself restricts what I can do. Imagine a database that does not allow you to modify existing data and does not allow you to delete data. I can only add new entries. It is also not possible to create multiple records with the same key. In our case, the key will be UTC time to the minute.

But that's not enough. Let's say I made a deal with the author of the video. He wants to create a video today, but it should look like it will be created next week. To do this, I am adding some entries to the database for the time next week. This does not violate database restrictions. I don't change anything or delete anything, just add new entries. I also don't create key conflicts. They may happen later when my code tries to insert new entries for the same time. But such conflicts can be easily resolved by the program. So my database needs to guarantee a sequence of keys. At any moment, I can only read the entry for the next key (for the minute following the last minute existing in the database).

There is another important issue that we need to discuss. Who inserts data into my database? Imagine that each instance of my site has code that inserts new records into a shared database. Each instance generates a new time code for the next minute once a minute and tries to insert it into the database. Only one instance will do this successfully (due to the unique key constraint), other instances will receive an error message due to a key conflict and will have to re-read the time code for the next minute from the database. This way they will all be able to show the same time code at the same time. Is this enough to be sure that our time codes are generated honestly? Unfortunately, no.

At home, I may have an Excel file in which I save time codes for many years to come. Since I have a password for the database, I can run a small program that will write these codes to the database before the instances of my site do it. This will allow me to know all the codes for many years, and I will be able to use this knowledge.

What can we do to overcome this problem? First, on the database provider's website, I can restrict the list of IP (addresses) from which I can connect to the database. Of course, the database provider should publish this list to everyone so that everyone can see that any changes to my database come from the same IP addresses as the instances of my site. But now I have to somehow guarantee that I won't be able to run another process on the same computer that my site is running on. I think it's much harder to do.

There is another approach. My sites can provide a kind of log in which we record whether this instance of the site was able to insert a new record into the database, or whether it encountered a key conflict and had to re-read the time code. People could (at least theoretically) visit all instances of my site and check if one of them was able to write time code in a given time. If they all encountered a key conflict, it would mean that someone else created the time code. But how can I keep this log? If it's a file on disk, no one can be sure that I haven't made some changes there. Such a file will be constantly changing, and the hash will not help us. I could save such a log in the server memory. In this case, we could be sure that the data stored there is correct, since we trust the source code. But the size of such storage will be limited. So we'll have to clean it out from time to time. This would mean that we would not be able to check the time code after a certain period of time. And also the log will disappear every time the site instance is restarted.

So, what can we do?

Blockchain

In fact, there is already a technology that can be used for our storage purposes. I'm talking about blockchain. It provides immutable (without changes in recorded data) storage and verification of new blocks by the entire community of participants.

For our purposes, the blockchain can be built as follows. Firstly, any participant can generate a block with a time code for the next minute. Secondly, the verification process verifies that:

The block is generated for the next minute after the last block in the chain, and not for a distant time in the future.
The block contains a time code that has not been used before (or has not been used for some considerable time).
The block is generated by a participant who has not generated any of the last N blocks in the chain. This check helps to exclude situations when one person generates several consecutive blocks.

This approach, of course, also has some drawbacks. I can still insert a block with an arbitrary time code into the chain, which I can know in advance. This means that I can use this time code if my video message lasts less than a minute.

Moreover, blocks should be generated once per minute. If there are many participants, this can lead to conflicts that generate temporary branching of the chain. In this case, our site may read data from the "wrong" branch of the chain, and the system will become useless. We can, of course, try to generate blocks in advance, for example, a day in advance, in the hope that during this time the community will already decide which branch is correct. But in this case, the author of the message can also record it a day in advance.

And, of course, the question arises about the incentive for generating new blocks. Why do a lot of people have to start generating blocks with time codes for our system? We need some kind of economic incentive, as in the case of bitcoin, but I don't see it here.

In general, there are many more questions that need to be answered.

Conclusion

My article discusses the creation of a software system that can be trusted without trusting its authors and owners. I can draw the following conclusions.

We can build trust in the source code of the website. But in fact, we replace trust in the author of the site with trust in the hosting provider. We can improve the situation by hosting multiple instances of the site with different hosting providers around the world. Maintaining such trust requires some action on the part of the hosting provider, but nothing extraordinary.

It is much more difficult to organize trust in data. It is very difficult to be sure that the owner has not made any changes to the data. Even transferring some checks to the side of the database provider cannot completely solve the problem. Probably, the use of systems such as blockchain could help here. But these systems show data to everyone, which is not always acceptable.

I hope it was at least fun for you to travel around the world of trust in software systems. I will be glad if it gives you food for thought. Good luck!

Reliable software

Monday, March 28, 2022

About trust in software systems

Description of our system

Trust in the source code

Data storage system

Blockchain

Conclusion

No comments:

Post a Comment