Other database content: data revision control and history

PMc · Jun 10, 2021

I'm not sure if I did talk about that already earlier, search doesn't find a thing (but my search-fu isn't currently well developed, anyway).

With our code development, we have version control systems like SVN or GIT. They record all changes to a codebase, and provide us with the ability to (at least)

see a log of concerned changes in chronological order
display the differences between any two points in time
restore any historical version

These tools work upon a filesystem tree. They are not limited to code base development, but can work for anything that can be stored in a file tree: e.g. documentation projects, or system configuration files.

Now switch to SQL: when we store data in a database, we might have the same desire to retrieve those mentioned kinds of historical information. (Specifically when trying to store systems configuration data in a database, this gets quite relevant.) But that data is not present in the filesystem, it is stored inside an RDBMS, and therefore is not accessible for GIT or SVN.

This matter gets occasional attention at places like StackOv., and gets usually answered in a way, that there are tools available to log/archive/audit modified table rows. Which brings up another issue:

In SQL there is a thing called "normalized" form - and it is a good thing, because it means that you clean up your data structure to the most logical and consistent form. In practice this means that you will get not one table full of (possibly redundant) informations, but many tables with specific informations, which are all interrelated in some way.
And this is the end of those archiving/auditing tools, because they cannot be aware of the relationships: they may still create audit logs of all the specific changes that have happened, but there is no way of getting an intellegible historical timeline or diff out of that, even less to restore the whole structure to a certain point-in-time.

What we certainly can to is restore the whole database to a certain point in time - the RDBMS usually provides for that. But that is not what is desired; usually only a specific object in a table plus all it's adjacent information in dependent tables should be restored/copied as it was at some point in time. (The RDBMS itself would have the necessary knowledge and tools: it does it's own revision management; but this is only used internally to keep the database consistent and manage transaction isolation levels. It is not accessible to the user.)

I am currently trying to devise a solution. Comments, inspirations, strategies etc are welcome, as usual.

astyle · Jul 13, 2021

PMc said:
(The RDBMS itself would have the necessary knowledge and tools: it does it's own revision management; but this is only used internally to keep the database consistent and manage transaction isolation levels. It is not accessible to the user.)

I'd recommend reading a good book on MySQL, but just about any RDBMS would have a record/log of transactions if properly configured. And yes, that is accessible to the user. In case of MySQL, you would need to log in as 'root' user of the database server (not to be confused with the 'root' user of the underlying system, those are two separate things).

In that log of transactions, every single transaction is an SQL query, even admin tasks like adding a user. And yes, you can have per-table rollback and recovery in just about any RDBMS. BTW, with an RDBMS, data in tables does not 'depend' on data in other tables in the same way that software packages have dependencies. Totally different animal. You can specify the dependency in a schema, and success of your SQL queries depends on that schema, but the data in tables doesn't care.

The database normalization - that's specified in the schema. Yes, you might end up re-organizing the tables a bit after discovering that your schema is a mess - but even then, tables are completely independent from schema.

And sorry, external archivers/auditing tools are not what you need to do a backup of a database. sqldump will take care of the backup.

Alain De Vos · Jul 13, 2021

Mariadb has a command "mysqldump" which dumps everything in sql format.
Postgresql has a command "pg_dump" which by default dumps in sql format.
[ Maybe the most important concept is that of a "foreign key", which is a "primary key" in another table. ]

Jose · Jul 14, 2021

This is a very hairy subject, but I believe what you're looking for is WAL archiving. That will help you with changes to the data in the RDBMS. I'm still looking for a good way of adding revision control to database schema changes. We use Flyway at work. It's just ok.

astyle · Jul 14, 2021

Flyway amounts to nothing more than take dump 1, dump 2, dump 3...

PMc · Jul 14, 2021

astyle said:
I'd recommend reading a good book on MySQL, but just about any RDBMS would have a record/log of transactions if properly configured. And yes, that is accessible to the user. In case of MySQL, you would need to log in as 'root' user of the database server (not to be confused with the 'root' user of the underlying system, those are two separate things).

In that log of transactions, every single transaction is an SQL query, even admin tasks like adding a user. And yes, you can have per-table rollback and recovery in just about any RDBMS.

This was already deprecated 20 years ago and is no longer the case: https://www.postgresql.org/docs/6.3/c0503.htm

astyle said:
And sorry, external archivers/auditing tools are not what you need to do a backup of a database. sqldump will take care of the backup.

Yes, but we don't do it that way. We do it by continuously streaming that beforementioned transaction log. This gives the ability to do a point-in-time recovery - but that means, the entire database cluster will be created anew. It is not possible to do that from within an application running on top of the database, or for only a subset of the data.

PMc · Jul 14, 2021

Jose said:
This is a very hairy subject, but I believe what you're looking for is WAL archiving. That will help you with changes to the data in the RDBMS. I'm still looking for a good way of adding revision control to database schema changes. We use Flyway at work. It's just ok.

It's not so hairy. It is how I learned it do be done with the Informix and Oracle clusters at the banks and insurance corps.
Then when I figured that postgres can do this as well, I did it all the same. (Thats why I started to use ZFS in 2008, because with CoW the WAL can be 10 times smaller.)
But when restoring from such a backup, the entire installation will be created anew (and then a single database could be dumped out of that restore). It is not possible to select individual tables for restore - even less could an application-level user decide that they want just the individual rows that belong to them brought back to the state of yesterday. (And that's what I want.)

astyle · Jul 14, 2021

There's a difference between basic design of the database server and tools to manage the database. Postgresql's 'Time Travel' is a tool to manage the databases. Yeah, the tool did get deprecated 20 years ago. The basic design separating the actual files on disk, tables, schemas, and keys (primary and foreign) stayed the same. I'd expect Oracle to maintain a log that is parseable to generate a per-table log.

And yes, it is possible to bring just one table back to yesterday's state. That would take a knowledgeable admin to tease out a one-table log that can be used to reconstruct the table. Realizing that some tables contain redundant info, and that the bespoke info very well can turn out to be foreign keys is one step. But beyond that - even if the tools are awkward to use, the basic design is still there.

PMc · Jul 14, 2021

astyle said:
There's a difference between basic design of the database server and tools to manage the database. Postgresql's 'Time Travel' is a tool to manage the databases. Yeah, the tool did get deprecated 20 years ago. The basic design separating the actual files on disk, tables, schemas, and keys (primary and foreign) stayed the same. I'd expect Oracle to maintain a log that is parseable to generate a per-table log.

I remember a discussion that mentioned a tool that would do such parsing on the redo-log (or WAL log, in that case). The postgres developers were not recommending it. Sadly I forgot about the specific context (it's a while back) and don't find the matter easily in my archives. What I am certain is, that grabbing things out of the main redo-logs is not a usual thing to do, as it is difficult to obtain data integrity from that source.

Because, if that were a reliable approach, people would not engage to devise other approaches to solve the history issue. While I was focusing on a solution inside the application (utilizing application tools), I just found another, more general approach here: https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/
This does also not use the main database log, but instead implements extra tables and triggers within the database layout.

astyle · Jul 14, 2021

PMc said:
I remember a discussion that mentioned a tool that would do such parsing on the redo-log (or WAL log, in that case). The postgres developers were not recommending it. Sadly I forgot about the specific context (it's a while back) and don't find the matter easily in my archives. What I am certain is, that grabbing things out of the main redo-logs is not a usual thing to do, as it is difficult to obtain data integrity from that source.

Because, if that were a reliable approach, people would not engage to devise other approaches to solve the history issue. While I was focusing on a solution inside the application (utilizing application tools), I just found another, more general approach here: https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/
This does also not use the main database log, but instead implements extra tables and triggers within the database layout.

Most tools are per-user anyway. And they all work with one database at a time, generally speaking. If you want to include a table from a different database, that's generally not an issue, just specify it correctly. Just take a random tool you're trying to use to roll just one table back to yesterday, and there'll be plenty of info on the Internet about how to do that. And data integrity will be right there. Rolling a single table back to yesterday is easy.

Making sure that the data rollback does not negatively affect queries - that's the hard part, that is the headache of every DBA worth their salt.

cmoerz · Jul 14, 2021

I'm unsure what features in particular you're looking for. Maybe it's worth taking a look at Sqitch (see http://sqitch.org)? They call it "Sensible database-native change management for framework-free development and dependable deployment."

There's a few alternatives, that fit different kinds of expectations, but that's the most general one, I'd say.

Sevendogsbsd · Jul 14, 2021

Just a brainstorming question: can individual tables be configured to save transaction logs so you can grab a snapshot in time of those or is it only at the DB level? I suppose that would depend on the DB brand in question too.

I assume you are talking about a DB fronted by say, a web application, where data is constantly being written and queried. You would like to roll back to a specific time, but only in certain tables. Did I understand that correctly?

astyle · Jul 14, 2021

Sevendogsbsd said:
Just a brainstorming question: can individual tables be configured to save transaction logs so you can grab a snapshot in time of those or is it only at the DB level? I suppose that would depend on the DB brand in question too.

No. that is part of server config to allow that or not. Within the server config file, you can restrict that to per-database or per-user. And tables are completely independent of the schema.

PMc · Jul 14, 2021

cmoerz said:
I'm unsure what features in particular you're looking for. Maybe it's worth taking a look at Sqitch (see http://sqitch.org)? They call it "Sensible database-native change management for framework-free development and dependable deployment."

This is database design change management. This is not what I want. I want database content change management.

Sevendogsbsd said:
I assume you are talking about a DB fronted by say, a web application, where data is constantly being written and queried. You would like to roll back to a specific time, but only in certain tables. Did I understand that correctly?

Yes!

astyle · Jul 14, 2021

Generally, rolling a table back to yesterday is only possible if that table is in your database. Most database front-ends are nothing more than forms for data entry and query. If you want to roll back - that is an admin task that requires you to use a different tool. Any RDBMS worth its salt will have per-user config that allows you to do something like 'DELETE * from table MyTable WHERE timestamp=today'. This means that as a regular user, you may or may not have access to the DELETE command. And any tool worth its salt will be able to connect to the database and show you the error if you try to delete a row when you don't have the correct database perms. Going back and forth does mean figuring out the INSERT and DELETE statements in SQL.

Alain De Vos · Jul 14, 2021

In pseudocode,
select * from table_to_backup > backupfile
cat backupfile | xargs -I {} insert {} into table_to_backup

Sevendogsbsd · Jul 14, 2021

Whew, I got it right

So, back to my original comment - are there transaction logs at a level other than the entire DB? It's been a long time since I have had to do much with databases on that level. I am pretty sure transaction logs can handle "rollbacks" that are less than the entire DB but again, I have been out of that world in a decade and a half.

astyle · Jul 14, 2021

Alain De Vos said:
select * from table_to_backup > backupfile

That part is done within the database client... you can't just issue it from bash or sh.

Alain De Vos said:
cat backupfile | xargs -I {} insert {} into table_to_backup

Won't quite work that way - the database client shell is primarily for SQL commands, not your regular UNIX command line.

PMc · Jul 14, 2021

Sevendogsbsd said:
Whew, I got it right So, back to my original comment - are there transaction logs at a level other than the entire DB? It's been a long time since I have had to do much with databases on that level. I am pretty sure transaction logs can handle "rollbacks" that are less than the entire DB but again, I have been out of that world in a decade and a half.

That is the question. In practice (meaning, current postgresql) a single transaction log is for ALL db together in the installation cluster (and it is used for crash recovery, full cluster restore, and whatever hot/cold/lukewarm standby schemes), and while it should theoretically be possible to filter any kind of information out of that, nobody wants to talk much about that, and certainly nobody would support it. For normal use-cases all the database internal stuff, including the transaction logs and all the filesystem-level files, are just blackboxes and handled by the database programs only.

Sevendogsbsd · Jul 14, 2021

I am trying to remember back to Oracle 10 and 11 and I am pretty sure the instances our DBAs managed could roll back individual transactions. The databases had millions of records so doing an entire rollback was not feasible. I can't speak to postgresql, sorry.

hardworkingnewbie · Jul 14, 2021

As far is I understood the OP wants to have a Git/RCS like versioning of changes to his database, meaning any updates/writes/deletes are being kept time stamped somewhere with the original data as well.

Then he wants to be able to say "oh, let's restore database record ID 3842 in table lusers at timestamp 20210714 1333", and the database should just do that while keeping. Basically the history track of git log and being able to view changes at the database as commits, so that he can go back in time if needed.

So without a further ado - have a look at Dolt. It's a RDBMS which has git like features builtin out of the box. Its open source.

Their own words:

Dolt is Git for Data!

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands. Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate's changes.
All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables. It's like Git and MySQL had a baby!

Sevendogsbsd · Jul 14, 2021

All big DBMS's (Oracle, MSSQL) have this capability in the form of transaction logs that can be rolled back. I'll have to read up on dolt, not heard of it.

PMc · Jul 14, 2021

Sevendogsbsd said:
I am trying to remember back to Oracle 10 and 11 and I am pretty sure the instances our DBAs managed could roll back individual transactions. The databases had millions of records so doing an entire rollback was not feasible. I can't speak to postgresql, sorry.

Ah. That is interesting. Should have another look at Oracle on some occasion...

hardworkingnewbie said:
As far is I understood the OP wants to have a Git/RCS like versioning of changes to his database, meaning any updates/writes/deletes are being kept time stamped somewhere with the original data as well.

Then he wants to be able to say "oh, let's restore database record ID 3842 in table lusers at timestamp 20210714 1333", and the database should just do that while keeping. Basically the history track of git log and being able to view changes at the database as commits, so that he can go back in time if needed.

Yep, you got it.

hardworkingnewbie said:
So without a further ado - have a look at Dolt. It's a RDBMS which has git like features builtin out of the box. Its open source.

Well, that seems to swap the entire database engine!

astyle · Jul 14, 2021

hardworkingnewbie said:
As far is I understood the OP wants to have a Git/RCS like versioning of changes to his database, meaning any updates/writes/deletes are being kept time stamped somewhere with the original data as well.

Then he wants to be able to say "oh, let's restore database record ID 3842 in table lusers at timestamp 20210714 1333", and the database should just do that while keeping. Basically the history track of git log and being able to view changes at the database as commits, so that he can go back in time if needed.

So without a further ado - have a look at Dolt. It's a RDBMS which has git like features builtin out of the box. Its open source.

Their own words:

Dolt is Git for Data!
Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands. Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate's changes.
All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables. It's like Git and MySQL had a baby!

Had to read the github page to believe that last sentence...

My concern would be for the size of the live database. It's one thing to use git to manage the sql dumps that are used for restoring tables from backup. And - categorizing Dolt as a complete RDBMS is not technically correct - a complete RDBMS would include a database engine, server, service, and a client, like Oracle or MySQL. Dolt is a tool to do the management/data entry chores on a server it connects to. It's useless on its own, but in combination with something like MySQL or mariadb or something to connect to - just might be an interesting thing to try. I do agree that it just might serve PMc's needs well.

Alain De Vos · Jul 14, 2021

How to create a backup of a single table in a postgres database?

Is there a way to create a backup of a single table within a database using postgres? And how? Does this also work with the pg_dump command?

stackoverflow.com

Other database content: data revision control and history

Dolt is Git for Data!​

Dolt is Git for Data!​

Dolt is Git for Data!

Dolt is Git for Data!