PostgreSQL's fsync surprise

josephg · on May 3, 2018

I've said it before and I'll say it again: Modern filesystems straight out expose the wrong API for 99% of applications. App developers almost never think of data as a stream of bytes. We think of data as a set of records. Values in the set change through atomic modification events.

The fundamental API primitives should be atomic changes. Atomic write (bytes), and atomic append. The funny thing about it is that POSIX already supports basically this API (datagrams) for both IPC and networking. It just doesn't support this API in the one place it would be most useful - the filesystem.

Ideally I want:

- Write() to be blocking / atomic by default. Don't return until data is safely committed.

- A transactional API: begin(fd); write(); write(); err = commit(fd). If any error happens, commit returns the error and none of the data is stored.

- An IOCP-style API for non-blocking applications. This is the API databases want to use, with the loop being <get network request>, <write data to filesystem>, <yield>, <get write completion event>, <send confirmation to client>.

- Deprecate fsync & friends. If you don't want to wait for the data to get committed, write in non-blocking mode and ignore the completion event.

Solving this problem in end-user applications is really hard - almost no applications implement atomic write on top of filesystems correctly. And they shouldn't have to - this should be the job of the filesystem. The filesystem can do this much more safely, with better guarantees, better performance and better error handling. Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone. "Do not turn off console while game is saving" is an embarrassment to everyone.

zie · on May 3, 2018

This is why SQLite advertises itself as a replacement for fopen(). It's crazy hard to get right, and SQLite did the work so we don't have to: "Think of SQLite not as a replacement for Oracle but as a replacement for fopen()" - https://sqlite.org/about.html

Obviously this would be horrible for PG to use as a replacement for fopen() and other systems, but for many, many use-cases SQLite is a good replacement for fopen().

copperx · on May 3, 2018

I don't understand why PG would change fopen() to SQLite calls in Arc, if Arc is supossed to be a general purpose programming language in the first place. Are there any other languages doing that?

harperlee · on May 3, 2018

I think the gp is referring to pg as postgres, not Paul Graham.

As in, "Obviously [it] would be horrible for [postgres to be used] as a replacement for fopen() and other systems, but for many, many use-cases SQLite is a good replacement for fopen()"

kjeetgill · on May 3, 2018

Actually I think gp meant:

"Obviously [SQLite] would be horrible for postgres to use as a replacement for fopen() and other systems, but for many, many use-cases SQLite is a good replacement for fopen()"

zie · on May 3, 2018

this :)

copperx · on May 3, 2018

"Gres" isn't a word; I don't understand why you abbreviate PostgreSQL as "PG".

tankenmate · on May 3, 2018

Gres (as in ingress / postgres) in this case comes from the Latin Gradi (Proto Indo European - ghredh) and means "to step" or "go".

Gradi / Ghredh is also the same root used in words like transgress, degrade, gradual, centigrade, congress, etc.

nigwil_ · on May 3, 2018

To clarify further the “gres” in Postgres and Ingres (the database management systems) are from the original acronym and that sub-part being Graphics REtrieval System.

harperlee · on May 3, 2018

The project itself makes that abbreviation often, such as in e.g. the pgdump command.

zie · on May 3, 2018

How would you abbreviate it? The project uses PG as it's abbreviation like @harperlee mentioned. I apologize for the confusion over what I meant.

teddyh · on May 3, 2018

> Modern filesystems straight out expose the wrong API for 99% of applications. App developers almost never think of data as a stream of bytes. We think of data as a set of records

As I recall, older operating systems like VMS and MULTICS all did this, and it was all tremendously complex. Unix’ simple stream-of-bytes file abstraction was a reaction to this, and it worked so well that it became the prevailing model. Before doing again what didn’t work before, check up on why it failed the previous time.

davidgould · on May 4, 2018

Unix is older than VMS (source: programmed on VMS when it was new), so it is unlikely that the unix file abstraction is a reaction to VMS.

orev · on May 5, 2018

Mainframes came before all of them and they used the records model.

gaius · on May 3, 2018

Your recollection is faulty; RMS was an absolute pleasure to work with. Fast, clean and efficient, and concurrent access was pain-free too. RMS supported streams as well as records.

Others have mentioned SQLite as replacement for fopen()... that’s a fairly RMS-like experience on modern systems

(I wonder if this is being downvoted because people think RMS means Stallman)

jordanb · on May 3, 2018

I think you guys are arguing worse is better versus the right thing. :)

Unix made for very simple filesystem internals, with a crappy interface (worse is better). RMS was a very complicated filesystem with a nice interface (the right thing).

davidgould · on May 4, 2018

RMS was an absolute pleasure to work with. Fast, clean and efficient, and concurrent access was pain-free too. RMS supported streams as well as records.

Hummmph, it was kind of a pleasure on a VAX, but on a PDP-11 under RSX or RSTS/E the libraries for it ate so much of the (16 bit) address space that it became a real challenge to fit a useful application into memory.

Also, it only sort of supported streams, they were not really byte streams like Unix, but variable length record streams with CRLF marking the record delimiters. So porting Unix code to VMS was never quite as straightforward as one hoped.

colanderman · on May 3, 2018

> - An IOCP-style API for non-blocking applications.

This exists, it's called Linux AIO (distinct from Posix AIO). The problem is, when you use it, you have to reimplement caching (and buffering) in userspace instead of journaling – which is just as hard to get right and can just as easily – maybe even more easily – lead to corruption. (Postgres, as an example, relies on the OS to buffer writes.)

> - Write() to be blocking / atomic by default. Don't return until data is safely committed.

This is the wrong thing for 99% of use cases. You get terrible performance unless you batch things, which means that casual use ends up with severe performance issues. (IIRC this was actually a problem on Android devices – many apps were misusing SQLite by not using transactions, resulting in every single database update being a separate atomic write to disk. Not only did this kill performance, but it caused excessive flash wear.)

> Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone.

Databases (among other systems) need features and control over the journal that a filesystem cannot provide. (Think MVCC, replication, etc.)

Beside – someone has to write the filesystem, and that filesystem uses largely the same mechanisms in the kernel that are exposed to userspace.

I agree fully that fsync ought to be deprecated for something with much more clearly-defined semantics. Both OS X and Linux have made attempts at this (F_FULLSYNC and sync_file_range, respectively), though clearly at least Linux still has some work to do.

But – barring such unclear semantics – the general model of using fsync to guarantee ordering is not a complex one to understand, and matches most use cases well.

anarazel · on May 3, 2018

> (Postgres, as an example, relies on the OS to buffer writes.)

We don't really buffer writes via the OS that much, most of them are buffered in postgres' internal buffer cache. We even force the kernel's hand to write out things earlier (otherwise there often are huge latency spikes). It's more reads where it's useful because it allows the OS to cache as much IO data as the current memory situation on the system makes sensible.

> > Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone.

You end up with journals anyway. There's no way to be performant otherwise, because you'd do hundreds of tiny writes instead of a few bigger journal writes. Imagine if your 1-row / 3 index transaction would have to write out ~7 blocks.

josephg · on May 3, 2018

> You end up with journals anyway. There's no way to be performant otherwise, because you'd do hundreds of tiny writes instead of a few bigger journal writes. Imagine if your 1-row / 3 index transaction would have to write out ~7 blocks.

We can make that fast by letting the OS / libc batch write transactions. Although if `write()` blocks until the batch commits & gets called in a loop we'd have a performance problem. Non-blocking APIs don't have that problem though.

The guarantee I want is that every write moves the system between two states: (not written)->(written). It should be impossible to reach some third (partially written) state through disk failure or an inappropriate power loss.

vidarh · on May 3, 2018

> We can make that fast by letting the OS / libc batch write transactions.

Those are two very different things, potentially. If the write traverse a userspace-kernel boundary, there is no way to make that fast for small writes. If the batching happens entirely userspace, but just in a standard implementation, then sure.

But people tend to severely underestimate how expensive syscalls are.

colanderman · on May 3, 2018

> We can make that fast by letting the OS / libc batch write transactions.

That's not the problem. (OSes already do this.) The problem is one of locality. (Data writes are spread out among many parts of the disk, causing many writes to be issued for any given transaction.) You get around this by (1) logging to a journal, then (2) batching many seconds worth of writes in the hopes that they touch fewer blocks than the sum of their parts (or, in the case of rotary media, that they can be ordered efficiently). You can't just "wait" for this to happen before responding to the client unless you like 5+ second service times.

Which brings us back to the necessity of a journal, which, as I already stated, must be managed by the application because different applications require different features and control which the filesystem doesn't (and likely can't) expose.

josephg · on May 3, 2018

> > - Write() to be blocking / atomic by default. Don't return until data is safely committed.

> This is the wrong thing for 99% of use cases. You get terrible performance unless you batch things, which means that casual use ends up with severe performance issues.

The right answer would be to batch writes in the OS, but of course this would deadlock without non-blocking APIs. Maybe you're right - maybe write() should have the effectively non-blocking API it has now with fsync as a commit. But I still want my guarantee that if the system / hardware dies halfway through a write (or batch of writes) then data won't be left in a corrupt half-written state. Package installations and OS upgrades should be able to do the entirety of their work in a filesystem-level transaction. I shouldn't need to worry about my OS needing a reinstall because my laptop ran out of power halfway through an update. I just can't think of any sensible way to do this using the APIs we have today.

> Beside – someone has to write the filesystem, and that filesystem uses largely the same mechanisms in the kernel that are exposed to userspace.

But importantly not the same mechanisms. The OS has control over write ordering, and most block devices support read & write barriers. Filesystems lean heavily on these primitives for correctness - and for good reason; they're really useful primitives for building correct, performant systems. But for some reason they aren't exposed to userspace. Userspace applications are instead left scratching in the dirt with fsync. It would be much easier to build correct, performant databases in userspace if we had access to this stuff.

colanderman · on May 3, 2018

A filesystem-level transaction would be a good idea. Maybe some FSes (e.g. ZFS) already support such a thing. You can get this (somewhat awkwardly) for a single file with any FS that supports copy-on-write by making a (lazy) copy of the file, doing your updates there, and then renaming it to the original file name (which is guaranteed to be atomic on most FSes). With smaller files this is standard practice even on FSes without copy-on-write.

> most block devices support read & write barriers

No, they really don't. Some hardware devices might. I'm not aware of any SAN that does. (I've specifically asked this question of the developers of three.)

> Filesystems lean heavily on these primitives for correctness

In Linux, not since 2010 they don't [1], for exactly the reason that they're poorly implemented by devices.

[1] https://lwn.net/Articles/400541/

branko_d · on May 6, 2018

NTFS has transactions...

https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

...but sadly, they seem to be on the path of being deprecated by Microsoft...

https://msdn.microsoft.com/en-us/library/windows/desktop/hh8...

paulmd · on May 3, 2018

Congratulations, you've just described Oracle Database Filesystem.

https://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob...

quietbritishjim · on May 3, 2018

Microsoft introduced transactional file system (and registry) APIs in Windows Vista [1]. But it was so complex that no one ever used it, and now it is semi-officially deprecated.

[1] https://en.wikipedia.org/wiki/Transactional_NTFS

branko_d · on May 6, 2018

I used it, and I would disagree that it was "complex" to use. No more than non-transacted Win32 file APIs anyway.

We did a nice P/Invoke wrapper around native Win32 functions in C#, exposing FileStream and such, that the rest of the code could then easily consume.

I think the main complexity is that there are no "official" high-level wrappers the way there are for non-transacted APIs. But at the level of Win32, there is no significant difference.

cryptonector · on May 3, 2018

No no no, we need async I/O for files. Writing should not be synchronous, but you should be able to find out about completion of each write.

Asynchrony is absolutely critical for performance.

Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.

Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).

trentnelson · on May 3, 2018

So, basically, we need the NT-style, completion-oriented, inherently-asynchronous I/O model? ;-) (Amen!)

davidgould · on May 4, 2018

You misspelled VMS as NT. And actually VMS borrowed the IO model from RSX-11.

Async QIO style IO is all good clean fun, until someone writes something like:

  fml(...) {
      char buf[1024];
      sys$qio(QIO_READ, buf, sizeof(buf));
      return;
  }
  
  main() {
      fml();
      do_stuff();   //surprise package delivered into stack here!
  }

cryptonector · on May 4, 2018

Yes. Every system call that can block should take an event queue argument. If you want the system call to block then the C library's stub should wait for it for you.

gruez · on May 3, 2018

>Ideally I want:

everything you describe will murder IO performance because you're essentially implementing a database with MVCC. it makes much more sense to implement what you described as a VFS library.

sethhochberg · on May 3, 2018

ZFS gets a lot of the way there - and indeed has lots of similarities to a database in its internals.

(In fact, ZFS with `sync=always` set on the dataset would do exactly what the parent commenter is asking for without requiring a whole new API. If someone really cared about performance they could even make this perform well with a battery-backed DRAM device for their intent log... but by that point we're way outside of the discussion of a general-purpose filesystem. If you really need guaranteed consistency for every write _without_ crippling performance, ZFS can do it for you at a hardware cost. Writes get cached in RAM, batched, flushed in transaction groups, and the fast/durable SLOG protects against corruption or loss of power or device failure without harming performance much. $$$ though. You'd better have quite a bit of RAM, and need a proper enterprise-grade DRAM storage device - or probably two in a mirror if your data is this critical).

josephg · on May 3, 2018

Murder the IO performance of what, exactly? For what use case is fast non-atomic writes something we care about? Atomic-append is the IO operation we should be optimizing for! What murders IO performance is trying to reimplement a second write-ahead-log in userspace on top of fsync. Which is something modern databases have to actually do in order to meet their correctness requirements. This complexity and performance cost would be much lower with knowledge of the kernel & control over filesystem internals.

And for every case of something that would have lower IO performance with atomic appends, I'm sure we can think of ways the kernel could make those atomic writes fast. Lets say you're copying a big file by wrapping a transaction around a read();write() loop. The OS could make the transaction extremely cheap by simply not changing the file's INODE list (or whatever) until the write has completed. In a filesystem like ZFS, using a transaction for this would let ZFS not update its filesystem checksums until the copy has finished, which would probably in aggregate result in improved performance.

quickben · on May 3, 2018

> "ideally I want:

- Write() to be blocking / atomic by default. Don't return until data is safely committed."

How do you deal with user pulling out usb stick?

josephg · on May 3, 2018

There's a bunch of ways to solve this. A WAL or a B+ tree both handle this sort of thing correctly.

tinus_hn · on May 3, 2018

I would love it if the OS just said ‘put it back or your files are gone’ and then let you reinsert the drive without data loss.

mjw1007 · on May 3, 2018

I've been surprised to see an apparent consensus from the filesystem developers that Postgres should be using direct IO.

I worry that if the Postgres people do make that change, they'll find themselves hearing from a different set of kernel developers that they should have known direct IO doesn't work properly and they should be using buffered IO instead.

Previously I'd thought the latter was the general view from the kernel side.

For example this message from ten years ago, and other strongly-worded views in that thread: https://lkml.org/lkml/2007/1/10/235

In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:

« As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them. »

anarazel · on May 3, 2018

> I worry that if the Postgres people do make that change, they'll find themselves hearing from a different set of kernel developers that they should have known direct IO doesn't work properly and they should be using buffered IO instead.

That definitely will happen. But the fact remains that at the moment you'll get considerably higher performance when expertly using O_DIRECT, and there's nothing on the horizon to change that.

> For example this message from ten years ago, and other strongly-worded views in that thread: https://lkml.org/lkml/2007/1/10/235

> In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:

I think partially that's just been overtaken by reality. A database is guaranteed to need its own buffer pool and you're a) going to have more information about recency in there b) the OS caching adds a good chunk of additional overhead. With buffered IO we (PostgreSQL) already had to add code to manage e.g. the amount of dirty data caching the OS does. The only reason DIO isn't always going to be beneficial after doing the necessary architectural improvements, is that the OS buffer pool is more adaptive in mixed use / not as well tuned databases.

otterley · on May 4, 2018

InnoDB (i.e. MySQL) has been using O_DIRECT for well over a decade. I think it’s fair to conclude it’s reliable by now.

bvinc · on May 3, 2018

Can we ignore for a second what the proper behavior should be, and instead focus on the documentation.

In my opinion, even a careful reading of the fsync man page does not cover what exactly happens if you close an fd, reopen the file in another process, and then call fsync. Am I supposed to read kernel source code? Ideally, after reading a man page, I should have no questions about exactly what guarantees are provided by an API.

dagenix · on May 2, 2018

I'm always surprised by what a mess doing what seem like simple file operations is. Maybe even more surprised that everything seems to generally work pretty well even with those issues. Even "I want to save this file" requires numerous sync operations on the file and the directory its in.

I'm certainly not qualified to criticize anyone for the current situation, and, as the article points out, even some of the more egregious sounding behavior (marking as clean pages after writing fails) has a pretty reasonable explanation. But, IIRC, as storage capacities continue to rise, error rates aren't falling nearly as fast. So, I'm left kinda wondering if there is some day in the future where the likely hood of encountering an error finally gets high enough that things don't work pretty well anymore.

gerdesj · on May 2, 2018

Although its not exactly associated with this as such there is a growing understanding that SMB/CIFS shares have a nasty habit of reporting "on storage" before the data really is safe. That is a bit of a problem for many backup systems, unless you do a verify afterwards and pick up the pieces. Backups can involve massive files with odd write and read patterns and databases generally involve quite large files with odd read and write patterns compared to say document storage.

Perhaps we need database and backup oriented filesystems and not funny looking files on top of generic filesystems.

jandrewrogers · on May 2, 2018

Ironically, most sophisticated database engines do implement complete file systems, treating those "funny looking files" as little more than virtual block devices. In fact, with very little extra code, you can trivially retarget some database kernels to run directly on top of raw block devices, eliminating the redundant file system. It partly depends on the storage management requirements of the user e.g. if they expect to share block devices across unrelated applications. In my experience, the raw block device code is simpler and more reliable; there are many odd edge cases in Linux file system behavior that come up that you must account for if you require robust and reliable storage behavior on top of one.

There are some additional performance and behavioral advantages to working with the storage devices directly. Anecdotally, if you run databases on virtual machines (never recommended but many people do), using raw block devices instead of a file system often seems to eliminate much of the disk I/O weirdness that occurs under VMs.

makmanalp · on May 3, 2018

> you can trivially retarget some database kernels to run directly on top of raw block devices, eliminating the redundant file system

e.g. in mysql: https://dev.mysql.com/doc/refman/8.0/en/innodb-raw-devices.h...

hinkley · on May 3, 2018

It’s getting damned hard to avoid running a database on a VM these days.

bbuchalter · on May 3, 2018

Could you expand on what you mean by "weirdness" on VM disk I/O in the context of database storage?

jandrewrogers · on May 3, 2018

The storage has anomalously high latency and throughput variance with some patterns that you don't see with non-virtualized storage and a modest degradation in average performance. This is expected, but it makes it difficult to schedule I/O efficiently. This is more noticeable if you are doing direct I/O because having a VM intercept your storage access defeats the purpose.

What was surprising is that the direct I/O behavior appears to be conditional on whether you are accessing the storage through a file system. My database kernel is block device agnostic, using files and raw devices interchangeably via direct I/O. Against expectations, when we accessed the same virtualized storage as raw block devices, the behavior was like bare metal even though we are running the exact same operations over the same direct I/O interface in a VM. Basically, the only difference was the file descriptor type.

I'm guessing that file systems are virtualization aware to some extent and access through them is actively managed; raw device accesses are VM oblivious and simply passed through by the storage virtualization layer.

heavenlyblue · on May 2, 2018

Agree, there's already a certain trend towards e.g. etcd/co for online configuration management.

On top of that, many issues you may be facing re. files now have already been resolved if you change the stack: you can't do transactions with fs.

ajross · on May 3, 2018

> what a mess doing what seem like simple file operations is

Proper handling and reporting of hardware-level errors all the way up through the stack (driver, block layer, filesystem, C library) to the application so it can recover in a reliable way is not a simple operation!

Simple operations are open/close/read/write. Those work. Until they don't, then you need to know how far back the operations you already did and "assumed" had worked didn't. And in this case the promise made to PostgreSQL by fsync() wasn't as firm as the "obvious" interpretation of the documentation would lead one to believe.

dagenix · on May 3, 2018

I don't doubt its a hard problem. If there was a simple, obvious better way to do it, I imagine we'd have it by now.

loeg · on May 2, 2018

> When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean.

That behavior seems problematic.

As always, there's a great Dan Luu blog post on the subject: https://danluu.com/filesystem-errors/

> Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.

Emphasis added.

zAy0LfpBZLC8mAC · on May 3, 2018

Now, taking the case of a user pulling out a USB thumb drive as an excuse for not keeping the dirty pages around seems ... disingenuous?

If the storage device has disappeared for good, you can just return EIO for all further I/O operations, and mark all open file descriptions for which there were dirty pages such that any further fsync() calls on the corresponding fds return an error?

I mean, either you think you can still retry, then you should keep the dirty page around, or you think retrying is futile, then feel free to drop the dirty pages, but make sure anyone who tries something that would make this loss visible gets an error, which should only require keeping flags on open file descriptions, and possibly pages/inodes/block devices that (semi-)persist the error at the desired resolution, which you can broaden if the bookkeeping uses too much memory.

loeg · on May 3, 2018

Yeah. The USB case is a cop out. For USB, keeping the pages dirty and the fsyncs erroring (as seems consistent with Postgres' needs and common sense) seems fine.

The memory can be reclaimed when 'umount --force', or something like that, discards filesystem dirty state.

kazinator · on May 3, 2018

> Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions, including: when and how would that flag ever be cleared? So this change seems unlikely to happen.

  fcntl(fd, F_CLR_FKNG_ERR);

pkaye · on May 3, 2018

I've seen this kind of problems when I was writing SSD firmware 10-15 years back. The operating systems just dont do much with the hardware reported errors. There is some old research papers called "IRON filesystems" that is a pretty good reading on how poor the error handling was and maybe still is.

toothpasta · on May 3, 2018

There's no way to recover from a failed write (if the drive is still operating and could reallocate the sector, it would have already done that). So mark the pages damaged and deallocate their contents. Keep the metadata for the damaged pages around until someone tries to sync or close the associated file.

caf · on May 3, 2018

There's no way to recover from a failed write (if the drive is still operating and could reallocate the sector, it would have already done that).

That's not exactly true. In the thin-provisioned block device case, administrator action can make it resume accepting writes.

cesarb · on May 3, 2018

The whole issue is that postgres wants to know about the failed write, because if it knows it can recover from it, by failing over to a slave.

zAy0LfpBZLC8mAC · on May 3, 2018

A write() can fail because a device read fails when the write() is not aligned on sector/page boundaries.

Annatar · on May 3, 2018

If you take into consideration that there are alternatives like FreeBSD and SmartOS which do not suffer from such serious and basic functionality malfunctions, it is illogical to keep putting up with GNU/Linux on the basis of being the only thing one is comfortable with.

Comfort is of little consolation or use if the operating system is this unreliable, especially since making sure that data is safely and correctly stored is core, basic functionality.

dis-sys · on May 3, 2018

This article is really good with lots of details in those linked discussions.

wondering what happens to other critical libraries such as RocksDB/LevelDB, what actually happens when there is hardware error not limited to unplugged usb cable?

bbuchalter · on May 3, 2018

Does anyone know if a similar conversation around this issue is needed or being had in the MySQL community?

takeda · on May 3, 2018

Sorry for being snarky, but from my ops experience MySQL manages to lose data even without hardware errors[1].

[1] My last experience was due to bug where certain pattern of data made MySQL/MariaDB think the data page was encrypted, after which it proceeded to discard that page and crash complaining that data is corrupter and from that point on refused to start until data got restored.

wruza · on May 3, 2018

Ah, sort of “this sequence will never appear in user data” assumption, I guess?

cryptonector · on May 3, 2018

Here's a simpler fix: when the underlying device produces an error then mark the in-core inode (not on disk) as having an error and have all further writes return EIO. Then fsync() too can notice the error state flag being set and also return EIO.

caf · on May 3, 2018

That's similar to Willy Tarreau's suggestion and suffers from the same issue - it only works up until that inode gets evicted due to memory pressure.

cryptonector · on May 3, 2018

PG could keep it open. I know it doesn't, but it could and should.

Also, perhaps the error flag should keep the inode in core. Note that the pages would still get thrown out, so as far as memory pressure this is not the end of the world.

forkandwait · on May 3, 2018

Can anyone more knowledgeable than I comment on how this affects FreeBSD?

kev009 · on May 3, 2018

It doesn't affect FreeBSD. https://lwn.net/Articles/752388/

CaptainZapp · on May 3, 2018

"The job of calling fsync(), however, is handled in a single "checkpointer" process, which is concerned with keeping on-disk storage in a consistent state that can recover from failures"

And there in lies the rub. Sybase ASE calls fsync() upon every commit, which is the reason that database devices are still mostly implemented with raw devices. Before version 11.9.2 (as far as I recall) you ran the exact same risk if you used the file system as devices. Now it's safe, but performance can get pretty heinous on write intensive systems.

anarazel · on May 3, 2018

> Sybase ASE calls fsync() upon every commit

Those are journal commits, not the commits that the piece you quote is talking about (actual data files).

grumpydba · on May 3, 2018

Sybase ASE opens files with either the O_SYNC or the O_DIRECT flag, the later being availlable since ASE 15. The general good practice is to use raw devices or direct i/o for write intensive workloads. BTW it uses asynchronous i/o too.

deepsun · on May 2, 2018

> Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term solution. But he also noted that getting there is "a metric ton of work" that isn't going to happen anytime soon.

No, that is a "recipe for disaster", as they say. Not doing something that everyone acknowledging as important, because that's a "lot of work" is what makes projects a mess. I've seen that many times on various projects.

anarazel · on May 3, 2018

Uh. We've five years of old versions to support. We're not going to backpatch the invasive stuff necessary to make DIO performant.

And it'd not be the default anyway, as it requires more tuning.

We are working on getting there.

always_good · on May 3, 2018

I don't see how you could write your post unless you're so naive you think there are no trade-offs.

You sound like a Dilbert manager's "I don't care, just have it on my desk at all costs by Friday."

That is not sound engineering.

derekp7 · on May 3, 2018

Diving in and quickly doing something complicated without a lot of careful consideration and testing is a recipe for disaster. Especially when there may be a simpler way of accomplishing the same goals, that just isn't available yet (be it future APIs, or some better way that no one has thought of yet).

2RTZZSro · on May 3, 2018

What is DIO?

Annatar · on May 3, 2018

Direct input/output; the software performs write(2) system calls directly and does its own input/output buffering and scheduling, bypassing the operating system and filesystem driver buffering. For this to be effective and function correctly, the filesystem driver must support mounting in direct input/output mode, or raw character devices must be presented to the software in question. Note that presenting raw block devices bypasses filesystem driver’s buffering, but does not bypass operating system (kernel) buffering, hence character devices must be used to bypass both. Software which uses raw devices usually has sophisticated input/output scheduling and buffering optimally tuned for its use case.