Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You just tail /var/log/apache/access.log which all of them write to by virtue of PIPE_BUF semantics.

I can see how this would be convenient. That's fine and I have nothing against it. It's just not a problem I consider interesting, because there is no inherent need for all 100 writers to write to the same file. You could just as easily have each write to a separate file and have a separate process that merges them together if you want a single file for convenience.

> I really don't see how any of this is relevant to the original point of having "log" files.

Maybe you're not aware: every major database keeps a commit log, and writes it to a file called a log file (sometimes "write-ahead log file"). That is what I think of when I hear about "log" files in the context of file durability, because this is the case where file consistency and durability actually affect the consistency of the system. Here are some examples:

https://www.sqlite.org/tempfiles.html#walfile

http://www.postgresql.org/docs/9.1/static/wal-intro.html

https://leveldb.googlecode.com/svn/trunk/doc/impl.html

This is a much more interesting problem (to me) because it is harder and relevant to the consistency of any database system.

> Now you're talking about DB transaction logs and blockchains, which really don't need to have their nuances implemented in terms of POSIX file semantics. They can just have a single process that manages those expectations.

You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?

SQLite has chosen to support concurrent SQLite processes all writing to the same database. Maybe you think they shouldn't do that, but they find it useful. And even in your world where they shouldn't do that, it's still not ok to just corrupt the database because the user didn't follow the rules.

> Everyone's expectations don't have to be implemented on the filesystem level

One of the kernel's key responsibilities is to arbitrate concurrent access to shared resources. It provides primitives that makes it possible to build higher-level abstractions. What I am describing is a primitive that allows for lock-free appends to consistent transaction logs. It could be useful for a lot of things, and makes at least as much sense as plenty of other things that are already in syscall interfaces.



>> I really don't see how any of this is relevant to the original point of having "log" files. > Maybe you're not aware: every major database keeps a commit log, and writes it to a file called a log file (sometimes "write-ahead log file").

> You do realize that SQLite can't just corrupt your database because you ran two SQLites concurrently, right?

Sure, but, so? I don't know sqlite's WAL logging code in detail, but I do know PostgreSQL's fairly intimately. I don't see how such an interface[1] would be relevant for WAL logging. Such logs usually have checksums and pointers to previous records in their format. For those to be correct each writing process needs to know about previous records (or at least their starting point). Thus you need locking and coordination in userspace anyway - kernel level append mechanics aren't that interesting.

In addition to that, if you care about performance, you'll want to pre-allocate the WAL files and possibly re-use them after a checkpoint. For many filesystems overwriting files is a lot more efficient than allocating new blocks. It avoids the need for fs-internal metadata journaling, avoids fragmentation etc.. With pre-allocated files you then can use fdatasync() instead of fsync(), which can be considerable performance benefit in our experience.

There are things that'd make it easier and more efficient to write correct and efficient journaling, but imo not what you were talking about. Querying and actually getting guarantees about which size of writes are atomic, for example; otherwise you need to use rather expensive workarounds like WAL logging full page contents after checkpoints, or double-write buffers.

Proper asynchronous fsync(), fdatasync() would also be rather useful.

[1] > I'm basically talking about compare and swap, except instead of compare and swap its compare and append




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: