Skip to content

Commit f12e504

Browse files
committed
open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/* and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR *dir; struct dirent *direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217 [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236 [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. [7]: rust-lang/rust#12148 [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308 Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Jann Horn <[email protected]> Cc: David Howells <[email protected]> Cc: Dmitry V. Levin <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Florian Weimer <[email protected]> Cc: [email protected] --- v1: - Linus Torvalds <[email protected]>: - add cond_resched() to yield cpu when closing a lot of file descriptors - Al Viro <[email protected]>: - add cond_resched() to yield cpu when closing a lot of file descriptors v2: unchanged
1 parent a188339 commit f12e504

File tree

22 files changed

+99
-9
lines changed

22 files changed

+99
-9
lines changed

arch/alpha/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -473,3 +473,4 @@
473473
541 common fsconfig sys_fsconfig
474474
542 common fsmount sys_fsmount
475475
543 common fspick sys_fspick
476+
545 common close_range sys_close_range

arch/arm/tools/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -447,3 +447,4 @@
447447
431 common fsconfig sys_fsconfig
448448
432 common fsmount sys_fsmount
449449
433 common fspick sys_fspick
450+
435 common close_range sys_close_range

arch/arm64/include/asm/unistd32.h

+2
Original file line numberDiff line numberDiff line change
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
886886
__SYSCALL(__NR_fsmount, sys_fsmount)
887887
#define __NR_fspick 433
888888
__SYSCALL(__NR_fspick, sys_fspick)
889+
#define __NR_close_range 435
890+
__SYSCALL(__NR_close_range, sys_close_range)
889891

890892
/*
891893
* Please add new compat syscalls above this comment and update

arch/ia64/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -354,3 +354,4 @@
354354
431 common fsconfig sys_fsconfig
355355
432 common fsmount sys_fsmount
356356
433 common fspick sys_fspick
357+
435 common close_range sys_close_range

arch/m68k/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -433,3 +433,4 @@
433433
431 common fsconfig sys_fsconfig
434434
432 common fsmount sys_fsmount
435435
433 common fspick sys_fspick
436+
435 common close_range sys_close_range

arch/microblaze/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -439,3 +439,4 @@
439439
431 common fsconfig sys_fsconfig
440440
432 common fsmount sys_fsmount
441441
433 common fspick sys_fspick
442+
435 common close_range sys_close_range

arch/mips/kernel/syscalls/syscall_n32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -372,3 +372,4 @@
372372
431 n32 fsconfig sys_fsconfig
373373
432 n32 fsmount sys_fsmount
374374
433 n32 fspick sys_fspick
375+
435 n32 close_range sys_close_range

arch/mips/kernel/syscalls/syscall_n64.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -348,3 +348,4 @@
348348
431 n64 fsconfig sys_fsconfig
349349
432 n64 fsmount sys_fsmount
350350
433 n64 fspick sys_fspick
351+
435 n64 close_range sys_close_range

arch/mips/kernel/syscalls/syscall_o32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -421,3 +421,4 @@
421421
431 o32 fsconfig sys_fsconfig
422422
432 o32 fsmount sys_fsmount
423423
433 o32 fspick sys_fspick
424+
435 o32 close_range sys_close_range

arch/parisc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -430,3 +430,4 @@
430430
431 common fsconfig sys_fsconfig
431431
432 common fsmount sys_fsmount
432432
433 common fspick sys_fspick
433+
435 common close_range sys_close_range

arch/powerpc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -515,3 +515,4 @@
515515
431 common fsconfig sys_fsconfig
516516
432 common fsmount sys_fsmount
517517
433 common fspick sys_fspick
518+
435 common close_range sys_close_range

arch/s390/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@
436436
431 common fsconfig sys_fsconfig sys_fsconfig
437437
432 common fsmount sys_fsmount sys_fsmount
438438
433 common fspick sys_fspick sys_fspick
439+
435 common close_range sys_close_range sys_close_range

arch/sh/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@
436436
431 common fsconfig sys_fsconfig
437437
432 common fsmount sys_fsmount
438438
433 common fspick sys_fspick
439+
435 common close_range sys_close_range

arch/sparc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -479,3 +479,4 @@
479479
431 common fsconfig sys_fsconfig
480480
432 common fsmount sys_fsmount
481481
433 common fspick sys_fspick
482+
435 common close_range sys_close_range

arch/x86/entry/syscalls/syscall_32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -438,3 +438,4 @@
438438
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
439439
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
440440
433 i386 fspick sys_fspick __ia32_sys_fspick
441+
435 i386 close_range sys_close_range __ia32_sys_close_range

arch/x86/entry/syscalls/syscall_64.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@
355355
431 common fsconfig __x64_sys_fsconfig
356356
432 common fsmount __x64_sys_fsmount
357357
433 common fspick __x64_sys_fspick
358+
435 common close_range __x64_sys_close_range
358359

359360
#
360361
# x32-specific system call numbers start at 512 to avoid cache impact

arch/xtensa/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -404,3 +404,4 @@
404404
431 common fsconfig sys_fsconfig
405405
432 common fsmount sys_fsmount
406406
433 common fspick sys_fspick
407+
435 common close_range sys_close_range

fs/file.c

+54-8
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include <linux/syscalls.h>
1111
#include <linux/export.h>
1212
#include <linux/fs.h>
13+
#include <linux/kernel.h>
1314
#include <linux/mm.h>
1415
#include <linux/sched/signal.h>
1516
#include <linux/slab.h>
@@ -615,12 +616,9 @@ void fd_install(unsigned int fd, struct file *file)
615616

616617
EXPORT_SYMBOL(fd_install);
617618

618-
/*
619-
* The same warnings as for __alloc_fd()/__fd_install() apply here...
620-
*/
621-
int __close_fd(struct files_struct *files, unsigned fd)
619+
static struct file *pick_file(struct files_struct *files, unsigned fd)
622620
{
623-
struct file *file;
621+
struct file *file = NULL;
624622
struct fdtable *fdt;
625623

626624
spin_lock(&files->file_lock);
@@ -632,15 +630,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
632630
goto out_unlock;
633631
rcu_assign_pointer(fdt->fd[fd], NULL);
634632
__put_unused_fd(files, fd);
635-
spin_unlock(&files->file_lock);
636-
return filp_close(file, files);
637633

638634
out_unlock:
639635
spin_unlock(&files->file_lock);
640-
return -EBADF;
636+
return file;
637+
}
638+
639+
/*
640+
* The same warnings as for __alloc_fd()/__fd_install() apply here...
641+
*/
642+
int __close_fd(struct files_struct *files, unsigned fd)
643+
{
644+
struct file *file;
645+
646+
file = pick_file(files, fd);
647+
if (!file)
648+
return -EBADF;
649+
650+
return filp_close(file, files);
641651
}
642652
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
643653

654+
/**
655+
* __close_range() - Close all file descriptors in a given range.
656+
*
657+
* @fd: starting file descriptor to close
658+
* @max_fd: last file descriptor to close
659+
*
660+
* This closes a range of file descriptors. All file descriptors
661+
* from @fd up to and including @max_fd are closed.
662+
*/
663+
int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
664+
{
665+
unsigned int cur_max;
666+
667+
if (fd > max_fd)
668+
return -EINVAL;
669+
670+
rcu_read_lock();
671+
cur_max = files_fdtable(files)->max_fds;
672+
rcu_read_unlock();
673+
674+
/* cap to last valid index into fdtable */
675+
max_fd = max(max_fd, (cur_max - 1));
676+
while (fd <= max_fd) {
677+
struct file *file;
678+
679+
file = pick_file(files, fd++);
680+
if (!file)
681+
continue;
682+
683+
filp_close(file, files);
684+
cond_resched();
685+
}
686+
687+
return 0;
688+
}
689+
644690
/*
645691
* variant of __close_fd that gets a ref on the file for later fput
646692
*/

fs/open.c

+20
Original file line numberDiff line numberDiff line change
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
11741174
return retval;
11751175
}
11761176

1177+
/**
1178+
* close_range() - Close all file descriptors in a given range.
1179+
*
1180+
* @fd: starting file descriptor to close
1181+
* @max_fd: last file descriptor to close
1182+
* @flags: reserved for future extensions
1183+
*
1184+
* This closes a range of file descriptors. All file descriptors
1185+
* from @fd up to and including @max_fd are closed.
1186+
* Currently, errors to close a given file descriptor are ignored.
1187+
*/
1188+
SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
1189+
unsigned int, flags)
1190+
{
1191+
if (flags)
1192+
return -EINVAL;
1193+
1194+
return __close_range(current->files, fd, max_fd);
1195+
}
1196+
11771197
/*
11781198
* This routine simulates a hangup on the tty, to arrange that users
11791199
* are given clean terminals at login time.

include/linux/fdtable.h

+2
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
121121
unsigned int fd, struct file *file);
122122
extern int __close_fd(struct files_struct *files,
123123
unsigned int fd);
124+
extern int __close_range(struct files_struct *files, unsigned int fd,
125+
unsigned int max_fd);
124126
extern int __close_fd_get_file(unsigned int fd, struct file **res);
125127

126128
extern struct kmem_cache *files_cachep;

include/linux/syscalls.h

+2
Original file line numberDiff line numberDiff line change
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
441441
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
442442
umode_t mode);
443443
asmlinkage long sys_close(unsigned int fd);
444+
asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
445+
unsigned int flags);
444446
asmlinkage long sys_vhangup(void);
445447

446448
/* fs/pipe.c */

include/uapi/asm-generic/unistd.h

+3-1
Original file line numberDiff line numberDiff line change
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
844844
__SYSCALL(__NR_fsmount, sys_fsmount)
845845
#define __NR_fspick 433
846846
__SYSCALL(__NR_fspick, sys_fspick)
847+
#define __NR_close_range 435
848+
__SYSCALL(__NR_close_range, sys_close_range)
847849

848850
#undef __NR_syscalls
849-
#define __NR_syscalls 434
851+
#define __NR_syscalls 436
850852

851853
/*
852854
* 32 bit systems traditionally used different

0 commit comments

Comments
 (0)