09 Sep, 2020

1 commit

  • Using the read_iter/write_iter interfaces allows for in-kernel users
    to set sysctls without using set_fs(). Also, the buffer is a string,
    so give it the real type of 'char *', not void *.

    [AV: Christoph's fixup folded in]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Matthew Wilcox (Oracle)
     

04 Jul, 2020

1 commit


11 Jun, 2020

2 commits

  • Pull sysctl fixes from Al Viro:
    "Fixups to regressions in sysctl series"

    * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sysctl: reject gigantic reads/write to sysctl files
    cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
    trace: fix an incorrect __user annotation on stack_trace_sysctl
    random: fix an incorrect __user annotation on proc_do_entropy
    net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
    net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl

    Linus Torvalds
     
  • Instead of triggering a WARN_ON deep down in the page allocator just
    give up early on allocations that are way larger than the usual sysctl
    values.

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Reported-by: Vegard Nossum
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

09 Jun, 2020

4 commits

  • After a recent change introduced by Vlastimil's series [0], kernel is
    able now to handle sysctl parameters on kernel command line; also, the
    series introduced a simple infrastructure to convert legacy boot
    parameters (that duplicate sysctls) into sysctl aliases.

    This patch converts the watchdog parameters softlockup_panic and
    {hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
    It fixes the documentation too, since the alias only accepts values 0 or
    1, not the full range of integers.

    We also took the opportunity here to improve the documentation of the
    previously converted hung_task_panic (see the patch series [0]) and put
    the alias table in alphabetical order.

    [0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Luis Chamberlain
    Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • We can now handle sysctl parameters on kernel command line and have
    infrastructure to convert legacy command line options that duplicate
    sysctl to become a sysctl alias.

    This patch converts the hung_task_panic parameter. Note that the sysctl
    handler is more strict and allows only 0 and 1, while the legacy
    parameter allowed any non-zero value. But there is little reason anyone
    would not be using 1.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Christian Brauner
    Cc: David Rientjes
    Cc: "Eric W . Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "Guilherme G . Piccoli"
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Luis Chamberlain
    Cc: Masami Hiramatsu
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • We can now handle sysctl parameters on kernel command line, but
    historically some parameters introduced their own command line
    equivalent, which we don't want to remove for compatibility reasons.

    We can, however, convert them to the generic infrastructure with a table
    translating the legacy command line parameters to their sysctl names,
    and removing the one-off param handlers.

    This patch adds the support and makes the first conversion to
    demonstrate it, on the (deprecated) numa_zonelist_order parameter.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Luis Chamberlain
    Acked-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Christian Brauner
    Cc: David Rientjes
    Cc: "Eric W . Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "Guilherme G . Piccoli"
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Masami Hiramatsu
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "support setting sysctl parameters from kernel command line", v3.

    This series adds support for something that seems like many people
    always wanted but nobody added it yet, so here's the ability to set
    sysctl parameters via kernel command line options in the form of
    sysctl.vm.something=1

    The important part is Patch 1. The second, not so important part is an
    attempt to clean up legacy one-off parameters that do the same thing as
    a sysctl. I don't want to remove them completely for compatibility
    reasons, but with generic sysctl support the idea is to remove the
    one-off param handlers and treat the parameters as aliases for the
    sysctl variants.

    I have identified several parameters that mention sysctl counterparts in
    Documentation/admin-guide/kernel-parameters.txt but there might be more.
    The conversion also has varying level of success:

    - numa_zonelist_order is converted in Patch 2 together with adding the
    necessary infrastructure. It's easy as it doesn't really do anything
    but warn on deprecated value these days.

    - hung_task_panic is converted in Patch 3, but there's a downside that
    now it only accepts 0 and 1, while previously it was any integer
    value

    - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
    so there's no straighforward conversion possible

    - traceoff_on_warning is a flag without value and it would be required
    to handle that somehow in the conversion infractructure, which seems
    pointless for a single flag

    This patch (of 5):

    A recently proposed patch to add vm_swappiness command line parameter in
    addition to existing sysctl [1] made me wonder why we don't have a
    general support for passing sysctl parameters via command line.

    Googling found only somebody else wondering the same [2], but I haven't
    found any prior discussion with reasons why not to do this.

    Settings the vm_swappiness issue aside (the underlying issue might be
    solved in a different way), quick search of kernel-parameters.txt shows
    there are already some that exist as both sysctl and kernel parameter -
    hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.

    A general mechanism would remove the need to add more of those one-offs
    and might be handy in situations where configuration by e.g.
    /etc/sysctl.d/ is impractical.

    Hence, this patch adds a new parse_args() pass that looks for parameters
    prefixed by 'sysctl.' and tries to interpret them as writes to the
    corresponding sys/ files using an temporary in-kernel procfs mount.
    This mechanism was suggested by Eric W. Biederman [3], as it handles
    all dynamically registered sysctl tables, even though we don't handle
    modular sysctls. Errors due to e.g. invalid parameter name or value
    are reported in the kernel log.

    The processing is hooked right before the init process is loaded, as
    some handlers might be more complicated than simple setters and might
    need some subsystems to be initialized. At the moment the init process
    can be started and eventually execute a process writing to /proc/sys/
    then it should be also fine to do that from the kernel.

    Sysctls registered later on module load time are not set by this
    mechanism - it's expected that in such scenarios, setting sysctl values
    from userspace is practical enough.

    [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
    [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
    [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Luis Chamberlain
    Reviewed-by: Masami Hiramatsu
    Acked-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Matthew Wilcox
    Cc: "Eric W . Biederman"
    Cc: "Guilherme G . Piccoli"
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Cc: Greg Kroah-Hartman
    Cc: Christian Brauner
    Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
    Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

24 Feb, 2020

1 commit

  • The function d_prune_aliases has the problem that it will only prune
    aliases thare are completely unused. It will not remove aliases for
    the dcache or even think of removing mounts from the dcache. For that
    behavior d_invalidate is needed.

    To use d_invalidate replace d_prune_aliases with d_find_alias followed
    by d_invalidate and dput.

    For completeness the directory and the non-directory cases are
    separated because in theory (although not in currently in practice for
    proc) directories can only ever have a single dentry while
    non-directories can have hardlinks and thus multiple dentries.
    As part of this separation use d_find_any_alias for directories
    to spare d_find_alias the extra work of doing that.

    Plus the differences between d_find_any_alias and d_find_alias makes
    it clear why the directory and non-directory code and not share code.

    To make it clear these routines now invalidate dentries rename
    proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
    proc_sys_prune_dcache proc_sys_invalidate_dcache.

    V2: Split the directory and non-directory cases. To make this
    code robust to future changes in proc.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Feb, 2020

1 commit


20 Feb, 2020

1 commit

  • I about to need and use the same functionality for pid based
    inodes and there is no point in adding a second field when
    this field is already here and serving the same purporse.

    Just give the field a generic name so it is clear that
    it is no longer sysctl specific.

    Also for good measure initialize sibling_inodes when
    proc_inode is initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

04 Feb, 2020

1 commit

  • Currently core /proc code uses "struct file_operations" for custom hooks,
    however, VFS doesn't directly call them. Every time VFS expands
    file_operations hook set, /proc code bloats for no reason.

    Introduce "struct proc_ops" which contains only those hooks which /proc
    allows to call into (open, release, read, write, ioctl, mmap, poll). It
    doesn't contain module pointer as well.

    Save ~184 bytes per usage:

    add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
    Function old new delta
    sysvipc_proc_ops - 72 +72
    ...
    config_gz_proc_ops - 72 +72
    proc_get_inode 289 339 +50
    proc_reg_get_unmapped_area 110 107 -3
    close_pdeo 227 224 -3
    proc_reg_open 289 284 -5
    proc_create_data 60 53 -7
    rt_cpu_seq_fops 256 - -256
    ...
    default_affinity_proc_fops 256 - -256
    Total: Before=5430095, After=5425343, chg -0.09%

    Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
    but this is not a correct behavior for proc. Since sysctl permission
    check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
    sense to use these values in u_[ug]id of proc inodes. In other words:
    although uid/gid in the inode is not read during test_perm, the inode
    logically belongs to the root of the namespace. I have confirmed this
    with Eric Biederman at LPC and in this thread:
    https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com

    Consequences
    ============

    Since the i_[ug]id values of proc nodes are not used for permissions
    checks, this change usually makes no functional difference. However, it
    causes an issue in a setup where:

    * a namespace container is created without root user in container -
    hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID

    * container creator tries to configure it by writing /proc/sys files,
    e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit

    Kernel does not allow to open an inode for writing if its i_[ug]id are
    invalid, making it impossible to write shmmax and thus - configure the
    container.

    Using a container with no root mapping is apparently rare, but we do use
    this configuration at Google. Also, we use a generic tool to configure
    the container limits, and the inability to write any of them causes a
    failure.

    History
    =======

    The invalid uids/gids in inodes first appeared due to 81754357770e (fs:
    Update i_[ug]id_(read|write) to translate relative to s_user_ns).
    However, AFAIK, this did not immediately cause any issues. The
    inability to write to these "invalid" inodes was only caused by a later
    commit 0bd23d09b874 (vfs: Don't modify inodes with a uid or gid unknown
    to the vfs).

    Tested: Used a repro program that creates a user namespace without any
    mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
    Before the change, it shows the overflow uid, with the change it's 0.
    The overflow uid indicates that the uid in the inode is not correct and
    thus it is not possible to open the file for writing.

    Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
    Fixes: 0bd23d09b874 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
    Signed-off-by: Radoslaw Burny
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: "Eric W . Biederman"
    Cc: Seth Forshee
    Cc: John Sperbeck
    Cc: Alexey Dobriyan
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Radoslaw Burny
     

03 May, 2019

1 commit


27 Apr, 2019

1 commit

  • Syzkaller report this:

    sysctl could not get directory: /net//bridge -12
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN PTI
    CPU: 1 PID: 7027 Comm: syz-executor.0 Tainted: G C 5.1.0-rc3+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:__write_once_size include/linux/compiler.h:220 [inline]
    RIP: 0010:__rb_change_child include/linux/rbtree_augmented.h:144 [inline]
    RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:186 [inline]
    RIP: 0010:rb_erase+0x5f4/0x19f0 lib/rbtree.c:459
    Code: 00 0f 85 60 13 00 00 48 89 1a 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 3c 02 00 0f 85 75 0c 00 00 4d 85 ed 4c 89 2e 74 ce 4c 89 ea 48
    RSP: 0018:ffff8881bb507778 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: ffff8881f224b5b8 RCX: ffffffff818f3f6a
    RDX: 000000000000000a RSI: 0000000000000050 RDI: ffff8881f224b568
    RBP: 0000000000000000 R08: ffffed10376a0ef4 R09: ffffed10376a0ef4
    R10: 0000000000000001 R11: ffffed10376a0ef4 R12: ffff8881f224b558
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS: 00007f3e7ce13700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fd60fbe9398 CR3: 00000001cb55c001 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    erase_entry fs/proc/proc_sysctl.c:178 [inline]
    erase_header+0xe3/0x160 fs/proc/proc_sysctl.c:207
    start_unregistering fs/proc/proc_sysctl.c:331 [inline]
    drop_sysctl_table+0x558/0x880 fs/proc/proc_sysctl.c:1631
    get_subdir fs/proc/proc_sysctl.c:1022 [inline]
    __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
    br_netfilter_init+0x68/0x1000 [br_netfilter]
    do_one_initcall+0xbc/0x47d init/main.c:901
    do_init_module+0x1b5/0x547 kernel/module.c:3456
    load_module+0x6405/0x8c10 kernel/module.c:3804
    __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
    do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    Modules linked in: br_netfilter(+) backlight comedi(C) hid_sensor_hub max3100 ti_ads8688 udc_core fddi snd_mona leds_gpio rc_streamzap mtd pata_netcell nf_log_common rc_winfast udp_tunnel snd_usbmidi_lib snd_usb_toneport snd_usb_line6 snd_rawmidi snd_seq_device snd_hwdep videobuf2_v4l2 videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops rc_gadmei_rm008z 8250_of smm665 hid_tmff hid_saitek hwmon_vid rc_ati_tv_wonder_hd_600 rc_core pata_pdc202xx_old dn_rtmsg as3722 ad714x_i2c ad714x snd_soc_cs4265 hid_kensington panel_ilitek_ili9322 drm drm_panel_orientation_quirks ipack cdc_phonet usbcore phonet hid_jabra hid extcon_arizona can_dev industrialio_triggered_buffer kfifo_buf industrialio adm1031 i2c_mux_ltc4306 i2c_mux ipmi_msghandler mlxsw_core snd_soc_cs35l34 snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer ac97_bus snd_compress snd soundcore gpio_da9055 uio ecdh_generic mdio_thunder of_mdio fixed_phy libphy mdio_cavium iptable_security iptable_raw iptable_mangle
    iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ide_pci_generic piix aes_x86_64 crypto_simd cryptd ide_core glue_helper input_leds psmouse intel_agp intel_gtt serio_raw ata_generic i2c_piix4 agpgart pata_acpi parport_pc parport floppy rtc_cmos sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: br_netfilter]
    Dumping ftrace buffer:
    (ftrace buffer empty)
    ---[ end trace 68741688d5fbfe85 ]---

    commit 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer
    dereference in put_links") forgot to handle start_unregistering() case,
    while header->parent is NULL, it calls erase_header() and as seen in the
    above syzkaller call trace, accessing &header->parent->root will trigger
    a NULL pointer dereference.

    As that commit explained, there is also no need to call
    start_unregistering() if header->parent is NULL.

    Link: http://lkml.kernel.org/r/20190409153622.28112-1-yuehaibing@huawei.com
    Fixes: 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links")
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: YueHaibing
    Reported-by: Hulk Robot
    Reviewed-by: Kees Cook
    Cc: Luis Chamberlain
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     

13 Apr, 2019

3 commits

  • Add file_pos field to bpf_sysctl context to read and write sysctl file
    position at which sysctl is being accessed (read or written).

    The field can be used to e.g. override whole sysctl value on write to
    sysctl even when sys_write is called by user space with file_pos > 0. Or
    BPF program may reject such accesses.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add helpers to work with new value being written to sysctl by user
    space.

    bpf_sysctl_get_new_value() copies value being written to sysctl into
    provided buffer.

    bpf_sysctl_set_new_value() overrides new value being written by user
    space with a one from provided buffer. Buffer should contain string
    representation of the value, similar to what can be seen in /proc/sys/.

    Both helpers can be used only on sysctl write.

    File position matters and can be managed by an interface that will be
    introduced separately. E.g. if user space calls sys_write to a file in
    /proc/sys/ at file position = X, where X > 0, then the value set by
    bpf_sysctl_set_new_value() will be written starting from X. If program
    wants to override whole value with specified buffer, file position has
    to be set to zero.

    Documentation for the new helpers is provided in bpf.h UAPI.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Containerized applications may run as root and it may create problems
    for whole host. Specifically such applications may change a sysctl and
    affect applications in other containers.

    Furthermore in existing infrastructure it may not be possible to just
    completely disable writing to sysctl, instead such a process should be
    gradual with ability to log what sysctl are being changed by a
    container, investigate, limit the set of writable sysctl to currently
    used ones (so that new ones can not be changed) and eventually reduce
    this set to zero.

    The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
    attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.

    New program type has access to following minimal context:
    struct bpf_sysctl {
    __u32 write;
    };

    Where @write indicates whether sysctl is being read (= 0) or written (=
    1).

    Helpers to access sysctl name and value will be introduced separately.

    BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
    passing control to ctl_table->proc_handler so that BPF program can
    either allow or deny access to sysctl.

    Suggested-by: Roman Gushchin
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     

30 Mar, 2019

1 commit

  • Syzkaller reports:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN PTI
    CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
    Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
    RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
    RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
    RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
    R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
    R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
    FS: 00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
    get_subdir fs/proc/proc_sysctl.c:1022 [inline]
    __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
    br_netfilter_init+0xbc/0x1000 [br_netfilter]
    do_one_initcall+0xfa/0x5ca init/main.c:887
    do_init_module+0x204/0x5f6 kernel/module.c:3460
    load_module+0x66b2/0x8570 kernel/module.c:3808
    __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
    do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x462e99
    Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
    RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
    RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
    RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
    R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
    Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
    iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
    Dumping ftrace buffer:
    (ftrace buffer empty)
    ---[ end trace 770020de38961fd0 ]---

    A new dir entry can be created in get_subdir and its 'header->parent' is
    set to NULL. Only after insert_header success, it will be set to 'dir',
    otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
    However in err handling path of get_subdir, drop_sysctl_table also be
    called on 'new->header' regardless its value of parent pointer. Then
    put_links is called, which triggers NULL-ptr deref when access member of
    header->parent.

    In fact we have multiple error paths which call drop_sysctl_table() there,
    upon failure on insert_links() we also call drop_sysctl_table().And even
    in the successful case on __register_sysctl_table() we still always call
    drop_sysctl_table().This patch fix it.

    Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: YueHaibing
    Reported-by: Hulk Robot
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     

14 Dec, 2018

1 commit

  • proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
    corresponding sysctl table is being unregistered. In our case we see
    this upon opening /proc/sys/net/*/conf files while network interfaces
    are being deleted, which confuses our configuration daemon.

    The problem was successfully reproduced and this fix tested on v4.9.122
    and v4.20-rc6.

    v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
    of mixing them with NULL. Thanks Al Viro for the feedback.

    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Ivan Delalande
    Signed-off-by: Al Viro

    Ivan Delalande
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

27 May, 2018

1 commit


12 Apr, 2018

3 commits

  • Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.

    The sysctl parameters msgmni, shmmni and semmni have an inherent limit
    of IPC_MNI (32k). However, users may not be aware of that because they
    can write a value much higher than that without getting any error or
    notification. Reading the parameters back will show the newly written
    values which are not real.

    Enforcing the limit by failing sysctl parameter write, however, can
    break existing user applications. To address this delemma, a new flags
    field is introduced into the ctl_table. The value CTL_FLAGS_CLAMP_RANGE
    can be added to any ctl_table entries to enable a looser range clamping
    without returning any error. For example,

    .flags = CTL_FLAGS_CLAMP_RANGE,

    This flags value are now used for the range checking of shmmni, msgmni
    and semmni without breaking existing applications. If any out of range
    value is written to those sysctl parameters, the following warning will
    be printed instead.

    Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.

    Reading the values back will show 32768 instead of some fake values.

    This patch (of 6):

    Fix a typo.

    Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Acked-by: Luis R. Rodriguez
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Kees Cook
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • proc_sys_link_fill_cache() does not need to check whether we're called
    for a link - it's already done by scan().

    Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de
    Signed-off-by: Danilo Krummrich
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: "Luis R . Rodriguez"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Danilo Krummrich
     
  • proc_sys_link_fill_cache() does not take currently unregistering sysctl
    tables into account, which might result into a page fault in
    sysctl_follow_link() - add a check to fix it.

    This bug has been present since v3.4.

    Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: Danilo Krummrich
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: "Luis R . Rodriguez"
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Danilo Krummrich
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Jul, 2017

1 commit

  • Merge yet more updates from Andrew Morton:

    - various misc things

    - kexec updates

    - sysctl core updates

    - scripts/gdb udpates

    - checkpoint-restart updates

    - ipc updates

    - kernel/watchdog updates

    - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

    - "stackprotector: ascii armor the stack canary"

    - more MM bits

    - checkpatch updates

    * emailed patches from Andrew Morton : (96 commits)
    writeback: rework wb_[dec|inc]_stat family of functions
    ARM: samsung: usb-ohci: move inline before return type
    video: fbdev: omap: move inline before return type
    video: fbdev: intelfb: move inline before return type
    USB: serial: safe_serial: move __inline__ before return type
    drivers: tty: serial: move inline before return type
    drivers: s390: move static and inline before return type
    x86/efi: move asmlinkage before return type
    sh: move inline before return type
    MIPS: SMP: move asmlinkage before return type
    m68k: coldfire: move inline before return type
    ia64: sn: pci: move inline before type
    ia64: move inline before return type
    FRV: tlbflush: move asmlinkage before return type
    CRIS: gpio: move inline before return type
    ARM: HP Jornada 7XX: move inline before return type
    ARM: KVM: move asmlinkage before type
    checkpatch: improve the STORAGE_CLASS test
    mm, migration: do not trigger OOM killer when migrating memory
    drm/i915: use __GFP_RETRY_MAYFAIL
    ...

    Linus Torvalds
     

13 Jul, 2017

3 commits

  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Patch series "sysctl: few fixes", v5.

    I've been working on making kmod more deterministic, and as I did that I
    couldn't help but notice a few issues with sysctl. My end goal was just
    to fix unsigned int support, which back then was completely broken.
    Liping Zhang has sent up small atomic fixes, however it still missed yet
    one more fix and Alexey Dobriyan had also suggested to just drop array
    support given its complexity.

    I have inspected array support using Coccinelle and indeed its not that
    popular, so if in fact we can avoid it for new interfaces, I agree its
    best.

    I did develop a sysctl stress driver but will hold that off for another
    series.

    This patch (of 5):

    Commit 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    improved sanity checks considerbly, however the enhancements on
    sysctl_check_table() meant adding a functional change so that only the
    last table entry's sanity error is propagated. It also changed the way
    errors were propagated so that each new check reset the err value, this
    means only last sanity check computed is used for an error. This has
    been in the kernel since v3.4 days.

    Fix this by carrying on errors from previous checks and iterations as we
    traverse the table and ensuring we keep any error from previous checks.
    We keep iterating on the table even if an error is found so we can
    complain for all errors found in one shot. This works as -EINVAL is
    always returned on error anyway, and the check for error is any non-zero
    value.

    Fixes: 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

12 Jul, 2017

1 commit

  • Andrei Vagin writes:
    FYI: This bug has been reproduced on 4.11.7
    > BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc]
    > ------------[ cut here ]------------
    > WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
    > CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
    > Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
    > Workqueue: events proc_cleanup_work
    > Call Trace:
    > dump_stack+0x63/0x86
    > __warn+0xcb/0xf0
    > warn_slowpath_null+0x1d/0x20
    > umount_check+0x6e/0x80
    > d_walk+0xc6/0x270
    > ? dentry_free+0x80/0x80
    > do_one_tree+0x26/0x40
    > shrink_dcache_for_umount+0x2d/0x90
    > generic_shutdown_super+0x1f/0xf0
    > kill_anon_super+0x12/0x20
    > proc_kill_sb+0x40/0x50
    > deactivate_locked_super+0x43/0x70
    > deactivate_super+0x5a/0x60
    > cleanup_mnt+0x3f/0x90
    > mntput_no_expire+0x13b/0x190
    > kern_unmount+0x3e/0x50
    > pid_ns_release_proc+0x15/0x20
    > proc_cleanup_work+0x15/0x20
    > process_one_work+0x197/0x450
    > worker_thread+0x4e/0x4a0
    > kthread+0x109/0x140
    > ? process_one_work+0x450/0x450
    > ? kthread_park+0x90/0x90
    > ret_from_fork+0x2c/0x40
    > ---[ end trace e1c109611e5d0b41 ]---
    > VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day...
    > BUG: unable to handle kernel NULL pointer dereference at (null)
    > IP: _raw_spin_lock+0xc/0x30
    > PGD 0

    Fix this by taking a reference to the super block in proc_sys_prune_dcache.

    The superblock reference is the core of the fix however the sysctl_inodes
    list is converted to a hlist so that hlist_del_init_rcu may be used. This
    allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
    not causing problems for proc_sys_evict_inode when if it later choses to
    remove the inode from the sysctl_inodes list. Removing inodes from the
    sysctl_inodes list allows proc_sys_prune_dcache to have a progress
    guarantee, while still being able to drop all locks. The fact that
    head->unregistering is set in start_unregistering ensures that no more
    inodes will be added to the the sysctl_inodes list.

    Previously the code did a dance where it delayed calling iput until the
    next entry in the list was being considered to ensure the inode remained on
    the sysctl_inodes list until the next entry was walked to. The structure
    of the loop in this patch does not need that so is much easier to
    understand and maintain.

    Cc: stable@vger.kernel.org
    Reported-by: Andrei Vagin
    Tested-by: Andrei Vagin
    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 May, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a set of small fixes that were mostly stumbled over during
    more significant development. This proc fix and the fix to
    posix-timers are the most significant of the lot.

    There is a lot of good development going on but unfortunately it
    didn't quite make the merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Fix unbalanced hard link numbers
    signal: Make kill_proc_info static
    rlimit: Properly call security_task_setrlimit
    signal: Remove unused definition of sig_user_definied
    ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
    ipc: Remove unused declaration of recompute_msgmni
    posix-timers: Correct sanity check in posix_cpu_nsleep
    sysctl: Remove dead register_sysctl_root

    Linus Torvalds
     

17 Apr, 2017

1 commit

  • The function no longer does anything. The is only a single caller of
    register_sysctl_root when semantically there should be two. Remove
    this function so that if someone decides this functionality is needed
    again it will be obvious all of the callers of setup_sysctl_set need
    to be audited and modified appropriately.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Apr, 2017

1 commit

  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") introduced the proc_douintvec helper function, but it forgot to
    add the related sanity check when doing register_sysctl_table. So add
    it now.

    Signed-off-by: Liping Zhang
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liping Zhang
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells