19 Dec, 2012

4 commits

  • Pull (again) user namespace infrastructure changes from Eric Biederman:
    "Those bugs, those darn embarrasing bugs just want don't want to get
    fixed.

    Linus I just updated my mirror of your kernel.org tree and it appears
    you successfully pulled everything except the last 4 commits that fix
    those embarrasing bugs.

    When you get a chance can you please repull my branch"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Fix typo in description of the limitation of userns_install
    userns: Add a more complete capability subset test to commit_creds
    userns: Require CAP_SYS_ADMIN for most uses of setns.
    Fix cap_capable to only allow owners in the parent user namespace to have caps.

    Linus Torvalds
     
  • Pull exofs changes from Boaz Harrosh:
    "These are just 3 patches, the last two are bug fixes on the error
    paths in exofs.

    The important patch is the one to osd_uld which adds sysfs info to osd
    devices for use by user-mode clustering discovery software. I'm
    already sitting on this patch since before February this year, It is
    important for some of the big installation cluster systems, who's been
    compiling their own kernel just for that patch."

    Ugh. The osd_uld patch already went through the SCSI tree, so this was
    kind of pointless. But at least it has the two small error-path fixes..

    * 'for-linus' of git://git.open-osd.org/linux-open-osd:
    exofs: don't leak io_state and pages on read error
    exofs: clean up the correct page collection on write error
    osduld: Add osdname & systemid sysfs at scsi_osd class

    Linus Torvalds
     
  • Pull btrfs update from Chris Mason:
    "A big set of fixes and features.

    In terms of line count, most of the code comes from Stefan, who added
    the ability to replace a single drive in place. This is different
    from how btrfs normally replaces drives, and is much much much faster.

    Josef is plowing through our synchronous write performance. This pull
    request does not include the DIO_OWN_WAITING patch that was discussed
    on the list, but it has a number of other improvements to cut down our
    latencies and CPU time during fsync/O_DIRECT writes.

    Miao Xie has a big series of fixes and is spreading out ordered
    operations over more CPUs. This improves performance and reduces
    contention.

    I've put in fixes for error handling around hash collisions. These
    are going back to individual stable kernels as I test against them.

    Otherwise we have a lot of fixes and cleanups, thanks everyone!
    raid5/6 is being rebased against the device replacement code. I'll
    have it posted this Friday along with a nice series of benchmarks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
    Btrfs: fix a bug of per-file nocow
    Btrfs: fix hash overflow handling
    Btrfs: don't take inode delalloc mutex if we're a free space inode
    Btrfs: fix autodefrag and umount lockup
    Btrfs: fix permissions of empty files not affected by umask
    Btrfs: put raid properties into global table
    Btrfs: fix BUG() in scrub when first superblock reading gives EIO
    Btrfs: do not call file_update_time in aio_write
    Btrfs: only unlock and relock if we have to
    Btrfs: use tokens where we can in the tree log
    Btrfs: optimize leaf_space_used
    Btrfs: don't memset new tokens
    Btrfs: only clear dirty on the buffer if it is marked as dirty
    Btrfs: move checks in set_page_dirty under DEBUG
    Btrfs: log changed inodes based on the extent map tree
    Btrfs: add path->really_keep_locks
    Btrfs: do not mark ems as prealloc if we are writing to them
    Btrfs: keep track of the extents original block length
    Btrfs: inline csums if we're fsyncing
    Btrfs: don't bother copying if we're only logging the inode
    ...

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Features include:

    - Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
    code. Remove altogether where possible, and replace with
    WARN_ON_ONCE and appropriate error returns where not.
    - NFSv4.1 client adds session dynamic slot table management. There
    is matching server side code that has been submitted to Bruce for
    consideration.

    Together, this code allows the server to dynamically manage the
    amount of memory it allocates to the duplicate request cache for
    each client. It will constantly resize those caches to reserve
    more memory for clients that are hot while shrinking caches for
    those that are quiescent.

    In addition, there are assorted bugfixes for the generic NFS write
    code, fixes to deal with the drop_nlink() warnings, and yet another
    fix for NFSv4 getacl."

    * tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
    SUNRPC: continue run over clients list on PipeFS event instead of break
    NFS: Don't use SetPageError in the NFS writeback code
    SUNRPC: variable 'svsk' is unused in function bc_send_request
    SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
    NFSv4.1: Deal effectively with interrupted RPC calls.
    NFSv4.1: Move the RPC timestamp out of the slot.
    NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
    NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
    NFS: Fix calls to drop_nlink()
    NFS: Ensure that we always drop inodes that have been marked as stale
    nfs: Remove unused list nfs4_clientid_list
    nfs: Remove duplicate function declaration in internal.h
    NFS: avoid NULL dereference in nfs_destroy_server
    SUNRPC handle EKEYEXPIRED in call_refreshresult
    SUNRPC set gss gc_expiry to full lifetime
    nfs: fix page dirtying in NFS DIO read codepath
    nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
    NFSv4.1: Be conservative about the client highest slotid
    NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
    nfs: don't extend writes to cover entire page if pagecache is invalid
    ...

    Linus Torvalds
     

18 Dec, 2012

26 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • The kernel keeps FAN_MARK_IGNORED_SURV_MODIFY bit separately from
    fsnotify_mark::mask|ignored_mask thus put it in @mflags (mark flags)
    field so the user-space reader will be able to detect if such bit were
    used on mark creation procedure.

    | pos: 0
    | flags: 04002
    | fanotify flags:10 event-flags:0
    | fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
    | fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This allow us to print out fsnotify details such as watchee inode, device,
    mask and optionally a file handle.

    For inotify objects if kernel compiled with exportfs support the output
    will be

    | pos: 0
    | flags: 02000000
    | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
    | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:11a1000020542153
    | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:49b1060023552153

    If kernel compiled without exportfs support, the file handle
    won't be provided but inode and device only.

    | pos: 0
    | flags: 02000000
    | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0
    | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0
    | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0

    For fanotify the output is like

    | pos: 0
    | flags: 04002
    | fanotify flags:10 event-flags:0
    | fanotify mnt_id:12 mask:3b ignored_mask:0
    | fanotify ino:50205 sdev:800013 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:05020500fb1d47e7

    To minimize impact on general fsnotify code the new functionality
    is gathered in fs/notify/fdinfo.c file.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We will need this helper in the next patch to provide a file handle for
    inotify marks in /proc/pid/fdinfo output.

    The patch is rather providing the way to use inodes directly when dentry
    is not available (like in case of inotify system).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This routine will be used to generate a file handle in fdinfo output for
    inotify subsystem, where if no s_export_op present the general
    export_encode_fh should be used. Thus add a test if s_export_op present
    inside exportfs_encode_fh itself.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This allows us to print out raw counter value. The /proc/pid/fdinfo/fd
    output is

    | pos: 0
    | flags: 04002
    | eventfd-count: 5a

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This patch brings ability to print out auxiliary data associated with
    file in procfs interface /proc/pid/fdinfo/fd.

    In particular further patches make eventfd, evenpoll, signalfd and
    fsnotify to print additional information complete enough to restore
    these objects after checkpoint.

    To simplify the code we add show_fdinfo callback inside struct
    file_operations (as Al and Pavel are proposing).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This also converts filling memory loop to use memset.

    Signed-off-by: Akinobu Mita
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: "Theodore Ts'o"
    Cc: David Laight
    Cc: David Woodhouse
    Cc: Eilon Greenstein
    Cc: Michel Lespinasse
    Cc: Robert Love
    Cc: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • To avoid an explosion of request_module calls on a chain of abusive
    scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
    as maximum recursion depth is hit, the error will fail all the way back
    up the chain, aborting immediately.

    This also has the side-effect of stopping the user's shell from attempting
    to reexecute the top-level file as a shell script. As seen in the
    dash source:

    if (cmd != path_bshell && errno == ENOEXEC) {
    *argv-- = cmd;
    *argv = cmd = path_bshell;
    goto repeat;
    }

    The above logic was designed for running scripts automatically that lacked
    the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
    things continue to behave as the shell expects.

    Additionally, when tracking recursion, the binfmt handlers should not be
    involved. The recursion being tracked is the depth of calls through
    search_binary_handler(), so that function should be exclusively responsible
    for tracking the depth.

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • During c/r sessions we've found that there is no way at the moment to
    fetch some VMA associated flags, such as mlock() and madvise().

    This leads us to a problem -- we don't know if we should call for mlock()
    and/or madvise() after restore on the vma area we're bringing back to
    life.

    This patch intorduces a new field into "smaps" output called VmFlags,
    where all set flags associated with the particular VMA is shown as two
    letter mnemonics.

    [ Strictly speaking for c/r we only need mlock/madvise bits but it has been
    said that providing just a few flags looks somehow inconsistent. So all
    flags are here now. ]

    This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
    other applications may start to use these fields.

    The data is encoded in a somewhat awkward two letters mnemonic form, to
    encourage userspace to be prepared for fields being added or removed in
    the future.

    [a.p.zijlstra@chello.nl: props to use for_each_set_bit]
    [sfr@canb.auug.org.au: props to use array instead of struct]
    [akpm@linux-foundation.org: overall redesign and simplification]
    [akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Without this patch it is really hard to interpret a bounding set, if
    CAP_LAST_CAP is unknown for a current kernel.

    Non-existant capabilities can not be deleted from a bounding set with help
    of prctl.

    E.g.: Here are two examples without/with this patch.

    CapBnd: ffffffe0fdecffff
    CapBnd: 00000000fdecffff

    I suggest to hide non-existent capabilities. Here is two reasons.
    * It's logically and easier for using.
    * It helps to checkpoint-restore capabilities of tasks, because tasks
    can be restored on another kernel, where CAP_LAST_CAP is bigger.

    Signed-off-by: Andrew Vagin
    Cc: Andrew G. Morgan
    Reviewed-by: Serge E. Hallyn
    Cc: Pavel Emelyanov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • Option parsing code expects an unsigned integer for the codepage option,
    but prefixes and stores this option with "cp" before passing to
    load_nls(). This makes the displayed option in /proc an invalid one.
    Strip the prefix when printing so that the displayed option is valid for
    reuse.

    Signed-off-by: Dave Reisner
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Reisner
     
  • parse_options() is supposed to return value < 0 on error however we
    returned 0 (success) in a lot of cases. This actually was not a problem
    in practice because match_token() used by parse_options() is clever and
    catches most of the problems for us.

    Signed-off-by: Jan Kara
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • So far FAT either offsets time stamps by sys_tz.minuteswest or leaves them
    as they are (when tz=UTC mount option is used). However in some cases it
    is useful if one can specify time stamp offset on his own (e.g. when time
    zone of the camera connected is different from time zone of the computer,
    or when HW clock is in UTC and thus sys_tz.minuteswest == 0).

    So provide a mount option time_offset= which allows user to specify offset
    in minutes that should be applied to time stamps on the filesystem.

    akpm: this code would work incorrectly when used via `mount -o remount',
    because cached inodes would not be updated. But fatfs's fat_remount() is
    basically a no-op anyway.

    Signed-off-by: Jan Kara
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Change fatfs so that a warning is emitted when an attempt is made to mount
    a filesystem with the unsupported `discard' option.

    ext4 aready does this: http://patchwork.ozlabs.org/patch/192668/

    Signed-off-by: Namjae Jeon
    Signed-off-by: Amit Sahrawat
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • If elf_core_dump() is called and fill_note_info() fails in the kmalloc()
    then it returns 0 but has not yet initialised all the needed fields. As a
    result we do a kfree(randomness) after correctly skipping the thread data.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Fixes following sparse warning:

    fs/notify/inode_mark.c:127:22: warning: symbol 'fsnotify_find_inode_mark_locked' was not declared. Should it be static?

    Signed-off-by: Tushar Behera
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tushar Behera
     
  • But the kernel decided to call it "origin" instead. Fix most of the
    sites.

    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     
  • Users report a bug, the reproducer is:
    $ mkfs.btrfs /dev/loop0
    $ mount /dev/loop0 /mnt/btrfs/
    $ mkdir /mnt/btrfs/dir
    $ chattr +C /mnt/btrfs/dir/
    $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
    $ lsattr /mnt/btrfs/dir/foo
    ---------------C- /mnt/btrfs/dir/foo
    $ filefrag /mnt/btrfs/dir/foo
    /mnt/btrfs/dir/foo: 1 extent found ---> an extent
    $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
    $ filefrag /mnt/btrfs/dir/foo
    /mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts

    The new created file should not only inherit the NODATACOW flag, but also
    honor NODATASUM flag, because we must do COW on a file extent with checksum.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • The handling for directory crc hash overflows was fairly obscure,
    split_leaf returns EOVERFLOW when we try to extend the item and that is
    supposed to bubble up to userland. For a while it did so, but along the
    way we added better handling of errors and forced the FS readonly if we
    hit IO errors during the directory insertion.

    Along the way, we started testing only for EEXIST and the EOVERFLOW case
    was dropped. The end result is that we may force the FS readonly if we
    catch a directory hash bucket overflow.

    This fixes a few problem spots. First I add tests for EOVERFLOW in the
    places where we can safely just return the error up the chain.

    btrfs_rename is harder though, because it tries to insert the new
    directory item only after it has already unlinked anything the rename
    was going to overwrite. Rather than adding very complex logic, I added
    a helper to test for the hash overflow case early while it is still safe
    to bail out.

    Snapshot and subvolume creation had a similar problem, so they are using
    the new helper now too.

    Signed-off-by: Chris Mason
    Reported-by: Pascal Junod

    Chris Mason
     
  • Pull ext3, udf, quota fixes from Jan Kara:
    "Some ext3 & quota cleanups and couple of udf fixes"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: Use the pre-processor to compile out quotactl_cmd_write when !CONFIG_BLOCK
    ext3: drop if around WARN_ON
    ext3: get rid of the duplicate code on ext3_fill_super
    udf: remove un-needed variable from inode_getblk
    udf: don't increment lenExtents while writing to a hole
    udf: fix memory leak while allocating blocks during write

    Linus Torvalds
     

17 Dec, 2012

10 commits

  • This confuses and angers lockdep even though it's ok. We don't really need
    the lock for free space inodes since only the transaction committer will be
    reserving space. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This happens because writeback_inodes_sb_nr_if_idle does down_read. This
    doesn't work for us and it has not been fixed upstream yet, so do it
    ourselves and use that instead so we can stop having this stupid long
    standing lockup. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • When a new file is created with btrfs_create(), the inode will initially be
    created with permissions 0666 and later on in btrfs_init_acl() it will be
    adapted to mask out the umask bits. The problem is that this change won't make
    it into the btrfs_inode unless there's another change to the inode (e.g. writing
    content changing the size or touching the file changing the mtime.)

    This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
    the change will not get lost if the in-memory inode is flushed before other
    changes are made to the file.

    Signed-off-by: Filipe Brandenburger
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Filipe Brandenburger
     
  • Raid properties can be shared among raid calculation code, we can put
    them into a global table to keep it simple.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • This fixes a very special case that can be reproduced by just
    disconnecting a disk at runtime, and without unmounting the
    filesystem first, start scrub on the filesystem with the
    disconnected disk. All read and write EIOs are handled
    correctly, only the first superblock is an exception and gives
    a BUG() in a subfunction. The BUG() is correct, it would crash
    later otherwise. The subfunction must not be called for
    superblocks and this is what the fix changes.

    Reported-by: Joeri Vanthienen
    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This starts a transaction and dirties the inode everytime we call it, which
    is super expensive if you have a write heavy workload. We will be updating
    the inode when the IO completes and we reserve the space for the inode
    update when we reserve space for the write, so there is no chance of loss of
    information or enospc issues. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I noticed while doing fsync tests that we were always dropping the path and
    re-searching when we first cow the log root even though we've already gotten
    the write lock on the root. That's because we don't take into account that
    there might not be a parent node, so fix the check to make sure there is
    actually a parent node before we undo all of this work for nothing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If we are syncing over and over the overhead of doing all those maps in
    fill_inode_item and log_changed_extents really starts to hurt, so use map
    tokens so we can avoid all the extra mapping. Since the token maps from our
    offset to the end of the page make sure to set the first thing in the item
    first so we really only do one map. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This gets called at least 4 times for every level while adding an object,
    and it involves 3 kmapping calls, which on my box take about 5us a piece.
    So instead use a token, which brings us down to 1 kmap call and makes this
    function take 1/3 of the time per call. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Our token logic depends on token->kaddr being set, and if it is not it sets
    everything properly as needed. So instead of memsetting just set
    token->kaddr to NULL. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik