09 Mar, 2013

1 commit

  • Update proc_ns_follow_link to use nd_jump_link instead of just
    manually updating nd.path.dentry.

    This fixes the BUG_ON(nd->inode != parent->d_inode) reported by Dave
    Jones and reproduced trivially with mkdir /proc/self/ns/uts/a.

    Sigh it looks like the VFS change to require use of nd_jump_link
    happend while proc_ns_follow_link was baking and since the common case
    of proc_ns_follow_link continued to work without problems the need for
    making this change was overlooked.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Feb, 2013

3 commits

  • In read_vmcore() two `if' tests are duplicated. Change the position of
    them could reduce the duplication. This change does not affect the
    behaviour of the function.

    [akpm@linux-foundation.org: avoid `if (foo = bar)' thing, use min_t()]
    [akpm@linux-foundation.org: s/max_t/min_t/]
    Signed-off-by: Zhang Yanfei
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • - use pr_foo() throughout

    - remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...")

    - nuke a few warnings which I've never seen happen, ever.

    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
    defines introduced in 54b501992dd2 ("coredump: warn about unsafe
    suid_dumpable / core_pattern combo"). Remove the new ones, and use the
    prior values instead.

    Signed-off-by: Kees Cook
    Reported-by: Chen Gang
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

4 commits

  • Make it drop the pde in *all* cases when no new reference to it is
    put into an inode - both when an inode had already been set up
    (as we were already doing) and when inode allocation has failed.
    Makes for simpler logics in callers...

    Signed-off-by: Al Viro

    Al Viro
     
  • If proc_get_inode() succeeded, but d_make_root() failed, pde_put() for
    proc_root will be called twice: the first time due to iput() called from
    d_make_root() and the second time directly in the end of
    proc_fill_super().

    Signed-off-by: Maxim Patlasov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Maxim Patlasov
     
  • According to SUSv3:

    [EACCES] Permission denied. An attempt was made to access a file in a way
    forbidden by its file access permissions.

    [EPERM] Operation not permitted. An attempt was made to perform an operation
    limited to processes with appropriate privileges or to the owner of a file
    or other resource.

    So -EPERM should be returned if capability checks fails.

    Strictly speaking this is an API change since the error code user sees is
    altered.

    Signed-off-by: Zhao Hongjiang
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Acked-by: Ian Kent
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Zhao Hongjiang
     
  • * calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
    NULL is just for dcache miss now.
    * exported, open-coded instances in ncpfs and cifs converted.

    Signed-off-by: Al Viro

    Al Viro
     

24 Feb, 2013

2 commits

  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Pull tty/serial patches from Greg Kroah-Hartman:
    "Here's the big tty/serial driver patches for 3.9-rc1.

    More tty port rework and fixes from Jiri here, as well as lots of
    individual serial driver updates and fixes.

    All of these have been in the linux-next tree for a while."

    * tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
    tty: mxser: improve error handling in mxser_probe() and mxser_module_init()
    serial: imx: fix uninitialized variable warning
    serial: tegra: assume CONFIG_OF
    TTY: do not update atime/mtime on read/write
    lguest: select CONFIG_TTY to build properly.
    ARM defconfigs: add missing inclusions of linux/platform_device.h
    fb/exynos: include platform_device.h
    ARM: sa1100/assabet: include platform_device.h directly
    serial: imx: Fix recursive locking bug
    pps: Fix build breakage from decoupling pps from tty
    tty: Remove ancient hardpps()
    pps: Additional cleanups in uart_handle_dcd_change
    pps: Move timestamp read into PPS code proper
    pps: Don't crash the machine when exiting will do
    pps: Fix a use-after free bug when unregistering a source.
    pps: Use pps_lookup_dev to reduce ldisc coupling
    pps: Add pps_lookup_dev() function
    tty: serial: uartlite: Support uartlite on big and little endian systems
    tty: serial: uartlite: Fix sparse and checkpatch warnings
    serial/arc-uart: Miscll DT related updates (Grant's review comments)
    ...

    Fix up trivial conflicts, mostly just due to the TTY config option
    clashing with the EXPERIMENTAL removal.

    Linus Torvalds
     

21 Feb, 2013

1 commit

  • Pull networking update from David Miller:

    1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
    timestamp offset. From Andrey Vagin.

    2) VMWARE VM VSOCK layer, from Andy King.

    3) Much improved support for virtual functions and SR-IOV in bnx2x,
    from Ariel ELior.

    4) All protocols on ipv4 and ipv6 are now network namespace aware, and
    all the compatability checks for initial-namespace-only protocols is
    removed. Thanks to Tom Parkin for helping deal with the last major
    holdout, L2TP.

    5) IPV6 support in netpoll and network namespace support in pktgen,
    from Cong Wang.

    6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
    Protocol (MVRP) support, from David Ward.

    7) Compute packet lengths more accurately in the packet scheduler, from
    Eric Dumazet.

    8) Use per-task page fragment allocator in skb_append_datato_frags(),
    also from Eric Dumazet.

    9) Add support for connection tracking labels in netfilter, from
    Florian Westphal.

    10) Fix default multicast group joining on ipv6, and add anti-spoofing
    checks to 6to4 and 6rd. From Hannes Frederic Sowa.

    11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
    times, rearrange inet frag datastructures for better cacheline
    locality, and move more operations outside of locking. From Jesper
    Dangaard Brouer.

    12) Instead of strict master slave relationships, allow arbitrary
    scenerios with "upper device lists". From Jiri Pirko.

    13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
    Pirko.

    14) Add a BPF filter netfilter match target, from Willem de Bruijn.

    15) Orphan and delete a bunch of pre-historic networking drivers from
    Paul Gortmaker.

    16) Add TSO support for GRE tunnels, from Pravin B SHelar. Although
    this still needs some minor bug fixing before it's %100 correct in
    all cases.

    17) Handle unresolved IPSEC states like ARP, with a resolution packet
    queue. From Steffen Klassert.

    18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
    Hemminger. This was long overdue.

    19) Support SO_REUSEPORT, from Tom Herbert.

    20) Allow locking a socket BPF filter, so that it cannot change after a
    process drops capabilities.

    21) Add VLAN filtering to bridge, from Vlad Yasevich.

    22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
    the ipv6 routes, from YOSHIFUJI Hideaki.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
    ipv6: fix race condition regarding dst->expires and dst->from.
    net: fix a wrong assignment in skb_split()
    ip_gre: remove an extra dst_release()
    ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
    atl1c: restore buffer state
    net: fix a build failure when !CONFIG_PROC_FS
    net: ipv4: fix waring -Wunused-variable
    net: proc: fix build failed when procfs is not configured
    Revert "xen: netback: remove redundant xenvif_put"
    net: move procfs code to net/core/net-procfs.c
    qmi_wwan, cdc-ether: add ADU960S
    bonding: set sysfs device_type to 'bond'
    bonding: fix bond_release_all inconsistencies
    b44: use netdev_alloc_skb_ip_align()
    xen: netback: remove redundant xenvif_put
    net: fec: Do a sanity check on the gpio number
    ip_gre: propogate target device GSO capability to the tunnel device
    ip_gre: allow CSUM capable devices to handle packets
    bonding: Fix initialize after use for 3ad machine state spinlock
    bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
    ...

    Linus Torvalds
     

19 Feb, 2013

2 commits


28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

19 Jan, 2013

1 commit

  • The option allows you to remove TTY and compile without errors. This
    saves space on systems that won't support TTY interfaces anyway.
    bloat-o-meter output is below.

    The bulk of this patch consists of Kconfig changes adding "depends on
    TTY" to various serial devices and similar drivers that require the TTY
    layer. Ideally, these dependencies would occur on a common intermediate
    symbol such as SERIO, but most drivers "select SERIO" rather than
    "depends on SERIO", and "select" does not respect dependencies.

    bloat-o-meter output comparing our previous minimal to new minimal by
    removing TTY. The list is filtered to not show removed entries with awk
    '$3 != "-"' as the list was very long.

    add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350)
    function old new delta
    chr_dev_init 166 170 +4
    allow_signal 80 82 +2
    static.__warned 143 142 -1
    disallow_signal 63 62 -1
    __set_special_pids 95 94 -1
    unregister_console 126 121 -5
    start_kernel 546 541 -5
    register_console 593 588 -5
    copy_from_user 45 40 -5
    sys_setsid 128 120 -8
    sys_vhangup 32 19 -13
    do_exit 1543 1526 -17
    bitmap_zero 60 40 -20
    arch_local_irq_save 137 117 -20
    release_task 674 652 -22
    static.spin_unlock_irqrestore 308 260 -48

    Signed-off-by: Joe Millenbach
    Reviewed-by: Jamey Sharp
    Reviewed-by: Josh Triplett
    Signed-off-by: Greg Kroah-Hartman

    Joe Millenbach
     

03 Jan, 2013

1 commit


26 Dec, 2012

1 commit

  • While testing the pid namespace code I hit this nasty warning.

    [ 176.262617] ------------[ cut here ]------------
    [ 176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
    [ 176.265145] Hardware name: Bochs
    [ 176.265677] Modules linked in:
    [ 176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
    [ 176.266564] Call Trace:
    [ 176.266564] [] warn_slowpath_common+0x7f/0xc0
    [ 176.266564] [] warn_slowpath_null+0x1a/0x20
    [ 176.266564] [] local_bh_enable_ip+0x7a/0xa0
    [ 176.266564] [] _raw_spin_unlock_bh+0x19/0x20
    [ 176.266564] [] proc_free_inum+0x3a/0x50
    [ 176.266564] [] free_pid_ns+0x1c/0x80
    [ 176.266564] [] put_pid_ns+0x35/0x50
    [ 176.266564] [] put_pid+0x4a/0x60
    [ 176.266564] [] tty_ioctl+0x717/0xc10
    [ 176.266564] [] ? wait_consider_task+0x855/0xb90
    [ 176.266564] [] ? default_spin_lock_flags+0x9/0x10
    [ 176.266564] [] ? remove_wait_queue+0x5a/0x70
    [ 176.266564] [] do_vfs_ioctl+0x98/0x550
    [ 176.266564] [] ? recalc_sigpending+0x1f/0x60
    [ 176.266564] [] ? __set_task_blocked+0x37/0x80
    [ 176.266564] [] ? sys_wait4+0xab/0xf0
    [ 176.266564] [] sys_ioctl+0x91/0xb0
    [ 176.266564] [] ? task_stopped_code+0x50/0x50
    [ 176.266564] [] system_call_fastpath+0x16/0x1b
    [ 176.266564] ---[ end trace 387af88219ad6143 ]---

    It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
    put_pid is called with another spinlock held and irqs disabled.

    For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
    in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Dec, 2012

3 commits

  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Lockdep found an inconsistent lock state when rcu is processing delayed
    work in softirq. Currently, kernel is using spin_lock/spin_unlock to
    protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
    context.

    Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.

    =================================
    [ INFO: inconsistent lock state ]
    3.7.0 #36 Not tainted
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0x8ae/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_alloc_inum+0x4c/0xd0
    alloc_mnt_ns+0x49/0xc0
    create_mnt_ns+0x25/0x70
    mnt_init+0x161/0x1c7
    vfs_caches_init+0x107/0x11a
    start_kernel+0x348/0x38c
    x86_64_start_reservations+0x131/0x136
    x86_64_start_kernel+0x103/0x112
    irq event stamp: 2993422
    hardirqs last enabled at (2993422): _raw_spin_unlock_irqrestore+0x55/0x80
    hardirqs last disabled at (2993421): _raw_spin_lock_irqsave+0x29/0x70
    softirqs last enabled at (2993394): _local_bh_enable+0x13/0x20
    softirqs last disabled at (2993395): call_softirq+0x1c/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(proc_inum_lock);

    lock(proc_inum_lock);

    *** DEADLOCK ***

    no locks held by swapper/1/0.

    stack backtrace:
    Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
    Call Trace:
    [] ? vprintk_emit+0x471/0x510
    print_usage_bug+0x2a5/0x2c0
    mark_lock+0x33b/0x5e0
    __lock_acquire+0x813/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_free_inum+0x1c/0x50
    free_pid_ns+0x1c/0x50
    put_pid_ns+0x2e/0x50
    put_pid+0x4a/0x60
    delayed_put_pid+0x12/0x20
    rcu_process_callbacks+0x462/0x790
    __do_softirq+0x1b4/0x3b0
    call_softirq+0x1c/0x30
    do_softirq+0x59/0xd0
    irq_exit+0x54/0xd0
    smp_apic_timer_interrupt+0x95/0xa3
    apic_timer_interrupt+0x72/0x80
    cpuidle_enter_tk+0x10/0x20
    cpuidle_enter_state+0x17/0x50
    cpuidle_idle_call+0x287/0x520
    cpu_idle+0xba/0x130
    start_secondary+0x2b3/0x2bc

    Signed-off-by: Xiaotian Feng
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • Removed vmtruncate

    Signed-off-by: Marco Stornelli
    Signed-off-by: Al Viro

    Marco Stornelli
     

18 Dec, 2012

9 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This patch brings ability to print out auxiliary data associated with
    file in procfs interface /proc/pid/fdinfo/fd.

    In particular further patches make eventfd, evenpoll, signalfd and
    fsnotify to print additional information complete enough to restore
    these objects after checkpoint.

    To simplify the code we add show_fdinfo callback inside struct
    file_operations (as Al and Pavel are proposing).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • During c/r sessions we've found that there is no way at the moment to
    fetch some VMA associated flags, such as mlock() and madvise().

    This leads us to a problem -- we don't know if we should call for mlock()
    and/or madvise() after restore on the vma area we're bringing back to
    life.

    This patch intorduces a new field into "smaps" output called VmFlags,
    where all set flags associated with the particular VMA is shown as two
    letter mnemonics.

    [ Strictly speaking for c/r we only need mlock/madvise bits but it has been
    said that providing just a few flags looks somehow inconsistent. So all
    flags are here now. ]

    This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
    other applications may start to use these fields.

    The data is encoded in a somewhat awkward two letters mnemonic form, to
    encourage userspace to be prepared for fields being added or removed in
    the future.

    [a.p.zijlstra@chello.nl: props to use for_each_set_bit]
    [sfr@canb.auug.org.au: props to use array instead of struct]
    [akpm@linux-foundation.org: overall redesign and simplification]
    [akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Without this patch it is really hard to interpret a bounding set, if
    CAP_LAST_CAP is unknown for a current kernel.

    Non-existant capabilities can not be deleted from a bounding set with help
    of prctl.

    E.g.: Here are two examples without/with this patch.

    CapBnd: ffffffe0fdecffff
    CapBnd: 00000000fdecffff

    I suggest to hide non-existent capabilities. Here is two reasons.
    * It's logically and easier for using.
    * It helps to checkpoint-restore capabilities of tasks, because tasks
    can be restored on another kernel, where CAP_LAST_CAP is bigger.

    Signed-off-by: Andrew Vagin
    Cc: Andrew G. Morgan
    Reviewed-by: Serge E. Hallyn
    Cc: Pavel Emelyanov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

14 Dec, 2012

1 commit

  • Merge misc VM changes from Andrew Morton:
    "The rest of most-of-MM. The other MM bits await a slab merge.

    This patch includes the addition of a huge zero_page. Not a
    performance boost but it an save large amounts of physical memory in
    some situations.

    Also a bunch of Fujitsu engineers are working on memory hotplug.
    Which, as it turns out, was badly broken. About half of their patches
    are included here; the remainder are 3.8 material."

    However, this merge disables CONFIG_MOVABLE_NODE, which was totally
    broken. We don't add new features with "default y", nor do we add
    Kconfig questions that are incomprehensible to most people without any
    help text. Does the feature even make sense without compaction or
    memory hotplug?

    * akpm: (54 commits)
    mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
    mm/memory.c: remove unused code from do_wp_page()
    asm-generic, mm: pgtable: consolidate zero page helpers
    mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
    hwpoison, hugetlbfs: fix RSS-counter warning
    hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
    mm: protect against concurrent vma expansion
    memcg: do not check for mm in __mem_cgroup_count_vm_event
    tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
    mm: provide more accurate estimation of pages occupied by memmap
    fs/buffer.c: remove redundant initialization in alloc_page_buffers()
    fs/buffer.c: do not inline exported function
    writeback: fix a typo in comment
    mm: introduce new field "managed_pages" to struct zone
    mm, oom: remove statically defined arch functions of same name
    mm, oom: remove redundant sleep in pagefault oom handler
    mm, oom: cleanup pagefault oom handler
    memory_hotplug: allow online/offline memory to result movable node
    numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
    mm, memcg: avoid unnecessary function call when memcg is disabled
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Dec, 2012

2 commits

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change affects group scheduling: we now track the runnable
    average on a per-task entity basis, allowing a smoother, exponential
    decay average based load/weight estimation instead of the previous
    binary on-the-runqueue/off-the-runqueue load weight method.

    This will inevitably disturb workloads that were in some sort of
    borderline balancing state or unstable equilibrium, so an eye has to
    be kept on regressions.

    For that reason the new load average is only limited to group
    scheduling (shares distribution) at the moment (which was also hurting
    the most from the prior, crude weight calculation and whose scheduling
    quality wins most from this change) - but we plan to extend this to
    regular SMP balancing as well in the future, which will simplify and
    speed up things a bit.

    Other changes involve ongoing preparatory work to extend NOHZ to the
    scheduler as well, eventually allowing completely irq-free user-space
    execution."

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"
    cputime: Comment cputime's adjusting code
    cputime: Consolidate cputime adjustment code
    cputime: Rename thread_group_times to thread_group_cputime_adjusted
    cputime: Move thread_group_cputime() to sched code
    vtime: Warn if irqs aren't disabled on system time accounting APIs
    vtime: No need to disable irqs on vtime_account()
    vtime: Consolidate a bit the ctx switch code
    vtime: Explicitly account pending user time on process tick
    vtime: Remove the underscore prefix invasion
    sched/autogroup: Fix crash on reboot when autogroup is disabled
    cputime: Separate irqtime accounting from generic vtime
    cputime: Specialize irq vtime hooks
    kvm: Directly account vtime to system on guest switch
    vtime: Make vtime_account_system() irqsafe
    vtime: Gather vtime declarations to their own header file
    sched: Describe CFS load-balancer
    sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
    sched: Make __update_entity_runnable_avg() fast
    sched: Update_cfs_shares at period edge
    ...

    Linus Torvalds
     
  • The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
    so this range can be represented by the signed short type with no
    functional change. The extra space this frees up in struct signal_struct
    will be used for per-thread oom kill flags in the next patch.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

11 Dec, 2012

1 commit


08 Dec, 2012

1 commit