30 Apr, 2013

3 commits

  • Pull core timer updates from Ingo Molnar:
    "The main changes in this cycle's merge are:

    - Implement shadow timekeeper to shorten in kernel reader side
    blocking, by Thomas Gleixner.

    - Posix timers enhancements by Pavel Emelyanov:

    - allocate timer ID per process, so that exact timer ID allocations
    can be re-created be checkpoint/restore code.

    - debuggability and tooling (/proc/PID/timers, etc.) improvements.

    - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
    processors (Penwell and Cloverview), there is a feature that the
    TSC won't stop in S3 state, so the TSC value won't be reset to 0
    after resume. This can be taken advantage of by the generic via
    the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
    recover/approximate sleep time, the main (and precise) clocksource
    can be used.

    - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
    CPUs the file goes beyond 4MB of size and thus the current
    simplistic seqfile approach fails. Convert /proc/timer_list to a
    proper seq_file with its own iterator.

    - Cleanups and refactorings of the core timekeeping code by John
    Stultz.

    - International Atomic Clock time is managed by the NTP code
    internally currently but not exposed externally. Separate the TAI
    code out and add CLOCK_TAI support and TAI support to the hrtimer
    and posix-timer code, by John Stultz.

    - Add deep idle support enhacement to the broadcast clockevents core
    timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
    clockevents feature (which will be utilized by future clockevents
    driver updates), which allows the use of IRQ affinities to avoid
    spurious wakeups of idle CPUs - the right CPU with an expiring
    timer will be woken.

    - Add new ARM bcm281xx clocksource driver, by Christian Daudt

    - ... various other fixes and cleanups"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    clockevents: Set dummy handler on CPU_DEAD shutdown
    timekeeping: Update tk->cycle_last in resume
    posix-timers: Remove unused variable
    clockevents: Switch into oneshot mode even if broadcast registered late
    timer_list: Convert timer list to be a proper seq_file
    timer_list: Split timer_list_show_tickdevices
    posix-timers: Show sigevent info in proc file
    posix-timers: Introduce /proc/PID/timers file
    posix timers: Allocate timer id per process (v2)
    timekeeping: Make sure to notify hrtimers when TAI offset changes
    hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
    hrtimer: Add expiry time overflow check in hrtimer_interrupt
    timekeeping: Shorten seq_count region
    timekeeping: Implement a shadow timekeeper
    timekeeping: Delay update of clock->cycle_last
    timekeeping: Store cycle_last value in timekeeper struct as well
    ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
    timekeeping: Simplify tai updating from do_adjtimex
    timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
    timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
    ...

    Linus Torvalds
     
  • Saves an ifdef, no code size changes

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
    code must be here and it's implementation needs vmlist_lock and it iterate
    a vmlist which may be internal data structure for vmalloc.

    It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
    for maintainability. So move the code to vmalloc.c

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Joonsoo Kim
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Atsushi Kumagai
    Cc: Chris Metcalf
    Cc: Dave Anderson
    Cc: Eric Biederman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Apr, 2013

1 commit


18 Apr, 2013

2 commits

  • Previous patch added proc file to list posix timers created by task.
    Expand the information provided in this file by adding info about
    notification method, with which timers were created. I.e. after
    the "ID:" line there go

    1. "signal:" line, that shows signal number and sigval bits;
    2. "notify:" line, that shows the timer notification method.

    Thus the timer entry would looke like this:

    ID: 123
    signal: 14/0000000000b005d0
    notify: signal/pid.732

    This information is enough to understand how timer_create() was called
    for each particular timer.

    Signed-off-by: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Michael Kerrisk
    Cc: Matthew Helsley
    Link: http://lkml.kernel.org/r/513DA024.80404@parallels.com
    Signed-off-by: Thomas Gleixner

    Pavel Emelyanov
     
  • Currently kernel doesn't provide any API for getting info about what
    posix timers are configured by processes. It's implied, that a process
    which configured some timers, knows what it did. However, for external
    tools it's impossible to get this information. In particular, this is
    critical for checkpoint-restore project to have this info.

    Introduce a per-pid proc file with information about posix
    timers. Since these timers are shared between threads, this file is
    present on tgid level only, no such thing in tid subdirs.

    The file format is expected to be the "/proc//smaps"-like,
    i.e. each timer will occupy seveal lines to allow for future
    extending.

    Each new timer entry starts with the

    ID:

    line which is added by this patch.

    Signed-off-by: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Michael Kerrisk
    Cc: Matthew Helsley
    Link: http://lkml.kernel.org/r/513DA00D.6070009@parallels.com
    Signed-off-by: Thomas Gleixner

    Pavel Emelyanov
     

12 Apr, 2013

1 commit

  • The smpboot threads rely on the park/unpark mechanism which binds per
    cpu threads on a particular core. Though the functionality is racy:

    CPU0 CPU1 CPU2
    unpark(T) wake_up_process(T)
    clear(SHOULD_PARK) T runs
    leave parkme() due to !SHOULD_PARK
    bind_to(CPU2) BUG_ON(wrong CPU)

    We cannot let the tasks move themself to the target CPU as one of
    those tasks is actually the migration thread itself, which requires
    that it starts running on the target cpu right away.

    The solution to this problem is to prevent wakeups in park mode which
    are not from unpark(). That way we can guarantee that the association
    of the task to the target cpu is working correctly.

    Add a new task state (TASK_PARKED) which prevents other wakeups and
    use this state explicitly for the unpark wakeup.

    Peter noticed: Also, since the task state is visible to userspace and
    all the parked tasks are still in the PID space, its a good hint in ps
    and friends that these tasks aren't really there for the moment.

    The migration thread has another related issue.

    CPU0 CPU1
    Bring up CPU2
    create_thread(T)
    park(T)
    wait_for_completion()
    parkme()
    complete()
    sched_set_stop_task()
    schedule(TASK_PARKED)

    The sched_set_stop_task() call is issued while the task is on the
    runqueue of CPU1 and that confuses the hell out of the stop_task class
    on that cpu. So we need the same synchronizaion before
    sched_set_stop_task().

    Reported-by: Dave Jones
    Reported-and-tested-by: Dave Hansen
    Reported-and-tested-by: Borislav Petkov
    Acked-by: Peter Ziljstra
    Cc: Srivatsa S. Bhat
    Cc: dhillf@gmail.com
    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

10 Apr, 2013

2 commits

  • Pull vfs fixes from Al Viro:
    "A nasty bug in fs/namespace.c caught by Andrey + a couple of less
    serious unpleasantness - ecryptfs misc device playing hopeless games
    with try_module_get() and palinfo procfs support being... not quite
    correctly done, to be polite."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mnt: release locks on error path in do_loopback
    palinfo fixes
    procfs: add proc_remove_subtree()
    ecryptfs: close rmmod race

    Linus Torvalds
     
  • just what it sounds like; do that only to procfs subtrees you've
    created - doing that to something shared with another driver is
    not only antisocial, but might cause interesting races with
    proc_create() and its ilk.

    Signed-off-by: Al Viro

    Al Viro
     

29 Mar, 2013

1 commit

  • Pull userns fixes from Eric W Biederman:
    "The bulk of the changes are fixing the worst consequences of the user
    namespace design oversight in not considering what happens when one
    namespace starts off as a clone of another namespace, as happens with
    the mount namespace.

    The rest of the changes are just plain bug fixes.

    Many thanks to Andy Lutomirski for pointing out many of these issues."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Restrict when proc and sysfs can be mounted
    ipc: Restrict mounting the mqueue filesystem
    vfs: Carefully propogate mounts across user namespaces
    vfs: Add a mount flag to lock read only bind mounts
    userns: Don't allow creation if the user is chrooted
    yama: Better permission check for ptraceme
    pid: Handle the exit of a multi-threaded init.
    scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.

    Linus Torvalds
     

27 Mar, 2013

1 commit

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Mar, 2013

1 commit

  • Dave Jones found another /proc issue with his Trinity tool: thanks to
    the namespace model, we can have multiple /proc dentries that point to
    the same inode, aliasing directories in /proc//net/ for example.

    This ends up being a total disaster, because it acts like hardlinked
    directories, and causes locking problems. We rely on the topological
    sort of the inodes pointed to by dentries, and if we have aliased
    directories, that odering becomes unreliable.

    In short: don't do this. Multiple dentries with the same (directory)
    inode is just a bad idea, and the namespace code should never have
    exposed things this way. But we're kind of stuck with it.

    This solves things by just always allocating a new inode during /proc
    dentry lookup, instead of using "iget_locked()" to look up existing
    inodes by superblock and number. That actually simplies the code a bit,
    at the cost of potentially doing more inode [de]allocations.

    That said, the inode lookup wasn't free either (and did a lot of locking
    of inodes), so it is probably not that noticeable. We could easily keep
    the old lookup model for non-directory entries, but rather than try to
    be excessively clever this just implements the minimal and simplest
    workaround for the problem.

    Reported-and-tested-by: Dave Jones
    Analyzed-by: Al Viro
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Mar, 2013

1 commit

  • Update proc_ns_follow_link to use nd_jump_link instead of just
    manually updating nd.path.dentry.

    This fixes the BUG_ON(nd->inode != parent->d_inode) reported by Dave
    Jones and reproduced trivially with mkdir /proc/self/ns/uts/a.

    Sigh it looks like the VFS change to require use of nd_jump_link
    happend while proc_ns_follow_link was baking and since the common case
    of proc_ns_follow_link continued to work without problems the need for
    making this change was overlooked.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Feb, 2013

3 commits

  • In read_vmcore() two `if' tests are duplicated. Change the position of
    them could reduce the duplication. This change does not affect the
    behaviour of the function.

    [akpm@linux-foundation.org: avoid `if (foo = bar)' thing, use min_t()]
    [akpm@linux-foundation.org: s/max_t/min_t/]
    Signed-off-by: Zhang Yanfei
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • - use pr_foo() throughout

    - remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...")

    - nuke a few warnings which I've never seen happen, ever.

    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
    defines introduced in 54b501992dd2 ("coredump: warn about unsafe
    suid_dumpable / core_pattern combo"). Remove the new ones, and use the
    prior values instead.

    Signed-off-by: Kees Cook
    Reported-by: Chen Gang
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

4 commits

  • Make it drop the pde in *all* cases when no new reference to it is
    put into an inode - both when an inode had already been set up
    (as we were already doing) and when inode allocation has failed.
    Makes for simpler logics in callers...

    Signed-off-by: Al Viro

    Al Viro
     
  • If proc_get_inode() succeeded, but d_make_root() failed, pde_put() for
    proc_root will be called twice: the first time due to iput() called from
    d_make_root() and the second time directly in the end of
    proc_fill_super().

    Signed-off-by: Maxim Patlasov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Maxim Patlasov
     
  • According to SUSv3:

    [EACCES] Permission denied. An attempt was made to access a file in a way
    forbidden by its file access permissions.

    [EPERM] Operation not permitted. An attempt was made to perform an operation
    limited to processes with appropriate privileges or to the owner of a file
    or other resource.

    So -EPERM should be returned if capability checks fails.

    Strictly speaking this is an API change since the error code user sees is
    altered.

    Signed-off-by: Zhao Hongjiang
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Acked-by: Ian Kent
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Zhao Hongjiang
     
  • * calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
    NULL is just for dcache miss now.
    * exported, open-coded instances in ncpfs and cifs converted.

    Signed-off-by: Al Viro

    Al Viro
     

24 Feb, 2013

2 commits

  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Pull tty/serial patches from Greg Kroah-Hartman:
    "Here's the big tty/serial driver patches for 3.9-rc1.

    More tty port rework and fixes from Jiri here, as well as lots of
    individual serial driver updates and fixes.

    All of these have been in the linux-next tree for a while."

    * tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
    tty: mxser: improve error handling in mxser_probe() and mxser_module_init()
    serial: imx: fix uninitialized variable warning
    serial: tegra: assume CONFIG_OF
    TTY: do not update atime/mtime on read/write
    lguest: select CONFIG_TTY to build properly.
    ARM defconfigs: add missing inclusions of linux/platform_device.h
    fb/exynos: include platform_device.h
    ARM: sa1100/assabet: include platform_device.h directly
    serial: imx: Fix recursive locking bug
    pps: Fix build breakage from decoupling pps from tty
    tty: Remove ancient hardpps()
    pps: Additional cleanups in uart_handle_dcd_change
    pps: Move timestamp read into PPS code proper
    pps: Don't crash the machine when exiting will do
    pps: Fix a use-after free bug when unregistering a source.
    pps: Use pps_lookup_dev to reduce ldisc coupling
    pps: Add pps_lookup_dev() function
    tty: serial: uartlite: Support uartlite on big and little endian systems
    tty: serial: uartlite: Fix sparse and checkpatch warnings
    serial/arc-uart: Miscll DT related updates (Grant's review comments)
    ...

    Fix up trivial conflicts, mostly just due to the TTY config option
    clashing with the EXPERIMENTAL removal.

    Linus Torvalds
     

21 Feb, 2013

1 commit

  • Pull networking update from David Miller:

    1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
    timestamp offset. From Andrey Vagin.

    2) VMWARE VM VSOCK layer, from Andy King.

    3) Much improved support for virtual functions and SR-IOV in bnx2x,
    from Ariel ELior.

    4) All protocols on ipv4 and ipv6 are now network namespace aware, and
    all the compatability checks for initial-namespace-only protocols is
    removed. Thanks to Tom Parkin for helping deal with the last major
    holdout, L2TP.

    5) IPV6 support in netpoll and network namespace support in pktgen,
    from Cong Wang.

    6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
    Protocol (MVRP) support, from David Ward.

    7) Compute packet lengths more accurately in the packet scheduler, from
    Eric Dumazet.

    8) Use per-task page fragment allocator in skb_append_datato_frags(),
    also from Eric Dumazet.

    9) Add support for connection tracking labels in netfilter, from
    Florian Westphal.

    10) Fix default multicast group joining on ipv6, and add anti-spoofing
    checks to 6to4 and 6rd. From Hannes Frederic Sowa.

    11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
    times, rearrange inet frag datastructures for better cacheline
    locality, and move more operations outside of locking. From Jesper
    Dangaard Brouer.

    12) Instead of strict master slave relationships, allow arbitrary
    scenerios with "upper device lists". From Jiri Pirko.

    13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
    Pirko.

    14) Add a BPF filter netfilter match target, from Willem de Bruijn.

    15) Orphan and delete a bunch of pre-historic networking drivers from
    Paul Gortmaker.

    16) Add TSO support for GRE tunnels, from Pravin B SHelar. Although
    this still needs some minor bug fixing before it's %100 correct in
    all cases.

    17) Handle unresolved IPSEC states like ARP, with a resolution packet
    queue. From Steffen Klassert.

    18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
    Hemminger. This was long overdue.

    19) Support SO_REUSEPORT, from Tom Herbert.

    20) Allow locking a socket BPF filter, so that it cannot change after a
    process drops capabilities.

    21) Add VLAN filtering to bridge, from Vlad Yasevich.

    22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
    the ipv6 routes, from YOSHIFUJI Hideaki.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
    ipv6: fix race condition regarding dst->expires and dst->from.
    net: fix a wrong assignment in skb_split()
    ip_gre: remove an extra dst_release()
    ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
    atl1c: restore buffer state
    net: fix a build failure when !CONFIG_PROC_FS
    net: ipv4: fix waring -Wunused-variable
    net: proc: fix build failed when procfs is not configured
    Revert "xen: netback: remove redundant xenvif_put"
    net: move procfs code to net/core/net-procfs.c
    qmi_wwan, cdc-ether: add ADU960S
    bonding: set sysfs device_type to 'bond'
    bonding: fix bond_release_all inconsistencies
    b44: use netdev_alloc_skb_ip_align()
    xen: netback: remove redundant xenvif_put
    net: fec: Do a sanity check on the gpio number
    ip_gre: propogate target device GSO capability to the tunnel device
    ip_gre: allow CSUM capable devices to handle packets
    bonding: Fix initialize after use for 3ad machine state spinlock
    bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
    ...

    Linus Torvalds
     

19 Feb, 2013

2 commits


28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

19 Jan, 2013

1 commit

  • The option allows you to remove TTY and compile without errors. This
    saves space on systems that won't support TTY interfaces anyway.
    bloat-o-meter output is below.

    The bulk of this patch consists of Kconfig changes adding "depends on
    TTY" to various serial devices and similar drivers that require the TTY
    layer. Ideally, these dependencies would occur on a common intermediate
    symbol such as SERIO, but most drivers "select SERIO" rather than
    "depends on SERIO", and "select" does not respect dependencies.

    bloat-o-meter output comparing our previous minimal to new minimal by
    removing TTY. The list is filtered to not show removed entries with awk
    '$3 != "-"' as the list was very long.

    add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350)
    function old new delta
    chr_dev_init 166 170 +4
    allow_signal 80 82 +2
    static.__warned 143 142 -1
    disallow_signal 63 62 -1
    __set_special_pids 95 94 -1
    unregister_console 126 121 -5
    start_kernel 546 541 -5
    register_console 593 588 -5
    copy_from_user 45 40 -5
    sys_setsid 128 120 -8
    sys_vhangup 32 19 -13
    do_exit 1543 1526 -17
    bitmap_zero 60 40 -20
    arch_local_irq_save 137 117 -20
    release_task 674 652 -22
    static.spin_unlock_irqrestore 308 260 -48

    Signed-off-by: Joe Millenbach
    Reviewed-by: Jamey Sharp
    Reviewed-by: Josh Triplett
    Signed-off-by: Greg Kroah-Hartman

    Joe Millenbach
     

03 Jan, 2013

1 commit


26 Dec, 2012

1 commit

  • While testing the pid namespace code I hit this nasty warning.

    [ 176.262617] ------------[ cut here ]------------
    [ 176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
    [ 176.265145] Hardware name: Bochs
    [ 176.265677] Modules linked in:
    [ 176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
    [ 176.266564] Call Trace:
    [ 176.266564] [] warn_slowpath_common+0x7f/0xc0
    [ 176.266564] [] warn_slowpath_null+0x1a/0x20
    [ 176.266564] [] local_bh_enable_ip+0x7a/0xa0
    [ 176.266564] [] _raw_spin_unlock_bh+0x19/0x20
    [ 176.266564] [] proc_free_inum+0x3a/0x50
    [ 176.266564] [] free_pid_ns+0x1c/0x80
    [ 176.266564] [] put_pid_ns+0x35/0x50
    [ 176.266564] [] put_pid+0x4a/0x60
    [ 176.266564] [] tty_ioctl+0x717/0xc10
    [ 176.266564] [] ? wait_consider_task+0x855/0xb90
    [ 176.266564] [] ? default_spin_lock_flags+0x9/0x10
    [ 176.266564] [] ? remove_wait_queue+0x5a/0x70
    [ 176.266564] [] do_vfs_ioctl+0x98/0x550
    [ 176.266564] [] ? recalc_sigpending+0x1f/0x60
    [ 176.266564] [] ? __set_task_blocked+0x37/0x80
    [ 176.266564] [] ? sys_wait4+0xab/0xf0
    [ 176.266564] [] sys_ioctl+0x91/0xb0
    [ 176.266564] [] ? task_stopped_code+0x50/0x50
    [ 176.266564] [] system_call_fastpath+0x16/0x1b
    [ 176.266564] ---[ end trace 387af88219ad6143 ]---

    It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
    put_pid is called with another spinlock held and irqs disabled.

    For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
    in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Dec, 2012

3 commits

  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Lockdep found an inconsistent lock state when rcu is processing delayed
    work in softirq. Currently, kernel is using spin_lock/spin_unlock to
    protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
    context.

    Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.

    =================================
    [ INFO: inconsistent lock state ]
    3.7.0 #36 Not tainted
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0x8ae/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_alloc_inum+0x4c/0xd0
    alloc_mnt_ns+0x49/0xc0
    create_mnt_ns+0x25/0x70
    mnt_init+0x161/0x1c7
    vfs_caches_init+0x107/0x11a
    start_kernel+0x348/0x38c
    x86_64_start_reservations+0x131/0x136
    x86_64_start_kernel+0x103/0x112
    irq event stamp: 2993422
    hardirqs last enabled at (2993422): _raw_spin_unlock_irqrestore+0x55/0x80
    hardirqs last disabled at (2993421): _raw_spin_lock_irqsave+0x29/0x70
    softirqs last enabled at (2993394): _local_bh_enable+0x13/0x20
    softirqs last disabled at (2993395): call_softirq+0x1c/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(proc_inum_lock);

    lock(proc_inum_lock);

    *** DEADLOCK ***

    no locks held by swapper/1/0.

    stack backtrace:
    Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
    Call Trace:
    [] ? vprintk_emit+0x471/0x510
    print_usage_bug+0x2a5/0x2c0
    mark_lock+0x33b/0x5e0
    __lock_acquire+0x813/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_free_inum+0x1c/0x50
    free_pid_ns+0x1c/0x50
    put_pid_ns+0x2e/0x50
    put_pid+0x4a/0x60
    delayed_put_pid+0x12/0x20
    rcu_process_callbacks+0x462/0x790
    __do_softirq+0x1b4/0x3b0
    call_softirq+0x1c/0x30
    do_softirq+0x59/0xd0
    irq_exit+0x54/0xd0
    smp_apic_timer_interrupt+0x95/0xa3
    apic_timer_interrupt+0x72/0x80
    cpuidle_enter_tk+0x10/0x20
    cpuidle_enter_state+0x17/0x50
    cpuidle_idle_call+0x287/0x520
    cpu_idle+0xba/0x130
    start_secondary+0x2b3/0x2bc

    Signed-off-by: Xiaotian Feng
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • Removed vmtruncate

    Signed-off-by: Marco Stornelli
    Signed-off-by: Al Viro

    Marco Stornelli
     

18 Dec, 2012

5 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • This patch brings ability to print out auxiliary data associated with
    file in procfs interface /proc/pid/fdinfo/fd.

    In particular further patches make eventfd, evenpoll, signalfd and
    fsnotify to print additional information complete enough to restore
    these objects after checkpoint.

    To simplify the code we add show_fdinfo callback inside struct
    file_operations (as Al and Pavel are proposing).

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook