15 Dec, 2011

1 commit


09 Dec, 2011

3 commits

  • Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
    iowait times") we are reporting idle/io_wait time also while a CPU is
    tickless. We rely on get_{idle,iowait}_time functions to retrieve
    proper data.

    These functions, however, use usecs_to_cputime to translate micro
    seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
    which reduces the data type from u64 to unsigned int and also checks
    whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
    and returns MAX_JIFFY_OFFSET in that case.

    When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
    it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
    >3000s! until we overflow unsigned int. Just for reference
    CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
    CONFIG_HZ_1000 ~2s.

    This results in a bug when people saw [h]top going mad reporting 100%
    CPU usage even though there was basically no CPU load. The reason was
    simply that /proc/stat stopped reporting idle/io_wait changes (and
    reported MAX_JIFFY_OFFSET) and so the only change happening was for user
    system time.

    Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
    to 32b type and it is much more appropriate for cumulative time values
    (unlike usecs_to_jiffies which intended for timeout calculations).

    Signed-off-by: Michal Hocko
    Tested-by: Artem S. Tashkinov
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the error message "directives may not be used inside a macro argument"
    which appears when the kernel is compiled for the cris architecture.

    Signed-off-by: Claudio Scordino
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • kern_mount() doesn't pair with plain mntput()...

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2011

1 commit

  • This reverts commit aa6afca5bcaba8101f3ea09d5c3e4100b2b9f0e5.

    It escalates of some of the google-chrome SELinux problems with ptrace
    ("Check failed: pid_ > 0. Did not find zygote process"), and Andrew
    says that it is also causing mystery lockdep reports.

    Reported-by: Alex Villacís Lasso
    Requested-by: James Morris
    Requested-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

4 commits

  • Says Andrew:

    "60 patches. That's good enough for -rc1 I guess. I have quite a lot
    of detritus to be rechecked, work through maintainers, etc.

    - most of the remains of MM
    - rtc
    - various misc
    - cgroups
    - memcg
    - cpusets
    - procfs
    - ipc
    - rapidio
    - sysctl
    - pps
    - w1
    - drivers/misc
    - aio"

    * akpm: (60 commits)
    memcg: replace ss->id_lock with a rwlock
    aio: allocate kiocbs in batches
    drivers/misc/vmw_balloon.c: fix typo in code comment
    drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
    w1: disable irqs in critical section
    drivers/w1/w1_int.c: multiple masters used same init_name
    drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
    drivers/power/ds2780_battery.c: add a nolock function to w1 interface
    drivers/power/ds2780_battery.c: create central point for calling w1 interface
    w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
    pps gpio client: add missing dependency
    pps: new client driver using GPIO
    pps: default echo function
    include/linux/dma-mapping.h: add dma_zalloc_coherent()
    sysctl: make CONFIG_SYSCTL_SYSCALL default to n
    sysctl: add support for poll()
    RapidIO: documentation update
    drivers/net/rionet.c: fix ethernet address macros for LE platforms
    RapidIO: fix potential null deref in rio_setup_device()
    RapidIO: add mport driver for Tsi721 bridge
    ...

    Linus Torvalds
     
  • Adding support for poll() in sysctl fs allows userspace to receive
    notifications of changes in sysctl entries. This adds a infrastructure to
    allow files in sysctl fs to be pollable and implements it for hostname and
    domainname.

    [akpm@linux-foundation.org: s/declare/define/ for definitions]
    Signed-off-by: Lucas De Marchi
    Cc: Greg KH
    Cc: Kay Sievers
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas De Marchi
     
  • fd* files are restricted to the task's owner, and other users may not get
    direct access to them. But one may open any of these files and run any
    setuid program, keeping opened file descriptors. As there are permission
    checks on open(), but not on readdir() and read(), operations on the kept
    file descriptors will not be checked. It makes it possible to violate
    procfs permission model.

    Reading fdinfo/* may disclosure current fds' position and flags, reading
    directory contents of fdinfo/ and fd/ may disclosure the number of opened
    files by the target task. This information is not sensible per se, but it
    can reveal some private information (like length of a password stored in a
    file) under certain conditions.

    Used existing (un)lock_trace functions to check for ptrace_may_access(),
    but instead of using EPERM return code from it use EACCES to be consistent
    with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
    they differ, attacker can guess what fds exist by analyzing stat() return
    code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
    readdir() and lookup() for fd/ and fdinfo/.

    Signed-off-by: Vasiliy Kulikov
    Cc: Cyrill Gorcunov
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

02 Nov, 2011

2 commits


01 Nov, 2011

4 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
    use walk_page_range() instead of custom page table walking code").

    Reported-by: Stephen Hemminger
    Tested-by: Stephen Hemminger
    Reviewed-by: Stephen Wilson
    Cc: KOSAKI Motohiro
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • These files were getting via an implicit include
    path, but we want to crush those out of existence since they cost
    time during compiles of processing thousands of lines of headers
    for no reason. Give them the lightweight header that just contains
    the EXPORT_SYMBOL infrastructure.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

26 Oct, 2011

1 commit

  • * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    time, s390: Get rid of compile warning
    dw_apb_timer: constify clocksource name
    time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
    time: Change jiffies_to_clock_t() argument type to unsigned long
    alarmtimers: Fix error handling
    clocksource: Make watchdog reset lockless
    posix-cpu-timers: Cure SMP accounting oddities
    s390: Use direct ktime path for s390 clockevent device
    clockevents: Add direct ktime programming function
    clockevents: Make minimum delay adjustments configurable
    nohz: Remove "Switched to NOHz mode" debugging messages
    proc: Consider NO_HZ when printing idle and iowait times
    nohz: Make idle/iowait counter update conditional
    nohz: Fix update_ts_time_stat idle accounting
    cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
    alarmtimers: Rework RTC device selection using class interface
    alarmtimers: Add try_to_cancel functionality
    alarmtimers: Add more refined alarm state tracking
    alarmtimers: Remove period from alarm structure
    alarmtimers: Remove interval cap limit hack
    ...

    Linus Torvalds
     

22 Sep, 2011

3 commits

  • This is modeled after the smaps code.

    It detects transparent hugepages and then does a single gather_stats()
    for the page as a whole. This has two benifits:
    1. It is more efficient since it does many pages in a single shot.
    2. It does not have to break down the huge page.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • gather_pte_stats() does a number of checks on a target page
    to see whether it should even be considered for statistics.
    This breaks that code out in to a separate function so that
    we can use it in the transparent hugepage case in the next
    patch.

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • We need to teach the numa_maps code about transparent huge pages. The
    first step is to teach gather_stats() that the pte it is dealing with
    might represent more than one page.

    Note that will we use this in a moment for transparent huge pages since
    they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
    of smaller pte_t's.

    I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
    for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
    figure out how many _bytes_ "dirty=1" means, you must first know the
    hugetlbfs page size. That's easier said than done especially if you
    don't have visibility in to the mount.

    But, that's probably a discussion for another day especially since it
    would change behavior to fix it. But, just in case anyone wonders why
    this patch only passes a '1' in the hugetlb case...

    Signed-off-by: Dave Hansen
    Acked-by: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

08 Sep, 2011

1 commit

  • show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
    statistics when priting information about idle and iowait times.
    This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
    counters are updated periodically.
    With NO_HZ things got more tricky because we are not doing idle/iowait
    accounting while we are tickless so the value might get outdated.
    Users of /proc/stat will notice that by unchanged idle/iowait values
    which is then interpreted as 0% idle/iowait time. From the user space
    POV this is an unexpected behavior and a change of the interface.

    Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
    total idle/iowait time since boot and it doesn't rely on sampling or any
    other periodic activity. Fall back to the previous behavior if NO_HZ is
    disabled or not configured.

    Signed-off-by: Michal Hocko
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz
    Signed-off-by: Thomas Gleixner

    Michal Hocko
     

07 Aug, 2011

2 commits

  • The CLOEXE bit is magical, and for performance (and semantic) reasons we
    don't actually maintain it in the file descriptor itself, but in a
    separate bit array. Which means that when we show f_flags, the CLOEXE
    status is shown incorrectly: we show the status not as it is now, but as
    it was when the file was opened.

    Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
    array.

    Uli needs this in order to re-implement the pfiles program:

    "For normal file descriptors (not sockets) this was the last piece of
    information which wasn't available. This is all part of my 'give
    Solaris users no reason to not switch' effort. I intend to offer the
    code to the util-linux-ng maintainers."

    Requested-by: Ulrich Drepper
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • WARN_ONCE() is very annoying, in that it shows the stack trace that we
    don't care about at all, and also triggers various user-level "kernel
    oopsed" logic that we really don't care about. And it's not like the
    user can do anything about the applications (sshd) in question, it's a
    distro issue.

    Requested-by: Andi Kleen (and many others)
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jul, 2011

1 commit

  • Since __proc_create() appends the name it is given to the end of the PDE
    structure that it allocates, there isn't a need to store a name pointer.
    Instead we can just replace the name pointer with a terminal char array of
    _unspecified_ length. The compiler will simply append the string to statically
    defined variables of PDE type overlapping any hole at the end of the structure
    and, unlike specifying an explicitly _zero_ length array, won't give a warning
    if you try to statically initialise it with a string of more than zero length.

    Also, whilst we're at it:

    (1) Move namelen to end just prior to name and reduce it to a single byte
    (name shouldn't be longer than NAME_MAX).

    (2) Move pde_unload_lock two places further on so that if it's four bytes in
    size on a 64-bit machine, it won't cause an unused hole in the PDE struct.

    Signed-off-by: David Howells
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    David Howells
     

27 Jul, 2011

3 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • If an inode's mode permits opening /proc/PID/io and the resulting file
    descriptor is kept across execve() of a setuid or similar binary, the
    ptrace_may_access() check tries to prevent using this fd against the
    task with escalated privileges.

    Unfortunately, there is a race in the check against execve(). If
    execve() is processed after the ptrace check, but before the actual io
    information gathering, io statistics will be gathered from the
    privileged process. At least in theory this might lead to gathering
    sensible information (like ssh/ftp password length) that wouldn't be
    available otherwise.

    Holding task->signal->cred_guard_mutex while gathering the io
    information should protect against the race.

    The order of locking is similar to the one inside of ptrace_attach():
    first goes cred_guard_mutex, then lock_task_sighand().

    Signed-off-by: Vasiliy Kulikov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Change the return value to ENOENT. This return value is then returned
    when opening the proc entry that have been removed. For example,
    open("/proc/bus/pci/XX/YY") when the corresponding device is being
    hot-removed.

    Signed-off-by: Daisuke Ogino
    Cc: Jesse Barnes
    Acked-by: Alexey Dobriyan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Ogino
     

26 Jul, 2011

1 commit

  • /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
    according to Documentation/feature-removal-schedule.txt.

    This patch makes the warning more verbose by making it appear as a more
    serious problem (the presence of a stack trace and being multiline should
    attract more attention) so that applications still using the old interface
    can get fixed.

    Very popular users of the old interface have been converted since the oom
    killer rewrite has been introduced. udevd switched to the
    /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
    opensshd switched in 5.7p1.

    At the start of 2012, this should be changed into a WARN() to emit all
    such incidents and then finally remove the tunable in August 2012 as
    scheduled.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

23 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
    vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
    isofs: Remove global fs lock
    jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
    fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
    mm/truncate.c: fix build for CONFIG_BLOCK not enabled
    fs:update the NOTE of the file_operations structure
    Remove dead code in dget_parent()
    AFS: Fix silly characters in a comment
    switch d_add_ci() to d_splice_alias() in "found negative" case as well
    simplify gfs2_lookup()
    jfs_lookup(): don't bother with . or ..
    get rid of useless dget_parent() in btrfs rename() and link()
    get rid of useless dget_parent() in fs/btrfs/ioctl.c
    fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
    drivers: fix up various ->llseek() implementations
    fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
    Ext4: handle SEEK_HOLE/SEEK_DATA generically
    Btrfs: implement our own ->llseek
    fs: add SEEK_HOLE and SEEK_DATA flags
    reiserfs: make reiserfs default to barrier=flush
    ...

    Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
    shrinker callout for the inode cache, that clashed with the xfs code to
    start the periodic workers later.

    Linus Torvalds
     
  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
    ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
    ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
    connector: add an event for monitoring process tracers
    ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
    ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
    ptrace_init_task: initialize child->jobctl explicitly
    has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
    ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
    ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
    ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
    ptrace: ptrace_reparented() should check same_thread_group()
    redefine thread_group_leader() as exit_signal >= 0
    do not change dead_task->exit_signal
    kill task_detached()
    reparent_leader: check EXIT_DEAD instead of task_detached()
    make do_notify_parent() __must_check, update the callers
    __ptrace_detach: avoid task_detached(), check do_notify_parent()
    kill tracehook_notify_death()
    make do_notify_parent() return bool
    ptrace: s/tracehook_tracer_task()/ptrace_parent()/
    ...

    Linus Torvalds
     

21 Jul, 2011

1 commit

  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     

20 Jul, 2011

4 commits


29 Jun, 2011

1 commit

  • /proc/PID/io may be used for gathering private information. E.g. for
    openssh and vsftpd daemons wchars/rchars may be used to learn the
    precise password length. Restrict it to processes being able to ptrace
    the target process.

    ptrace_may_access() is needed to prevent keeping open file descriptor of
    "io" file, executing setuid binary and gathering io information of the
    setuid'ed process.

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

23 Jun, 2011

1 commit


20 Jun, 2011

2 commits


17 Jun, 2011

1 commit