07 Jan, 2009

2 commits

  • f_op->poll is the only vfs operation which is not allowed to sleep. It's
    because poll and select implementation used task state to synchronize
    against wake ups, which doesn't have to be the case anymore as wait/wake
    interface can now use custom wake up functions. The non-sleep restriction
    can be a bit tricky because ->poll is not called from an atomic context
    and the result of accidentally sleeping in ->poll only shows up as
    temporary busy looping when the timing is right or rather wrong.

    This patch converts poll/select to use custom wake up function and use
    separate triggered variable to synchronize against wake up events. The
    only added overhead is an extra function call during wake up and
    negligible.

    This patch removes the one non-sleep exception from vfs locking rules and
    is beneficial to userland filesystem implementations like FUSE, 9p or
    peculiar fs like spufs as it's very difficult for those to implement
    non-sleeping poll method.

    While at it, make the following cosmetic changes to make poll.h and
    select.c checkpatch friendly.

    * s/type * symbol/type *symbol/ : three places in poll.h
    * remove blank line before EXPORT_SYMBOL() : two places in select.c

    Oleg: spotted missing barrier in poll_schedule_timeout()
    Davide: spotted missing write barrier in pollwake()

    Signed-off-by: Tejun Heo
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Signed-off-by: Miklos Szeredi
    Cc: Davide Libenzi
    Cc: Brad Boyer
    Cc: Al Viro
    Cc: Roland McGrath
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Cc: Davide Libenzi
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Jan, 2009

1 commit


03 Jan, 2009

2 commits

  • * 'linux-next' of git://git.infradead.org/ubifs-2.6: (33 commits)
    UBIFS: add more useful debugging prints
    UBIFS: print debugging messages properly
    UBIFS: fix numerous spelling mistakes
    UBIFS: allow mounting when short of space
    UBIFS: fix writing uncompressed files
    UBIFS: fix checkpatch.pl warnings
    UBIFS: fix sparse warnings
    UBIFS: simplify make_free_space
    UBIFS: do not lie about used blocks
    UBIFS: restore budg_uncommitted_idx
    UBIFS: always commit on unmount
    UBIFS: use ubi_sync
    UBIFS: always commit in sync_fs
    UBIFS: fix file-system synchronization
    UBIFS: fix constants initialization
    UBIFS: avoid unnecessary calculations
    UBIFS: re-calculate min_idx_size after the commit
    UBIFS: use nicer 64-bit math
    UBIFS: fix available blocks count
    UBIFS: various comment improvements and fixes
    ...

    Linus Torvalds
     
  • Changelog [v2]:
    - Add note indicating strict isolation is not possible unless all
    mounts of devpts use the 'newinstance' mount option.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Alan Cox
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

01 Jan, 2009

3 commits


31 Dec, 2008

1 commit


29 Dec, 2008

1 commit

  • Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

    Conflicts:

    fs/xfs/linux-2.6/xfs_cred.h
    fs/xfs/linux-2.6/xfs_globals.h
    fs/xfs/linux-2.6/xfs_ioctl.c
    fs/xfs/xfs_vnodeops.h

    Signed-off-by: Lachlan McIlroy

    Lachlan McIlroy
     

23 Dec, 2008

1 commit

  • …86/debug', 'x86/defconfig', 'x86/detect-hyper', 'x86/doc', 'x86/dumpstack', 'x86/early-printk', 'x86/fpu', 'x86/idle', 'x86/io', 'x86/memory-corruption-check', 'x86/microcode', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/pat2', 'x86/pci-ioapic-boot-irq-quirks', 'x86/ptrace', 'x86/quirks', 'x86/reboot', 'x86/setup-memory', 'x86/signal', 'x86/sparse-fixes', 'x86/time', 'x86/uv' and 'x86/xen' into x86/core

    Ingo Molnar
     

05 Dec, 2008

1 commit


03 Dec, 2008

1 commit


02 Dec, 2008

2 commits

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Remove some features from the "not-supported" list that are actually
    supported now.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

01 Dec, 2008

1 commit


28 Nov, 2008

1 commit


13 Nov, 2008

1 commit

  • xip documentation updated:
    - change "get_xip_page" to "get_xip_mem";
    - explain changed function parameters

    Signed-off-by: Marco Stornelli
    Signed-off-by: Randy Dunlap
    Cc: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Stornelli
     

07 Nov, 2008

3 commits

  • Niv Sardi
     
  • FAT has the ATTR_RO (read-only) attribute. But on Windows, the ATTR_RO
    of the directory will be just ignored actually, and is used by only
    applications as flag. E.g. it's setted for the customized folder by
    Explorer.

    http://msdn2.microsoft.com/en-us/library/aa969337.aspx

    This adds "rodir" option. If user specified it, ATTR_RO is used as
    read-only flag even if it's the directory. Otherwise, inode->i_mode
    is not used to hold ATTR_RO (i.e. fat_mode_can_save_ro() returns 0).

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • While debugging a sync mount regression on vfat I noticed that there were
    mount options parsed by the driver that were not documented.

    [hirofumi@mail.parknet.co.jp: fix some parts]
    Signed-off-by: Bart Trojanowski
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Trojanowski
     

31 Oct, 2008

2 commits


30 Oct, 2008

1 commit

  • On Linux all filesystems are supposed to be operating under Posix'
    restricted chown. Restricted chown means it restricts chown to the owner
    unless you have CAP_FOWNER.

    NOTE: that 2 files outside of fs/xfs have been modified too for this
    change.

    Reviewed-by: Dave Chinner

    SGI-PV: 988919

    SGI-Modid: 2.6.x-xfs-melb:linux:32413b

    Signed-off-by: Tim Shimmin
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Tim Shimmin
     

21 Oct, 2008

1 commit

  • * 'linux-next' of git://git.infradead.org/ubifs-2.6: (25 commits)
    UBIFS: fix ubifs_compress commentary
    UBIFS: amend printk
    UBIFS: do not read unnecessary bytes when unpacking bits
    UBIFS: check buffer length when scanning for LPT nodes
    UBIFS: correct condition to eliminate unecessary assignment
    UBIFS: add more debugging messages for LPT
    UBIFS: fix bulk-read handling uptodate pages
    UBIFS: improve garbage collection
    UBIFS: allow for sync_fs when read-only
    UBIFS: commit on sync_fs
    UBIFS: correct comment for commit_on_unmount
    UBIFS: update dbg_dump_inode
    UBIFS: fix commentary
    UBIFS: fix races in bit-fields
    UBIFS: ensure data read beyond i_size is zeroed out correctly
    UBIFS: correct key comparison
    UBIFS: use bit-fields when possible
    UBIFS: check data CRC when in error state
    UBIFS: improve znode splitting rules
    UBIFS: add no_chk_data_crc mount option
    ...

    Linus Torvalds
     

20 Oct, 2008

3 commits

  • If the journal doesn't abort when it gets an IO error in file data blocks,
    the file data corruption will spread silently. Because most of
    applications and commands do buffered writes without fsync(), they don't
    notice the IO error. It's scary for mission critical systems. On the
    other hand, if the journal aborts whenever it gets an IO error in file
    data blocks, the system will easily become inoperable. So this patch
    introduces a filesystem option to determine whether it aborts the journal
    or just call printk() when it gets an IO error in file data.

    If you mount a ext3 fs with data_err=abort option, it aborts on file data
    write error. If you mount it with data_err=ignore, it doesn't abort, just
    call printk(). data_err=ignore is the default.

    Signed-off-by: Hidehiro Kawai
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • The current documentation of dirty_ratio and dirty_background_ratio is a
    bit misleading.

    In the documentation we say that they are "a percentage of total system
    memory", but the current page writeback policy, intead, is to apply the
    percentages to the dirtyable memory, that means free pages + reclaimable
    pages.

    Better to be more explicit to clarify this concept.

    Signed-off-by: Andrea Righi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • Presently hugepage's vma has a VM_RESERVED flag in order not to be
    swapped. But a VM_RESERVED vma isn't core dumped because this flag is
    often used for some kernel vmas (e.g. vmalloc, sound related).

    Thus hugepages are never dumped and it can't be debugged easily. Many
    developers want hugepages to be included into core-dump.

    However, We can't read generic VM_RESERVED area because this area is often
    IO mapping area. then these area reading may change device state. it is
    definitly undesiable side-effect.

    So adding a hugepage specific bit to the coredump filter is better. It
    will be able to hugepage core dumping and doesn't cause any side-effect to
    any i/o devices.

    In additional, libhugetlb use hugetlb private mapping pages as anonymous
    page. Then, hugepage private mapping pages should be core dumped by
    default.

    Then, /proc/[pid]/core_dump_filter has two new bits.

    - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
    - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

    I tested by following method.

    % ulimit -c unlimited
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core
    %
    % echo 0x43 > /proc/self/coredump_filter
    % ./crash_hugepage 50
    % ./crash_hugepage 50 -p
    % ls -lh
    % gdb ./crash_hugepage core

    #include
    #include
    #include
    #include
    #include

    #include "hugetlbfs.h"

    int main(int argc, char** argv){
    char* p;
    int ch;
    int mmap_flags = MAP_SHARED;
    int fd;
    int nr_pages;

    while((ch = getopt(argc, argv, "p")) != -1) {
    switch (ch) {
    case 'p':
    mmap_flags &= ~MAP_SHARED;
    mmap_flags |= MAP_PRIVATE;
    break;
    default:
    /* nothing*/
    break;
    }
    }
    argc -= optind;
    argv += optind;

    if (argc == 0){
    printf("need # of pages\n");
    exit(1);
    }

    nr_pages = atoi(argv[0]);
    if (nr_pages < 2) {
    printf("nr_pages must >2\n");
    exit(1);
    }

    fd = hugetlbfs_unlinked_fd();
    p = mmap(NULL, nr_pages * gethugepagesize(),
    PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

    sleep(2);

    *(p + gethugepagesize()) = 1; /* COW */
    sleep(2);

    /* crash! */
    *(int*)0 = 1;

    return 0;
    }

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Kawai Hidehiro
    Cc: Hugh Dickins
    Cc: William Irwin
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Oct, 2008

6 commits


14 Oct, 2008

3 commits


11 Oct, 2008

2 commits

  • If the journal doesn't abort when it gets an IO error in file data
    blocks, the file data corruption will spread silently. Because
    most of applications and commands do buffered writes without fsync(),
    they don't notice the IO error. It's scary for mission critical
    systems. On the other hand, if the journal aborts whenever it gets
    an IO error in file data blocks, the system will easily become
    inoperable. So this patch introduces a filesystem option to
    determine whether it aborts the journal or just call printk() when
    it gets an IO error in file data.

    If you mount an ext4 fs with data_err=abort option, it aborts on file
    data write error. If you mount it with data_err=ignore, it doesn't
    abort, just call printk(). data_err=ignore is the default.

    Here is the corresponding patch of the ext3 version:
    http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     
  • The ext4 filesystem is getting stable enough that it's time to drop
    the "dev" prefix. Also remove the requirement for the TEST_FILESYS
    flag.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

10 Oct, 2008

1 commit

  • After commit 831830b5a2b5d413407adf380ef62fe17d6fcbf2 aka
    "restrict reading from /proc//maps to those who share ->mm or can ptrace"
    sysctl stopped being relevant because commit moved security checks from ->show
    time to ->start time (mm_for_maps()).

    Signed-off-by: Alexey Dobriyan
    Acked-by: Kees Cook

    Alexey Dobriyan