17 Oct, 2007

4 commits

  • Implement file posix capabilities. This allows programs to be given a
    subset of root's powers regardless of who runs them, without having to use
    setuid and giving the binary all of root's powers.

    This version works with Kaigai Kohei's userspace tools, found at
    http://www.kaigai.gr.jp/index.php. For more information on how to use this
    patch, Chris Friedhoff has posted a nice page at
    http://www.friedhoff.org/fscaps.html.

    Changelog:
    Nov 27:
    Incorporate fixes from Andrew Morton
    (security-introduce-file-caps-tweaks and
    security-introduce-file-caps-warning-fix)
    Fix Kconfig dependency.
    Fix change signaling behavior when file caps are not compiled in.

    Nov 13:
    Integrate comments from Alexey: Remove CONFIG_ ifdef from
    capability.h, and use %zd for printing a size_t.

    Nov 13:
    Fix endianness warnings by sparse as suggested by Alexey
    Dobriyan.

    Nov 09:
    Address warnings of unused variables at cap_bprm_set_security
    when file capabilities are disabled, and simultaneously clean
    up the code a little, by pulling the new code into a helper
    function.

    Nov 08:
    For pointers to required userspace tools and how to use
    them, see http://www.friedhoff.org/fscaps.html.

    Nov 07:
    Fix the calculation of the highest bit checked in
    check_cap_sanity().

    Nov 07:
    Allow file caps to be enabled without CONFIG_SECURITY, since
    capabilities are the default.
    Hook cap_task_setscheduler when !CONFIG_SECURITY.
    Move capable(TASK_KILL) to end of cap_task_kill to reduce
    audit messages.

    Nov 05:
    Add secondary calls in selinux/hooks.c to task_setioprio and
    task_setscheduler so that selinux and capabilities with file
    cap support can be stacked.

    Sep 05:
    As Seth Arnold points out, uid checks are out of place
    for capability code.

    Sep 01:
    Define task_setscheduler, task_setioprio, cap_task_kill, and
    task_setnice to make sure a user cannot affect a process in which
    they called a program with some fscaps.

    One remaining question is the note under task_setscheduler: are we
    ok with CAP_SYS_NICE being sufficient to confine a process to a
    cpuset?

    It is a semantic change, as without fsccaps, attach_task doesn't
    allow CAP_SYS_NICE to override the uid equivalence check. But since
    it uses security_task_setscheduler, which elsewhere is used where
    CAP_SYS_NICE can be used to override the uid equivalence check,
    fixing it might be tough.

    task_setscheduler
    note: this also controls cpuset:attach_task. Are we ok with
    CAP_SYS_NICE being used to confine to a cpuset?
    task_setioprio
    task_setnice
    sys_setpriority uses this (through set_one_prio) for another
    process. Need same checks as setrlimit

    Aug 21:
    Updated secureexec implementation to reflect the fact that
    euid and uid might be the same and nonzero, but the process
    might still have elevated caps.

    Aug 15:
    Handle endianness of xattrs.
    Enforce capability version match between kernel and disk.
    Enforce that no bits beyond the known max capability are
    set, else return -EPERM.
    With this extra processing, it may be worth reconsidering
    doing all the work at bprm_set_security rather than
    d_instantiate.

    Aug 10:
    Always call getxattr at bprm_set_security, rather than
    caching it at d_instantiate.

    [morgan@kernel.org: file-caps clean up for linux/capability.h]
    [bunk@kernel.org: unexport cap_inode_killpriv]
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morgan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • * 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits)
    Fix memory leak in dm-crypt
    SPARC64: sg chaining support
    SPARC: sg chaining support
    PPC: sg chaining support
    PS3: sg chaining support
    IA64: sg chaining support
    x86-64: enable sg chaining
    x86-64: update pci-gart iommu to sg helpers
    x86-64: update nommu to sg helpers
    x86-64: update calgary iommu to sg helpers
    swiotlb: sg chaining support
    i386: enable sg chaining
    i386 dma_map_sg: convert to using sg helpers
    mmc: need to zero sglist on init
    Panic in blk_rq_map_sg() from CCISS driver
    remove sglist_len
    remove blk_queue_max_phys_segments in libata
    revert sg segment size ifdefs
    Fixup u14-34f ENABLE_SG_CHAINING
    qla1280: enable use_sg_chaining option
    ...

    Linus Torvalds
     
  • These are intended to replace prepare_write and commit_write with more
    flexible alternatives that are also able to avoid the buffered write
    deadlock problems efficiently (which prepare_write is unable to do).

    [mark.fasheh@oracle.com: API design contributions, code review and fixes]
    [akpm@linux-foundation.org: various fixes]
    [dmonakhov@sw.ru: new aop block_write_begin fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Mark Fasheh
    Signed-off-by: Dmitriy Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Combine the file_ra_state members
    unsigned long prev_index
    unsigned int prev_offset
    into
    loff_t prev_pos

    It is more consistent and better supports huge files.

    Thanks to Peter for the nice proposal!

    [akpm@linux-foundation.org: fix shift overflow]
    Cc: Peter Zijlstra
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

16 Oct, 2007

1 commit

  • The out label should not include the unmap, the only way to jump
    there already has unmapped the source.

    00002000
    f7c21a00 00000000 00000000 c0489036 00018e32 00000002 00000000
    00001000
    Call Trace:
    [] pipe_to_user+0xca/0xd3
    [] __splice_from_pipe+0x53/0x1bd
    [] ------------[ cut here ]------------
    filemap_fault+0x221/0x380
    [] pipe_to_user+0x0/0xd3
    [] sys_vmsplice+0x3b7/0x422
    [] kernel BUG at mm/highmem.c:206!
    handle_mm_fault+0x4d5/0x8eb
    [] kmap_atomic+0x1c/0x20
    [] unmap_vmas+0x3d1/0x584
    [] free_pgtables+0x90/0xa0
    [] pgd_dtor+0x0/0x1
    [] audit_syscall_exit+0x2aa/0x2c6
    [] do_syscall_trace+0x124/0x169
    [] syscall_call+0x7/0xb
    =======================
    Code: 2d 00 d0 5b 00 25 00 00 e0 ff 29 invalid opcode: 0000 [#1]
    c2 89 d0 c1 e8 0c 8b 14 85 a0 6c 7c c0 4a 85 d2 89 14 85 a0 6c 7c c0 74 07
    31 c9 4a 75 15 eb 04 0b eb fe 31 c9 81 3d 78 38 6d c0 78 38 6d c0 0f
    95 c1 b0 01
    EIP: [] kunmap_high+0x51/0x8e SS:ESP 0068:f5960df0
    SMP
    Modules linked in: netconsole autofs4 hidp nfs lockd nfs_acl rfcomm l2cap
    bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core
    ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath
    dm_mod video output sbs batteryac parport_pc lp parport sg i2c_piix4
    i2c_core floppy cfi_probe gen_probe scb2_flash mtd chipreg tg3 e1000 button
    ide_cd serio_raw cdrom aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd
    ehci_hcd ohci_hcd uhci_hcd
    CPU: 3
    EIP: 0060:[] Not tainted VLI
    EFLAGS: 00010246 (2.6.23 #1)
    EIP is at kunmap_high+0x51/0x8e

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Oct, 2007

1 commit

  • Nick Piggin points out that splice isn't being good about the mmap
    semaphore: while two readers can nest inside each others, it does leave
    a possible deadlock if a writer (ie a new mmap()) comes in during that
    nesting.

    Original "just move the locking" patch by Nick, replaced by one by me
    based on an optimistic pagefault_disable(). And then Jens tested and
    updated that patch.

    Reported-by: Nick Piggin
    Tested-by: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Jul, 2007

1 commit


21 Jul, 2007

1 commit

  • If add_to_page_cache_lru() fails, the page will not be locked. But
    splice jumps to an error path that does a page release and unlock,
    causing a BUG() in unlock_page().

    Fix this by adding one more label that just releases the page. This bug
    was actually triggered on EL5 by gurudas pai
    using fio.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

20 Jul, 2007

4 commits

  • Split ondemand readahead interface into two functions. I think this makes it
    a little clearer for non-readahead experts (like Rusty).

    Internally they both call ondemand_readahead(), but the page argument is
    changed to an obvious boolean flag.

    Signed-off-by: Rusty Russell
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Pass real splice size to page_cache_readahead_ondemand().

    The splice code works in chunks of 16 pages internally. The readahead code
    should be told of the overall splice size, instead of the internal chunk size.
    Otherwize bad things may happen. Imagine some 17-page random splice reads.
    The code before this patch will result in two readahead calls: readahead(16);
    readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O
    and 31 readahead miss pages.

    Signed-off-by: Fengguang Wu
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Move synchronous page_cache_readahead_ondemand() call out of splice loop.

    This avoids one pointless page allocation/insertion in case of non-zero
    ra_pages, or many pointless readahead calls in case of zero ra_pages.

    Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will
    not get expected readahead behavior anyway. The splice code works in batches
    of 16 pages, which can be taken as another form of synchronous readahead.

    Signed-off-by: Fengguang Wu
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert splice reads to use on-demand readahead.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

16 Jul, 2007

1 commit

  • OGAWA Hirofumi reported that he's noticed
    nfsd read corruption in recent kernels, and did the hard work of
    discovering that it's due to splice updating the file position twice.
    This means that the next operation would start further ahead than it
    should.

    nfsd_vfs_read()
    splice_direct_to_actor()
    while(len) {
    do_splice_to() [update sd->pos]
    -> generic_file_splice_read() [read from sd->pos]
    nfsd_direct_splice_actor()
    -> __splice_from_pipe() [update sd->pos]

    There's nothing wrong with the core splice code, but the direct
    splicing is an addon that calls both input and output paths.
    So it has to take care in locally caching offset so it remains correct.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Jul, 2007

2 commits


10 Jul, 2007

7 commits


15 Jun, 2007

3 commits


08 Jun, 2007

5 commits


08 May, 2007

2 commits

  • Don't try to guess what the read-ahead logic will do, allow it
    to make its own decisions.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Eric Dumazet, thank you for disclosing this bug.

    Readahead logic somehow fails to populate the page range with data.
    It can be because

    1) the readahead routine is not always called in the following lines of

    fs/splice.c:
    if (!loff || nr_pages > 1)
    page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);

    2) even called, page_cache_readahead() wont guarantee the pages are there.
    It wont submit readahead I/O for pages already in the radix tree, or when
    (ra_pages == 0), or after 256 cache hits.

    In your case, it should be because of the retried reads, which lead to
    excessive cache hits, and disables readahead at some time.

    And that _one_ failure of readahead blocks the whole read process.
    The application receives EAGAIN and retries the read, but
    __generic_file_splice_read() refuse to make progress:

    - in the previous invocation, it has allocated a blank page and inserted it
    into the radix tree, but never has the chance to start I/O for it: the test
    of SPLICE_F_NONBLOCK goes before that.

    - in the retried invocation, the readahead code will neither get out of the
    cache hit mode, nor will it submit I/O for an already existing page.

    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Fengguang Wu
     

29 Mar, 2007

1 commit


27 Mar, 2007

3 commits

  • Ocfs2 wants to implement it's own splice write actor so that it can better
    manage cluster / page locks. This lets us re-use the rest of splice write
    while only providing our own code where it's actually important.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Jens Axboe

    Mark Fasheh
     
  • Splice does not need to readpage to bring the page uptodate before writing
    to it, because prepare_write will take care of that for us.

    Splice is also wrong to SetPageUptodate before the page is actually uptodate.
    This results in the old uninitialised memory leak. This gets fixed as a
    matter of course when removing the readpage logic.

    Signed-off-by: Nick Piggin
    Signed-off-by: Jens Axboe

    Nick Piggin
     
  • Stealing pages with splice is problematic because we cannot just insert
    an uptodate page into the pagecache and hope the filesystem can take care
    of it later.

    We also cannot just ClearPageUptodate, then hope prepare_write does not
    write anything into the page, because I don't think prepare_write gives
    that guarantee.

    Remove support for SPLICE_F_MOVE for now. If we really want to bring it
    back, we might be able to do so with a the new filesystem buffered write
    aops APIs I'm working on. If we really don't want to bring it back, then
    we should decide that sooner rather than later, and remove the flag and
    all the stealing infrastructure before anybody starts using it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Jens Axboe

    Nick Piggin
     

14 Dec, 2006

1 commit

  • - pipe/splice should use const pipe_buf_operations and file_operations

    - struct pipe_inode_info has an unused field "start" : get rid of it.

    Signed-off-by: Eric Dumazet
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

05 Nov, 2006

1 commit


29 Oct, 2006

1 commit

  • - Consolidate page_cache_alloc

    - Fix splice: only the pagecache pages and filesystem data need to use
    mapping_gfp_mask.

    - Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.

    Signed-off-by: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin