12 May, 2013

1 commit

  • Pull audit changes from Eric Paris:
    "Al used to send pull requests every couple of years but he told me to
    just start pushing them to you directly.

    Our touching outside of core audit code is pretty straight forward. A
    couple of interface changes which hit net/. A simple argument bug
    calling audit functions in namei.c and the removal of some assembly
    branch prediction code on ppc"

    * git://git.infradead.org/users/eparis/audit: (31 commits)
    audit: fix message spacing printing auid
    Revert "audit: move kaudit thread start from auditd registration to kaudit init"
    audit: vfs: fix audit_inode call in O_CREAT case of do_last
    audit: Make testing for a valid loginuid explicit.
    audit: fix event coverage of AUDIT_ANOM_LINK
    audit: use spin_lock in audit_receive_msg to process tty logging
    audit: do not needlessly take a lock in tty_audit_exit
    audit: do not needlessly take a spinlock in copy_signal
    audit: add an option to control logging of passwords with pam_tty_audit
    audit: use spin_lock_irqsave/restore in audit tty code
    helper for some session id stuff
    audit: use a consistent audit helper to log lsm information
    audit: push loginuid and sessionid processing down
    audit: stop pushing loginid, uid, sessionid as arguments
    audit: remove the old depricated kernel interface
    audit: make validity checking generic
    audit: allow checking the type of audit message in the user filter
    audit: fix build break when AUDIT_DEBUG == 2
    audit: remove duplicate export of audit_enabled
    Audit: do not print error when LSMs disabled
    ...

    Linus Torvalds
     

11 May, 2013

3 commits

  • Pull nfsd fixes from Bruce Fields:
    "Small fixes for two bugs and two warnings"

    * 'for-3.10' of git://linux-nfs.org/~bfields/linux:
    nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error
    SUNRPC: fix decoding of optional gss-proxy xdr fields
    SUNRPC: Refactor gssx_dec_option_array() to kill uninitialized warning
    nfsd4: don't allow owner override on 4.1 CLAIM_FH opens

    Linus Torvalds
     
  • Pull stray syscall bits from Al Viro:
    "Several syscall-related commits that were missing from the original"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
    unicore32: just use mmap_pgoff()...
    unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
    x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)

    Linus Torvalds
     
  • …ernel/git/tyhicks/ecryptfs

    Pull eCryptfs update from Tyler Hicks:
    "Improve performance when AES-NI (and most likely other crypto
    accelerators) is available by moving to the ablkcipher crypto API.
    The improvement is more apparent on faster storage devices.

    There's no noticeable change when hardware crypto is not available"

    * tag 'ecryptfs-3.10-rc1-ablkcipher' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: Use the ablkcipher crypto API

    Linus Torvalds
     

10 May, 2013

10 commits

  • Pull m68knommu updates from Greg Ungerer:
    "The bulk of the changes are generalizing the ColdFire v3 core support
    and adding in 537x CPU support. Also a couple of other bug fixes, one
    to fix a reintroduction of a past bug in the romfs filesystem nommu
    support."

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
    m68knommu: enable Timer on coldfire 532x
    m68knommu: fix ColdFire 5373/5329 QSPI base address
    m68knommu: add support for configuring a Freescale M5373EVB board
    m68knommu: add support for the ColdFire 537x family of CPUs
    m68knommu: make ColdFire M532x platform support more v3 generic
    m68knommu: create and use a common M53xx ColdFire class of CPUs
    m68k: remove unused asm/dbg.h
    m68k: Set ColdFire ACR1 cache mode depending on kernel configuration
    romfs: fix nommu map length to keep inside filesystem
    m68k: clean up unused "config ROMVECSIZE"

    Linus Torvalds
     
  • Make the switch from the blkcipher kernel crypto interface to the
    ablkcipher interface.

    encrypt_scatterlist() and decrypt_scatterlist() now use the ablkcipher
    interface but, from the eCryptfs standpoint, still treat the crypto
    operation as a synchronous operation. They submit the async request and
    then wait until the operation is finished before they return. Most of
    the changes are contained inside those two functions.

    Despite waiting for the completion of the crypto operation, the
    ablkcipher interface provides performance increases in most cases when
    used on AES-NI capable hardware.

    Signed-off-by: Tyler Hicks
    Acked-by: Colin King
    Reviewed-by: Zeev Zilberman
    Cc: Dustin Kirkland
    Cc: Tim Chen
    Cc: Ying Huang
    Cc: Thieu Le
    Cc: Li Wang
    Cc: Jarkko Sakkinen

    Tyler Hicks
     
  • Pull trivial pstore update from Tony Luck:
    "Couple of pstore cleanups"

    It turns out that the kmemdup() conversion ends up being undone by the
    fact that the memory block also needed the ecc information (see commit
    bd08ec33b5c2: "pstore/ram: Restore ecc information block"), so all that
    remains after merging is the error return code change.

    * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    pstore/ram: fix error return code in ramoops_probe()
    fs: pstore: Replaced calls to kmalloc and memcpy with kmemdup

    Linus Torvalds
     
  • Pull more vfs fixes from Al Viro:
    "Regression fix from Geert + yet another open-coded kernel_read()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ecryptfs: don't open-code kernel_read()
    xtensa simdisk: Fix proc_create_data() conversion fallout

    Linus Torvalds
     
  • Pull btrfs update from Chris Mason:
    "These are mostly fixes. The biggest exceptions are Josef's skinny
    extents and Jan Schmidt's code to rebuild our quota indexes if they
    get out of sync (or you enable quotas on an existing filesystem).

    The skinny extents are off by default because they are a new variation
    on the extent allocation tree format. btrfstune -x enables them, and
    the new format makes the extent allocation tree about 30% smaller.

    I rebased this a few days ago to rework Dave Sterba's crc checks on
    the super block, but almost all of these go back to rc6, since I
    though 3.9 was due any minute.

    The biggest missing fix is the tracepoint bug that was hit late in
    3.9. I ran into problems with that in overnight testing and I'm still
    tracking it down. I'll definitely have that fixed for rc2."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
    Btrfs: allow superblock mismatch from older mkfs
    btrfs: enhance superblock checks
    btrfs: fix misleading variable name for flags
    btrfs: use unsigned long type for extent state bits
    Btrfs: improve the loop of scrub_stripe
    btrfs: read entire device info under lock
    btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
    btrfs: handle errors returned from get_tree_block_key
    btrfs: make static code static & remove dead code
    Btrfs: deal with errors in write_dev_supers
    Btrfs: remove almost all of the BUG()'s from tree-log.c
    Btrfs: deal with free space cache errors while replaying log
    Btrfs: automatic rescan after "quota enable" command
    Btrfs: rescan for qgroups
    Btrfs: split btrfs_qgroup_account_ref into four functions
    Btrfs: allocate new chunks if the space is not enough for global rsv
    Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
    btrfs: move leak debug code to functions
    Btrfs: return free space in cow error path
    Btrfs: set UUID in root_item for created trees
    ...

    Linus Torvalds
     
  • Pull xfs update (#2) from Ben Myers:

    - add CONFIG_XFS_WARN, a step between zero debugging and
    CONFIG_XFS_DEBUG.

    - fix attrmulti and attrlist to fall back to vmalloc when kmalloc
    fails.

    * tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs:
    xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle
    xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle
    xfs: introduce CONFIG_XFS_WARN

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull more NFS client bugfixes from Trond Myklebust:

    - Ensure that we match the 'sec=' mount flavour against the server list

    - Fix the NFSv4 byte range locking in the presence of delegations

    - Ensure that we conform to the NFSv4.1 spec w.r.t. freeing lock
    stateids

    - Fix a pNFS data server connection race

    * tag 'nfs-for-3.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS4.1 Fix data server connection race
    NFSv3: match sec= flavor against server list
    NFSv4.1: Ensure that we free the lock stateid on the server
    NFSv4: Convert nfs41_free_stateid to use an asynchronous RPC call
    SUNRPC: Don't spam syslog with "Pseudoflavor not found" messages
    NFSv4.x: Fix handling of partially delegated locks

    Linus Torvalds
     
  • Toralf reported the following oops to the linux-nfs mailing list:

    -----------------[snip]------------------
    NFSD: unable to generate recoverydir name (-2).
    NFSD: disabling legacy clientid tracking. Reboot recovery will not function correctly!
    BUG: unable to handle kernel NULL pointer dereference at 000003c8
    IP: [] nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
    *pdpt = 000000002ba33001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    Modules linked in: loop nfsd auth_rpcgss ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_tcpudp xt_recent xt_conntrack nf_conntrack_ftp xt_limit xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun arc4 iwldvm mac80211 coretemp kvm_intel uvcvideo sdhci_pci sdhci mmc_core videobuf2_vmalloc videobuf2_memops usblp videobuf2_core i915 iwlwifi psmouse videodev cfg80211 kvm fbcon bitblit cfbfillrect acpi_cpufreq mperf evdev softcursor font cfbimgblt i2c_algo_bit cfbcopyarea intel_agp intel_gtt drm_kms_helper snd_hda_codec_conexant drm agpgart fb fbdev tpm_tis thinkpad_acpi tpm nvram e1000e rfkill thermal ptp wmi pps_core tpm_bios 8250_pci processor 8250 ac snd_hda_intel snd_hda_codec snd_pcm battery video i2c_i801 snd_page_alloc snd_timer button serial_core i2c_core snd soundcore thermal_sys hwmon aesni_intel ablk_helper cryp
    td lrw aes_i586 xts gf128mul cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
    Pid: 6374, comm: nfsd Not tainted 3.9.1 #6 LENOVO 4180F65/4180F65
    EIP: 0060:[] EFLAGS: 00010202 CPU: 0
    EIP is at nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
    EAX: 00000000 EBX: fffffffe ECX: 00000007 EDX: 00000007
    ESI: eb9dcb00 EDI: eb2991c0 EBP: eb2bde38 ESP: eb2bde34
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 80050033 CR2: 000003c8 CR3: 2ba80000 CR4: 000407f0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: ffff0ff0 DR7: 00000400
    Process nfsd (pid: 6374, ti=eb2bc000 task=eb2711c0 task.ti=eb2bc000)
    Stack:
    fffffffe eb2bde4c f90a3e0c f90a7754 fffffffe eb0a9c00 eb2bdea0 f90a41ed
    eb2991c0 1b270000 eb2991c0 eb2bde7c f9099ce9 eb2bde98 0129a020 eb29a020
    eb2bdecc eb2991c0 eb2bdea8 f9099da5 00000000 eb9dcb00 00000001 67822f08
    Call Trace:
    [] legacy_recdir_name_error+0x3c/0x40 [nfsd]
    [] nfsd4_create_clid_dir+0x15d/0x1c0 [nfsd]
    [] ? nfsd4_lookup_stateid+0x99/0xd0 [nfsd]
    [] ? nfs4_preprocess_seqid_op+0x85/0x100 [nfsd]
    [] nfsd4_client_record_create+0x37/0x50 [nfsd]
    [] nfsd4_open_confirm+0xfe/0x130 [nfsd]
    [] ? nfsd4_encode_operation+0x61/0x90 [nfsd]
    [] ? nfsd4_free_stateid+0xc0/0xc0 [nfsd]
    [] nfsd4_proc_compound+0x41b/0x530 [nfsd]
    [] nfsd_dispatch+0x8b/0x1a0 [nfsd]
    [] svc_process+0x3dd/0x640 [sunrpc]
    [] nfsd+0xad/0x110 [nfsd]
    [] ? nfsd_destroy+0x70/0x70 [nfsd]
    [] kthread+0x94/0xa0
    [] ret_from_kernel_thread+0x1b/0x28
    [] ? flush_kthread_work+0xd0/0xd0
    Code: 86 b0 00 00 00 90 c5 0a f9 c7 04 24 70 76 0a f9 e8 74 a9 3d c8 eb ba 8d 76 00 55 89 e5 53 66 66 66 66 90 8b 15 68 c7 0a f9 85 d2 88 c8 03 00 00 74 2c 3b 11 77 28 8b 5c 91 08 85 db 74 22 8b
    EIP: [] nfsd4_client_tracking_exit+0x11/0x50 [nfsd] SS:ESP 0068:eb2bde34
    CR2: 00000000000003c8
    ---[ end trace 09e54015d145c9c6 ]---

    The problem appears to be a regression that was introduced in commit
    9a9c6478 "nfsd: make NFSv4 recovery client tracking options per net".
    Prior to that commit, it was safe to pass a NULL net pointer to
    nfsd4_client_tracking_exit in the legacy recdir case, and
    legacy_recdir_name_error did so. After that comit, the net pointer must
    be valid.

    This patch just fixes legacy_recdir_name_error to pass in a valid net
    pointer to that function.

    Cc: # v3.8+
    Cc: Stanislav Kinsbursky
    Reported-and-tested-by: Toralf Förster
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

09 May, 2013

4 commits

  • Pull f2fs updates from Jaegeuk Kim:
    "This patch-set includes the following major enhancement patches.
    - introduce a new gloabl lock scheme
    - add tracepoints on several major functions
    - fix the overall cleaning process focused on victim selection
    - apply the block plugging to merge IOs as much as possible
    - enhance management of free nids and its list
    - enhance the readahead mode for node pages
    - address several cretical deadlock conditions
    - reduce lock_page calls

    The other minor bug fixes and enhancements are as follows.
    - calculation mistakes: overflow
    - bio types: READ, READA, and READ_SYNC
    - fix the recovery flow, data races, and null pointer errors"

    * tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
    f2fs: cover free_nid management with spin_lock
    f2fs: optimize scan_nat_page()
    f2fs: code cleanup for scan_nat_page() and build_free_nids()
    f2fs: bugfix for alloc_nid_failed()
    f2fs: recover when journal contains deleted files
    f2fs: continue to mount after failing recovery
    f2fs: avoid deadlock during evict after f2fs_gc
    f2fs: modify the number of issued pages to merge IOs
    f2fs: remove useless #include as we're now using sysfs as debug entry.
    f2fs: fix inconsistent using of NM_WOUT_THRESHOLD
    f2fs: check truncation of mapping after lock_page
    f2fs: enhance alloc_nid and build_free_nids flows
    f2fs: add a tracepoint on f2fs_new_inode
    f2fs: check nid == 0 in add_free_nid
    f2fs: add REQ_META about metadata requests for submit
    f2fs: give a chance to merge IOs by IO scheduler
    f2fs: avoid frequent background GC
    f2fs: add tracepoints to debug checkpoint request
    f2fs: add tracepoints for write page operations
    f2fs: add tracepoints to debug the block allocation
    ...

    Linus Torvalds
     
  • Unlike meta data server mounts which support multiple mount points to
    the same server via struct nfs_server, data servers support a single connection.

    Concurrent calls to setup the data server connection can race where the first
    call allocates the nfs_client struct, and before the cache struct nfs_client
    pointer can be set, a second call also tries to setup the connection, finds the
    already allocated nfs_client, bumps the reference count, re-initializes the
    session,etc. This results in a hanging data server session after umount.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • Fix to return a negative error code from the error handling
    case instead of 0, as done elsewhere in this function.

    Signed-off-by: Wei Yongjun
    Acked-by: Kees Cook
    Signed-off-by: Tony Luck

    Wei Yongjun
     
  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

08 May, 2013

22 commits

  • After build_free_nids() searches free nid candidates from nat pages and
    current journal blocks, it checks all the candidates if they are allocated
    so that the nat cache has its nid with an allocated block address.

    In this procedure, previously we used
    list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list).
    But, this is not covered by free_nid_list_lock, resulting in null pointer bug.

    This patch moves this checking routine inside add_free_nid() in order not to use
    the spin_lock.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • When nm_i->fcnt > 2 * MAX_FREE_NIDS, stop scanning other NAT entries.

    Signed-off-by: Haicheng Li
    [Jaegeuk Kim: fix handling the return value of add_free_nid()]
    Signed-off-by: Jaegeuk Kim

    Haicheng Li
     
  • This patch does two cleanups:
    1. remove unused variable "fcnt" in build_free_nids().
    2. make scan_nat_page() as void type and remove useless variable "fcnt".

    Signed-off-by: Haicheng Li
    Signed-off-by: Jaegeuk Kim

    Haicheng Li
     
  • Directly drop the free_nid cache when nm_i->fcnt > 2 * MAX_FREE_NIDS

    Since there is NOT nmi->free_nid_list_lock spinlock protection between
    a sequential calling of alloc_nid() and alloc_nid_failed(), some other
    threads may already add new free_nid to the free_nid_list during this
    period.

    We need to make sure nmi->fcnt is never > 2 * MAX_FREE_NIDS.

    Signed-off-by: Haicheng Li
    [Jaegeuk Kim: fit the coding style]
    Signed-off-by: Jaegeuk Kim

    Haicheng Li
     
  • When recovering a journal file with fsync data for files that have
    been deleted, don't bail out on recovery.

    Signed-off-by: Chris Fries
    Reviewed-by: Russell Knize
    Reviewed-by: Jason Hrycay
    [Jaegeuk Kim: fit the coding style]
    Signed-off-by: Jaegeuk Kim

    Chris Fries
     
  • When unable to roll forward the journal, we shouldn't bail out and
    not mount, we should continue to attempt the mount. Bad recovery data
    is likely unrecoverable at this point, and requiring the user to try
    to mount again doesn't solve any issues.

    Signed-off-by: Chris Fries
    Reviewed-by: Russell Knize
    Reviewed-by: Jason Hrycay
    Signed-off-by: Jaegeuk Kim

    Chris Fries
     
  • o Deadlock case #1

    Thread 1:
    - writeback_sb_inodes
    - do_writepages
    - f2fs_write_data_pages
    - write_cache_pages
    - f2fs_write_data_page
    - f2fs_balance_fs
    - wait mutex_lock(gc_mutex)

    Thread 2:
    - f2fs_balance_fs
    - mutex_lock(gc_mutex)
    - f2fs_gc
    - f2fs_iget
    - wait iget_locked(inode->i_lock)

    Thread 3:
    - do_unlinkat
    - iput
    - lock(inode->i_lock)
    - evict
    - inode_wait_for_writeback

    o Deadlock case #2

    Thread 1:
    - __writeback_single_inode
    : set I_SYNC
    - do_writepages
    - f2fs_write_data_page
    - f2fs_balance_fs
    - f2fs_gc
    - iput
    - evict
    - inode_wait_for_writeback(I_SYNC)

    In order to avoid this, even though iput is called with the zero-reference
    count, we need to stop the eviction procedure if the inode is on writeback.
    So this patch links f2fs_drop_inode which checks the I_SYNC flag.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Merge more incoming from Andrew Morton:

    - Various fixes which were stalled or which I picked up recently

    - A large rotorooting of the AIO code. Allegedly to improve
    performance but I don't really have good performance numbers (I might
    have lost the email) and I can't raise Kent today. I held this out
    of 3.9 and we could give it another cycle if it's all too late/scary.

    I ended up taking only the first two thirds of the AIO rotorooting. I
    left the percpu parts and the batch completion for later. - Linus

    * emailed patches from Andrew Morton : (33 commits)
    aio: don't include aio.h in sched.h
    aio: kill ki_retry
    aio: kill ki_key
    aio: give shared kioctx fields their own cachelines
    aio: kill struct aio_ring_info
    aio: kill batch allocation
    aio: change reqs_active to include unreaped completions
    aio: use cancellation list lazily
    aio: use flush_dcache_page()
    aio: make aio_read_evt() more efficient, convert to hrtimers
    wait: add wait_event_hrtimeout()
    aio: refcounting cleanup
    aio: make aio_put_req() lockless
    aio: do fget() after aio_get_req()
    aio: dprintk() -> pr_debug()
    aio: move private stuff out of aio.h
    aio: add kiocb_cancel()
    aio: kill return value of aio_complete()
    char: add aio_{read,write} to /dev/{null,zero}
    aio: remove retry-based AIO
    ...

    Linus Torvalds
     
  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
    need this anymore - ki_retry was only called right after the kiocb was
    initialized.

    This also refactors and trims some duplicated code, as well as cleaning up
    the refcounting/error handling a bit.

    [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
    [akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • ki_key wasn't actually used for anything previously - it was always 0.
    Drop it to trim struct kiocb a bit.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
    In older kernels, creating a file with open(..., O_CREAT) created
    audit_name records that looked like this:

    type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

    ...in recent kernels though, they look like this:

    type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

    Richard bisected to determine that the problems started with commit
    bfcec708, but the log messages have changed with some later
    audit-related patches.

    The problem is that this audit_inode call is passing in the parent of
    the dentry being opened, but audit_inode is being called with the parent
    flag false. This causes later audit_inode and audit_inode_child calls to
    match the wrong entry in the audit_names list.

    This patch simply sets the flag to properly indicate that this inode
    represents the parent. With this, the audit_names entries are back to
    looking like they did before.

    Cc: # v3.7+
    Reported-by: Jiri Jaburek
    Signed-off-by: Jeff Layton
    Test By: Richard Guy Briggs
    Signed-off-by: Eric Paris

    Jeff Layton
     
  • [akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • struct aio_ring_info was kind of odd, the only place it's used is where
    it's embedded in struct kioctx - there's no real need for it.

    The next patch rearranges struct kioctx and puts various things on their
    own cachelines - getting rid of struct aio_ring_info now makes that
    reordering a bit clearer.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Previously, allocating a kiocb required touching quite a few global
    (well, per kioctx) cachelines... so batching up allocation to amortize
    those was worthwhile. But we've gotten rid of some of those, and in
    another couple of patches kiocb allocation won't require writing to any
    shared cachelines, so that means we can just rip this code out.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • The aio code tries really hard to avoid having to deal with the
    completion ringbuffer overflowing. To do that, it has to keep track of
    the number of outstanding kiocbs, and the number of completions
    currently in the ringbuffer - and it's got to check that every time we
    allocate a kiocb. Ouch.

    But - we can improve this quite a bit if we just change reqs_active to
    mean "number of outstanding requests and unreaped completions" - that
    means kiocb allocation doesn't have to look at the ringbuffer, which is
    a fairly significant win.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Cancelling kiocbs requires adding them to a per kioctx linked list,
    which is one of the few things we need to take the kioctx lock for in
    the fast path. But most kiocbs can't be cancelled - so if we just do
    this lazily, we can avoid quite a bit of locking overhead.

    While we're at it, instead of using a flag bit switch to using ki_cancel
    itself to indicate that a kiocb has been cancelled/completed. This lets
    us get rid of ki_flags entirely.

    [akpm@linux-foundation.org: remove buggy BUG()]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • This wasn't causing problems before because it's not needed on x86, but
    it is needed on other architectures.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Previously, aio_read_event() pulled a single completion off the
    ringbuffer at a time, locking and unlocking each time. Change it to
    pull off as many events as it can at a time, and copy them directly to
    userspace.

    This also fixes a bug where if copying the event to userspace failed,
    we'd lose the event.

    Also convert it to wait_event_interruptible_hrtimeout(), which
    simplifies it quite a bit.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • The usage of ctx->dead was fubar - it makes no sense to explicitly check
    it all over the place, especially when we're already using RCU.

    Now, ctx->dead only indicates whether we've dropped the initial
    refcount. The new teardown sequence is:

    set ctx->dead
    hlist_del_rcu();
    synchronize_rcu();

    Now we know no system calls can take a new ref, and it's safe to drop
    the initial ref:

    put_ioctx();

    We also need to ensure there are no more outstanding kiocbs. This was
    done incorrectly - it was being done in kill_ctx(), and before dropping
    the initial refcount. At this point, other syscalls may still be
    submitting kiocbs!

    Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
    kioctx->users has dropped to 0 and we know no more iocbs could be
    submitted.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • Freeing a kiocb needed to touch the kioctx for three things:

    * Pull it off the reqs_active list
    * Decrementing reqs_active
    * Issuing a wakeup, if the kioctx was in the process of being freed.

    This patch moves these to aio_complete(), for a couple reasons:

    * aio_complete() already has to issue the wakeup, so if we drop the
    kioctx refcount before aio_complete does its wakeup we don't have to
    do it twice.
    * aio_complete currently has to take the kioctx lock, so it makes sense
    for it to pull the kiocb off the reqs_active list too.
    * A later patch is going to change reqs_active to include unreaped
    completions - this will mean allocating a kiocb doesn't have to look
    at the ringbuffer. So taking the decrement of reqs_active out of
    kiocb_free() is useful prep work for that patch.

    This doesn't really affect cancellation, since existing (usb) code that
    implements a cancel function still calls aio_complete() - we just have
    to make sure that aio_complete does the necessary teardown for cancelled
    kiocbs.

    It does affect code paths where we free kiocbs that were never
    submitted; they need to decrement reqs_active and pull the kiocb off the
    reqs_active list. This occurs in two places: kiocb_batch_free(), which
    is going away in a later patch, and the error path in io_submit_one.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • aio_get_req() will fail if we have the maximum number of requests
    outstanding, which depending on the application may not be uncommon. So
    avoid doing an unnecessary fget().

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet