From: Matthew Dillon Date: Mon, 13 Sep 2010 23:41:40 +0000 (-0700) Subject: Kernel - Implement swapoff X-Git-Tag: v2.9.0~210^2~1 X-Git-Url: https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff_plain/9f3543c6db8d588b11ad6d877f92bb3a20d7d0f4 Kernel - Implement swapoff * Generally port of the swapoff implementation from FreeBSD to DragonFly, with major modifications. Modifications to handle swapcache issues (VCHR vnodes with VM objects can have swap associations for swapcache). * Libkvm changes So there are two problems with libkvm. The first is not really swapoff-related - the new sysctl way of reporting numbers bzero'es swap_max elements in the given swap_ary array. This is in contrast to the old kvm way, which bzero'es only those elements that will be actually filled. So if we have 3 swap devices and swap_max is 16, then the sysctl code will zero out all 16 elements and fill the first 4, while the old kvm code will zero out exactly 4 elements and fill them. Since we want to keep API stable (I learned it the hard way :-) ) I think this fix can be separated out and go to master as a bugfix to the newly introduced sysctl way of reporting things. The second problem only shows up if we introduce a swapoff syscall and enforce using of the old kvm way. It was written with the assumption that swap devices can only be added, not removed - it assumes than if I have a swap device with index 3, 4 swap devices are active. This is not true with swapoff - I can swapon A, B, C and D, then swapoff B and C and here we are - I have an active swap device with index 3, but only 2 devices are active. It turned out to be easier to just rewrite it (based on sysctl way), because that assumption was rather deep and everything was based on it. Since along with sysctl way per-device swap accounting was introduced, the kvm way now uses it instead of scanning blist. Which brings us to the last change - blist scanning code is now used only for debugging purposes. getswapinfo_radix() is now called only if DUMP_TREE flag is set. Pieces that touched swap_ary entries are removed, swap_ary and swap_max are no longer passed to scanning code. After all that both ways are now working correctly with the regards to the swapoff call and the old kvm way (the behaviour is exactly the same, all boudary cases were tested, API remains the same). The only (minor) difference is that swapctl numbers are a little bit bigger than kvm way ones. Thats because kvm way subtracts dmmax (the assumption is that the first dmmax is never allocated), and sysctl way does not. I tried to fix this, but it turns out that we need to introduce a dmmax sysctl for that. So if you want I can add it, but I want to hear from you first (both on this thing and my changes to libkvm in general). * Userspace. Add swapoff & adjust manual pages. Note: Bounty project ($300) Submitted-by: Ilya Dryomov --- diff --git a/include/unistd.h b/include/unistd.h index 16e7ba9068..b9efb522eb 100644 --- a/include/unistd.h +++ b/include/unistd.h @@ -557,6 +557,7 @@ int setrgid(gid_t); int setruid(uid_t); void setusershell(void); int strtofflags(char **, u_long *, u_long *); +int swapoff(const char *); int swapon(const char *); int syscall(int, ...); off_t __syscall(quad_t, ...); diff --git a/lib/libc/sys/Makefile.inc b/lib/libc/sys/Makefile.inc index 11029b0a57..1a9c8cc8f7 100644 --- a/lib/libc/sys/Makefile.inc +++ b/lib/libc/sys/Makefile.inc @@ -167,6 +167,7 @@ MLINKS+=statfs.2 fstatfs.2 MLINKS+=statvfs.2 fstatvfs.2 MLINKS+=symlink.2 symlinkat.2 MLINKS+=syscall.2 __syscall.2 +MLINKS+=swapon.2 swapoff.2 MLINKS+=tls.2 set_tls_area.2 tls.2 get_tls_area.2 MLINKS+=truncate.2 ftruncate.2 MLINKS+=umtx.2 umtx_sleep.2 umtx.2 umtx_wakeup.2 diff --git a/lib/libc/sys/swapon.2 b/lib/libc/sys/swapon.2 index 6f06a60310..4ed3d40277 100644 --- a/lib/libc/sys/swapon.2 +++ b/lib/libc/sys/swapon.2 @@ -33,36 +33,51 @@ .\" $FreeBSD: src/lib/libc/sys/swapon.2,v 1.6.2.6 2001/12/14 18:34:01 ru Exp $ .\" $DragonFly: src/lib/libc/sys/swapon.2,v 1.3 2006/02/17 19:35:06 swildner Exp $ .\" -.Dd June 4, 1993 +.Dd September 7, 2010 .Dt SWAPON 2 .Os .Sh NAME -.Nm swapon -.Nd add a swap device for interleaved paging/swapping +.Nm swapon , swapoff +.Nd control devices for interleaved paging/swapping .Sh LIBRARY .Lb libc .Sh SYNOPSIS .In unistd.h .Ft int .Fn swapon "const char *special" +.Ft int +.Fn swapoff "const char *special" .Sh DESCRIPTION -.Fn Swapon +The +.Fn swapon +system call makes the block device .Fa special available to the system for -allocation for paging and swapping. The names of potentially +allocation for paging and swapping. +The names of potentially available devices are known to the system and defined at system -configuration time. The size of the swap area on +configuration time. +The size of the swap area on .Fa special is calculated at the time the device is first made available for swapping. +.Pp +The +.Fn swapoff +system call disables paging and swapping on the given device. +All associated swap metadata are deallocated, and the device +is made available for other purposes. .Sh RETURN VALUES If an error has occurred, a value of -1 is returned and .Va errno is set to indicate the error. .Sh ERRORS -.Fn Swapon -succeeds unless: +Both +.Fn swapon +and +.Fn swapoff +can fail if: .Bl -tag -width Er .It Bq Er ENOTDIR A component of the path prefix is not a directory. @@ -77,19 +92,31 @@ Search permission is denied for a component of the path prefix. Too many symbolic links were encountered in translating the pathname. .It Bq Er EPERM The caller is not the super-user. +.It Bq Er EFAULT +The +.Fa special +argument +points outside the process's allocated address space. +.El +.Pp +Additionally, +.Fn swapon +can fail for the following reasons: +.Bl -tag -width Er +.It Bq Er EINVAL +The system has reached the boot-time limit on the number of +swap devices, +.Va vm.nswapdev . .It Bq Er ENOTBLK -.Fa Special +The +.Fa special +argument is not a block device. .It Bq Er EBUSY The device specified by .Fa special has already been made available for swapping -.It Bq Er EINVAL -The device configured by -.Fa special -was not -configured into the system as a swap device. .It Bq Er ENXIO The major device number of .Fa special @@ -97,20 +124,31 @@ is out of range (this indicates no device driver exists for the associated hardware). .It Bq Er EIO An I/O error occurred while opening the swap device. -.It Bq Er EFAULT -.Fa Special -points outside the process's allocated address space. +.El +.Pp +Lastly, +.Fn swapoff +can fail if: +.Bl -tag -width Er +.It Bq Er EINVAL +The system is not currently swapping to +.Fa special . +.It Bq Er ENOMEM +Not enough virtual memory is available to safely disable +paging and swapping to the given device. .El .Sh SEE ALSO .Xr config 8 , -.Xr swapon 8 +.Xr swapon 8 , +.Xr sysctl 8 .Sh HISTORY The .Fn swapon -function call appeared in +system call appeared in .Bx 4.0 . -.Sh BUGS -There is no way to stop swapping on a disk so that the pack may be -dismounted. -.Pp -This call will be upgraded in future versions of the system. +The +.Fn swapoff +system call appeared in +.Fx 5.0 +and was later ported to +.Dx 2.7 . diff --git a/lib/libkvm/kvm_getswapinfo.c b/lib/libkvm/kvm_getswapinfo.c index 51e3f75e60..98cb5e1bae 100644 --- a/lib/libkvm/kvm_getswapinfo.c +++ b/lib/libkvm/kvm_getswapinfo.c @@ -77,8 +77,8 @@ static int nswdev; static int unswdev; static int dmmax; -static void getswapinfo_radix(kvm_t *kd, struct kvm_swap *swap_ary, - int swap_max, int flags); +static int nlist_init(kvm_t *kd); +static void dump_blist(kvm_t *kd); static int kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, int swap_max, int flags); @@ -102,6 +102,17 @@ static int kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, return (0); \ } +#define GETSWDEVNAME(dev, str, flags) \ + if (dev == NODEV) { \ + strlcpy(str, "[NFS swap]", sizeof(str)); \ + } else { \ + snprintf( \ + str, sizeof(str), "%s%s", \ + ((flags & SWIF_DEV_PREFIX) ? _PATH_DEV : ""), \ + devname(dev, S_IFCHR) \ + ); \ + } + int kvm_getswapinfo( kvm_t *kd, @@ -109,7 +120,10 @@ kvm_getswapinfo( int swap_max, int flags ) { - int ti; + int i, ti, swi; + int ttl; + struct swdevt *sw; + struct swdevt swinfo; /* * clear cache @@ -119,6 +133,9 @@ kvm_getswapinfo( return(0); } + if (swap_max < 1) + return (-1); + /* * Use sysctl if possible */ @@ -131,113 +148,90 @@ kvm_getswapinfo( /* * namelist */ - if (kvm_swap_nl_cached == 0) { - struct swdevt *sw; + if (!nlist_init(kd)) + return (-1); - if (kvm_nlist(kd, kvm_swap_nl) < 0) - return(-1); + swi = unswdev; + if (swi >= swap_max) + swi = swap_max - 1; - /* - * required entries - */ + bzero(swap_ary, sizeof(struct kvm_swap) * (swi + 1)); - if ( - kvm_swap_nl[NL_SWDEVT].n_value == 0 || - kvm_swap_nl[NL_NSWDEV].n_value == 0 || - kvm_swap_nl[NL_DMMAX].n_value == 0 || - kvm_swap_nl[NL_SWAPBLIST].n_type == 0 - ) { - return(-1); - } + KGET(NL_SWDEVT, sw); + for (i = ti = 0; i < nswdev; ++i) { + KGET2(&sw[i], &swinfo, sizeof(swinfo), "swinfo"); - /* - * get globals, type of swap - */ - - KGET(NL_NSWDEV, nswdev); - KGET(NL_DMMAX, dmmax); + if (swinfo.sw_nblks == 0) + continue; /* - * figure out how many actual swap devices are enabled + * The first dmmax is never allocated to avoid + * trashing the disklabels. */ + ttl = swinfo.sw_nblks - dmmax; + if (ttl == 0) + continue; - KGET(NL_SWDEVT, sw); - for (unswdev = nswdev - 1; unswdev >= 0; --unswdev) { - struct swdevt swinfo; + swap_ary[swi].ksw_total += ttl; + swap_ary[swi].ksw_used += swinfo.sw_nused; - KGET2(&sw[unswdev], &swinfo, sizeof(swinfo), "swinfo"); - if (swinfo.sw_nblks) - break; + if (ti < swi) { + swap_ary[ti].ksw_total = ttl; + swap_ary[ti].ksw_used = swinfo.sw_nused; + swap_ary[ti].ksw_flags = swinfo.sw_flags; + GETSWDEVNAME(swinfo.sw_dev, swap_ary[ti].ksw_devname, + flags); + ++ti; } - ++unswdev; + } + + if (flags & SWIF_DUMP_TREE) + dump_blist(kd); + return (swi); +} - kvm_swap_nl_cached = 1; +static int +nlist_init(kvm_t *kd) +{ + int i; + struct swdevt *sw; + struct swdevt swinfo; + + if (kvm_swap_nl_cached) + return (1); + + if (kvm_nlist(kd, kvm_swap_nl) < 0) + return (0); + + /* + * required entries + */ + if (kvm_swap_nl[NL_SWDEVT].n_value == 0 || + kvm_swap_nl[NL_NSWDEV].n_value == 0 || + kvm_swap_nl[NL_DMMAX].n_value == 0 || + kvm_swap_nl[NL_SWAPBLIST].n_type == 0) { + return (0); } - { - struct swdevt *sw; - int i; + /* + * get globals, type of swap + */ + KGET(NL_NSWDEV, nswdev); + KGET(NL_DMMAX, dmmax); - ti = unswdev; - if (ti >= swap_max) - ti = swap_max - 1; + /* + * figure out how many actual swap devices are enabled + */ + KGET(NL_SWDEVT, sw); + for (i = unswdev = 0; i < nswdev; ++i) { + KGET2(&sw[i], &swinfo, sizeof(swinfo), "swinfo"); + if (swinfo.sw_nblks) + ++unswdev; - if (ti >= 0) - bzero(swap_ary, sizeof(struct kvm_swap) * (ti + 1)); - - KGET(NL_SWDEVT, sw); - for (i = 0; i < unswdev; ++i) { - struct swdevt swinfo; - int ttl; - - KGET2(&sw[i], &swinfo, sizeof(swinfo), "swinfo"); - - /* - * old style: everything in DEV_BSIZE'd chunks, - * convert to pages. - * - * new style: swinfo in DEV_BSIZE'd chunks but dmmax - * in pages. - * - * The first dmmax is never allocating to avoid - * trashing the disklabels - */ - - ttl = swinfo.sw_nblks - dmmax; - - if (ttl == 0) - continue; - - if (i < ti) { - swap_ary[i].ksw_total = ttl; - swap_ary[i].ksw_used = ttl; - swap_ary[i].ksw_flags = swinfo.sw_flags; - if (swinfo.sw_dev == NODEV) { - snprintf( - swap_ary[i].ksw_devname, - sizeof(swap_ary[i].ksw_devname), - "%s", - "[NFS swap]" - ); - } else { - snprintf( - swap_ary[i].ksw_devname, - sizeof(swap_ary[i].ksw_devname), - "%s%s", - ((flags & SWIF_DEV_PREFIX) ? _PATH_DEV : ""), - devname(swinfo.sw_dev, S_IFCHR) - ); - } - } - if (ti >= 0) { - swap_ary[ti].ksw_total += ttl; - swap_ary[ti].ksw_used += ttl; - } - } } - getswapinfo_radix(kd, swap_ary, swap_max, flags); - return(ti); + kvm_swap_nl_cached = 1; + return (1); } /* @@ -257,14 +251,10 @@ scanradix( kvm_t *kd, int dmmax, int nswdev, - struct kvm_swap *swap_ary, - int swap_max, - int tab, - int flags + int tab ) { blmeta_t meta; blmeta_t scan_array[BLIST_BMAP_RADIX]; - int ti = (unswdev >= swap_max) ? swap_max - 1 : unswdev; if (scan_cache) { meta = *scan_cache; @@ -282,13 +272,11 @@ scanradix( * Terminator */ if (meta.bm_bighint == (swblk_t)-1) { - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s(0x%06x,%lld) Terminator\n", - TABME, - blk, - (long long)radix - ); - } + printf("%*.*s(0x%06x,%lld) Terminator\n", + TABME, + blk, + (long long)radix + ); return(-1); } @@ -296,81 +284,33 @@ scanradix( /* * Leaf bitmap */ - int i; - - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s(0x%06x,%lld) Bitmap %08x big=%d\n", - TABME, - blk, - (long long)radix, - (int)meta.u.bmu_bitmap, - meta.bm_bighint - ); - } + printf("%*.*s(0x%06x,%lld) Bitmap %08x big=%d\n", + TABME, + blk, + (long long)radix, + (int)meta.u.bmu_bitmap, + meta.bm_bighint + ); - /* - * If not all allocated, count. - */ - if (meta.u.bmu_bitmap != 0) { - for (i = 0; i < BLIST_BMAP_RADIX && i < count; ++i) { - /* - * A 0 bit means allocated - */ - if ((meta.u.bmu_bitmap & (1 << i))) { - int t = 0; - - if (nswdev) - t = (blk + i) / dmmax % nswdev; - if (t < ti) - --swap_ary[t].ksw_used; - if (ti >= 0) - --swap_ary[ti].ksw_used; - } - } - } } else if (meta.u.bmu_avail == radix) { /* * Meta node if all free */ - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s(0x%06x,%lld) Submap ALL-FREE {\n", - TABME, - blk, - (long long)radix - ); - } - /* - * Note: both dmmax and radix are powers of 2. However, dmmax - * may be larger then radix so use a smaller increment if - * necessary. - */ - { - int t; - int tinc = dmmax; - - while (tinc > radix) - tinc >>= 1; - - for (t = blk; t < blk + radix; t += tinc) { - int u = (nswdev) ? (t / dmmax % nswdev) : 0; + printf("%*.*s(0x%06x,%lld) Submap ALL-FREE {\n", + TABME, + blk, + (long long)radix + ); - if (u < ti) - swap_ary[u].ksw_used -= tinc; - if (ti >= 0) - swap_ary[ti].ksw_used -= tinc; - } - } } else if (meta.u.bmu_avail == 0) { /* * Meta node if all used */ - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s(0x%06x,%lld) Submap ALL-ALLOCATED\n", - TABME, - blk, - (long long)radix - ); - } + printf("%*.*s(0x%06x,%lld) Submap ALL-ALLOCATED\n", + TABME, + blk, + (long long)radix + ); } else { /* * Meta node if not all free @@ -378,15 +318,13 @@ scanradix( int i; int next_skip; - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s(0x%06x,%lld) Submap avail=%d big=%d {\n", - TABME, - blk, - (long long)radix, - (int)meta.u.bmu_avail, - meta.bm_bighint - ); - } + printf("%*.*s(0x%06x,%lld) Submap avail=%d big=%d {\n", + TABME, + blk, + (long long)radix, + (int)meta.u.bmu_avail, + meta.bm_bighint + ); radix /= BLIST_META_RADIX; next_skip = skip / BLIST_META_RADIX; @@ -406,24 +344,19 @@ scanradix( kd, dmmax, nswdev, - swap_ary, - swap_max, - tab + 4, - flags + tab + 4 ); if (r < 0) break; blk += (swblk_t)radix; } - if (flags & SWIF_DUMP_TREE) { - printf("%*.*s}\n", TABME); - } + printf("%*.*s}\n", TABME); } return(0); } static void -getswapinfo_radix(kvm_t *kd, struct kvm_swap *swap_ary, int swap_max, int flags) +dump_blist(kvm_t *kd) { struct blist *swapblist = NULL; struct blist blcopy = { 0 }; @@ -431,51 +364,32 @@ getswapinfo_radix(kvm_t *kd, struct kvm_swap *swap_ary, int swap_max, int flags) KGET(NL_SWAPBLIST, swapblist); if (swapblist == NULL) { - if (flags & SWIF_DUMP_TREE) - printf("radix tree: NULL - no swap in system\n"); + printf("radix tree: NULL - no swap in system\n"); return; } KGET2(swapblist, &blcopy, sizeof(blcopy), "*swapblist"); - if (flags & SWIF_DUMP_TREE) { - printf("radix tree: %d/%d/%lld blocks, %dK wired\n", - blcopy.bl_free, - blcopy.bl_blocks, - (long long)blcopy.bl_radix, - (int)((blcopy.bl_rootblks * sizeof(blmeta_t) + 1023)/ - 1024) - ); - } - - /* - * XXX Scan the radix tree in the kernel if we have more then one - * swap device so we can get per-device statistics. This can - * get nasty because swap devices are interleaved based on the - * maximum of (4), so the blist winds up not using any shortcuts. - * - * Otherwise just pull the free count out of the blist header, - * which is a billion times faster. - */ - if ((flags & SWIF_DUMP_TREE) || unswdev > 1) { - scanradix( - blcopy.bl_root, - NULL, - 0, - blcopy.bl_radix, - blcopy.bl_skip, - blcopy.bl_rootblks, - kd, - dmmax, - nswdev, - swap_ary, - swap_max, - 0, - flags - ); - } else { - swap_ary[0].ksw_used -= blcopy.bl_free; - } + printf("radix tree: %d/%d/%lld blocks, %dK wired\n", + blcopy.bl_free, + blcopy.bl_blocks, + (long long)blcopy.bl_radix, + (int)((blcopy.bl_rootblks * sizeof(blmeta_t) + 1023)/ + 1024) + ); + + scanradix( + blcopy.bl_root, + NULL, + 0, + blcopy.bl_radix, + blcopy.bl_skip, + blcopy.bl_rootblks, + kd, + dmmax, + nswdev, + 0 + ); } static @@ -492,9 +406,6 @@ kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, char *xswbuf; struct xswdev *xsw; - if (swap_max < 1) - return(-1); - if (sysctlbyname("vm.swap_info_array", NULL, &bytes, NULL, 0) < 0) return(-1); if (bytes == 0) @@ -510,9 +421,6 @@ kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, return(-1); } - bzero(swap_ary, sizeof(struct kvm_swap) * swap_max); - --swap_max; - /* * Calculate size of xsw entry returned by kernel (it can be larger * than the one we have if there is a version mismatch). @@ -530,8 +438,10 @@ kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, continue; ++swi; } - if (swi > swap_max) - swi = swap_max; + if (swi >= swap_max) + swi = swap_max - 1; + + bzero(swap_ary, sizeof(struct kvm_swap) * (swi + 1)); /* * Accumulate results. If the provided swap_ary[] is too @@ -547,27 +457,12 @@ kvm_getswapinfo_sysctl(kvm_t *kd, struct kvm_swap *swap_ary, swap_ary[swi].ksw_total += xsw->xsw_nblks; swap_ary[swi].ksw_used += xsw->xsw_used; - if (ti < swap_max) { + if (ti < swi) { swap_ary[ti].ksw_total = xsw->xsw_nblks; swap_ary[ti].ksw_used = xsw->xsw_used; swap_ary[ti].ksw_flags = xsw->xsw_flags; - - if (xsw->xsw_dev == NODEV) { - snprintf( - swap_ary[ti].ksw_devname, - sizeof(swap_ary[ti].ksw_devname), - "%s", - "[NFS swap]" - ); - } else { - snprintf( - swap_ary[ti].ksw_devname, - sizeof(swap_ary[ti].ksw_devname), - "%s%s", - ((flags & SWIF_DEV_PREFIX) ? _PATH_DEV : ""), - devname(xsw->xsw_dev, S_IFCHR) - ); - } + GETSWDEVNAME(xsw->xsw_dev, swap_ary[ti].ksw_devname, + flags); ++ti; } } diff --git a/sbin/swapon/Makefile b/sbin/swapon/Makefile index e29a4065b2..14bf60caf0 100644 --- a/sbin/swapon/Makefile +++ b/sbin/swapon/Makefile @@ -4,5 +4,12 @@ PROG= swapon MAN= swapon.8 +LINKS= ${BINDIR}/swapon ${BINDIR}/swapoff +LINKS+= ${BINDIR}/swapon ${BINDIR}/swapctl +MLINKS= swapon.8 swapoff.8 +MLINKS+=swapon.8 swapctl.8 + +DPADD= ${LIBUTIL} +LDADD= -lutil .include diff --git a/sbin/swapon/swapon.8 b/sbin/swapon/swapon.8 index 95502aa9ff..9bf233865a 100644 --- a/sbin/swapon/swapon.8 +++ b/sbin/swapon/swapon.8 @@ -33,44 +33,136 @@ .\" $FreeBSD: src/sbin/swapon/swapon.8,v 1.15.2.2 2001/12/14 15:17:56 ru Exp $ .\" $DragonFly: src/sbin/swapon/swapon.8,v 1.4 2007/08/10 18:28:27 swildner Exp $ .\" -.Dd June 5, 1993 +.Dd September 7, 2010 .Dt SWAPON 8 .Os .Sh NAME -.Nm swapon -.Nd "specify additional device for paging and swapping" +.Nm swapon , swapoff , swapctl +.Nd "specify devices for paging and swapping" .Sh SYNOPSIS -.Nm -.Fl a -.Nm -.Ar special_file ... +.Nm swapon Fl aq | Ar +.Nm swapoff Fl aq | Ar +.Nm swapctl +.Op Fl AghklmsU +.Oo +.Fl a Ar +| +.Fl d Ar +.Oc .Sh DESCRIPTION -.Nm Swapon -is used to specify additional devices on which paging and swapping -are to take place. -The system begins by swapping and paging on only a single device -so that only one disk is required at bootstrap time. -Calls to -.Nm -normally occur in the system multi-user initialization file -.Pa /etc/rc -making all swap devices available, so that the paging and swapping -activity is interleaved across several devices. +The +.Nm swapon , swapoff +and +.Nm swapctl +utilities are used to control swap devices in the system. +At boot time all swap entries in +.Pa /etc/fstab +are added automatically when the system goes multi-user. +Swap devices use a fixed interleave; the maximum number of devices +is specified by the kernel configuration option +.Dv NSWAPDEV , +which is typically set to 4. +There is no priority mechanism. .Pp -Normally, the first form is used: -.Bl -tag -width indent -.It Fl a -All devices marked as ``sw'' -swap devices in +The +.Nm swapon +utility adds the specified swap devices to the system. +If the +.Fl a +option is used, all swap devices in .Pa /etc/fstab -are made available unless their ``noauto'' option is also set. -.El +will be added, unless their +.Dq noauto +option is also set. +If the +.Fl q +option is used informational messages will not be +written to standard output when a swap device is added. +.Pp +The +.Nm swapoff +utility removes the specified swap devices from the system. +If the +.Fl a +option is used, all swap devices in +.Pa /etc/fstab +will be removed, unless their +.Dq noauto +option is also set. +If the +.Fl q +option is used informational messages will not be +written to standard output when a swap device is removed. +Note that +.Nm swapoff +will fail and refuse to remove a swap device if there is insufficient +VM (memory + remaining swap devices) to run the system. +The +.Nm swapoff +utility +must move swapped pages out of the device being removed which could +lead to high system loads for a period of time, depending on how +much data has been swapped out to that device. .Pp -The second form gives individual block devices as given -in the system swap configuration table. The call makes only this space -available to the system for swap allocation. +The +.Nm swapctl +utility exists primarily for those familiar with other +.Bx Ns s +and may be +used to add, remove, or list swap devices. +Note that the +.Fl a +option is used differently in +.Nm swapctl +and indicates that a specific list of devices should be added. +The +.Fl d +option indicates that a specific list should be removed. +The +.Fl A +and +.Fl U +options to +.Nm swapctl +operate on all swap entries in +.Pa /etc/fstab +which do not have their +.Dq noauto +option set. +.Pp +Swap information can be generated using the +.Xr swapinfo 8 +utility, +.Nm pstat +.Fl s , +or +.Nm swapctl +.Fl l . +The +.Nm swapctl +utility has the following options for listing swap: +.Bl -tag -width indent +.It Fl h +Output values in human-readable form. +.It Fl g +Output values in gigabytes. +.It Fl k +Output values in kilobytes. +.It Fl m +Output values in megabytes. +.It Fl l +List the devices making up system swap. +.It Fl s +Print a summary line for system swap. +.Pp +The +.Ev BLOCKSIZE +environment variable is used if not specifically +overridden. +1K blocks are used by default. +.El .Sh FILES -.Bl -tag -width "/dev/{ad,da}?s?b" -compact +.Bl -tag -width ".Pa /dev/{ad,da}?s?b" -compact .It Pa /dev/{ad,da}?s?b standard paging devices .It Pa /dev/vn?s?b @@ -80,6 +172,9 @@ ASCII filesystem description table .It Pa /etc/vntab ASCII vnode file table .El +.Sh DIAGNOSTICS +These utilities may fail for the reasons described in +.Xr swapon 2 . .Sh SEE ALSO .Xr swapon 2 , .Xr fstab 5 , @@ -89,10 +184,14 @@ ASCII vnode file table .Xr vnconfig 8 .Sh HISTORY The -.Nm -command appeared in +.Nm swapon +utility appeared in .Bx 4.0 . -.Sh BUGS -There is no way to stop paging and swapping on a device. -It is therefore not possible to dismount swap devices which are -mounted during system operation. +The +.Nm swapoff +and +.Nm swapctl +utilities appeared in +.Fx 5.1 +and were later ported to +.Dx 2.7 . diff --git a/sbin/swapon/swapon.c b/sbin/swapon/swapon.c index a3e2cacf6f..8fcefeee5f 100644 --- a/sbin/swapon/swapon.c +++ b/sbin/swapon/swapon.c @@ -36,6 +36,11 @@ * $DragonFly: src/sbin/swapon/swapon.c,v 1.5 2005/11/06 12:50:21 swildner Exp $ */ +#include +#include +#include +#include + #include #include #include @@ -43,61 +48,148 @@ #include #include #include +#include +#include static void usage(void); -int add(char *name, int ignoreebusy); +static int swap_on_off(char *name, int doingall); +static void swaplist(int lflag, int sflag, int hflag); + +enum { SWAPON, SWAPOFF, SWAPCTL } orig_prog, which_prog = SWAPCTL; int main(int argc, char **argv) { struct fstab *fsp; - int stat; - int ch, doall; + char *ptr; + int ret; + int ch; + int doall, sflag, lflag, hflag, qflag; - doall = 0; - while ((ch = getopt(argc, argv, "a")) != -1) + if ((ptr = strrchr(argv[0], '/')) == NULL) + ptr = argv[0]; + if (strstr(ptr, "swapon")) + which_prog = SWAPON; + else if (strstr(ptr, "swapoff")) + which_prog = SWAPOFF; + orig_prog = which_prog; + + sflag = lflag = hflag = qflag = doall = 0; + while ((ch = getopt(argc, argv, "AadghklmqsU")) != -1) { switch((char)ch) { + case 'A': + if (which_prog == SWAPCTL) { + doall = 1; + which_prog = SWAPON; + } else { + usage(); + } + break; case 'a': - doall = 1; + if (which_prog == SWAPON || which_prog == SWAPOFF) + doall = 1; + else + which_prog = SWAPON; + break; + case 'd': + if (which_prog == SWAPCTL) + which_prog = SWAPOFF; + else + usage(); + break; + case 'g': + hflag = 'G'; + break; + case 'h': + hflag = 'H'; + break; + case 'k': + hflag = 'K'; + break; + case 'l': + lflag = 1; + break; + case 'm': + hflag = 'M'; + break; + case 'q': + if (which_prog == SWAPON || which_prog == SWAPOFF) + qflag = 1; + break; + case 's': + sflag = 1; + break; + case 'U': + if (which_prog == SWAPCTL) { + doall = 1; + which_prog = SWAPOFF; + } else { + usage(); + } break; case '?': default: usage(); } + } argv += optind; - stat = 0; - if (doall) { - while ((fsp = getfsent()) != NULL) { - if (strcmp(fsp->fs_type, FSTAB_SW)) - continue; - if (strstr(fsp->fs_mntops, "noauto")) - continue; - if (add(fsp->fs_spec, 1)) - stat = 1; - else - printf("swapon: adding %s as swap device\n", - fsp->fs_spec); + ret = 0; + if (which_prog == SWAPON || which_prog == SWAPOFF) { + if (doall) { + while ((fsp = getfsent()) != NULL) { + if (strcmp(fsp->fs_type, FSTAB_SW)) + continue; + if (strstr(fsp->fs_mntops, "noauto")) + continue; + if (swap_on_off(fsp->fs_spec, 1)) { + ret = 1; + } else { + if (!qflag) { + printf("%s: %sing %s as swap device\n", + getprogname(), + which_prog == SWAPOFF ? "remov" : "add", + fsp->fs_spec); + } + } + } + } else if (*argv == NULL) { + usage(); } - } else if (*argv == NULL) { - usage(); - } - while (*argv) { - stat |= add(getdevpath(*argv, 0), 0); - ++argv; + for (; *argv; ++argv) { + if (swap_on_off(getdevpath(*argv, 0), 0)) { + ret = 1; + } else if (orig_prog == SWAPCTL) { + printf("%s: %sing %s as swap device\n", + getprogname(), + which_prog == SWAPOFF ? "remov" : "add", + *argv); + } + } + } else { + if (lflag || sflag) + swaplist(lflag, sflag, hflag); + else + usage(); } - exit(stat); + exit(ret); } -int -add(char *name, int ignoreebusy) +static int +swap_on_off(char *name, int doingall) { - if (swapon(name) == -1) { - switch (errno) { + if ((which_prog == SWAPOFF ? swapoff(name) : swapon(name)) == -1) { + switch(errno) { case EBUSY: - if (!ignoreebusy) + if (!doingall) warnx("%s: device already in use", name); break; + case EINVAL: + if (which_prog == SWAPON) + warnx("%s: NSWAPDEV limit reached", name); + else if (!doingall) + warn("%s", name); + break; default: warn("%s", name); break; @@ -110,6 +202,137 @@ add(char *name, int ignoreebusy) static void usage(void) { - fprintf(stderr, "usage: swapon [-a] [special_file ...]\n"); + fprintf(stderr, "usage: %s ", getprogname()); + switch (orig_prog) { + case SWAPON: + case SWAPOFF: + fprintf(stderr, "-aq | file ...\n"); + break; + case SWAPCTL: + fprintf(stderr, "[-AghklmsU] [-a file ... | -d file ...]\n"); + break; + } exit(1); } + +static void +sizetobuf(char *buf, size_t bufsize, int hflag, long long val, int hlen, + long blocksize) +{ + if (hflag == 'H') { + char tmp[16]; + + humanize_number(tmp, 5, (int64_t)val, "", HN_AUTOSCALE, + HN_B | HN_NOSPACE | HN_DECIMAL); + snprintf(buf, bufsize, "%*s", hlen, tmp); + } else { + snprintf(buf, bufsize, "%*lld", hlen, val / blocksize); + } +} + +static void +swaplist(int lflag, int sflag, int hflag) +{ + size_t ksize, bytes = 0; + char *xswbuf; + struct xswdev *xsw; + int hlen, pagesize; + int i, n; + long blocksize; + long long total, used, tmp_total, tmp_used; + char buf[32]; + + pagesize = getpagesize(); + switch(hflag) { + case 'G': + blocksize = 1024 * 1024 * 1024; + strlcpy(buf, "1GB-blocks", sizeof(buf)); + hlen = 10; + break; + case 'H': + blocksize = -1; + strlcpy(buf, "Bytes", sizeof(buf)); + hlen = 10; + break; + case 'K': + blocksize = 1024; + strlcpy(buf, "1kB-blocks", sizeof(buf)); + hlen = 10; + break; + case 'M': + blocksize = 1024 * 1024; + strlcpy(buf, "1MB-blocks", sizeof(buf)); + hlen = 10; + break; + default: + getbsize(&hlen, &blocksize); + snprintf(buf, sizeof(buf), "%ld-blocks", blocksize); + break; + } + + if (sysctlbyname("vm.swap_info_array", NULL, &bytes, NULL, 0) < 0) + err(1, "sysctlbyname()"); + if (bytes == 0) + err(1, "sysctlbyname()"); + + xswbuf = malloc(bytes); + if (sysctlbyname("vm.swap_info_array", xswbuf, &bytes, NULL, 0) < 0) { + free(xswbuf); + err(1, "sysctlbyname()"); + } + if (bytes == 0) { + free(xswbuf); + err(1, "sysctlbyname()"); + } + + /* + * Calculate size of xsw entry returned by kernel (it can be larger + * than the one we have if there is a version mismatch). + */ + ksize = ((struct xswdev *)xswbuf)->xsw_size; + n = (int)(bytes / ksize); + + if (lflag) { + printf("%-13s %*s %*s\n", + "Device:", + hlen, buf, + hlen, "Used:"); + } + + total = used = tmp_total = tmp_used = 0; + for (i = 0; i < n; ++i) { + xsw = (void *)((char *)xswbuf + i * ksize); + + if (xsw->xsw_nblks == 0) + continue; + + if (sflag) { + tmp_total = (long long)xsw->xsw_nblks * pagesize; + tmp_used = (long long)xsw->xsw_used * pagesize; + total += tmp_total; + used += tmp_used; + } + + if (lflag) { + sizetobuf(buf, sizeof(buf), hflag, tmp_total, hlen, + blocksize); + if (xsw->xsw_dev == NODEV) { + printf("%-13s %s ", "[NFS swap]", buf); + } else { + printf("/dev/%-8s %s ", + devname(xsw->xsw_dev, S_IFCHR), buf); + } + + sizetobuf(buf, sizeof(buf), hflag, tmp_used, hlen, + blocksize); + printf("%s\n", buf); + } + } + + if (sflag) { + sizetobuf(buf, sizeof(buf), hflag, total, hlen, blocksize); + printf("Total: %s ", buf); + sizetobuf(buf, sizeof(buf), hflag, used, hlen, blocksize); + printf("%s\n", buf); + } +} diff --git a/sys/kern/subr_blist.c b/sys/kern/subr_blist.c index 9baa219ce5..670dbde344 100644 --- a/sys/kern/subr_blist.c +++ b/sys/kern/subr_blist.c @@ -141,6 +141,9 @@ static swblk_t blst_meta_alloc(blmeta_t *scan, swblk_t blk, static void blst_leaf_free(blmeta_t *scan, swblk_t relblk, int count); static void blst_meta_free(blmeta_t *scan, swblk_t freeBlk, swblk_t count, int64_t radix, int skip, swblk_t blk); +static swblk_t blst_leaf_fill(blmeta_t *scan, swblk_t blk, int count); +static swblk_t blst_meta_fill(blmeta_t *scan, swblk_t fillBlk, swblk_t count, + int64_t radix, int skip, swblk_t blk); static void blst_copy(blmeta_t *scan, swblk_t blk, int64_t radix, swblk_t skip, blist_t dest, swblk_t count); static swblk_t blst_radix_init(blmeta_t *scan, int64_t radix, @@ -257,6 +260,32 @@ blist_free(blist_t bl, swblk_t blkno, swblk_t count) } } +/* + * blist_fill() - mark a region in the block bitmap as off-limits + * to the allocator (i.e. allocate it), ignoring any + * existing allocations. Return the number of blocks + * actually filled that were free before the call. + */ + +swblk_t +blist_fill(blist_t bl, swblk_t blkno, swblk_t count) +{ + swblk_t filled; + + if (bl) { + if (bl->bl_radix == BLIST_BMAP_RADIX) { + filled = blst_leaf_fill(bl->bl_root, blkno, count); + } else { + filled = blst_meta_fill(bl->bl_root, blkno, count, + bl->bl_radix, bl->bl_skip, 0); + } + bl->bl_free -= filled; + return (filled); + } else { + return 0; + } +} + /* * blist_resize() - resize an existing radix tree to handle the * specified number of blocks. This will reallocate @@ -605,6 +634,111 @@ blst_meta_free(blmeta_t *scan, swblk_t freeBlk, swblk_t count, } } +/* + * BLST_LEAF_FILL() - allocate specific blocks in leaf bitmap + * + * Allocates all blocks in the specified range regardless of + * any existing allocations in that range. Returns the number + * of blocks allocated by the call. + */ +static swblk_t +blst_leaf_fill(blmeta_t *scan, swblk_t blk, int count) +{ + int n = blk & (BLIST_BMAP_RADIX - 1); + swblk_t nblks; + u_swblk_t mask, bitmap; + + mask = ((u_swblk_t)-1 << n) & + ((u_swblk_t)-1 >> (BLIST_BMAP_RADIX - count - n)); + + /* Count the number of blocks we're about to allocate */ + bitmap = scan->u.bmu_bitmap & mask; + for (nblks = 0; bitmap != 0; nblks++) + bitmap &= bitmap - 1; + + scan->u.bmu_bitmap &= ~mask; + return (nblks); +} + +/* + * BLST_META_FILL() - allocate specific blocks at a meta node + * + * Allocates the specified range of blocks, regardless of + * any existing allocations in the range. The range must + * be within the extent of this node. Returns the number + * of blocks allocated by the call. + */ +static swblk_t +blst_meta_fill(blmeta_t *scan, swblk_t fillBlk, swblk_t count, + int64_t radix, int skip, swblk_t blk) +{ + int i; + int next_skip = ((u_int)skip / BLIST_META_RADIX); + swblk_t nblks = 0; + + if (count == radix || scan->u.bmu_avail == 0) { + /* + * ALL-ALLOCATED special case + */ + nblks = scan->u.bmu_avail; + scan->u.bmu_avail = 0; + scan->bm_bighint = count; + return (nblks); + } + + if (scan->u.bmu_avail == radix) { + radix /= BLIST_META_RADIX; + + /* + * ALL-FREE special case, initialize sublevel + */ + for (i = 1; i <= skip; i += next_skip) { + if (scan[i].bm_bighint == (swblk_t)-1) + break; + if (next_skip == 1) { + scan[i].u.bmu_bitmap = (u_swblk_t)-1; + scan[i].bm_bighint = BLIST_BMAP_RADIX; + } else { + scan[i].bm_bighint = (swblk_t)radix; + scan[i].u.bmu_avail = (swblk_t)radix; + } + } + } else { + radix /= BLIST_META_RADIX; + } + + if (count > (swblk_t)radix) + panic("blst_meta_fill: allocation too large"); + + i = (fillBlk - blk) / (swblk_t)radix; + blk += i * (swblk_t)radix; + i = i * next_skip + 1; + + while (i <= skip && blk < fillBlk + count) { + swblk_t v; + + v = blk + (swblk_t)radix - fillBlk; + if (v > count) + v = count; + + if (scan->bm_bighint == (swblk_t)-1) + panic("blst_meta_fill: filling unexpected range"); + + if (next_skip == 1) { + nblks += blst_leaf_fill(&scan[i], fillBlk, v); + } else { + nblks += blst_meta_fill(&scan[i], fillBlk, v, + radix, next_skip - 1, blk); + } + count -= v; + fillBlk += v; + blk += (swblk_t)radix; + i += next_skip; + } + scan->u.bmu_avail -= nblks; + return (nblks); +} + /* * BLIST_RADIX_COPY() - copy one radix tree to another * @@ -914,12 +1048,21 @@ main(int ac, char **av) kprintf("?\n"); } break; + case 'l': + if (sscanf(buf + 1, "%x %d", &da, &count) == 2) { + printf(" n=%d\n", + blist_fill(bl, da, count)); + } else { + kprintf("?\n"); + } + break; case '?': case 'h': puts( "p -print\n" "a %d -allocate\n" "f %x %d -free\n" + "l %x %d -fill\n" "r %d -resize\n" "h/? -help" ); diff --git a/sys/kern/syscalls.master b/sys/kern/syscalls.master index 0bcc1f3650..f8dc2f2ef5 100644 --- a/sys/kern/syscalls.master +++ b/sys/kern/syscalls.master @@ -741,3 +741,4 @@ 527 STD POSIX { int readlinkat(int fd, char *path, char *buf, \ size_t bufsize); } 528 STD POSIX { int symlinkat(char *path1, int fd, char *path2); } +529 STD BSD { int swapoff(char *name); } diff --git a/sys/sys/blist.h b/sys/sys/blist.h index 9de5f7387f..1ff7e75fc9 100644 --- a/sys/sys/blist.h +++ b/sys/sys/blist.h @@ -38,6 +38,7 @@ * (void) blist_destroy(blist) * blkno = blist_alloc(blist, count) * (void) blist_free(blist, blkno, count) + * nblks = blist_fill(blist, blkno, count) * (void) blist_resize(&blist, count, freeextra) * * @@ -116,6 +117,7 @@ extern blist_t blist_create(swblk_t blocks); extern void blist_destroy(blist_t blist); extern swblk_t blist_alloc(blist_t blist, swblk_t count); extern void blist_free(blist_t blist, swblk_t blkno, swblk_t count); +extern swblk_t blist_fill(blist_t blist, swblk_t blkno, swblk_t count); extern void blist_print(blist_t blist); extern void blist_resize(blist_t *pblist, swblk_t count, int freenew); diff --git a/sys/sys/conf.h b/sys/sys/conf.h index 8726ee37d1..b437f7eb2a 100644 --- a/sys/sys/conf.h +++ b/sys/sys/conf.h @@ -190,6 +190,7 @@ struct swdevt { #define SW_FREED 0x01 #define SW_SEQUENTIAL 0x02 +#define SW_CLOSING 0x04 #define sw_freed sw_flags /* XXX compat */ #ifdef _KERNEL diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index 735033b9eb..dceacb0871 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -166,7 +166,12 @@ struct blist *swapblist; static int swap_async_max = 4; /* maximum in-progress async I/O's */ static int swap_burst_read = 0; /* allow burst reading */ -extern struct vnode *swapdev_vp; /* from vm_swap.c */ +/* from vm_swap.c */ +extern struct vnode *swapdev_vp; +extern struct swdevt *swdevt; +extern int nswdev; + +#define BLK2DEVIDX(blk) (nswdev > 1 ? blk / dmmax % nswdev : 0) SYSCTL_INT(_vm, OID_AUTO, swap_async_max, CTLFLAG_RW, &swap_async_max, 0, "Maximum running async swap ops"); @@ -518,12 +523,19 @@ swp_pager_getswapspace(vm_object_t object, int npages) static __inline void swp_pager_freeswapspace(vm_object_t object, swblk_t blk, int npages) { - blist_free(swapblist, blk, npages); - swapacctspace(blk, npages); + struct swdevt *sp = &swdevt[BLK2DEVIDX(blk)]; + + sp->sw_nused -= npages; if (object->type == OBJT_SWAP) vm_swap_anon_use -= npages; else vm_swap_cache_use -= npages; + + if (sp->sw_flags & SW_CLOSING) + return; + + blist_free(swapblist, blk, npages); + vm_swap_size += npages; swp_sizecheck(); } @@ -1889,6 +1901,86 @@ swp_pager_async_iodone(struct bio *bio) crit_exit(); } +/* + * Fault-in a potentially swapped page and remove the swap reference. + */ +static __inline void +swp_pager_fault_page(vm_object_t object, vm_pindex_t pindex) +{ + struct vnode *vp; + vm_page_t m; + int error; + + if (object->type == OBJT_VNODE) { + /* + * Any swap related to a vnode is due to swapcache. We must + * vget() the vnode in case it is not active (otherwise + * vref() will panic). Calling vm_object_page_remove() will + * ensure that any swap ref is removed interlocked with the + * page. clean_only is set to TRUE so we don't throw away + * dirty pages. + */ + vp = object->handle; + error = vget(vp, LK_SHARED | LK_RETRY | LK_CANRECURSE); + if (error == 0) { + vm_object_page_remove(object, pindex, pindex + 1, TRUE); + vput(vp); + } + } else { + /* + * Otherwise it is a normal OBJT_SWAP object and we can + * fault the page in and remove the swap. + */ + m = vm_fault_object_page(object, IDX_TO_OFF(pindex), + VM_PROT_NONE, + VM_FAULT_DIRTY | VM_FAULT_UNSWAP, + &error); + if (m) + vm_page_unhold(m); + } +} + +int +swap_pager_swapoff(int devidx) +{ + vm_object_t object; + struct swblock *swap; + swblk_t v; + int i; + + lwkt_gettoken(&vm_token); + lwkt_gettoken(&vmobj_token); +rescan: + TAILQ_FOREACH(object, &vm_object_list, object_list) { + if (object->type == OBJT_SWAP || object->type == OBJT_VNODE) { + RB_FOREACH(swap, swblock_rb_tree, &object->swblock_root) { + for (i = 0; i < SWAP_META_PAGES; ++i) { + v = swap->swb_pages[i]; + if (v != SWAPBLK_NONE && + BLK2DEVIDX(v) == devidx) { + swp_pager_fault_page( + object, + swap->swb_index + i); + goto rescan; + } + } + } + } + } + lwkt_reltoken(&vmobj_token); + lwkt_reltoken(&vm_token); + + /* + * If we fail to locate all swblocks we just fail gracefully and + * do not bother to restore paging on the swap device. If the + * user wants to retry the user can retry. + */ + if (swdevt[devidx].sw_nused) + return (1); + else + return (0); +} + /************************************************************************ * SWAP META DATA * ************************************************************************ diff --git a/sys/vm/swap_pager.h b/sys/vm/swap_pager.h index a5543dcb59..6412b2d422 100644 --- a/sys/vm/swap_pager.h +++ b/sys/vm/swap_pager.h @@ -94,9 +94,11 @@ extern int vm_swap_anon_use; extern int vm_swapcache_read_enable; extern int vm_swapcache_inactive_heuristic; extern struct blist *swapblist; +extern int nswap_lowat, nswap_hiwat; void swap_pager_putpages (vm_object_t, struct vm_page **, int, boolean_t, int *); boolean_t swap_pager_haspage (vm_object_t object, vm_pindex_t pindex); +int swap_pager_swapoff (int devidx); int swap_pager_swp_alloc (vm_object_t, int); void swap_pager_copy (vm_object_t, vm_object_t, vm_pindex_t, int); diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c index 31f9c7df34..872815944c 100644 --- a/sys/vm/vm_fault.c +++ b/sys/vm/vm_fault.c @@ -790,6 +790,11 @@ RetryFault: if (fault_type & VM_PROT_WRITE) vm_page_dirty(fs.m); + if (fault_flags & VM_FAULT_DIRTY) + vm_page_dirty(fs.m); + if (fault_flags & VM_FAULT_UNSWAP) + swap_pager_unswapped(fs.m); + /* * Indicate that the page was accessed. */ diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h index 14cab11b67..ac325db539 100644 --- a/sys/vm/vm_map.h +++ b/sys/vm/vm_map.h @@ -428,6 +428,7 @@ vmspace_resident_count(struct vmspace *vmspace) #define VM_FAULT_USER_WIRE 0x02 /* Likewise, but for user purposes */ #define VM_FAULT_BURST 0x04 /* Burst fault can be done */ #define VM_FAULT_DIRTY 0x08 /* Dirty the page */ +#define VM_FAULT_UNSWAP 0x10 /* Remove backing store from the page */ #define VM_FAULT_WIRE_MASK (VM_FAULT_CHANGE_WIRING|VM_FAULT_USER_WIRE) #ifdef _KERNEL diff --git a/sys/vm/vm_swap.c b/sys/vm/vm_swap.c index e6660984ee..ae00c76afa 100644 --- a/sys/vm/vm_swap.c +++ b/sys/vm/vm_swap.c @@ -81,7 +81,7 @@ int nswdev = NSWAPDEV; /* exported to pstat/systat */ int vm_swap_size; int vm_swap_max; -static int swapdev_strategy (struct vop_strategy_args *ap); +static int swapoff_one (int index); struct vnode *swapdev_vp; /* @@ -158,13 +158,28 @@ swapdev_strategy(struct vop_strategy_args *ap) return 0; } +static int +swapdev_inactive(struct vop_inactive_args *ap) +{ + vrecycle(ap->a_vp); + return(0); +} + +static int +swapdev_reclaim(struct vop_reclaim_args *ap) +{ + return(0); +} + /* * Create a special vnode op vector for swapdev_vp - we only use * vn_strategy(), everything else returns an error. */ static struct vop_ops swapdev_vnode_vops = { .vop_default = vop_defaultop, - .vop_strategy = swapdev_strategy + .vop_strategy = swapdev_strategy, + .vop_inactive = swapdev_inactive, + .vop_reclaim = swapdev_reclaim }; static struct vop_ops *swapdev_vnode_vops_p = &swapdev_vnode_vops; @@ -251,7 +266,7 @@ swaponvp(struct thread *td, struct vnode *vp, u_quad_t nblks) cdev_t dev; int index; int error; - long blk; + swblk_t blk; cred = td->td_ucred; @@ -339,7 +354,7 @@ swaponvp(struct thread *td, struct vnode *vp, u_quad_t nblks) sp->sw_vp = vp; sp->sw_dev = dev2udev(dev); sp->sw_device = dev; - sp->sw_flags |= SW_FREED; + sp->sw_flags = SW_FREED; sp->sw_nused = 0; /* @@ -371,6 +386,152 @@ swaponvp(struct thread *td, struct vnode *vp, u_quad_t nblks) return (0); } +/* + * swapoff_args(char *name) + * + * System call swapoff(name) disables swapping on device name, + * which must be an active swap device. Return ENOMEM + * if there is not enough memory to page in the contents of + * the given device. + * + * No requirements. + */ +int +sys_swapoff(struct swapoff_args *uap) +{ + struct vnode *vp; + struct nlookupdata nd; + struct swdevt *sp; + int error, index; + + error = priv_check(curthread, PRIV_ROOT); + if (error) + return (error); + + mtx_lock(&swap_mtx); + get_mplock(); + vp = NULL; + error = nlookup_init(&nd, uap->name, UIO_USERSPACE, NLC_FOLLOW); + if (error == 0) + error = nlookup(&nd); + if (error == 0) + error = cache_vref(&nd.nl_nch, nd.nl_cred, &vp); + nlookup_done(&nd); + if (error) + goto done; + + for (sp = swdevt, index = 0; index < nswdev; index++, sp++) { + if (sp->sw_vp == vp) + goto found; + } + error = EINVAL; + goto done; +found: + error = swapoff_one(index); + +done: + rel_mplock(); + mtx_unlock(&swap_mtx); + return (error); +} + +static int +swapoff_one(int index) +{ + swblk_t blk, aligned_nblks; + swblk_t dvbase, vsbase; + u_int pq_active_clean, pq_inactive_clean; + struct swdevt *sp; + vm_page_t m; + + mtx_lock(&swap_mtx); + + sp = &swdevt[index]; + aligned_nblks = sp->sw_nblks; + pq_active_clean = pq_inactive_clean = 0; + + /* + * We can turn off this swap device safely only if the + * available virtual memory in the system will fit the amount + * of data we will have to page back in, plus an epsilon so + * the system doesn't become critically low on swap space. + */ + lwkt_gettoken(&vm_token); + TAILQ_FOREACH(m, &vm_page_queues[PQ_ACTIVE].pl, pageq) { + if (m->flags & (PG_MARKER | PG_FICTITIOUS)) + continue; + + if (m->dirty == 0) { + vm_page_test_dirty(m); + if (m->dirty == 0) + ++pq_active_clean; + } + } + TAILQ_FOREACH(m, &vm_page_queues[PQ_INACTIVE].pl, pageq) { + if (m->flags & (PG_MARKER | PG_FICTITIOUS)) + continue; + + if (m->dirty == 0) { + vm_page_test_dirty(m); + if (m->dirty == 0) + ++pq_inactive_clean; + } + } + lwkt_reltoken(&vm_token); + + if (vmstats.v_free_count + vmstats.v_cache_count + pq_active_clean + + pq_inactive_clean + vm_swap_size < aligned_nblks + nswap_lowat) { + mtx_unlock(&swap_mtx); + return (ENOMEM); + } + + /* + * Prevent further allocations on this device + */ + sp->sw_flags |= SW_CLOSING; + for (dvbase = dmmax; dvbase < aligned_nblks; dvbase += dmmax) { + blk = min(aligned_nblks - dvbase, dmmax); + vsbase = index * dmmax + dvbase * nswdev; + vm_swap_size -= blist_fill(swapblist, vsbase, blk); + vm_swap_max -= blk; + } + + /* + * Page in the contents of the device and close it. + */ + if (swap_pager_swapoff(index)) { + mtx_unlock(&swap_mtx); + return (EINTR); + } + + VOP_CLOSE(sp->sw_vp, FREAD | FWRITE); + vrele(sp->sw_vp); + bzero(swdevt + index, sizeof(struct swdevt)); + + /* + * Resize the bitmap based on the nem largest swap device, + * or free the bitmap if there are no more devices. + */ + for (sp = swdevt, aligned_nblks = 0; sp < swdevt + nswdev; sp++) { + if (sp->sw_vp) + aligned_nblks = max(aligned_nblks, sp->sw_nblks); + } + + nswap = aligned_nblks * nswdev; + + if (nswap == 0) { + blist_destroy(swapblist); + swapblist = NULL; + vrele(swapdev_vp); + swapdev_vp = NULL; + } else { + blist_resize(&swapblist, nswap, 0); + } + + mtx_unlock(&swap_mtx); + return (0); +} + /* * Account for swap space in individual swdevt's. The caller ensures * that the provided range falls into a single swdevt.