r344188 glebius Feb. 15, 2019, 11:36 p.m.
back to the lever before r343030.  For 64-bit machines reduce it slightly,
too.  Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.

Provide a loader tunable to change vnode pager pbufs count. Document it.
ViewVC
r344187 cem Feb. 15, 2019, 10:55 p.m.
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.

As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).

We do something similar in fuse_vnop_getattr already.

PR:		230258 (as reported in description; other issues explored in
			comments are not all resolved)
Reported by:	MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by:	Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
ViewVC
r344186 cem Feb. 15, 2019, 10:52 p.m.
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.

Prior to this change, we implement and default to writeback caching.  This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.

In this change, add cache mode as an extension of cache enable/disable.  The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).

For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data.  WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement.  Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.

Refs:
  * https://sourceforge.net/p/fuse/mailman/message/8902254/
  * https://github.com/vgough/encfs/issues/315

PR:		230258 (inspired by)
ViewVC
r344185 cem Feb. 15, 2019, 10:51 p.m.
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR).  In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).

PR:		230258 (related)
ViewVC
r344184 cem Feb. 15, 2019, 10:50 p.m.
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.

Historically, FreeBSD FUSE has ignored this client directive entirely.  This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.

For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.

Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.

PR:		230258 (inspired by; does not fix)
ViewVC
r344183 cem Feb. 15, 2019, 10:49 p.m.
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds.  In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.

Historically, FreeBSD FUSE has ignored this client directive entirely.  This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.

For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.

In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.

One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy.  It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?

Some minor APIs changed to facilitate this:

1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").

2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0.  It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).

3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.

4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache).  The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.

PR:		230258 (inspired by; does not fix)
ViewVC
r344182 emaste Feb. 15, 2019, 10:48 p.m.
Suggested by:	kevans
Reviewed by:	kevans
ViewVC
r344181 emaste Feb. 15, 2019, 10:30 p.m.
r344180 emaste Feb. 15, 2019, 10:28 p.m.
r344179 emaste Feb. 15, 2019, 10:22 p.m.
Building binaries as PIE allows the executable itself to be loaded at a
random address when ASLR is enabled (not just its shared libraries).

With this change PIE objects have a .pieo extension and INTERNALLIB
libraries libXXX_pie.a.

MK_PIE is disabled for some kerberos5 tools, Clang, and Subversion, as
they explicitly reference .a libraries in their Makefiles.  These can
be addressed on an individual basis later.  MK_PIE is also disabled for
rtld-elf because it is already position-independent using bespoke
Makefile rules.

Currently only dynamically linked binaries will be built as PIE.

Discussed with:	dim
Reviewed by:	kib
MFC after:	1 month
Relnotes:	Yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18423
ViewVC
r344162 shurd Feb. 15, 2019, 6:51 p.m.
iflib was returning FILTER_HANDLED, in cases where FILTER_STRAY was more
correct. This potentially caused issues with shared legacy interrupts.

Driver filters returning FILTER_STRAY are now properly handled.

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Reviewed by:	marius, gallatin
Obtained from:	Haiku (a84bb9, 4947d1)
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D19201
ViewVC
r344161 kevans Feb. 15, 2019, 6:28 p.m.
net_open previously casted the first vararg to a char * and this was
half-OK: at first, it is passed to netif_open, which would cast it back to
the struct devdesc * that it really is and use it properly. It is then
strdup()d and used as the netdev_name, which is objectively wrong.

Correct it so that the first vararg is properly casted to a struct devdesc *
and the netdev_name gets set properly to make it more clear at a glance that
it's not doing something horribly wrong.

Reported by:	mmel
Reviewed by:	imp, mmel, tsoome
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19206
ViewVC
r344160 rgrimes Feb. 15, 2019, 4:48 p.m.
was incorrectly implemented leading to a possible double free.

It is possible for both the conditional free,
and the unconditional free added in r340044 to be done,
fix that by initializing uopt to NULL,
removing the conditional free,
and only using the unconditional free at the end.

Reported by:	Patrick Mooney (patrick.mooney@joyent.com)
Reviewed by:	jhb (maintainer), Patrick Mooney (joyent/illumos)
Approved by:	bde (mentor)
CID:		1357336
MFC after:	3 days
MFC with:	340044
Differential Revision:	https://reviews.freebsd.org/D19202
ViewVC
r344159 rgrimes Feb. 15, 2019, 4:20 p.m.
nopt is the only allocated space,
xopt and cp are aliases into that allocated space.
Remove the 2 unneeded free's

Reported by:	Patrick Mooney (@pmooney_pfmooney.com)
Reviewed by:	jhb (maintainer), Patrick Mooney (joyent/illumos)
Approved by:	bde (mentor)
CID:		1305412
MFC after:	3 days
MFC with:	340042
Differential Revision:	https://reviews.freebsd.org/D19200
ViewVC
r344151 mw Feb. 15, 2019, 10:40 a.m.
In out of order mode Rx buffer are accesses by req_id.
Accessing and validating mbuf using ntc is causing false error.

Increase driver revision after latest RX OOO completion fixes.

Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
ViewVC