762caeb4 kib Aug. 6, 2019, 5:05 p.m.
Discussed with:	bjk, emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21087
cgit ViewVC
90e35b0a kib Aug. 6, 2019, 4:53 p.m.
Such speculations could use user-controlled %gs base, esp. since
FreeBSD supports WRGSBASE instructions.

Place LFENCEs on entry for each basic block after the test for
previous kernel/user mode on the kernel entry, which prevents the
speculation.  Code accesses %gs-based PCPU before any serialization
instructions are executed, like %cr3 reload for KPTI.

With pti disabled, on haswell i7-4770S machine, "syscall_timings getppid"
shows when no lfence is added to syscall path:
test	loop	time	iterations	periteration
getppid	0	1.040918865	4643611	0.000000224
getppid	1	1.004985962	4481816	0.000000224
getppid	2	1.005196483	4482363	0.000000224
with lfence:
getppid	0	1.043701091	4554779	0.000000229
getppid	1	1.016930328	4438094	0.000000229
getppid	2	1.023223117	4466640	0.000000229
and ministat reports 'No difference proven at 95.0% confidence.'

Security:	CVE-2019-1125
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
cgit ViewVC
814f33aa glebius Aug. 6, 2019, 4:11 p.m.
54e9e4e7 emaste Aug. 6, 2019, 4:09 p.m.
admbugs:	780
Submitted by:	Guido Vranken, bz
Reported by:	Guido Vranken
Security:	CVE-2019-5610
cgit ViewVC
ab7de25c kevans Aug. 6, 2019, 1:09 p.m.
Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (ec2b89264cfc63e05e611cce82cc449197403aa4)
MFC after:	3 days
cgit ViewVC
12b678e9 tuexen Aug. 6, 2019, 11:26 a.m.
MFC after:		1 week
Sponsored by:		Netflix, Inc.
cgit ViewVC
cd2de8b7 tuexen Aug. 6, 2019, 10:29 a.m.
PR:			238520
Reported by:		pho@
MFC after:		1 week
cgit ViewVC
43ecbff2 tuexen Aug. 6, 2019, 8:33 a.m.
85dbddbe cem Aug. 6, 2019, 3:54 a.m.
PR:	239607
Reported by:	"drclaw" <drclaw AT gmail.com>
cgit ViewVC
8b1531ec jhibbits Aug. 6, 2019, 3:49 a.m.
It helps if my local kernel build has INVARIANTS.
cgit ViewVC
dc825fed jhibbits Aug. 6, 2019, 3:16 a.m.
There is no need for the 64-bit pmap to have a fixed number of page table
buffers.  Since the 64-bit pmap has a DMAP, we can effectively have user
page tables limited only by total RAM size.
cgit ViewVC
427d205c asomers Aug. 6, 2019, 12:50 a.m.
Reported by:	glebius
Sponsored by:	The FreeBSD Foundation
cgit ViewVC
e48c002f vangyzen Aug. 5, 2019, 10:59 p.m.
pthread_cond_timedwait() should wait _at least_ until the timeout,
but it might appear to wait longer due to system activity and
scheduling.  The test ignored fractional seconds when comparing the
actual and expected timeouts, so it allowed anywhere between zero
and one extra second of wait time.  Zero is a bit unreasonable.
Compare fractional seconds so we always allow up to one extra second.

Reviewed by:	ngie
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
cgit ViewVC
ed9ffd2f jhb Aug. 5, 2019, 9:39 p.m.
When transmitting a large TCP packet, the final transmit descriptor
includes the length of the protocol headers to be duplicated on each
segment.  The device model was trusting the guest-supplied value
without validating it.  A value of zero would result in the guest
being able to indirect a garbage pointer on the stack to overwrite
arbitrary memory in the bhyve process.  A value that was non-zero but
too small for the requested parameters resulted in the device model
reading and writing values beyond the end of the on-stack buffer used
to hold the template header.

To fix, validate the supplied length and drop requests to transmit
packets that would overflow the header buffer.  While here, initialize
the header pointer to NULL as a preventive measure so that any access
to an unallocated template header crashes they hypervisor
deterministically.

While here, only read the TCP sequence number if the packet being
split is a TCP packet.  The e1000 logic supports a segmentation of UDP
frames, and while UDP segmentation requires this part of the header to
be valid (so there is no buffer overflow), only reading the field when
needed is cleaner.

admbugs:	918
Reported by:	Reno Robert <renorobert@gmail.com>
Reviewed by:	markj
Approved by:	so
Security:	CVE-2019-5609
cgit ViewVC
c81f1790 mahrens Aug. 5, 2019, 9:34 p.m.
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
cgit