From 72a5cbbe81da15e65490b24182907afcbf208aa3 Mon Sep 17 00:00:00 2001
From: Gray Olson <gray@grayolson.com>
Date: Mon, 21 Sep 2020 13:07:48 -0700
Subject: [PATCH 1/2] Edit documentation for `std::{f32,f64}::mul_add`.

Makes it more clear that a performance improvement is not guaranteed
when using FMA, even when the target architecture supports it natively.
---
 library/std/src/f32.rs | 7 +++++--
 library/std/src/f64.rs | 7 +++++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/library/std/src/f32.rs b/library/std/src/f32.rs
index 59c2da5273bde..c97dac69634df 100644
--- a/library/std/src/f32.rs
+++ b/library/std/src/f32.rs
@@ -206,8 +206,11 @@ impl f32 {
     /// Fused multiply-add. Computes `(self * a) + b` with only one rounding
     /// error, yielding a more accurate result than an unfused multiply-add.
     ///
-    /// Using `mul_add` can be more performant than an unfused multiply-add if
-    /// the target architecture has a dedicated `fma` CPU instruction.
+    /// Using `mul_add` *can* be more performant than an unfused multiply-add if
+    /// the target architecture has a dedicated `fma` CPU instruction. However,
+    /// this is not always true, and care must be taken not to overload the
+    /// architecture's available FMA units when using many FMA instructions
+    /// in a row, which can cause a stall and performance degradation.
     ///
     /// # Examples
     ///
diff --git a/library/std/src/f64.rs b/library/std/src/f64.rs
index bd094bdb55dc3..1ef34409437f8 100644
--- a/library/std/src/f64.rs
+++ b/library/std/src/f64.rs
@@ -206,8 +206,11 @@ impl f64 {
     /// Fused multiply-add. Computes `(self * a) + b` with only one rounding
     /// error, yielding a more accurate result than an unfused multiply-add.
     ///
-    /// Using `mul_add` can be more performant than an unfused multiply-add if
-    /// the target architecture has a dedicated `fma` CPU instruction.
+    /// Using `mul_add` *can* be more performant than an unfused multiply-add if
+    /// the target architecture has a dedicated `fma` CPU instruction. However,
+    /// this is not always true, and care must be taken not to overload the
+    /// architecture's available FMA units when using many FMA instructions
+    /// in a row, which can cause a stall and performance degradation.
     ///
     /// # Examples
     ///

From a6d98d8ec918c7aa2b0712f1ff2c9b1db5924275 Mon Sep 17 00:00:00 2001
From: Gray Olson <gray@grayolson.com>
Date: Tue, 13 Oct 2020 11:03:31 -0700
Subject: [PATCH 2/2] generalize warning

---
 library/std/src/f32.rs | 7 +++----
 library/std/src/f64.rs | 7 +++----
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/library/std/src/f32.rs b/library/std/src/f32.rs
index c97dac69634df..9bebf68cf3d26 100644
--- a/library/std/src/f32.rs
+++ b/library/std/src/f32.rs
@@ -206,11 +206,10 @@ impl f32 {
     /// Fused multiply-add. Computes `(self * a) + b` with only one rounding
     /// error, yielding a more accurate result than an unfused multiply-add.
     ///
-    /// Using `mul_add` *can* be more performant than an unfused multiply-add if
+    /// Using `mul_add` *may* be more performant than an unfused multiply-add if
     /// the target architecture has a dedicated `fma` CPU instruction. However,
-    /// this is not always true, and care must be taken not to overload the
-    /// architecture's available FMA units when using many FMA instructions
-    /// in a row, which can cause a stall and performance degradation.
+    /// this is not always true, and will be heavily dependant on designing
+    /// algorithms with specific target hardware in mind.
     ///
     /// # Examples
     ///
diff --git a/library/std/src/f64.rs b/library/std/src/f64.rs
index 1ef34409437f8..860e461ec70a3 100644
--- a/library/std/src/f64.rs
+++ b/library/std/src/f64.rs
@@ -206,11 +206,10 @@ impl f64 {
     /// Fused multiply-add. Computes `(self * a) + b` with only one rounding
     /// error, yielding a more accurate result than an unfused multiply-add.
     ///
-    /// Using `mul_add` *can* be more performant than an unfused multiply-add if
+    /// Using `mul_add` *may* be more performant than an unfused multiply-add if
     /// the target architecture has a dedicated `fma` CPU instruction. However,
-    /// this is not always true, and care must be taken not to overload the
-    /// architecture's available FMA units when using many FMA instructions
-    /// in a row, which can cause a stall and performance degradation.
+    /// this is not always true, and will be heavily dependant on designing
+    /// algorithms with specific target hardware in mind.
     ///
     /// # Examples
     ///