Skip to content

Robust statistics: trimmed mean and interquartile range #379

Open
@Beliavsky

Description

@Beliavsky

Since sorting and I assume ranking will be part of stdlib, here is some related functionality to consider. In another issue I mentioned the median. The median is a special case of a trimmed mean, where you trim the top N and bottom N values for a vector with 2N+1 or 2N+2 elements and compute the mean of the remaining 1 or 2 values. Trimmed means are a commonly used robust statistic -- one computes the mean after removing top N and bottom N. So a function

function trimmed_mean(x,ntrim_high,ntrim_low) result(xmean)
real, intent(in) :: x
integer, intent(in) :: ntrim_high,ntrim_low
real :: xmean ! mean after removing ntrim_high and ntrim_low values
end function trimmed_mean

could be created. In quantitative finance, insurance, and probably other domains, you often want to know what the mean of N largest or smallest observations were. A complement to the function above would be

function extreme_mean(x,nhigh,nlow) result(xmean)
real, intent(in) :: x
integer, intent(in), optional :: nhigh,nlow
real :: xmean ! mean of nhigh and nlow values
end function extreme_mean

Usually only one of nhigh and nlow would be specified by the user, but the case where both are specified can also be handled. In quantitative finance, the mean of the N lowest returns of a sample is an estimator of the Conditional Value at Risk.

The most commonly computed average is just the sample mean, which is in stdlib. Quantities such as the trimmed mean are less trivial to program and would perhaps provide a greater benefit by being in stdlib.

The descriptive statistics section has var to compute variance, which is the square of the standard deviation. A robust alternative to standard deviation to measure the spread of data is the interquartile range (IQR) (the difference of 75th and 25th percentiles). Having percentiles and IQR in stdlib would be nice. If you ask R to summarize 1000 normal variates, it says

> x = rnorm(1000)
> summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.84938 -0.65930  0.06237  0.06395  0.75097  3.22734 

So statisticians think percentile measures are as important as the standard deviation to understand spread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions