6.7.8. Detection of Outliers

6.7.8.1. Operation


Operation name:Detection of Outliers
Algorithm name:XXX
Algorithm reference:
 XXX
Description:This Operation enables the detection of outliers within a sample.

6.7.8.2. Options


name:percentile-approach
description:identify valid values inside the range of two limiting percentiles
settings:** to be defined later **

name:threshold-approach
description:identify valid values within a given range determined by two threshold values
settings:** to be defined later **

6.7.8.3. Input data


name:time (steps)
type:integer or double
range:[0; +infinity]
dimensionality:vector
description:days/months since …

name:variable(s)
type:floating point number
range:[-infinity; +infinity]
dimensionality:vector
description:values of (a) certain variable(s)

6.7.8.4. Output data


name:cleaned sample
type:floating point number
range:[-infinity; +infinity]
dimensionality:vector
description:clean input after outliers have been removed

6.7.8.5. Parameters


name:lon1, x1 (longitudinal position)
type:floating point number
valid values:[-180.; +180.] respectively [0.; 360.]
default value:minimum longitude of input data
description:longitudinal coordinate limiting rectangular area of interest

name:lon2, x2 (longitudinal position)
type:floating point number
valid values:[-180.; +180.] resp. [0.; 360.]
default value:maximum longitude of input data
description:longitudinal coordinate limiting rectangular area of interest

name:lat1, y1 (latitudinal position)
type:floating point number
valid values:[-90.; +90.]
default value:minimum latitude of input data
description:latitudinal coordinate limiting rectangular area of interest

name:lat2, y2 (latitudinal position)
type:floating point number
valid values:[-90.; +90.]
default value:maximum latitude of input data
description:latitudinal coordinate limiting rectangular area of interest

6.7.8.6. Example

'''The following program is an example for the 'Detection of Outliers'. The suggested method is a detection of outliers
 based on percentiles or threshold-limitation.

 Step 1:
A random dataset with a length of 95 floats within the span of 15 and 25 is generated randomly. Five outlier values are
added by hand.

Step 2:
Prompt:: Decide between the two approaches/methods.

Step 3:
Prompt:: Set limitations either a percentage [%] or a value embracing the distribution.

Step 4:
Prompt:: Flag or drop the outliers. If falgged: column_stack a new column with 0/1. '1' flags an outlier.

Step 5:
Implemt of an 'R-like' which()-statement.

Step 6: Exclude or flag the values.

Return-Object: 'new_sampl' based on the prior decisions.

#Comment: This method of detecting outliers is just one of many! UC2 is a perfect example of a 'Detection o Outliers'
via two threshold-values giving a rigid limition for the span of values allowed. When the data is assumed to be tempera-
tures in Celius measured during the summer. I.e. the User could save drop/flag all values lower 15 and greater 25,
since the temperature in the given period is considered to vary in that range.

02.02.2017 Stephan Herzog
'''

#import modules
import numpy as np

## - TEST DATA - ##
#Generate 95 random values within 15 and 25; pass it to 'vec1'
sampl = np.random.uniform(low=15.0,high=25.0,size=95)
sampl = np.append(sampl,[-3.141,42,1337,-273.15,21122012])
np.random.shuffle(sampl)


######BEGIN: VOR DEM PROMPT DIE ABFRAGE EINBAUEN OB PERCENTIL_METHODE ODER SCHWELLWERT!!!!
logical_prompt = raw_input("Please decide between the methods for a detection of outliers: Press (1) for a percentile-"
                                                   "approach; Press (2) for a threshold-approach.")

## - Calc. of percentiles - ##
if (logical_prompt == '1') :
        prompt1lower = raw_input("Please enter the lower limit for the percentile: ")   ##Suggestion: 2.5
        prompt2upper = raw_input("Please enter the upper limit for the percentile: ")   ##Suggestion: 97.5

        p_lower = np.percentile(sampl, float(prompt1lower))     ##key aspect
        p_upper = np.percentile(sampl, float(prompt2upper))     ##key aspect

## - Prompt for threshold - ##
if (logical_prompt == '2') :
        p_lower = raw_input("Please enter the lower limit for the threshold: ")
        p_upper = raw_input("Please enter the upper limit for the threshold: ")

        p_lower = float(p_lower)
        p_upper = float(p_upper)

## - Prompt for flag or drop - ##
logical = raw_input("Should the outliers be flagged? (Y/N)")

## - Identfiy values within limits - ##
which = lambda lst:list(np.where(lst)[0])       ##key aspect

lst = map(lambda x:(x<p_lower or x>p_upper), sampl)

print(which(lst))
## - Flag or Drop Outliers - ##
if ( logical == 'Y') :
        flag = np.repeat(0,len(sampl))
        flag[which(lst)] = 1
        new_sampl = np.column_stack((sampl,flag))
        print(new_sampl.shape)
        print(new_sampl[which(lst),:])
else:
        new_sampl = np.delete(sampl,which(lst))
        print(new_sampl.shape)

## - Write to Output - ## e.g. .csv or other