6.7.8. Detection of Outliers

6.7.8.1. Operation


Operation name

Detection of Outliers

Algorithm name

XXX

Algorithm reference

XXX

Description

This Operation enables the detection of outliers within a sample.


6.7.8.2. Options


name

percentile-approach

description

identify valid values inside the range of two limiting percentiles

settings

** to be defined later **


name

threshold-approach

description

identify valid values within a given range determined by two threshold values

settings

** to be defined later **


6.7.8.3. Input data


name

time (steps)

type

integer or double

range

[0; +infinity]

dimensionality

vector

description

days/months since …


name

variable(s)

type

floating point number

range

[-infinity; +infinity]

dimensionality

vector

description

values of (a) certain variable(s)


6.7.8.4. Output data


name

cleaned sample

type

floating point number

range

[-infinity; +infinity]

dimensionality

vector

description

clean input after outliers have been removed


6.7.8.5. Parameters


name

lon1, x1 (longitudinal position)

type

floating point number

valid values

[-180.; +180.] respectively [0.; 360.]

default value

minimum longitude of input data

description

longitudinal coordinate limiting rectangular area of interest


name

lon2, x2 (longitudinal position)

type

floating point number

valid values

[-180.; +180.] resp. [0.; 360.]

default value

maximum longitude of input data

description

longitudinal coordinate limiting rectangular area of interest


name

lat1, y1 (latitudinal position)

type

floating point number

valid values

[-90.; +90.]

default value

minimum latitude of input data

description

latitudinal coordinate limiting rectangular area of interest


name

lat2, y2 (latitudinal position)

type

floating point number

valid values

[-90.; +90.]

default value

maximum latitude of input data

description

latitudinal coordinate limiting rectangular area of interest


6.7.8.6. Example

'''The following program is an example for the 'Detection of Outliers'. The suggested method is a detection of outliers
 based on percentiles or threshold-limitation.

 Step 1:
A random dataset with a length of 95 floats within the span of 15 and 25 is generated randomly. Five outlier values are
added by hand.

Step 2:
Prompt:: Decide between the two approaches/methods.

Step 3:
Prompt:: Set limitations either a percentage [%] or a value embracing the distribution.

Step 4:
Prompt:: Flag or drop the outliers. If falgged: column_stack a new column with 0/1. '1' flags an outlier.

Step 5:
Implemt of an 'R-like' which()-statement.

Step 6: Exclude or flag the values.

Return-Object: 'new_sampl' based on the prior decisions.

#Comment: This method of detecting outliers is just one of many! UC2 is a perfect example of a 'Detection o Outliers'
via two threshold-values giving a rigid limition for the span of values allowed. When the data is assumed to be tempera-
tures in Celius measured during the summer. I.e. the User could save drop/flag all values lower 15 and greater 25,
since the temperature in the given period is considered to vary in that range.

02.02.2017 Stephan Herzog
'''

#import modules
import numpy as np

## - TEST DATA - ##
#Generate 95 random values within 15 and 25; pass it to 'vec1'
sampl = np.random.uniform(low=15.0,high=25.0,size=95)
sampl = np.append(sampl,[-3.141,42,1337,-273.15,21122012])
np.random.shuffle(sampl)


######BEGIN: VOR DEM PROMPT DIE ABFRAGE EINBAUEN OB PERCENTIL_METHODE ODER SCHWELLWERT!!!!
logical_prompt = raw_input("Please decide between the methods for a detection of outliers: Press (1) for a percentile-"
                                                   "approach; Press (2) for a threshold-approach.")

## - Calc. of percentiles - ##
if (logical_prompt == '1') :
        prompt1lower = raw_input("Please enter the lower limit for the percentile: ")   ##Suggestion: 2.5
        prompt2upper = raw_input("Please enter the upper limit for the percentile: ")   ##Suggestion: 97.5

        p_lower = np.percentile(sampl, float(prompt1lower))     ##key aspect
        p_upper = np.percentile(sampl, float(prompt2upper))     ##key aspect

## - Prompt for threshold - ##
if (logical_prompt == '2') :
        p_lower = raw_input("Please enter the lower limit for the threshold: ")
        p_upper = raw_input("Please enter the upper limit for the threshold: ")

        p_lower = float(p_lower)
        p_upper = float(p_upper)

## - Prompt for flag or drop - ##
logical = raw_input("Should the outliers be flagged? (Y/N)")

## - Identfiy values within limits - ##
which = lambda lst:list(np.where(lst)[0])       ##key aspect

lst = map(lambda x:(x<p_lower or x>p_upper), sampl)

print(which(lst))
## - Flag or Drop Outliers - ##
if ( logical == 'Y') :
        flag = np.repeat(0,len(sampl))
        flag[which(lst)] = 1
        new_sampl = np.column_stack((sampl,flag))
        print(new_sampl.shape)
        print(new_sampl[which(lst),:])
else:
        new_sampl = np.delete(sampl,which(lst))
        print(new_sampl.shape)

## - Write to Output - ## e.g. .csv or other