RUL-数据准备-第二次清洗-去除极值

以下是前文的代码,集合在这里

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
train = pd.read_csv('D:/RUL/CMAPSSData/train_FD001.txt', parse_dates=False, delimiter=" ", decimal=".", header=None)
test = pd.read_csv('D:/RUL/CMAPSSData/test_FD001.txt', parse_dates=False, delimiter=" ", decimal=".", header=None)
RUL = pd.read_csv('D:/RUL/CMAPSSData/RUL_FD001.txt', parse_dates=False, delimiter=" ", decimal=".", header=None)
table_NaN = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1)
table_NaN.columns = ['train', 'test']
#清理无用列
train.drop(train.columns[[-1,-2]], axis=1, inplace=True)
test.drop(test.columns[[-1,-2]], axis=1, inplace=True)
RUL.drop(RUL.columns[[-1,-1]], axis=1, inplace=True)
#命名表头
cols = ['unit', 'cycles', 'op_setting1', 'op_setting2', 'op_setting3', 's1', 's2', 's3', 's4', 's5', 
        's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
train.columns = cols
test.columns = cols

由于数据中可能存在一些值,和主要数据分布的值相去甚远,就像在比赛中要去除一个最低分和一个最高分一样,我们也要清理掉数据集中的极限值。

In [10]:
train.describe().transpose()
Out[10]:
count mean std min 25% 50% 75% max
unit 20631.0 51.506568 2.922763e+01 1.0000 26.0000 52.0000 77.0000 100.0000
cycles 20631.0 108.807862 6.888099e+01 1.0000 52.0000 104.0000 156.0000 362.0000
op_setting1 20631.0 -0.000009 2.187313e-03 -0.0087 -0.0015 0.0000 0.0015 0.0087
op_setting2 20631.0 0.000002 2.930621e-04 -0.0006 -0.0002 0.0000 0.0003 0.0006
op_setting3 20631.0 100.000000 0.000000e+00 100.0000 100.0000 100.0000 100.0000 100.0000
s1 20631.0 518.670000 0.000000e+00 518.6700 518.6700 518.6700 518.6700 518.6700
s2 20631.0 642.680934 5.000533e-01 641.2100 642.3250 642.6400 643.0000 644.5300
s3 20631.0 1590.523119 6.131150e+00 1571.0400 1586.2600 1590.1000 1594.3800 1616.9100
s4 20631.0 1408.933782 9.000605e+00 1382.2500 1402.3600 1408.0400 1414.5550 1441.4900
s5 20631.0 14.620000 1.776400e-15 14.6200 14.6200 14.6200 14.6200 14.6200
s6 20631.0 21.609803 1.388985e-03 21.6000 21.6100 21.6100 21.6100 21.6100
s7 20631.0 553.367711 8.850923e-01 549.8500 552.8100 553.4400 554.0100 556.0600
s8 20631.0 2388.096652 7.098548e-02 2387.9000 2388.0500 2388.0900 2388.1400 2388.5600
s9 20631.0 9065.242941 2.208288e+01 9021.7300 9053.1000 9060.6600 9069.4200 9244.5900
s10 20631.0 1.300000 0.000000e+00 1.3000 1.3000 1.3000 1.3000 1.3000
s11 20631.0 47.541168 2.670874e-01 46.8500 47.3500 47.5100 47.7000 48.5300
s12 20631.0 521.413470 7.375534e-01 518.6900 520.9600 521.4800 521.9500 523.3800
s13 20631.0 2388.096152 7.191892e-02 2387.8800 2388.0400 2388.0900 2388.1400 2388.5600
s14 20631.0 8143.752722 1.907618e+01 8099.9400 8133.2450 8140.5400 8148.3100 8293.7200
s15 20631.0 8.442146 3.750504e-02 8.3249 8.4149 8.4389 8.4656 8.5848
s16 20631.0 0.030000 1.387812e-17 0.0300 0.0300 0.0300 0.0300 0.0300
s17 20631.0 393.210654 1.548763e+00 388.0000 392.0000 393.0000 394.0000 400.0000
s18 20631.0 2388.000000 0.000000e+00 2388.0000 2388.0000 2388.0000 2388.0000 2388.0000
s19 20631.0 100.000000 0.000000e+00 100.0000 100.0000 100.0000 100.0000 100.0000
s20 20631.0 38.816271 1.807464e-01 38.1400 38.7000 38.8300 38.9500 39.4300
s21 20631.0 23.289705 1.082509e-01 22.8942 23.2218 23.2979 23.3668 23.6184

观察上面的数据,我们会发现,有些数据与别的数据很不一样,比如说S16传感器,最大最小值都是0.0300,并且标准方差为0,这种数据被称为平线,意味着传感器可能发生了故障。其它传感器如S1、S5、S10、S16、S19以及op_setting 3,数据都为平线,所以这些数据需要从数据集中清理掉。清理掉极端值后,我们通过直方图来直观地看下剩余的列的数据的分布:

In [11]:
train.drop(['s1', 's5', 's10', 's16', 's18', 's19', 'op_setting3'], axis=1, inplace=True)
test.drop(['s1', 's5', 's10', 's16', 's18', 's19', 'op_setting3'], axis=1, inplace=True)
train.hist(bins=50, figsize=(18,16))
plt.show()

通过直方图,我们能够直观地观察引擎的数据。但是,到目前为止,我们对于每个传感的最长使用寿命一无所知,只有在有一个大致的期望的情况下,才能够对数据的推演结果及过程保持乐观。

In [12]:
cyclestrain = train.groupby('unit', as_index=False)['cycles'].max()
cyclestest = test.groupby('unit', as_index=False)['cycles'].max()
fig = plt.figure(figsize = (16,12))
fig.add_subplot(1,2,1)
bar_labels = list(cyclestrain['unit'])
bars = plt.bar(list(cyclestrain['unit']), cyclestrain['cycles'], color='red')
plt.ylim([0, 400])
plt.xlabel('Units', fontsize=16)
plt.ylabel('Max. Cycles', fontsize=16)
plt.title('Max. Cycles per unit in trainset', fontsize=16)
plt.xticks(np.arange(min(bar_labels)-1, max(bar_labels)-1, 5.0), fontsize=12)
plt.yticks(fontsize=12)
fig.add_subplot(1,2,2)
bars = plt.bar(list(cyclestest['unit']), cyclestest['cycles'], color='grey')
plt.ylim([0, 400])
plt.xlabel('Units', fontsize=16)
plt.ylabel('Max. Cycles', fontsize=16)
plt.title('Max. Cycles per unit in testset', fontsize=16)
plt.xticks(np.arange(min(bar_labels)-1, max(bar_labels)-1, 5.0), fontsize=12)
plt.yticks(fontsize=12)
plt.show()
In [ ]: