인공지능

[빅데이터 직무연구회] 6회차 모임 정리 (1)

Chipmunks 2018. 6. 6.

728x90

[빅데이터 직무연구회] 6회차 모임 정리 (1)

모임 요일 : 5월 24일 목요일 저녁 6시

Chapter 4. 데이터 표현과 특성 공학

일반적인 특성의 전형적인 형태 = 범주형 특성(categorical feature) 또는 이산형 특성(discrete feature)

특성 공학(feature engineering) : 특정 애플리케이션에 가장 적합한 데이터 표현을 찾는 것

성능에 더 도움되는 행동 : 올바른 데이터 표현 >> 지도 학습 모델에서 적절한 매개변수를 선택하는 것

4.1 범주형 변수

4.1.1 원-핫-인코딩(가변수)

범주형 변수를 표현하는 데 가장 널리 쓰이는 방법. 원-아웃-오브-엔 인코딩(one-out-of-N encoding) 혹은 가변수(dummy variable) 이라고도 함.

가변수 : 범주형 변수를 0 또는 1 값을 가진 하나 이상의 새로운 특성으로 바꾼 것. 0과 1로 표현된 변수는 선형 이진 분류 공식에 적용할 수 있음

데이터 포인트마다 정확히 N 개의 새로운 특성 중 하나는 1이 된다.

원-핫 인코딩은 통계학에서 사용하는 더미 코딩과 비슷하다. ( 완전히 같지는 않다. ) 통계학에서는 k개의 값을 가지 범주형 특성을 k-l 개의 특성으로 변환한다. 마지막 범주는 모든 열이 0으로 표현된다. 이는 분석의 편리성 때문이다. 데이터 행렬의 랭크 부족 (rank deficient) 현상을 피하기 위함이다.

※ 랭크 부족 : 네 개의 범주를 네 개의 특성으로 인코딩하면 맨 마지막 특성은 앞의 세 특성을 참조해서 예측할 수 있다. 예를 들면 네 개의 범주 중 세 개의 값이 0 이라면, 마지막 범주는 1 이라는 것을 알 수 있다. 이런 식으로 한 열이 다른 열에 의존적이거나 열의 값이 모두 0인 경우를 말한다. 행렬 분해 방식에 따라 문제가 될 수 있다. 선형대수학 (Linear Algebra) 에서 자세히 배우는 것 같다.

pandas에서 get_dummies 함수를 사용해 데이터를 인코딩 할 수 있다. 객체 타입(문자열 등)이나 범주형(pandas의 category 타입)을 가진 열을 자동으로 변환해준다.

data_dummies의 values 속성으로 DataFrame을 NumPy 배열로 바꿀 수 있다. NumPy 배열로 머신러닝 모델을 학습시킨다. 단, 모델을 학습시키기 전 타깃값을 분리해야 한다.

※ 훈련 데이터와 테스트 데이터를 모두 담고 있는 DataFrame 으로 먼저 인코딩 작업을 해야 한다. 각각 다른 DataFrame 으로 원-핫-인코딩을 하였다면 다음과 같은 문제가 일어난다. 세 개의 범주형 값 중, 다른 DataFrame으로 분리가 되면서 한 범주형 값이 어떤 DataFrame 데이터에 들어가지 않았다면, 완전히 다른 의미가 되어버린다. 따라서 범주형 값을 모두 갖고 있는 DataFrame 으로 인코딩작업을 한뒤 나눠준다.

4.1.2 숫자로 표현된 범주형 특성

범주형 변수가 문자열로 인코딩되어 있을 수도 있지만, 바로 숫자로 인코딩된 경우가 대다수다. 그러나 이 숫자들이 범주형인지 연속형인지 한 눈에 어떻게 알아낼 수 있을까?

pandas의 get_dummies 함수는 숫자 특성은 모두 연속형이라고 가정한다. 대신 어떤 열이 연속형인지 범주형인지 지정해주는 scikit-learn의 OneHotEncoder 를 사용한다.

또는 DataFrame에 있는 숫자 열을 모두 문자열로 바꿀 수도 있다.

4.2 구간 분할, 이산화 그리고 선형 모델, 트리 모델

구간 분할 (bining, 이산화) : 한 특성을 여러 특성으로 나눔. 연속형 데이터로 이루어진 선형 모델을 강력하게 만들어 줌.

np.linspace 함수로 구간을 생성할 수 있다. np.linspace(-3, 3, 11) 코드는 -3과 3 사이에 같은 간격으로 11개의 지점을 생성해 10개의 구간(두 지점 사이의 간격)을 생성해준다.

그 다음 각 데이터 포인트가 어느 구간에 속하는지 저장한다. np.digitize 함수를 사용한다. 이로써 데이터셋에 있는 연속형 특성을 각 데이터 포인트가 어느 구간에 속했는지로 인코딩한 범주형 특성으로 변환했다.

예제에서 선형 모델은 더 이상 하나의 직선이 아니고, 각 구간마다 직선을 만들어 더 유연한 모델이 되었다.

4.3 상호작용과 다항식

원본 데이터에 상호작용(interaction)과 다항식(polynomial)을 추가해 특성을 풍부하게 만들 수 있다. 통계적 모델링에서 자주 사용할 뿐 아니라 일반적인 머신러닝 애플리케이션에도 많이 적용한다.

각 구간에 선형 모델을, 상호작용과 다항식 특성 공학을 적용해 훨씬 부드러운 곡선을 만들 수 있다. 고차원 다항식은 데이터가 부족한 영역에서 너무 민감하다는 단점이 있다.

더 복잡한 모델인 커널 SVM을 데이터 변환없이 사용할 때와 비슷한 복잡도다.

4.4 일변량 비선형 변환

제곱 항이나 세제곱 항을 추가하면 선형 회귀 모델에 도움이 됨을 경험했다. 이 뿐 아니라 log, exp, sin 같은 수학 함수를 적용하는 방법도 특성 변환에 유용하다.

트리 기반 모델은 특성의 순서에만 영향을 받는다. 반면 선형 모델과 신경망은 각 특성의 스케일과 분포와 밀접하게 연관돼있다. 특성과 타깃값 사이에 비선형성이 있다면 모델을 만들기가 어렵다.

log와 exp 함수로 데이터의 스케일을 변경해, 선형 모델과 신경망 모델의 성능을 올리는데 도움을 준다.

sin과 cos 함수는 주기적인 패턴이 들어있는 데이터를 다룰 때 편리하다.

4.5 특성 자동 선택

특성이 추가되면 모델은 더 복잡해지고 과대적합될 가능성이 높아진다. 새로운 특성을 추가할 때와 고차원 데이터셋을 다룰 때, 가장 유용한 특성만 서택하고 나머지는 무시해서 특성의 수를 줄이는 것이 좋다.

모델이 더욱 간단해지고 일반화 성능이 올라가다. 이를 위한 전략으로 일변량 통계(univariate statistics), 모델 기반 선택(model-based selection), 반복적 선택(iterative election)이 있다. 모두 지도 학습 방법이다. 따라서 최적값을 찾으려면 타깃값이 필요하다. 데이터를 훈련 세트와 테스트 세트로 나눈 다음 훈련 데이터만 특성 선택에 사용해야 한다.

4.5.1 일변량 통계

각각 특성과 타깃 사이에 중요한 통계적 관계가 있는지를 계산한다. 깊게 관련되어 있다고 판단되는 특성을 선택한다. '일변량'으로 각 특성이 독립적으로 평가된다. 다른 특성과 깊게 연관된 특성은 선택되지 않는다.

분류 : 분산분석(ANOVA, analysis of variance)을 사용. 분산분석은 데이터를 클래스별로 나누어 평균을 비교하는 방법이다. 분산분석으로 계산한 어떤 특성의 F-값(F-value)이 높으면 그 특성은 클래스별 평균이 서로 다르다는 뜻이다.

계산이 매우 빠르고 평가를 위해 모델을 만들지 않아도 된다.

scikit-learn.feature_selection 모듈 에서 일변량 분석으로 분류에서는 f_classif (기본값)를, 회귀에서는 f_regression 을 보통 선택하여 테스트한다. 계산한 p-값(p-value)에 기초하여 특성을 제외한다.

f_classif 와 f_regression 함수는 모두 sklearn.feature_selection 모듈에 있다. f_classif 에서는 클래스별 평균의 분산(SS_between)을 전체 분산(SS_tot)에서 클래스별 평균 분산 (SS_between)을 뺀 값으로 나누어 F-값을 계산한다. 클래스가 k개이고 샘플이 n일 경우를 식으로 나타내면 다음과 같다.

scikit-learn 에서는 계산 속도를 높이기 위해 이 식에서 유도된 간소화된 식을 사용하여 계산한다.

f_regression 에서는 각 특성에 대해 상관계수를 계산하고 이를 이용하여 F-값과 p-값을 계산한다.

매우 높은 p-값(타깃값과 연관성이 작을 것으로 예상)을 가진 특성을 제외할 수 있도록 임계값을 조정하는 매개변수 사용한다.

=> 클래스들의 평균이 같다는 가설을 세울 때 p-값은 이 가설을 지지하는 확률을 나타내는 것이다. F-분포에서는 F-값 이후의 오른쪽 꼬리 부분의 면적이다. 이 면적은 pvalues_ 속성에 저장되어 있다. pvalues_ 값이 큰 특성은 클래스들의 평균이 비슷하므로 타깃에 미치는 영향이 적다고 판단한다.

임계값을 계산하는 방법은 각각 다르다. 가장 간단한 SelectKBest는 고정된 k개의 특성을 선택한다. SelectPercentile 은 지정된 비율만큼 특성을 선택한다.

get_support 메서드는 선택된 특성을 불리언 값으로 표시해주어 어떤 특성이 선택되었는지 확인할 수 있다.

4.5.2 모델 기반 특성 선택

지도 학습 머신러닝 모델을 사용하여 특성의 중요도를 평가해 가장 중요한 특성들만 선택한다.

특성 선택에 사용하는 지도 학습 모델은 최종적으로 사용할 지도 학습 모델과 같을 필요가 없다.

특성 선택을 위한 모델은 각 특성의 중요도를 측청해 순서를 매길 수 있어야 한다. 결정 트리와 이를 기반으로 한 모델들은 각 특성의 중요도가 담겨 있는 feature_importances_ 속성을 제공한다. 선형 모델 계수의 절댓값도 특성의 중요도를 재는 데 사용할 수 있다. L1 규제를 사용한 선형 모델은 일부 특성의 계수만 학습한다. 이런 특징을 이용해 다른 모델의 특성 선택으로 전처리 단계에 사용할 수 있다.

일변량 분석과는 반대로 모델 기반특성 선택은 한 번에 모든 특성을 고려한다. 따라서 상호작용 부분을 반영할 수 있다. 모델 기반의 특성 선택은 SelectFromModel 에 구현되어 있다.

SelectFromModel은 지도 학습 모델로 계산된 중요도가 지정한 임계치보다 큰 모든 특성을 선택한다. 매우 복잡한 모델을 사용해 일변량 분석과는 훨씬 강력하게 특성을 선택할 수 있다.

4.5.3 반복적 특성 선택

일변량 분석은 모델을 사용하지 않았다. 모델 기반 선택은 하나의 모델만을 사용해 특성을 선택했다.

그러나 반복적 특성 선택(Iterative Feature Selection)에서는 특성의 수가 각기 다른 일련의 모델이 만들어진다기본적으로 다음의 두 가지 방법이 있다.

첫 번째는 특성을 하나도 선택하지 않은 상태로 시작해 어떤 종료 조건이 될 때까지 특성을 하나씩 제거해나가는 방법이다.

두 번째는 모든 특성을 가지고 시작해 어떤 종료 조건에 도달할 때 까지 하나씩 추가하는 방법이다.

일련의 모델이 만들어지기 때문에, 계산 비용이 훨씬 많이 든다. 재귀적 특성 제거(RFE, recursive feature elimination)이 그 예다. 모든 특성으로 시작해 모델을 만든다. 특성 중요도가 가장 낮은 특성을 제거한다. 그 제거한 특성을 뺀 나머지 특성으로 새로운 모델을 만든다. 미리 정의한 특성 개수가 남을 때 까지 반복한다.

4.6 전문가 지식 활용

특성 공학은 특정한 애플리케이션을 위해 전문가의 지식(Exper Knowledge)을 사용할 수 있는 중요한 영역이다. 종종 분야 전문가는 초기 데이터에서 더 유용한 특성을 선택할 수 있도록 도움을 줄 수 있다.

랜덤 포레스트는 데이터 전처리가 거의 필요하지 않아 맨 처음 시도해보기 좋은 모델이다. 트리 모델인 랜덤 포레스트는 훈련 세트에 있는 특성의 범위 밖으로 외삽(extrapolation)할 수 있는 능력이 없다.

from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn

# 폰트 관련 자료
# https://programmers.co.kr/learn/courses/21/lessons/950
import matplotlib
matplotlib.rc("font", family="NanumGothicCoding")

In [2]:

import os

data = pd.read_csv(
    os.path.join(mglearn.datasets.DATA_PATH, "adult.data"),
    header=None, index_col=False,
    names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
          'marital-status', 'occupation', 'relationship', 'race', 'gender',
          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])

data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]

display(data.head())

	age	workclass	education	gender	hours-per-week	occupation	income
0	39	State-gov	Bachelors	Male	40	Adm-clerical	<=50K
1	50	Self-emp-not-inc	Bachelors	Male	13	Exec-managerial	<=50K
2	38	Private	HS-grad	Male	40	Handlers-cleaners	<=50K
3	53	Private	11th	Male	40	Handlers-cleaners	<=50K
4	28	Private	Bachelors	Female	40	Prof-specialty	<=50K

In [3]:

print(data.gender.value_counts())

 Male      21790
 Female    10771
Name: gender, dtype: int64

In [4]:

print("원본 특성:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("get_dummies 후의 특성:\n", list(data_dummies.columns))

원본 특성:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

get_dummies 후의 특성:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K']

In [5]:

data_dummies.head()

Out[5]:

	age	hours-per-week	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	...	occupation_ Prof-specialty	income_ <=50K
0	39	40	0	0	1	...	0	1
1	50	13	0	1	0	...	0	1
2	38	40	1	0	0	...	0	1
3	53	40	1	0	0	...	0	1
4	28	40	1	0	0	...	1	1

5 rows × 46 columns

In [6]:

features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# NumPy 배열 추출
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

X.shape: (32561, 44) y.shape: (32561,)

In [7]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("테스트 점수: {:.2f}".format(logreg.score(X_test, y_test)))

테스트 점수: 0.81

In [8]:

demo_df = pd.DataFrame({'숫자 특성': [0, 1, 2, 1], '범주형 특성': ['양말', '여우', '양말', '상자']})
display(demo_df)

	범주형 특성	숫자 특성
0	양말	0
1	여우	1
2	양말	2
3	상자	1

In [9]:

display(pd.get_dummies(demo_df))

	숫자 특성	범주형 특성_상자	범주형 특성_양말	범주형 특성_여우
0	0	0	1	0
1	1	0	0	1
2	2	0	1	0
3	1	1	0	0

In [10]:

demo_df['숫자 특성'] = demo_df['숫자 특성'].astype(str)
display(pd.get_dummies(demo_df, columns=['숫자 특성', '범주형 특성']))

	숫자 특성_0	숫자 특성_1	숫자 특성_2	범주형 특성_상자	범주형 특성_양말	범주형 특성_여우
0	1	0	0	0	1	0
1	0	1	0	0	0	1
2	0	0	1	0	1	0
3	0	1	0	1	0	0

In [11]:

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

X, y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3 ,3, 1000, endpoint=False).reshape(-1, 1)

reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
plt.plot(line, reg.predict(line), label="결정 트리")

reg = LinearRegression().fit(X, y)
plt.plot(line, reg.predict(line), '--', label="선형 회귀")

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("회귀 출력")
plt.xlabel("입력 특성")
plt.legend(loc="best")

/usr/local/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

Out[11]:

<matplotlib.legend.Legend at 0x10c7a0c50>

In [12]:

bins = np.linspace(-3 ,3, 11)
print("구간: {}".format(bins))

구간: [-3.  -2.4 -1.8 -1.2 -0.6  0.   0.6  1.2  1.8  2.4  3. ]

In [13]:

which_bin = np.digitize(X, bins=bins)
print("\n데이터 포인트:\n", X[:5])
print("\n데이터 포인트의 소속 구간:\n", which_bin[:5])

데이터 포인트:
 [[-0.75275929]
 [ 2.70428584]
 [ 1.39196365]
 [ 0.59195091]
 [-2.06388816]]

데이터 포인트의 소속 구간:
 [[ 4]
 [10]
 [ 8]
 [ 6]
 [ 2]]

In [14]:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit(which_bin)

X_binned = encoder.transform(which_bin)
print(X_binned[:5])

[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

In [15]:

print("X_binned.shape: {}".format(X_binned.shape))

X_binned.shape: (100, 10)

In [16]:

line_binned = encoder.transform(np.digitize(line, bins=bins))

reg = LinearRegression().fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='구간 선형 회귀')

reg = DecisionTreeRegressor(min_samples_split=3).fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), '--', label='구간 결정 트리')
plt.plot(X[:, 0], y, 'o', c='k')
plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)
plt.legend(loc="best")
plt.ylabel("회귀 출력")
plt.xlabel("입력 특성")

Out[16]:

Text(0.5,0,'입력 특성')

In [17]:

X_combined = np.hstack([X, X_binned])
print(X_combined.shape)

(100, 11)

In [18]:

reg = LinearRegression().fit(X_combined, y)

line_combined = np.hstack([line, line_binned])
plt.plot(line, reg.predict(line_combined), label='원본 특성을 더한 선형 회귀')

for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k', linewidth=1)
plt.legend(loc="best")
plt.ylabel("회귀 출력")
plt.xlabel("입력 특성")
plt.plot(X[:, 0], y, 'o', c='k')

Out[18]:

[<matplotlib.lines.Line2D at 0x10d94bd68>]

In [19]:

X_product = np.hstack([X_binned, X * X_binned])
print(X_product.shape)

(100, 20)

In [20]:

reg = LinearRegression().fit(X_product, y)

line_product = np.hstack([line_binned, line * line_binned])
plt.plot(line, reg.predict(line_product), label='원보 특성을 곱한 선형 회귀')

for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k', linewidth=1)
    
plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel('회귀 출력')
plt.xlabel('입력 특성')
plt.legend(loc="best")

Out[20]:

<matplotlib.legend.Legend at 0x113793f98>

In [21]:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=10, include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)

In [22]:

print("X_poly.shape: {}".format(X_poly.shape))

X_poly.shape: (100, 10)

In [23]:

print("X 원소:\n{}".format(X[:5]))
print("X_poly 원소:\n{}".format(X_poly[:5]))

X 원소:
[[-0.75275929]
 [ 2.70428584]
 [ 1.39196365]
 [ 0.59195091]
 [-2.06388816]]
X_poly 원소:
[[-7.52759287e-01  5.66646544e-01 -4.26548448e-01  3.21088306e-01
  -2.41702204e-01  1.81943579e-01 -1.36959719e-01  1.03097700e-01
  -7.76077513e-02  5.84199555e-02]
 [ 2.70428584e+00  7.31316190e+00  1.97768801e+01  5.34823369e+01
   1.44631526e+02  3.91124988e+02  1.05771377e+03  2.86036036e+03
   7.73523202e+03  2.09182784e+04]
 [ 1.39196365e+00  1.93756281e+00  2.69701700e+00  3.75414962e+00
   5.22563982e+00  7.27390068e+00  1.01250053e+01  1.40936394e+01
   1.96178338e+01  2.73073115e+01]
 [ 5.91950905e-01  3.50405874e-01  2.07423074e-01  1.22784277e-01
   7.26822637e-02  4.30243318e-02  2.54682921e-02  1.50759786e-02
   8.92423917e-03  5.28271146e-03]
 [-2.06388816e+00  4.25963433e+00 -8.79140884e+00  1.81444846e+01
  -3.74481869e+01  7.72888694e+01 -1.59515582e+02  3.29222321e+02
  -6.79478050e+02  1.40236670e+03]]

In [24]:

print("항 이름:\n{}".format(poly.get_feature_names()))

항 이름:
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']

In [25]:

reg = LinearRegression().fit(X_poly, y)

line_poly = poly.transform(line)
plt.plot(line, reg.predict(line_poly), label='다항 선형 회귀')
plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("회귀 출력")
plt.xlabel("입력 특성")
plt.legend(loc="best")

Out[25]:

<matplotlib.legend.Legend at 0x113b173c8>

In [26]:

from sklearn.svm import SVR

for gamma in [1, 10]:
    svr = SVR(gamma=gamma).fit(X, y)
    plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("회귀 출력")
plt.xlabel("입력 특성")
plt.legend(loc="best")

Out[26]:

<matplotlib.legend.Legend at 0x113a73f60>

In [27]:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [28]:

poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_poly.shape: {}".format(X_train_poly.shape))

X_train.shape: (379, 13)
X_train_poly.shape: (379, 105)

In [29]:

print("다항 특성 이름:\n{}".format(poly.get_feature_names()))

다항 특성 이름:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']

In [30]:

from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train_scaled, y_train)
print("상호작용 특성이 없을 때 점수: {:.3f}".format(ridge.score(X_test_scaled, y_test)))

ridge = Ridge().fit(X_train_poly, y_train)
print("상호작용 특성이 있을 때 점수: {:.3f}".format(ridge.score(X_test_poly, y_test)))

상호작용 특성이 없을 때 점수: 0.621
상호작용 특성이 있을 때 점수: 0.753

In [31]:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=0).fit(X_train_scaled, y_train)
print("상호작용 특성이 없을 때 점수: {:.3f}".format(rf.score(X_test_scaled, y_test)))

rf = RandomForestRegressor(n_estimators=100, random_state=0).fit(X_train_poly, y_train)
print("상호작용 특성이 있을 때 점수: {:.3f}".format(rf.score(X_test_poly, y_test)))

상호작용 특성이 없을 때 점수: 0.795
상호작용 특성이 있을 때 점수: 0.773

In [32]:

rnd = np.random.RandomState(0)
X_org = rnd.normal(size=(1000, 3))
w = rnd.normal(size=3)

X = rnd.poisson(10 * np.exp(X_org))
y = np.dot(X_org, w)
print(X[:10, 0])

[ 56  81  25  20  27  18  12  21 109   7]

In [33]:

print("특성 출현 횟수:\n{}".format(np.bincount(X[:, 0])))

특성 출현 횟수:
[28 38 68 48 61 59 45 56 37 40 35 34 36 26 23 26 27 21 23 23 18 21 10  9
 17  9  7 14 12  7  3  8  4  5  5  3  4  2  4  1  1  3  2  5  3  8  2  5
  2  1  2  3  3  2  2  3  3  0  1  2  1  0  0  3  1  0  0  0  1  3  0  1
  0  2  0  1  1  0  0  0  0  1  0  0  2  2  0  1  1  0  0  0  0  1  1  0
  0  0  0  0  0  0  1  0  0  0  0  0  1  1  0  0  1  0  0  0  0  0  0  0
  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1]

In [34]:

bins = np.bincount(X[:, 0])
plt.bar(range(len(bins)), bins, color='grey')
plt.ylabel("출현 횟수")
plt.xlabel("값")

Out[34]:

Text(0.5,0,'값')

In [35]:

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
score = Ridge().fit(X_train, y_train).score(X_test, y_test)
print("테스트 점수: {:.3f}".format(score))

테스트 점수: 0.622

In [36]:

X_train_log = np.log(X_train + 1)
X_test_log = np.log(X_test + 1)

In [37]:

plt.hist(X_train_log[:, 0], bins=25, color='gray')
plt.ylabel("출현 횟수")
plt.xlabel("값")

Out[37]:

Text(0.5,0,'값')

In [38]:

score = Ridge().fit(X_train_log, y_train).score(X_test_log, y_test)
print("테스트 점수: {:.3f}".format(score))

테스트 점수: 0.875

In [41]:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

rng = np.random.RandomState(42)
noise = rng.normal(size=(len(cancer.data), 50))

X_w_noise = np.hstack([cancer.data, noise])
X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target, random_state=0, test_size=.5)
select = SelectPercentile(percentile=50)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)

print("X_train.shape: {}".format(X_train.shape))
print("X_train_selected.shape: {}".format(X_train_selected.shape))

X_train.shape: (284, 80)
X_train_selected.shape: (284, 40)

In [43]:

mask = select.get_support()
print(mask)

plt.matshow(mask.reshape(1, -1), cmap='gray_r')
plt.xlabel("특성 번호")

[ True  True  True  True  True  True  True  True  True False  True False
  True  True  True  True  True  True False False  True  True  True  True
  True  True  True  True  True  True False False False  True False  True
 False False  True False False False False  True False False  True False
 False  True False  True False False False False False False  True False
  True False False False False  True False  True False False False False
  True  True False  True False False False False]

Out[43]:

Text(0.5,0,'특성 번호')

In [48]:

from sklearn.linear_model import LogisticRegression

X_test_selected = select.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
print("전체 특성을 사용한 점수: {:.3f}".format(lr.score(X_test, y_test)))
lr.fit(X_train_selected, y_train)
print("선택된 일부 특성을 사용한 점수: {:.3f}".format(lr.score(X_test_selected, y_test)))

전체 특성을 사용한 점수: 0.930
선택된 일부 특성을 사용한 점수: 0.940

In [50]:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold="median")

In [51]:

select.fit(X_train, y_train)

X_train_l1 = select.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_l1.shape: {}".format(X_train_l1.shape))

X_train.shape: (284, 80)
X_train_l1.shape: (284, 40)

In [53]:

mask = select.get_support()
plt.matshow(mask.reshape(1, -1), cmap="gray_r")
plt.xlabel("특성 번호")

Out[53]:

Text(0.5,0,'특성 번호')

In [54]:

X_test_l1 = select.transform(X_test)
score = LogisticRegression().fit(X_train_l1, y_train).score(X_test_l1, y_test)
print("테스트 점수: {:.3f}".format(score))

테스트 점수: 0.951

In [56]:

from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)
select.fit(X_train, y_train)
mask = select.get_support()
plt.matshow(mask.reshape(1, -1), cmap='gray_r')
plt.xlabel("특성 번호")

Out[56]:

Text(0.5,0,'특성 번호')

In [57]:

X_train_rfe = select.transform(X_train)
X_test_rfe = select.transform(X_test)

score = LogisticRegression().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)
print("테스트 점수: {:.3f}".format(score))

테스트 점수: 0.951

In [58]:

print("테스트 점수: {:.3f}".format(select.score(X_test, y_test)))

테스트 점수: 0.951

In [59]:

citibike = mglearn.datasets.load_citibike()

print("시티바이크 데이터:\n{}".format(citibike.head()))

시티바이크 데이터:
starttime
2015-08-01 00:00:00     3
2015-08-01 03:00:00     0
2015-08-01 06:00:00     9
2015-08-01 09:00:00    41
2015-08-01 12:00:00    39
Freq: 3H, Name: one, dtype: int64

In [70]:

plt.figure(figsize=(10, 3))
xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(), freq='D')
week = ["일", "월", "화", "수", "목", "금", "토"]
xticks_name = [week[int(w)]+d for w, d in zip(xticks.strftime("%w"), xticks.strftime(" %m-%d"))]
plt.xticks(xticks, xticks_name, rotation=90, ha="left")
plt.plot(citibike, linewidth=1)
plt.xlabel("날짜")
plt.ylabel("대여횟수")

Out[70]:

Text(0,0.5,'대여횟수')

In [71]:

y = citibike.values
X = citibike.index.astype("int64").values.reshape(-1, 1)

In [77]:

n_train = 184

def eval_on_features(features, target, regressor):
    X_train, X_test = features[:n_train], features[n_train:]
    y_train, y_test = target[:n_train], target[n_train:]
    regressor.fit(X_train, y_train)
    print("테스트 세트 R^2: {:.2f}".format(regressor.score(X_test, y_test)))
    y_pred = regressor.predict(X_test)
    y_pred_train = regressor.predict(X_train)
    plt.figure(figsize=(10, 3))
    
    plt.xticks(range(0, len(X), 8), xticks_name, rotation=90, ha="left")
    
    plt.plot(range(n_train), y_train, label="훈련")
    plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="테스트")
    plt.plot(range(n_train), y_pred_train, '--', label="훈련 예측")
    
    plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--', label="테스트 예측")
    
    plt.legend(loc=(1.01, 0))
    plt.xlabel("날짜")
    plt.ylabel("대여횟수")

In [78]:

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X, y, regressor)

테스트 세트 R^2: -0.04

In [79]:

X_hour = citibike.index.hour.values.reshape(-1, 1)
eval_on_features(X_hour, y, regressor)

테스트 세트 R^2: 0.60

In [81]:

X_hour_week = np.hstack([citibike.index.dayofweek.values.reshape(-1, 1), citibike.index.hour.values.reshape(-1, 1)])
eval_on_features(X_hour_week, y, regressor)

테스트 세트 R^2: 0.84

In [82]:

from sklearn.linear_model import LinearRegression
eval_on_features(X_hour_week, y, LinearRegression())

테스트 세트 R^2: 0.13

In [83]:

enc = OneHotEncoder()
X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()

In [84]:

eval_on_features(X_hour_week_onehot, y, Ridge())

테스트 세트 R^2: 0.62

In [85]:

poly_transformer = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_hour_week_onehot_poly = poly_transformer.fit_transform(X_hour_week_onehot)
lr = Ridge()
eval_on_features(X_hour_week_onehot_poly, y, lr)

테스트 세트 R^2: 0.85

In [86]:

hour = ["%02d:00" % i for i in range(0, 24, 3)]
day = ["월", "화", "수", "목", "금", "토", "일"]
features = day + hour

In [88]:

features_poly = poly_transformer.get_feature_names(features)
features_nonzero = np.array(features_poly)[lr.coef_ != 0]
coef_nonzero = lr.coef_[lr.coef_ != 0]

In [91]:

plt.figure(figsize=(15, 2))
plt.plot(coef_nonzero, 'o')
plt.xticks(np.arange(len(coef_nonzero)), features_nonzero, rotation=90)
plt.xlabel("특성 이름")
plt.ylabel("계수 크기")

Out[91]:

Text(0,0.5,'계수 크기')

저작자표시

'인공지능' 카테고리의 다른 글

[빅데이터 직무연구회] 7회차 모임 정리 (0)	2018.06.26
[빅데이터 직무연구회] 6회차 모임 정리 (2) (0)	2018.06.06
[빅데이터 직무연구회] 5회차 모임 예제 소스 (3) (0)	2018.05.19
[빅데이터 직무연구회] 5회차 모임 예제 소스 (2) (0)	2018.05.19
[빅데이터 직무연구회] 5회차 모임 예제 소스 (1) (0)	2018.05.19

[빅데이터 직무연구회] 6회차 모임 정리 (1)

[빅데이터 직무연구회] 6회차 모임 정리 (1)

Chapter 4. 데이터 표현과 특성 공학

4.1 범주형 변수

4.1.1 원-핫-인코딩(가변수)

4.1.2 숫자로 표현된 범주형 특성

4.2 구간 분할, 이산화 그리고 선형 모델, 트리 모델

4.3 상호작용과 다항식

4.4 일변량 비선형 변환

4.5 특성 자동 선택

4.5.1 일변량 통계

4.5.2 모델 기반 특성 선택

4.5.3 반복적 특성 선택

4.6 전문가 지식 활용

'인공지능' 카테고리의 다른 글

댓글

티스토리툴바

	age	hours-per-week	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	...	occupation_ Prof-specialty	income_ <=50K
0	39	40	0	0	1	...	0	1
1	50	13	0	1	0	...	0	1
2	38	40	1	0	0	...	0	1
3	53	40	1	0	0	...	0	1
4	28	40	1	0	0	...	1	1

	age	hours-per-week	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	...	occupation_ Prof-specialty	income_ <=50K
0	39	40	0	0	1	...	0	1
1	50	13	0	1	0	...	0	1
2	38	40	1	0	0	...	0	1
3	53	40	1	0	0	...	0	1
4	28	40	1	0	0	...	1	1

	age	hours-per-week	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	...	occupation_ Prof-specialty	income_ <=50K
0	39	40	0	0	1	...	0	1
1	50	13	0	1	0	...	0	1
2	38	40	1	0	0	...	0	1
3	53	40	1	0	0	...	0	1
4	28	40	1	0	0	...	1	1