이직률 분석에 사용한 회귀분석

2022-11-14 13 분 소요

Employee Attrition Rate using Regression

회귀 분석을 사용한 직원 감소율

Introduction

인공지능은 프로세스를 자동화하고, 비즈니스에 대한 통찰력을 모으고, 프로세스 속도를 높이기 위해 다양한 산업에서 사용되고 있습니다. 인공지능이 실제로 산업에 어떤 영향을 미치는지 실제 시나리오에서 인공지능의 사용을 연구하기 위해 Python을 사용할 것입니다.

직원은 조직에서 가장 중요한 존재입니다. 성공적인 직원들은 조직에 많은 것을 제공합니다. 이 노트북에서는 AI를 사용하여 직원의 이직률이나 회사가 직원을 유지할 수 있는 빈도를 예측해 볼 것입니다.

Context

Hackerearth가 수집하여 [Kaggle]에 업로드한 직원 감소율을 포함한 데이터 세트를 사용합니다. 회귀 분석을 사용하여 감소율을 예측하고 우리 모델이 얼마나 성공적인지 확인할 것입니다.

Use Python to open csv files

scikit-learn과 pandas를 사용하여 데이터 세트를 작업합니다. Scikit-learn은 예측 데이터 분석을 위한 효율적인 도구를 제공하는 매우 유용한 기계 학습 라이브러리입니다. Pandas는 데이터 과학을 위한 인기 있는 Python 라이브러리입니다. 강력하고 유연한 데이터 구조를 제공하여 데이터 조작 및 분석을 더 쉽게 만듭니다.

Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error

Dataset 가져오기

데이터 세트에는 직원 이직률이 포함되어 있습니다. 데이터 세트를 시각화해 보겠습니다.

# train 변수(데이터프레임)로 [Dataset]_Module11_Train_(Employee).csv 가져오기
# your code here 
df_train = pd.read_csv("./[Dataset]_Module11_Train_(Employee).csv")

Task 1: training set의 column 출력

# training set의 column 출력
# your code here 
df_train.columns

Index(['Employee_ID', 'Gender', 'Age', 'Education_Level',
       'Relationship_Status', 'Hometown', 'Unit', 'Decision_skill_possess',
       'Time_of_service', 'Time_since_promotion', 'growth_rate', 'Travel_Rate',
       'Post_Level', 'Pay_Scale', 'Compensation_and_Benefits',
       'Work_Life_balance', 'VAR1', 'VAR2', 'VAR3', 'VAR4', 'VAR5', 'VAR6',
       'VAR7', 'Attrition_rate'],
      dtype='object')

# train 데이터 세트 크기 및 첫 5행 확인하기
# your code here 
df_train.head(5)

	Employee_ID	Gender	Age	Education_Level	Relationship_Status	Hometown	Unit	Decision_skill_possess	Time_of_service	Time_since_promotion	...	Compensation_and_Benefits	Work_Life_balance	VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	Attrition_rate
0	EID_23371	F	42.0	4	Married	Franklin	IT	Conceptual	4.0	4	...	type2	3.0	4	0.7516	1.8688	2.0	4	5	3	0.1841
1	EID_18000	M	24.0	3	Single	Springfield	Logistics	Analytical	5.0	4	...	type2	4.0	3	-0.9612	-0.4537	2.0	3	5	3	0.0670
2	EID_3891	F	58.0	3	Married	Clinton	Quality	Conceptual	27.0	3	...	type2	1.0	4	-0.9612	-0.4537	3.0	3	8	3	0.0851
3	EID_17492	F	26.0	3	Single	Lebanon	Human Resource Management	Behavioral	4.0	3	...	type2	1.0	3	-1.8176	-0.4537	NaN	3	7	3	0.0668
4	EID_22534	F	31.0	1	Married	Springfield	Logistics	Conceptual	5.0	4	...	type3	3.0	1	0.7516	-0.4537	2.0	2	8	2	0.1827

5 rows × 24 columns

# train 데이터 세트 정보 확인하기
# your code here 
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Employee_ID                7000 non-null   object 
 1   Gender                     7000 non-null   object 
 2   Age                        6588 non-null   float64
 3   Education_Level            7000 non-null   int64  
 4   Relationship_Status        7000 non-null   object 
 5   Hometown                   7000 non-null   object 
 6   Unit                       7000 non-null   object 
 7   Decision_skill_possess     7000 non-null   object 
 8   Time_of_service            6856 non-null   float64
 9   Time_since_promotion       7000 non-null   int64  
 10  growth_rate                7000 non-null   int64  
 11  Travel_Rate                7000 non-null   int64  
 12  Post_Level                 7000 non-null   int64  
 13  Pay_Scale                  6991 non-null   float64
 14  Compensation_and_Benefits  7000 non-null   object 
 15  Work_Life_balance          6989 non-null   float64
 16  VAR1                       7000 non-null   int64  
 17  VAR2                       6423 non-null   float64
 18  VAR3                       7000 non-null   float64
 19  VAR4                       6344 non-null   float64
 20  VAR5                       7000 non-null   int64  
 21  VAR6                       7000 non-null   int64  
 22  VAR7                       7000 non-null   int64  
 23  Attrition_rate             7000 non-null   float64
dtypes: float64(8), int64(9), object(7)
memory usage: 1.3+ MB

# train 데이터 세트 데이터 타입 확인하기
# your code here 
df_train.dtypes

Employee_ID                   object
Gender                        object
Age                          float64
Education_Level                int64
Relationship_Status           object
Hometown                      object
Unit                          object
Decision_skill_possess        object
Time_of_service              float64
Time_since_promotion           int64
growth_rate                    int64
Travel_Rate                    int64
Post_Level                     int64
Pay_Scale                    float64
Compensation_and_Benefits     object
Work_Life_balance            float64
VAR1                           int64
VAR2                         float64
VAR3                         float64
VAR4                         float64
VAR5                           int64
VAR6                           int64
VAR7                           int64
Attrition_rate               float64
dtype: object

# train 데이터 세트 유니크 아이템의 개수를 확인합니다.
# your code here 
df_train.nunique(dropna=False)

Employee_ID                  7000
Gender                          2
Age                            48
Education_Level                 5
Relationship_Status             2
Hometown                        5
Unit                           12
Decision_skill_possess          4
Time_of_service                45
Time_since_promotion            5
growth_rate                    55
Travel_Rate                     3
Post_Level                      5
Pay_Scale                      11
Compensation_and_Benefits       5
Work_Life_balance               6
VAR1                            5
VAR2                            6
VAR3                            5
VAR4                            4
VAR5                            5
VAR6                            5
VAR7                            5
Attrition_rate               3317
dtype: int64

# Attrition_rate 컬럼에 대하 히스토그램 그리기 
# your code here 
df_train['Attrition_rate'].plot.hist()

<AxesSubplot: ylabel='Frequency'>

# 성별이 직원의 성과에 미치는 영향을 확인해 봅니다.

# your code here 
# dft_g_gr = df_train.groupby('Gender')['growth_rate'].mean()
# dft_g_gr

dft_g_gr = df_train[['Gender','growth_rate']].groupby(['Gender']).agg('median')
dft_g_gr

	growth_rate
Gender
F	48.0
M	47.0

# your code here 
# sns.barplot(
#     data= dft_g_gr,
#     x= "growth_rate",
#     y= 'Rate'
# )

dft_g_gr.T.plot(kind='bar', figsize=(10, 5) )
plt.ylabel('Rate')
plt.legend(loc="upper left") # 범례표..
plt.xticks(rotation=0); # 세로가 아닌 가로로 표시
plt.show()

# 데이터 세트에서 남성과 여성의 수 시각화
# your code here 
plt.figure(figsize=(10, 5))
sns.countplot(x=df_train['Gender'], palette = 'bone')
plt.title('Comparison of Males and Females', fontweight = 30)
plt.xlabel('Gender')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

# 데이터 세트에서 Hometown 그룹별시각화

# your code here 
plt.figure(figsize=(10, 5))
sns.countplot(x=df_train['Hometown'], palette = 'pastel')
plt.title('Comparison of various Groups', fontweight = 30)
plt.xlabel('Groups')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

# 데이터 세트에서 결혼 유무에 대한 시각화

# your code here 
plt.figure(figsize=(10, 5))
sns.countplot(x=df_train['Relationship_Status'], palette = 'pastel')
plt.title('Comparison of various Groups', fontweight = 30)
plt.xlabel('Groups')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

# 나이가 직원의 성과에 미치는 영향을 확인

# your code here 
dft_rs_ar = df_train[['Relationship_Status','Attrition_rate']].groupby(['Relationship_Status']).agg('median')
dft_rs_ar

	Attrition_rate
Relationship_Status
Married	0.14155
Single	0.14470

# your code here 
dft_rs_ar.T.plot(kind='bar', figsize=(10, 5) )
plt.ylabel('Rates')
plt.title('Relationship Status', fontweight = 30)
plt.legend(loc="upper right") # 범례표..
plt.xticks(rotation=0); # 세로가 아닌 가로로 표시
plt.show()

Task2: describe 함수를 사용하여 training data set 에 대한 정보 가져오기

Data 설명

데이터가 어떻게 분배되어 있는지 확인해 봅시다. 각 열의 평균값, 최대값, 최소값을 다른 특성들과 함께 시각화할 수 있습니다.

# your code here 
df_train.describe()

	Age	Education_Level	Time_of_service	Time_since_promotion	growth_rate	Travel_Rate	Post_Level	Pay_Scale	Work_Life_balance	VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	Attrition_rate
count	6588.000000	7000.000000	6856.000000	7000.000000	7000.000000	7000.000000	7000.000000	6991.000000	6989.000000	7000.000000	6423.000000	7000.000000	6344.000000	7000.000000	7000.000000	7000.000000	7000.000000
mean	39.622799	3.187857	13.385064	2.367143	47.064286	0.817857	2.798000	6.006294	2.387895	3.098571	-0.008126	-0.013606	1.891078	2.834143	7.101286	3.257000	0.189376
std	13.606920	1.065102	10.364188	1.149395	15.761406	0.648205	1.163721	2.058435	1.122786	0.836377	0.989850	0.986933	0.529403	0.938945	1.164262	0.925319	0.185753
min	19.000000	1.000000	0.000000	0.000000	20.000000	0.000000	1.000000	1.000000	1.000000	1.000000	-1.817600	-2.776200	1.000000	1.000000	5.000000	1.000000	0.000000
25%	27.000000	3.000000	5.000000	1.000000	33.000000	0.000000	2.000000	5.000000	1.000000	3.000000	-0.961200	-0.453700	2.000000	2.000000	6.000000	3.000000	0.070400
50%	37.000000	3.000000	10.000000	2.000000	47.000000	1.000000	3.000000	6.000000	2.000000	3.000000	-0.104800	-0.453700	2.000000	3.000000	7.000000	3.000000	0.142650
75%	52.000000	4.000000	21.000000	3.000000	61.000000	1.000000	3.000000	8.000000	3.000000	3.000000	0.751600	0.707500	2.000000	3.000000	8.000000	4.000000	0.235000
max	65.000000	5.000000	43.000000	4.000000	74.000000	2.000000	5.000000	10.000000	5.000000	5.000000	1.608100	1.868800	3.000000	5.000000	9.000000	5.000000	0.995900

# training set에 누락된 값이 있는지 확인합니다.
# your code here 
df_train.isna().any()

Employee_ID                  False
Gender                       False
Age                           True
Education_Level              False
Relationship_Status          False
Hometown                     False
Unit                         False
Decision_skill_possess       False
Time_of_service               True
Time_since_promotion         False
growth_rate                  False
Travel_Rate                  False
Post_Level                   False
Pay_Scale                     True
Compensation_and_Benefits    False
Work_Life_balance             True
VAR1                         False
VAR2                          True
VAR3                         False
VAR4                          True
VAR5                         False
VAR6                         False
VAR7                         False
Attrition_rate               False
dtype: bool

Data 시각화

이제, 상관 행렬을 이용하여 각 데이터 feature가 얼마나 관련되어 있는지 알아보겠습니다.

plt.figure(figsize=(18,10))
cor = df_train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Accent)
plt.show()
plt.savefig("main_correlation.png")

C:\Users\User\AppData\Local\Temp\ipykernel_8388\2391686863.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  cor = df_train.corr()

<Figure size 640x480 with 0 Axes>

모델 준비

이제 훈련을 위한 데이터를 마무리하고 모델을 준비합니다.

# Create evaluation function
from sklearn.metrics import mean_squared_log_error, mean_absolute_error
def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))
# Create function to evaluate our model
def show_scores(y_test, val_preds):
    scores = {"Valid MAE": mean_absolute_error(y_test, val_preds),
              "Valid RMSLE": rmsle(y_test, val_preds)}
    return scores

# Attrition_rate는 예측할 레이블 또는 출력입니다.
# features는 Attrition_rate를 예측하는 데 사용됩니다.
label = ["Attrition_rate"]
features = ['VAR7','VAR6','VAR5','VAR1','VAR3','growth_rate','Time_of_service','Time_since_promotion','Travel_Rate','Post_Level','Education_Level']

featured_data = df_train.loc[:,features+label]
# your code here 
featured_data.shape

(7000, 12)

# dropna 함수를 사용하여 누락된 값이 있는 열을 제거합니다.
# your code here 
featured_data = featured_data.dropna()
featured_data.shape

(6856, 12)

X = featured_data.loc[:,features]
y = featured_data.loc[:,label]

# test size가 0.55이므로 training과 test data를 55%:45%로 분할합니다.
# your code here 
# 위치 잘맞출것;;
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.45, random_state=110)
print( "x_train values count: " + str(x_train.shape[0]) )
print( "y_train values count: " + str(y_train.shape[0]) )
print( "x_test values count: " + str(x_test.shape[0]) )
print( "y_test values count: " + str(y_test.shape[0]) )
# print( "x_train values count: " + str(len(x_train)))
# print( "y_train values count: " + str(len(y_train)))
# print( "x_test values count: " + str(len(x_test)))
# print( "y_test values count: " + str(len(y_test)))

x_train values count: 3770
y_train values count: 3770
x_test values count: 3086
y_test values count: 3086

# LinearRegression 모델을 사용하여 학습(fit)하고 예측(predict) 합니다
model = LinearRegression()
# your code here 
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# score 를 출력해 봅니다. : error(MAE, RMSLE)
# your code here 
show_scores(y_test, y_pred)

{'Valid MAE': 0.1280385863084669, 'Valid RMSLE': 0.14088870335452292}

예측 해보기

# 예측 (아래 선언 이전에 한게 있어서 필요없을지도..)
import pandas as pd

# sample 데이터 [Dataset]_Module11_sample_(Employee).csv 가져오기
sample = pd.read_csv("./[Dataset]_Module11_sample_(Employee).csv")

c=[]
for i in range(len(y_pred)):
    c.append((y_pred[i][0].round(5)))
pf=c[:3000]

sample.head(5)

	Employee_ID	Gender	Age	Education_Level	Relationship_Status	Hometown	Unit	Decision_skill_possess	Time_of_service	Time_since_promotion	...	Pay_Scale	Compensation_and_Benefits	Work_Life_balance	VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7
0	EID_22713	F	32.0	5	Single	Springfield	R&D	Conceptual	7.0	4	...	4.0	type2	1.0	3	-0.9612	-0.4537	2.0	1	8	4
1	EID_9658	M	65.0	2	Single	Lebanon	IT	Directive	41.0	2	...	1.0	type2	1.0	4	-0.9612	0.7075	1.0	2	8	2
2	EID_22203	M	52.0	3	Married	Springfield	Sales	Directive	21.0	3	...	8.0	type3	1.0	4	-0.1048	0.7075	2.0	1	9	3
3	EID_7652	M	50.0	5	Single	Washington	Marketing	Analytical	11.0	4	...	2.0	type0	4.0	3	-0.1048	0.7075	2.0	2	8	3
4	EID_6516	F	44.0	3	Married	Franklin	R&D	Conceptual	12.0	4	...	2.0	type2	4.0	4	1.6081	0.7075	2.0	2	7	4

5 rows × 23 columns

# your code here 
dff = pd.DataFrame({'Employee_ID':sample['Employee_ID'],'Attrition_rate':pf})
dff.head()

	Employee_ID	Attrition_rate
0	EID_22713	0.18430
1	EID_9658	0.18544
2	EID_22203	0.18532
3	EID_7652	0.20305
4	EID_6516	0.20507

Task 3: 예측된 결과의 Attrition_rate이 높은 20개 열 값 출력

# your code here 
dff.sort_values('Attrition_rate', ascending=False).head(20)

	Employee_ID	Attrition_rate
1809	EID_5873	0.21646
2036	EID_10338	0.21526
2920	EID_19140	0.21519
373	EID_4261	0.21500
2986	EID_17284	0.21488
2513	EID_24214	0.21315
281	EID_14096	0.21303
1135	EID_15641	0.21291
1534	EID_10684	0.21227
2431	EID_22724	0.21154
34	EID_19046	0.21149
51	EID_17967	0.21116
2173	EID_18844	0.21107
691	EID_14595	0.21069
2516	EID_2152	0.21055
1925	EID_23260	0.21048
1180	EID_25701	0.21040
2388	EID_6168	0.21012
1747	EID_20012	0.21011
2944	EID_1930	0.21009

추가로…

강사님께서 아래처럼 주어진 방식으로 예측이 주어졌을 경우.. 차이와 이직률을 분석해본다면..

ID          = ["Employee_ID"]
pred_data   = sample.loc[:,features+ID]
pred_data   = pred_data.dropna(axis=0)
y = pred_data.loc[:,ID]
sample_data = sample.loc[:,features]
sample_data = sample_data.dropna(axis=0)
y_hat = model.predict(sample_data)
size = len(y_hat)
c=[]
for i in range(len(y_hat)):
    c.append((y_hat[i][0].round(5)))
pf=c[:size]
sample_data.head(5)

	VAR7	VAR6	VAR5	VAR1	VAR3	growth_rate	Time_of_service	Time_since_promotion	Travel_Rate	Post_Level	Education_Level
0	4	8	1	3	-0.4537	30	7.0	4	1	5	5
1	2	8	2	4	0.7075	72	41.0	2	1	1	2
2	3	9	1	4	0.7075	25	21.0	3	0	1	3
3	3	8	2	3	0.7075	28	11.0	4	1	1	5
4	4	7	2	4	0.7075	47	12.0	4	1	3	3

y.head(5)

	Employee_ID
0	EID_22713
1	EID_9658
2	EID_22203
3	EID_7652
4	EID_6516

pf[:10]

[0.19277,
 0.17537,
 0.17751,
 0.17771,
 0.19025,
 0.19642,
 0.19457,
 0.1931,
 0.19157,
 0.17776]

dff1 = pd.DataFrame({'Employee_ID':y['Employee_ID'], 'Attrition_rate':pf})
dff1.head()

	Employee_ID	Attrition_rate
0	EID_22713	0.19277
1	EID_9658	0.17537
2	EID_22203	0.17751
3	EID_7652	0.17771
4	EID_6516	0.19025

dff1.sort_values('Attrition_rate', ascending=False).head(20)

	Employee_ID	Attrition_rate
1695	EID_21702	0.21477
2791	EID_17304	0.21387
52	EID_13270	0.21373
2546	EID_13443	0.21370
1676	EID_21042	0.21336
988	EID_23350	0.21214
631	EID_13681	0.21191
1819	EID_19366	0.21178
638	EID_7609	0.21153
2156	EID_15348	0.21148
1041	EID_5420	0.21137
846	EID_24603	0.21125
2327	EID_18242	0.21122
512	EID_16224	0.21088
578	EID_22377	0.21073
305	EID_18954	0.21072
1521	EID_23613	0.21064
1541	EID_11601	0.21055
1859	EID_15161	0.21050
561	EID_13536	0.21035

위 차이를 보기 편하게 아래 표로..

두 대조표를 합쳐본다면…

dff = dff.rename(columns={ "Attrition_rate": "Attrition_rate1"})
dff1 = dff1.rename(columns={"Attrition_rate": "Attrition_rate2"})
dff_t = pd.concat([dff, dff1["Attrition_rate2"]], axis=1)
dff_t.sort_values('Attrition_rate1', ascending=False).head(20)

	Employee_ID	Attrition_rate1	Attrition_rate2
1809	EID_5873	0.21646	0.19128
2036	EID_10338	0.21526	0.19797
2920	EID_19140	0.21519	0.18473
373	EID_4261	0.21500	0.16753
2986	EID_17284	0.21488	0.19579
2513	EID_24214	0.21315	0.18597
281	EID_14096	0.21303	0.19472
1135	EID_15641	0.21291	0.18291
1534	EID_10684	0.21227	0.18685
2431	EID_22724	0.21154	0.19074
34	EID_19046	0.21149	0.19697
51	EID_17967	0.21116	0.19278
2173	EID_18844	0.21107	0.19377
691	EID_14595	0.21069	0.19196
2516	EID_2152	0.21055	0.19799
1925	EID_23260	0.21048	0.17965
1180	EID_25701	0.21040	0.18777
2388	EID_6168	0.21012	0.18116
1747	EID_20012	0.21011	0.18441
2944	EID_1930	0.21009	0.17772

dff_t.sort_values('Attrition_rate2', ascending=False).head(20)

	Employee_ID	Attrition_rate1	Attrition_rate2
1695	EID_21702	0.18664	0.21477
2791	EID_17304	0.18761	0.21387
52	EID_13270	0.19970	0.21373
2546	EID_13443	0.20360	0.21370
1676	EID_21042	0.18628	0.21336
988	EID_23350	0.19529	0.21214
631	EID_13681	0.17672	0.21191
1819	EID_19366	0.18475	0.21178
638	EID_7609	0.19132	0.21153
2156	EID_15348	0.19424	0.21148
1041	EID_5420	0.17756	0.21137
846	EID_24603	0.18627	0.21125
2327	EID_18242	0.19149	0.21122
512	EID_16224	0.18563	0.21088
578	EID_22377	0.17252	0.21073
305	EID_18954	0.18692	0.21072
1521	EID_23613	0.19274	0.21064
1541	EID_11601	0.17665	0.21055
1859	EID_15161	0.19136	0.21050
561	EID_13536	0.19840	0.21035

혹시 LinearRegression에 옵션을 더 줘서 성능 향상이 가능할까?

순서대로 진행해보자.

# LinearRegression 모델을 사용하여 학습(fit)하고 예측(predict) 합니다
# x_train, y_train 등 데이터는 위 그대로..
model = LinearRegression(copy_X=False)
# your code here 
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# score 를 출력해 봅니다. : error(MAE, RMSLE)
# your code here 
show_scores(y_test, y_pred)

{'Valid MAE': 0.1280385863084669, 'Valid RMSLE': 0.14088870335452292}

LR 옵션변경 결과

별로 바뀐게 없어서 이상으로 마무리를… 다음!

아무래도 옵션변경보다.. 데이터를 손봐야..

그 외 모델을 적용한다면…

예로.. Ridge, Lesso, RandomForest..

음…. 시간없을거 같아서 RandomForest만…

비교해서 최대한 오류가 적은 방식을 최우선…;;

먼저 모델 적용부터..

# Classifier로 하기에 데이터자체 특성상 오류가 나서 아래 모델로..
# 물론 불연속적으로 정제해서 하는 방법도 있으나 일단...
from sklearn.ensemble import RandomForestRegressor

### RandomForestClassifier 모델을 사용하여 학습(fit)하고 예측(predict) 합니다
# x_train, y_train 등 데이터는 위 그대로..
model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=950, max_samples = 1 )
# bootstrap=True, n_jobs=10, min_samples_split=10, warm_start=False
# max_leaf_nodes=2, min_weight_fraction_leaf=0.23, max_features="log2", min_impurity_decrease=0.1,
# ccp_alpha=0.05, 
# your code here 
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# score 를 출력해 봅니다. : error(MAE, RMSLE)
# your code here 
show_scores(y_test, y_pred)

C:\Users\User\AppData\Local\Temp\ipykernel_8388\957247849.py:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  model.fit(x_train, y_train)

{'Valid MAE': 0.12574740764744005, 'Valid RMSLE': 0.1403597189649096}

Random Forest 모델을 끄적이면서…

음… 아무리 옵션을 조정을 해봐도 0.14… 미만으로 내려가지 않는다.

좀 더 확인이 필요하지만 이 데이터에서는 Random Forest로는 한계가 있을듯 하다;;;

그러면… 다른 모델로…

다른 사람이 한걸 봤는데.. Lasso Reg 모델이 0.13 초반정도의 에러율을 낮춘 결과를 보였다.

좀 더 단순한 Ridge, Lasso 등 모델로 낮게 에러율을 낮출 수 있는 듯하다.

데이터의 경우?

데이터 자체 수집 등은 손 대기 힘든 부분이라 한계가 있음.

대신에 Random_state를 110으로 조정되어있는데 더 올려보거나 낮춰보거나..

test, train 비율 조정등의 경우도 있으나

랜덤포레스트를 이리저리 굴려본다면 차라리 Ridge, Lasso등을 해봐서 확인하는게 더 확실하다는 결론을 생각해냈다.

Conclusion

이 노트북에서 우리는 기업에서 AI를 사용하여 충성할 직원을 예측하는 방법을 살펴보았습니다. 우리는 직원 감소율을 예측하기 위해 선형 회귀 모델을 만들었습니다.

그 외 참조한 링크

https://minorman.tistory.com/84

판다스 모든 열 확인용…

https://eunjin3786.tistory.com/204

판다스 정보 등 확인용..

https://zephyrus1111.tistory.com/163

유니크 갯수 확인용

https://rfriend.tistory.com/383

그룹by 확인

https://zephyrus1111.tistory.com/70

그룹by 특정열 출력 확인

https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette

막대 그래프 설정 가능한 색상 확인

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

train_test_split X, y 나눠서 적용방법 확인

https://seaborn.pydata.org/generated/seaborn.countplot.html

countplot 확인…

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

pandas dropna 확인..

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

dataframe의 column 이름을 바꾸고 싶다면..

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Linear_Regression 옵션 확인

마무리

이 과제로 각 회사내 사람들의 이직률을 이렇게 분석해서 필요한 정보를 전해주는구나 하고 깨닫게 되었다.
바뀌어진 버전에 따라 도큐먼트를 잘 봐둬야 되어야 문제가 벌어져도 쓸 수 있다는 교훈.
성별 growth_rate 비교에 데이터가 예시
이번엔 이전 과제에 비해 참고하려는 링크들이 많았다.
오류율 등 품질 향상을 위해 LR.. 및 다른 모델 등 중에서 랜덤포레스트를 해보았으나 0.14이하로 못내렸다.
이 후 Lasso, Ridge 등 모델로 결과를 해볼 필요가 있을 거 같다.

주의사항

Twitter Facebook LinkedIn