Intel AI 캡스톤2 판매예측을 통한 이익실현

2023-01-29 22 분 소요

인공지능 판매 예측을 통해 이익실현 프로젝트

어느(ros..) 매장에서 판매예측을 통해 이익을 실현하고자 함.
이를 위해서 이때까지의 판매데이터를 바탕으로 훈련, 예측이 목적
예측하기 위한 데이터 준비와 이번엔 예측중심의 모델로 해결모색.

진행 순서

필요한 파일들 읽어오기
- Store, Train, Test csv 파일들 읽어오기
파일 상태 확인
- 각 Data Frame 의 크기, head, 결측값, info 확인하기
결측값을 0으로 채워 넣기
Train Data에서 각 변수가 Sales 에 미치는 영향 파악하기: 그래프 그려보기
매장별 통계 데이터 확인
- 5.1 spc = sales/customer 로 새로운 열 생성 à train.csv df
- 5.2 Groupby 이용 store로 sales, customers, spc 평균값 만들어 보기 : store별 평균값 data
Store df에 4번에서 생성된 컬럼 merge 시키기 à store df columns :
Train df에 store df merge 시키기(기준은 Store 명 기준으로)
Train df에서 Date를 Year, Month, Day, week로 변환하여 각각 컬럼을 생성
Train DF에서 Label, Features 컬럼 나누기 X, y
X, y 데이터셋을 Train, test로 나누기
- 8:2로..
회귀(예측) AI 모델 선택하기
- 예로.. Linear regression, Ridge regression, Lasso regression, Polynomial regression …
학습(훈련), 예측, 성능평가
- Fit 학습, predict(예측), Score(성능 평가)
결과 보고
- test.csv 파일로 판매량 예측하기 à Jupyter File, Submission_OOO.csv 예측 판매량

0. 필요한 라이브러리

# 기본적인 부분 우선...
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 훈련 및 필요한 여러모로의 모델들
from sklearn.model_selection import train_test_split # 분리
from sklearn.preprocessing import StandardScaler # 정규화.
from sklearn.metrics import classification_report # 성능 요약

# 이후는 모델 탐색에 필요한 import...
from sklearn.linear_model import Lasso # Lasso
from sklearn.linear_model import Ridge # Ridge
from sklearn.linear_model import LinearRegression # LR
from sklearn.preprocessing import PolynomialFeatures # PF
from sklearn.ensemble import RandomForestRegressor #RFR

1. 필요한 파일들 읽어오기

Store, Train, Test csv 파일들 읽어오기

:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
  df_train = pd.read_csv('./train.csv')

2. 파일 상태 확인

각 Data Frame 의 크기, head, 결측값, info 확인하기

df_train의 크기, 결측값, 정보 확인

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

해당 데이터의 train 크기

해당 데이터의 train 정보

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Store          1017209 non-null  int64 
 1   DayOfWeek      1017209 non-null  int64 
 2   Date           1017209 non-null  object
 3   Sales          1017209 non-null  int64 
 4   Customers      1017209 non-null  int64 
 5   Open           1017209 non-null  int64 
 6   Promo          1017209 non-null  int64 
 7   StateHoliday   1017209 non-null  object
 8   SchoolHoliday  1017209 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 69.8+ MB

해당 데이터의 train 요약

	Store	DayOfWeek	Sales	Customers	Open	Promo	SchoolHoliday
count	1.017209e+06	1.017209e+06	1.017209e+06	1.017209e+06	1.017209e+06	1.017209e+06	1.017209e+06
mean	5.584297e+02	3.998341e+00	5.773819e+03	6.331459e+02	8.301067e-01	3.815145e-01	1.786467e-01
std	3.219087e+02	1.997391e+00	3.849926e+03	4.644117e+02	3.755392e-01	4.857586e-01	3.830564e-01
min	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	2.800000e+02	2.000000e+00	3.727000e+03	4.050000e+02	1.000000e+00	0.000000e+00	0.000000e+00
50%	5.580000e+02	4.000000e+00	5.744000e+03	6.090000e+02	1.000000e+00	0.000000e+00	0.000000e+00
75%	8.380000e+02	6.000000e+00	7.856000e+03	8.370000e+02	1.000000e+00	1.000000e+00	0.000000e+00
max	1.115000e+03	7.000000e+00	4.155100e+04	7.388000e+03	1.000000e+00	1.000000e+00	1.000000e+00

해당 데이터의 train 결측값 확인

Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

일단.. Train쪽 결측값은 없어보임

다음은 test 데이터도 이와 같이 진행

머리부분 확인

	Id	Store	DayOfWeek	Date	Open	Promo
0	1	1	4	2015-09-17	1.0	1
1	2	3	4	2015-09-17	1.0	1
2	3	7	4	2015-09-17	1.0	1
3	4	8	4	2015-09-17	1.0	1
4	5	9	4	2015-09-17	1.0	1

해당 데이터의 test 크기

해당 데이터의 test 정보

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41088 entries, 0 to 41087
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             41088 non-null  int64  
 1   Store          41088 non-null  int64  
 2   DayOfWeek      41088 non-null  int64  
 3   Date           41088 non-null  object 
 4   Open           41077 non-null  float64
 5   Promo          41088 non-null  int64  
 6   StateHoliday   41088 non-null  object 
 7   SchoolHoliday  41088 non-null  int64  
dtypes: float64(1), int64(5), object(2)
memory usage: 2.5+ MB

해당 데이터의 test 요약

	Id	Store	DayOfWeek	Open	Promo	SchoolHoliday
count	41088.000000	41088.000000	41088.000000	41077.000000	41088.000000	41088.000000
mean	20544.500000	555.899533	3.979167	0.854322	0.395833	0.443487
std	11861.228267	320.274496	2.015481	0.352787	0.489035	0.496802
min	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000
25%	10272.750000	279.750000	2.000000	1.000000	0.000000	0.000000
50%	20544.500000	553.500000	4.000000	1.000000	0.000000	0.000000
75%	30816.250000	832.250000	6.000000	1.000000	1.000000	1.000000
max	41088.000000	1115.000000	7.000000	1.000000	1.000000	1.000000

위 데이터 보아하니 customers가 없음.. 이 customers를 예측해서 test와 합쳐 결과를 내는것…

해당 데이터의 test 결측값 확인

Id                0
Store             0
DayOfWeek         0
Date              0
Open             11
Promo             0
StateHoliday      0
SchoolHoliday     0
dtype: int64

open에 11개의 결측값 존재. 처리해야함.
다음 프로세스에서…
store 데이터도 확인

해당 데이터의 store 머리부분

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

해당 데이터의 store 분량

해당 데이터의 store 정보

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB

해당 데이터의 store 요약

	Store	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear
count	1115.00000	1112.000000	761.000000	761.000000	1115.000000	571.000000	571.000000
mean	558.00000	5404.901079	7.224704	2008.668857	0.512108	23.595447	2011.763573
std	322.01708	7663.174720	3.212348	6.195983	0.500078	14.141984	1.674935
min	1.00000	20.000000	1.000000	1900.000000	0.000000	1.000000	2009.000000
25%	279.50000	717.500000	4.000000	2006.000000	0.000000	13.000000	2011.000000
50%	558.00000	2325.000000	8.000000	2010.000000	1.000000	22.000000	2012.000000
75%	836.50000	6882.500000	10.000000	2013.000000	1.000000	37.000000	2013.000000
max	1115.00000	75860.000000	12.000000	2015.000000	1.000000	50.000000	2015.000000

해당 데이터의 store 결측값 확인

Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

3. 결측값을 0으로 채워 넣기

test와 store에 결측값이 감지
0로 채워넣는 프로세스 시작

Id               0
Store            0
DayOfWeek        0
Date             0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

store 데이터도 이와같이 진행

Store                        0
StoreType                    0
Assortment                   0
CompetitionDistance          0
CompetitionOpenSinceMonth    0
CompetitionOpenSinceYear     0
Promo2                       0
Promo2SinceWeek              0
Promo2SinceYear              0
PromoInterval                0
dtype: int64

4. 위 데이터로 그래프 그려보기

Train Data에서 각 변수가 Sales 에 미치는 영향 파악 목적
일단 특정 한달 2015.7 기준으로 하라고.. 양이 많길래..

train 데이터 정보부터

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Store          1017209 non-null  int64 
 1   DayOfWeek      1017209 non-null  int64 
 2   Date           1017209 non-null  object
 3   Sales          1017209 non-null  int64 
 4   Customers      1017209 non-null  int64 
 5   Open           1017209 non-null  int64 
 6   Promo          1017209 non-null  int64 
 7   StateHoliday   1017209 non-null  object
 8   SchoolHoliday  1017209 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 69.8+ MB

train 행기준 확인

Store  DayOfWeek  Date        Sales  Customers  Open  Promo  StateHoliday  SchoolHoliday
1      1          2013-01-07  7176   785        1     1      0             1                1
745    5          2015-06-05  7622   711        1     1      0             0                1
                  2015-03-06  7667   738        1     1      0             0                1
                  2015-03-13  6268   668        1     0      0             0                1
                  2015-03-20  7857   725        1     1      0             0                1
                                                                                           ..
372    7          2013-03-03  0      0          0     0      0             0                1
                  2013-03-10  0      0          0     0      0             0                1
                  2013-03-17  0      0          0     0      0             0                1
                  2013-03-24  0      0          0     0      0             0                1
1115   7          2015-07-26  0      0          0     0      0             0                1
Length: 1017209, dtype: int64

train 앞부분 체크

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

여기서 7월 데이터만 보여주기

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	0	1
1	2	5	2015-07-31	6064	625	1	1	0	1
2	3	5	2015-07-31	8314	821	1	1	0	1
3	4	5	2015-07-31	13995	1498	1	1	0	1
4	5	5	2015-07-31	4822	559	1	1	0	1
...	...	...	...	...	...	...	...	...	...
34560	1111	3	2015-07-01	3701	351	1	1	0	1
34561	1112	3	2015-07-01	10620	716	1	1	0	1
34562	1113	3	2015-07-01	8222	770	1	1	0	0
34563	1114	3	2015-07-01	27071	3788	1	1	0	0
34564	1115	3	2015-07-01	7701	447	1	1	0	0

34565 rows × 9 columns

7월 일별당 매출 그래프

소비자와 판매자간의 그래프

해당 그래프는 7월만으로 비교하기 어려워서 전체데이터를 넣어봄

방학기준 소비자와 판매자간의 그래프

여기는 프로모션별…

위 그래프 결론에서..

각 그래프는 각 판매와 소비자간 휴일등 시기, 이벤트에서 어떤 상관관계나 추이등을 보여줌
크리스마스 연휴가 공휴일보다 영향력이..
방학은 별로.
프로모션은 매장에 따라 어떤 영향인가 보여주는(마지막)

5. 매장별 통계 데이터 확인

5.1 spc = sales/customer 로 새로운 열 생성 » train.csv df
5.2 Groupby 이용 store로 sales, customers, spc 평균값 만들어 보기 : store별 평균값 data

5.1에서 spc라는 새로운 열을 df_train에 생성해서 내용은 sales/customer로 넣으라는것.

5.2에서 Groupby라는 기능으로 store에 sales, customers, spc 평균값 정리. 단, store별로

시작하기전에 train encoding 진행. holiday 진행 필수!

먼저 train 인코딩

0    855087
0    131072
a     20260
b      6690
c      4100
Name: StateHoliday, dtype: int64
0    855087
0    131072
1     20260
2      6690
3      4100
Name: StateHoliday, dtype: int64

Spc행 null 처리

Spc null count : 172869
Spc null count : 0

데이터 확인

Store  DayOfWeek  Date        Sales  Customers  Open  Promo  StateHoliday  SchoolHoliday  Spc      
1      1          2013-01-07  7176   785        1     1      0             1              9.141401     1
745    5          2015-06-05  7622   711        1     1      0             0              10.720113    1
                  2015-03-06  7667   738        1     1      0             0              10.388889    1
                  2015-03-13  6268   668        1     0      0             0              9.383234     1
                  2015-03-20  7857   725        1     1      0             0              10.837241    1
                                                                                                      ..
372    7          2013-03-03  0      0          0     0      0             0              0.000000     1
                  2013-03-10  0      0          0     0      0             0              0.000000     1
                  2013-03-17  0      0          0     0      0             0              0.000000     1
                  2013-03-24  0      0          0     0      0             0              0.000000     1
1115   7          2015-07-26  0      0          0     0      0             0              0.000000     1
Length: 1017209, dtype: int64

Store 그룹에서 평균 판매, 소비자, Spc 확인

:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  tostore_df = en_df_train.groupby(['Store'])['Sales', 'Customers', 'Spc'].mean()

	AVG_Sales	AVG_Customers	AVG_Spc
Store
1	3945.704883	467.646497	6.958559
2	4122.991507	486.045648	6.998110
3	5741.253715	620.286624	7.539925
4	8021.769639	1100.057325	6.033827
5	3867.110403	444.360934	7.121176
...	...	...	...
1111	4342.968153	373.548832	9.549273
1112	8465.280255	693.498938	9.918483
1113	5516.180467	596.763270	7.666212
1114	17200.196391	2664.057325	5.372308
1115	5225.296178	358.687898	11.952081

1115 rows × 3 columns

6. Store 데이터에 Spc등 데이터 합치고 정리

Store df에 5번에서 생성된 컬럼 merge 시키기 » store data frame columns :
5번에 합칠 데이터라 한다면 5.2에 평균낸 데이터로 생각.
다만 먼저 합치기 전에 df_store 자체의 인코딩을 진행 분석에 문자열은 포함 못함
인코딩 대상은 [“StoreType”, “Assortment”, “PromoInterval”] 임.
디코딩은 필요없을듯함.. 애초 결과에 포함안되니..

인코딩 진행
StoreType, Assortment는 a, b, c 순 순서대로 숫자 0, 1, 2로 인코딩
PromoInterval의 경우 특정 개월 그룹으로 묶인 String그룹이라 그 그룹째로 0, 1, 2 지정.

a    602
d    348
c    148
b     17
Name: StoreType, dtype: int64
a    593
c    513
b      9
Name: Assortment, dtype: int64
0                   544
Jan,Apr,Jul,Oct     335
Feb,May,Aug,Nov     130
Mar,Jun,Sept,Dec    106
Name: PromoInterval, dtype: int64
0    602
3    348
2    148
1     17
Name: StoreType, dtype: int64
0    593
2    513
1      9
Name: Assortment, dtype: int64
0    544
1    335
2    130
3    106
Name: PromoInterval, dtype: int64

작업했던 데이터들을 합쳐서 확인.

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	AVG_Sales	AVG_Customers	AVG_Spc
0	1	2	0	1270.0	9.0	2008.0	0	0.0	0.0	0	3945.704883	467.646497	6.958559
1	2	0	0	570.0	11.0	2007.0	1	13.0	2010.0	1	4122.991507	486.045648	6.998110
2	3	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	1	5741.253715	620.286624	7.539925
3	4	2	2	620.0	9.0	2009.0	0	0.0	0.0	0	8021.769639	1100.057325	6.033827
4	5	0	0	29910.0	4.0	2015.0	0	0.0	0.0	0	3867.110403	444.360934	7.121176

7. Train 데이터에 store 데이터 merge 시키기(기준은 Store 명 기준으로)

기준은 Store명 기준
즉 Train + Store의 완전한 Train데이터 구축하라.

위를 바탕으로 합쳐서 데이터 확인

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday	Spc	...	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	AVG_Sales	AVG_Customers	AVG_Spc
0	1	5	2015-07-31	5263	555	1	1	1	9.482883	...	1270.0	9.0	2008.0	0	0.0	0.0	0	3945.704883	467.646497	6.958559
1	2	5	2015-07-31	6064	625	1	1	1	9.702400	...	570.0	11.0	2007.0	1	13.0	2010.0	1	4122.991507	486.045648	6.998110
2	3	5	2015-07-31	8314	821	1	1	1	10.126675	...	14130.0	12.0	2006.0	1	14.0	2011.0	1	5741.253715	620.286624	7.539925
3	4	5	2015-07-31	13995	1498	1	1	1	9.342457	...	620.0	9.0	2009.0	0	0.0	0.0	0	8021.769639	1100.057325	6.033827
4	5	5	2015-07-31	4822	559	1	1	1	8.626118	...	29910.0	4.0	2015.0	0	0.0	0.0	0	3867.110403	444.360934	7.121176

5 rows × 22 columns

8. Train 데이터에서 날짜데이터를 Year Month Day Week 분리하고 정리

Year, Month, Day, Week 등 각각 4개 데이터로 분리
python TimeDate 등 기능을 쓴다면..

백문이 불여일견.. 분석하기 쉽도록 나눠서 처리..
Week의 경우 dt의 isocalendar의 week를 통해 몇번째 주인지 출력가능

......
dc_df_train.head()

	Store	DayOfWeek	Sales	Customers	Open	Promo	SchoolHoliday	Spc	StoreType	...	Promo2SinceWeek	Promo2SinceYear	PromoInterval	AVG_Sales	AVG_Customers	AVG_Spc	Year	Month	Day	Week
0	1	5	5263	555	1	1	1	9.482883	2	...	0.0	0.0	0	3945.704883	467.646497	6.958559	2015	7	31	31
1	2	5	6064	625	1	1	1	9.702400	0	...	13.0	2010.0	1	4122.991507	486.045648	6.998110	2015	7	31	31
2	3	5	8314	821	1	1	1	10.126675	0	...	14.0	2011.0	1	5741.253715	620.286624	7.539925	2015	7	31	31
3	4	5	13995	1498	1	1	1	9.342457	2	...	0.0	0.0	0	8021.769639	1100.057325	6.033827	2015	7	31	31
4	5	5	4822	559	1	1	1	8.626118	0	...	0.0	0.0	0	3867.110403	444.360934	7.121176	2015	7	31	31

5 rows × 25 columns

여기까지 하면서

이와 같이 test도 test + store 데이터를 통해서 train처럼 작업필요
다만 test의 경우 customers가 없기에(답안 없는…)
Spc customers가 없는 Sales만 계산해서 합쳐지는..
여기도 StateHoliday Encoding 진행

train과 작업방식은 같음

0    40908
a      180
Name: StateHoliday, dtype: int64
0    40908
1      180
Name: StateHoliday, dtype: int64

위 작업을 test데이터에 붙임

	Id	Store	DayOfWeek	Date	Open	Promo	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	1	4	2015-09-17	1.0	1	2	0	1270.0	9.0	2008.0	0	0.0	0.0	0
1	2	3	4	2015-09-17	1.0	1	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	1
2	3	7	4	2015-09-17	1.0	1	0	2	24000.0	4.0	2013.0	0	0.0	0.0	0
3	4	8	4	2015-09-17	1.0	1	0	0	7520.0	10.0	2014.0	0	0.0	0.0	0
4	5	9	4	2015-09-17	1.0	1	0	2	2030.0	8.0	2000.0	0	0.0	0.0	0

똑같이 날짜를 분석하기쉽게 나눔.

# step3

......

dc_df_test.head()

	Id	Store	DayOfWeek	Open	Promo	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	Year	Month	Day	Week
0	1	1	4	1.0	1	2	0	1270.0	9.0	2008.0	0	0.0	0.0	0	2015	9	17	38
1	2	3	4	1.0	1	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	1	2015	9	17	38
2	3	7	4	1.0	1	0	2	24000.0	4.0	2013.0	0	0.0	0.0	0	2015	9	17	38
3	4	8	4	1.0	1	0	0	7520.0	10.0	2014.0	0	0.0	0.0	0	2015	9	17	38
4	5	9	4	1.0	1	0	2	2030.0	8.0	2000.0	0	0.0	0.0	0	2015	9	17	38

아래 정보를 바탕으로 test로 인공지능 예측시 활용예정.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41088 entries, 0 to 41087
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Id                         41088 non-null  int64  
 1   Store                      41088 non-null  int64  
 2   DayOfWeek                  41088 non-null  int64  
 3   Open                       41088 non-null  float64
 4   Promo                      41088 non-null  int64  
 5   StateHoliday               41088 non-null  object 
 6   SchoolHoliday              41088 non-null  int64  
 7   StoreType                  41088 non-null  int64  
 8   Assortment                 41088 non-null  int64  
 9   CompetitionDistance        41088 non-null  float64
 10  CompetitionOpenSinceMonth  41088 non-null  float64
 11  CompetitionOpenSinceYear   41088 non-null  float64
 12  Promo2                     41088 non-null  int64  
 13  Promo2SinceWeek            41088 non-null  float64
 14  Promo2SinceYear            41088 non-null  float64
 15  PromoInterval              41088 non-null  int64  
 16  Year                       41088 non-null  int64  
 17  Month                      41088 non-null  int64  
 18  Day                        41088 non-null  int64  
 19  Week                       41088 non-null  UInt32 
dtypes: UInt32(1), float64(6), int64(12), object(1)
memory usage: 6.5+ MB

9. Train DF에서 Label, Features 컬럼 나누기

X는 학습할 특성
y는 Label임은 당연지사
그러면 라벨은 Spc관련 데이터로 예상.
고로 test에는 없는 데이터. Sales, Customer가 라벨

컬럼 나누기 전에 이때까지 한 컬럼들 현상황 확인

Index(['Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'Spc', 'StoreType', 'Assortment',
       'CompetitionDistance', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
       'Promo2SinceYear', 'PromoInterval', 'AVG_Sales', 'AVG_Customers',
       'AVG_Spc', 'Year', 'Month', 'Day', 'Week'],
      dtype='object')

여기서 실제 확인할 데이터들 판매, 소비자, Spc, 그리고 평균 판매, 소비자, Spc등을
예측 결과값으로 분류
나머지는 판단자료데이터이므로 X로 분류

# 실제 이렇게 코드가 안짜여있으나 결과가 없기에 대략적으로 코드로 표현.
X = [ 'etc...' ]
y = ['Sales', 'Customers', 'Spc', 'AVG_Sales', 'AVG_Customers', 'AVG_Spc']

따로 분리된 예측결과 데이터 확인

	Sales	Customers	Spc	AVG_Sales	AVG_Customers	AVG_Spc
0	5263	555	9.482883	3945.704883	467.646497	6.958559
1	6064	625	9.702400	4122.991507	486.045648	6.998110
2	8314	821	10.126675	5741.253715	620.286624	7.539925
3	13995	1498	9.342457	8021.769639	1100.057325	6.033827
4	4822	559	8.626118	3867.110403	444.360934	7.121176

10. X, y 데이터셋을 Train, test로 나누기

X, y 로 8:2로 나누기

8:2 데이터 분리결과

x_train values count: 813767
y_train values count: 813767
x_test values count: 203442
y_test values count: 203442

그 데이터 일부

	Store	DayOfWeek	Open	Promo	SchoolHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	Year	Month	Day	Week
720893	274	1	1	1	0	1	1	3640.0	0.0	0.0	1	10.0	2013.0	1	2013	9	23	39
611704	355	1	1	0	1	0	2	9720.0	8.0	2013.0	0	0.0	0.0	0	2013	12	30	1
390659	5	6	1	0	0	0	0	29910.0	4.0	2015.0	0	0.0	0.0	0	2014	7	19	29
477862	313	2	1	1	1	3	2	14160.0	0.0	0.0	0	0.0	0.0	0	2014	4	29	18
374227	478	3	1	1	1	3	2	1940.0	3.0	2012.0	0	0.0	0.0	0	2014	8	6	32

11. 회귀(예측) AI 모델 선택

Linear regression, Ridge regression, Lasso regression, Polynomial regression … 등으로 한번 해보기
그외 신경망으로 예측이 가능한지 확인

각 모델들 입장..

한꺼번에 진행.

# 기본적으로 4개 모델 기본옵션으로
# LinearRegression
# Lasso
# Ridge
# PolynomialFeatures ( degree : 2, 분산포함)
# 이와같이 작업할 모델 선정.
models = [
  ......
]

# Polynomial Regression 에서는 예측해줄 모델을 따로 설정해줘야함.
# PolynomialFeatures는 전처리만 담당하는 모델이기 때문.
......

... = [
  'Linear Regression',
  'Lasso',
  'Ridge',
  'Polynomial Regression',
]

12. 학습(훈련), 예측, 성능평가

학습(fit)

예측(predict)

평가(Score)

위 3과정을 진행하면서 확인
참고로 f1-score등은 분류모델의 성능 확인을 위한 데이터이기 때문에 패스
여기서 기준은 R2

# 평가를 위한 라이브러리 받아오기
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 모델별로 Recall 점수 저장
# 모델 Recall 점수 순서대로 바차트를 그려 모델별로 성능 확인 가능

# error list 작성
from sklearn.metrics import mean_absolute_error, mean_squared_error, median_absolute_error, r2_score
err_list = [
    ......
]

......

# 막대그래프 설정 색상모음
colors = [
  ......
         ]

# R2 도출 및 그래프화 함수.
# 모델명, 예측값, 실제값을 주면 위의 plot_predictions 함수 호출하여 Scatter 그래프 그리며
# 모델별 MSE값을 Bar chart로 그려줌
def R2_eval(name_, pred, actual):
    ......

훈련, 예측, 평가 진행

# 'Polynomial Regression' 일경우을 제외하고는 단순 fit진행.
# 'Polynomial Regression'에서 선정된 submodel을 통해서 본격 훈련, 예측 진행.
for idx, model in enumerate(models):
  # 훈련
  ......
  
  # 예측
  ......
  
  # 평가
  print( "Model : " + ...... )
  for ...... :
    print( "name : model name"  )
  
  R2_eval(......)

Model : Linear Regression
MeanAE : 600.8584387741995
MedianAE : 459.8686808152799
MSE : 1776361.3675434568
RMSE : 863.3503418997265
R2 : 0.4035107433683398
               model         R2
0  Linear Regression  40.351074

<Figure size 864x648 with 0 Axes>

Model : Lasso
MeanAE : 600.8850792186124
MedianAE : 459.6573239646421
MSE : 1776504.4040635328
RMSE : 863.4085057486828
R2 : 0.39972120000178996
               model         R2
0  Linear Regression  40.351074
1              Lasso  39.972120

<Figure size 864x648 with 0 Axes>

Model : Ridge
MeanAE : 600.8565338754206
MedianAE : 459.7666609792696
MSE : 1776354.0141114946
RMSE : 863.3505996037321
R2 : 0.4034714371240787
               model         R2
0  Linear Regression  40.351074
1              Ridge  40.347144
2              Lasso  39.972120

<Figure size 864x648 with 0 Axes>

Model : Polynomial Regression
MeanAE : 567.0672357671167
MedianAE : 424.3426106043767
MSE : 1605768.94745449
RMSE : 814.6784490255732
R2 : 0.5152320248978711
                   model         R2
0  Polynomial Regression  51.523202
1      Linear Regression  40.351074
2                  Ridge  40.347144
3                  Lasso  39.972120

<Figure size 864x648 with 0 Axes>

13. 결과 보고

일단 성능좋은 Polynomial Regression 으로 결과보고 만들기
test 데이터는 아까 저번에 만들어둔걸로..
Polynomial 특징상 LR도 같이…

# 그나마 성능 제일 좋은 애로 진행해서, id제거, 훈련 및 예측 등으로 진행
......

[predict].shape

(41088, 6)

결과로 나올 행 이름을 추가해서 데이터 보여주기

	Sales	Customers	Spc	AVG_Sales	AVG_Customers	AVG_Spc
0	7607.752650	889.715885	8.842535	6029.960503	760.633890	6.731576
1	6120.761986	565.340003	10.511444	4590.800948	461.041049	7.980048
2	7475.062033	688.157977	10.359298	5474.933978	540.822080	8.026779
3	6533.838109	705.976699	9.304709	5040.406667	568.408090	7.396305
4	8645.445125	954.430660	9.374381	6633.371202	770.817615	7.359387
...	...	...	...	...	...	...
41083	5685.018501	642.265425	9.209060	5226.790307	607.817054	7.347172
41084	8344.936185	980.593709	8.781420	7013.484881	847.581939	6.920039
41085	7070.259068	798.421054	9.299123	6308.945487	722.586681	7.508190
41086	7876.067103	969.457115	8.611529	6923.436591	867.374927	6.852334
41087	5478.998990	431.183754	11.537297	3942.053162	292.918551	9.495523

41088 rows × 6 columns

답안지 나옴!!

이제 이걸로 답안지 샘플대로 test Store id와 위 Sales, Customers와 조합하면 될듯함.
Customers는 상수화

소숫점으로 되어있는 데이터를 상수화 하면 아래와 같이 됨

	Sales	Customers	Spc	AVG_Sales	AVG_Customers	AVG_Spc
0	7607.752650	889	8.842535	6029.960503	760.633890	6.731576
1	6120.761986	565	10.511444	4590.800948	461.041049	7.980048
2	7475.062033	688	10.359298	5474.933978	540.822080	8.026779
3	6533.838109	705	9.304709	5040.406667	568.408090	7.396305
4	8645.445125	954	9.374381	6633.371202	770.817615	7.359387
...	...	...	...	...	...	...
41083	5685.018501	642	9.209060	5226.790307	607.817054	7.347172
41084	8344.936185	980	8.781420	7013.484881	847.581939	6.920039
41085	7070.259068	798	9.299123	6308.945487	722.586681	7.508190
41086	7876.067103	969	8.611529	6923.436591	867.374927	6.852334
41087	5478.998990	431	11.537297	3942.053162	292.918551	9.495523

41088 rows × 6 columns

test에서는 판매량과 소비자만 표시

	Store	Sales	Customers
0	1	7607.752650	889
1	3	6120.761986	565
2	7	7475.062033	688
3	8	6533.838109	705
4	9	8645.445125	954
...	...	...	...
41083	1111	5685.018501	642
41084	1112	8344.936185	980
41085	1113	7070.259068	798
41086	1114	7876.067103	969
41087	1115	5478.998990	431

41088 rows × 3 columns

이제 위 데이터로 csv파일을 뽑아서 제출하면 종료.

RE 11~13 : 성능문제로 다시..

막상 결과물데이터를 보니, 음수값으로 예측되는 이상현상이 발생.
원인은 모델성능이 낮아서…
50%로는 안됨.. 80~90%의 성능을 요함.
결론은 랜덤포레스트등 성능이 괜찮을 수 있는 모델등으로 하기로 함.
그외 Arima, LSTM등으로도 해볼려고 한다.
모델 정의 및 훈련은 위와 다르게 별도로..
물론 그래프 구별은 같이..

성능 수치는 n수 5으로만 설정해서 진행.

model = RandomForestRegressor(.......)

......

# 평가
print( "Model : model name" )

# 해당 모델의 성능 지표 출력

R2_eval(......) # 그래프 출력

Model : RandomForestRegressor
MeanAE : 101.26543946837957
MedianAE : 60.41119834006054
MSE : 141548.80777202942
RMSE : 191.73388691972642
R2 : 0.9776348312126862
                   model         R2
0  RandomForestRegressor  97.763483
1  Polynomial Regression  51.523202
2      Linear Regression  40.351074
3                  Ridge  40.347144
4                  Lasso  39.972120

<Figure size 864x648 with 0 Axes>

RFR의 넘사벽…

위 모델로 다시 만들어보자.

# model # RFR

# id 제거..
......

[predict].shape

(41088, 6)

이전과 같이 위에 출력된 예측데이터에 행 라벨 붙임.

	Sales	Customers	Spc	AVG_Sales	AVG_Customers	AVG_Spc
0	4305.8	490.4	8.759757	3945.704883	467.646497	6.958559
1	7649.0	767.0	9.972677	5741.253715	620.286624	7.539925
2	9212.8	968.2	9.520230	7364.866987	785.740040	7.743082
3	7302.6	811.2	9.096203	4685.878132	539.836730	7.119593
4	8240.6	647.6	12.722666	5426.816348	479.487261	9.267299
...	...	...	...	...	...	...
41083	3268.0	290.0	11.539755	4334.747082	392.967034	9.128899
41084	9438.8	792.4	11.883049	8465.280255	693.498938	9.918483
41085	6509.0	649.8	9.985102	5516.180467	596.763270	7.666212
41086	22468.0	3754.8	5.926716	17200.196391	2664.057325	5.372308
41087	8412.2	587.8	14.223439	5293.188323	373.936730	11.648778

41088 rows × 6 columns

# 상수화 처리부터
df_predict = df_predict.astype({'Customers': 'int64'})
df_predict

	Sales	Customers	Spc	AVG_Sales	AVG_Customers	AVG_Spc
0	4305.8	490	8.759757	3945.704883	467.646497	6.958559
1	7649.0	767	9.972677	5741.253715	620.286624	7.539925
2	9212.8	968	9.520230	7364.866987	785.740040	7.743082
3	7302.6	811	9.096203	4685.878132	539.836730	7.119593
4	8240.6	647	12.722666	5426.816348	479.487261	9.267299
...	...	...	...	...	...	...
41083	3268.0	290	11.539755	4334.747082	392.967034	9.128899
41084	9438.8	792	11.883049	8465.280255	693.498938	9.918483
41085	6509.0	649	9.985102	5516.180467	596.763270	7.666212
41086	22468.0	3754	5.926716	17200.196391	2664.057325	5.372308
41087	8412.2	587	14.223439	5293.188323	373.936730	11.648778

41088 rows × 6 columns

소비자 상수화 처리

	Store	Sales	Customers
0	1	4305.8	490
1	3	7649.0	767
2	7	9212.8	968
3	8	7302.6	811
4	9	8240.6	647
...	...	...	...
41083	1111	3268.0	290
41084	1112	9438.8	792
41085	1113	6509.0	649
41086	1114	22468.0	3754
41087	1115	8412.2	587

41088 rows × 3 columns

csv파일 출력

이렇게 해피엔딩이긴 합니다만..

확실히 결과를 보니 음수 등 이상사태 없고 잘나와 보인다.
하지만 RFR기본 옵션만 가지고 좋은 성능이긴 한데..
더 좋은 결과의 옵션을 탐색해서 최고의 답을 내야되지 않겠나?
해서 아래처럼 준비..

# 각 모델에 대한 정확도와 이웃 수를 저장하기 위해 두 빈 목록을 만듭니다.
......

# ii를 사용하여 값 1에서 15까지 반복합니다. 이것은 RFR 관련 수가 됩니다.
for ii in range(1,16):
    # 이웃 수를 ii로 설정
    
    # 데이터로 모델 훈련 또는 피팅
    
    # .score는 테스트 데이터를 기반으로 모델의 정확도를 제공합니다. 정확도를 목록에 저장합니다.
   
    # 목록에 이웃 수 추가
    
    ......

#그래프 보여주기
......

더 진행했으면 좋겠는데…

시간상 n= 14에서 멈추도록하자.
이 기반으로 다시 출력

이전 RFR처럼 그대로 진행. 다른점은 n최선의 수 14로 진행.

Model : RandomForestRegressor
MeanAE : 95.29082905903987
MedianAE : 56.82775681298893
MSE : 125750.19816998013
RMSE : 180.10760475556484
R2 : 0.9806635172796763
                   model         R2
0  RandomForestRegressor  98.066352
1  Polynomial Regression  51.523202
2      Linear Regression  40.351074
3                  Ridge  40.347144
4                  Lasso  39.972120

<Figure size 864x648 with 0 Axes>

이를 바탕으로 실전데이터처리로 상수화등으로 진행후, csv파일을 출력 보고하면 종료.

그 외 참조한 사이트

날짜 조건에 따른 추출방법

https://kibua20.tistory.com/195

pandas 데이터 계산

https://nalara12200.tistory.com/162

pandas groupby 참조 사이트

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

pandas merge 에 관해서

https://yganalyst.github.io/data_handling/Pd_12/#2-1-merge%ED%95%A8%EC%88%98%EC%9D%98-%EC%98%B5%EC%85%98

pandas 날짜 분리 및 처리에 관해서

https://steadiness-193.tistory.com/60

pandas 몇주차 출력에 관해서

https://moondol-ai.tistory.com/180

여기서 경고에서 나온 노하우 : pandas dt.isocalendar().week 이 변수를 통해 몇 주차인지 출력이 가능

Polynomial regression 등에 관해서

https://data36.com/polynomial-regression-python-scikit-learn/

마무리

MSE 수치는 원래 이렇게 높게 나오나 보다 하면서 깨달음
그렇다 해도 예측의 위대함을 경험한 귀중한 기회
RFR이외에 다른 더 좋은 예측 모델도 존재하는데..

거기에 예측 신경망;;; 일단 이렇게 업로드 했으나 추후 업데이트 예정.

Twitter Facebook LinkedIn

Intel AI 캡스톤2 판매예측을 통한 이익실현

인공지능 판매 예측을 통해 이익실현 프로젝트

진행 순서

0. 필요한 라이브러리

1. 필요한 파일들 읽어오기

2. 파일 상태 확인

3. 결측값을 0으로 채워 넣기

4. 위 데이터로 그래프 그려보기

위 그래프 결론에서..

5. 매장별 통계 데이터 확인

6. Store 데이터에 Spc등 데이터 합치고 정리

7. Train 데이터에 store 데이터 merge 시키기(기준은 Store 명 기준으로)

8. Train 데이터에서 날짜데이터를 Year Month Day Week 분리하고 정리

여기까지 하면서

9. Train DF에서 Label, Features 컬럼 나누기

10. X, y 데이터셋을 Train, test로 나누기

11. 회귀(예측) AI 모델 선택

각 모델들 입장..

12. 학습(훈련), 예측, 성능평가

13. 결과 보고

답안지 나옴!!

RE 11~13 : 성능문제로 다시..

RFR의 넘사벽…

이렇게 해피엔딩이긴 합니다만..

더 진행했으면 좋겠는데…

그 외 참조한 사이트

마무리

공유하기

댓글남기기

참고

파이썬 정리 노트01

Intel AI 캡스톤1 비만 예방을 위한 행동분류예측

네이버 지식in 웹크롤링 만져보기 겸 감성분석시도

네이버 영상분석 자연어모델 LSTM으로 맛보기