[2차 프로젝트] 전자상거래 데이터분석 및 마케팅 전략 제시

journal201 2024. 9. 9. 20:48

2024. 9. 9. 20:48

A. 데이터셋

1. 데이터 소스:
https://www.kaggle.com/datasets/quangvinhhuynh/marketing-and-retail-analyst-e-comerce

Marketing and Retail Analyst E-comerce

www.kaggle.com

해당 데이터는 전자 상거래 플랫폼의 소매 부문 판매 채널과 관련된 것이다. 가장 정확한 보고서를 작성하기 위해서는 수익과 이익을 지난달 및 전년도 같은 기간과 비교하여 효과적인 마케팅 전략을 수립하는 것이 중요하다. 또한, 이익을 기준으로 상위 성과 제품을 식별하고 이러한 제품에 대한 세부 분석을 진행해야 하며, 미래 개발을 위한 성장 가능성이 있는 제품도 함께 평가해야 한다.

해당 데이터는 orders, order_items, customers, payments, products 의 총 5개 개별 파일로 하나의 데이터로 만들어 사용할 예정이다. cleaned csv 파일이 있지만 어떤 기준으로 어떤 방식으로 정제되었는지 모르기 때문에 사용하지 않는다.

2. 칼럼 확인

orders	- order_id(PK): 주문의 고유 식별자, 이 테이블의 기본 키 역할 - customer_id: 고객의 고유 식별자 - order_status: 주문 상태. 예: 배송됨, 취소됨, 처리 중 등 - order_purchase_timestamp: 고객이 주문을 한 시점의 타임스탬프 - order_approved_at: 판매자가 주문을 승인한 시점의 타임스탬프 - order_delivered_timestamp: 고객 위치에 주문이 배송된 시점의 타임스탬프 - order_estimated_delivery_date: 주문할 때 고객에게 제공된 예상 배송 날짜
order_items	- order_id(PK): 주문의 고유 식별자 - order_item_id(PK): 각 주문 내 항목 번호. 이 컬럼과 함께 order_id가 이 테이블의 기본 키 역할 - product_id: 제품의 고유 식별자 - seller_id: 판매자의 고유 식별자 - price: 제품의 판매 가격 - shipping_charges: 제품의 배송에 관련된 비용
customers	- customer_id(PK): 고객의 고유 식별자, 이 테이블의 기본 키 역할 - customer_zip_code_prefix: 고객의 우편번호 - customer_city: 고객의 도시 - customer_state: 고객의 주
payments	- order_id: 주문의 고유 식별자, 이 테이블에서 이 컬럼은 중복될 수 있다 - payment_sequential: 주어진 주문에 대한 결제 순서 정보를 제공 - payment_type: 결제 유형 예: 신용카드, 직불카드 등 - payment_installments: 신용카드 결제 시 할부 회차 - payment_value: 거래 금액
products	- product_id: 각 제품의 고유 식별자, 이 테이블의 기본 키 역할 - product_category_name: 제품이 속한 카테고리 이름 - product_weight_g: 제품 무게 (그램) - product_length_cm: 제품 길이 (센티미터) - product_height_cm: 제품 높이 (센티미터) - product_width_cm: 제품 너비 (센티미터)

3. 데이터 하나의 파일로 합치기

customers_df = customers_df.drop_duplicates()

# orders_df와 customers_df를 customer_id를 기준으로 병합
merged_df = pd.merge(orders_df, customers_df, on='customer_id', how='left')

# merged_df와 order_items_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, order_items_df, on='order_id', how='left')

# merged_df와 payments_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, payments_df, on='order_id', how='left')

# merged_df와 products_df를 product_id를 기준으로 병합
final_df = pd.merge(merged_df, products_df, on='product_id', how='left')

# 결과를 CSV 파일로 저장
# final_df.to_csv("merged_data_final.csv", index=False)
merge_df = final_df

B. Exploratory Data Analysis (EDA)

1. 데이터 톺아보기

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119160 entries, 0 to 119159
Data columns (total 24 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       119160 non-null  object 
 1   customer_id                    119160 non-null  object 
 2   order_status                   119160 non-null  object 
 3   order_purchase_timestamp       119160 non-null  object 
 4   order_approved_at              118982 non-null  object 
 5   order_delivered_timestamp      115738 non-null  object 
 6   order_estimated_delivery_date  119160 non-null  object 
 7   customer_zip_code_prefix       119160 non-null  int64  
 8   customer_city                  119160 non-null  object 
 9   customer_state                 119160 non-null  object 
 10  order_item_id                  118325 non-null  float64
 11  product_id                     118325 non-null  object 
 12  seller_id                      118325 non-null  object 
 13  price                          118325 non-null  float64
 14  shipping_charges               118325 non-null  float64
 15  payment_sequential             119157 non-null  float64
 16  payment_type                   119157 non-null  object 
 17  payment_installments           119157 non-null  float64
 18  payment_value                  119157 non-null  float64
 19  product_category_name          117893 non-null  object 
 20  product_weight_g               118305 non-null  float64
 21  product_length_cm              118305 non-null  float64
 22  product_height_cm              118305 non-null  float64
 23  product_width_cm               118305 non-null  float64
dtypes: float64(10), int64(1), object(13)
memory usage: 21.8+ MB

- 수치형 변수 및 범주형 변수 나누기

numerical_cols = ['order_item_id', 'price', 'shipping_charges', 'payment_sequential', 'payment_installments', 'payment_value',
                           'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
categorical_cols = ['order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_timestamp', 
                             'order_estimated_delivery_date', 'customer_zip_code_prefix', 'customer_city', 'customer_state',
                             'payment_type', 'product_category_name']

2. 수치형 시각화

- 방법: histoplot(), kdeplot(),displot(), boxplot(), pairplot()

- 모든 column에서 데이터의 왼쪽 쏠림 현상 발견 → 데이터의 표준화 필요 예상

- 이상치 정제 필요 예상.

3. 범주형 시각화

- barlplot(), countplot(), boxplot()

C. 데이터 정제

1. 결측치 처리

# 각 컬럼의 중앙값으로 결측치를 채우기 [product_weight_g, product_length_cm, product_height_cm, product_width_cm]
merge_df['product_weight_g'].fillna(merge_df['product_weight_g'].median(), inplace=True)
merge_df['product_length_cm'].fillna(merge_df['product_length_cm'].median(), inplace=True)
merge_df['product_height_cm'].fillna(merge_df['product_height_cm'].median(), inplace=True)
merge_df['product_width_cm'].fillna(merge_df['product_width_cm'].median(), inplace=True)

# 결측치를 제거할 열 목록
columns_to_drop_null= [
    'order_approved_at', 
    'order_item_id', 
    'product_id', 
    'seller_id', 
    'price', 
    'shipping_charges', 
    'payment_sequential', 
    'payment_type', 
    'payment_installments', 
    'payment_value', 
    'product_category_name'
]

# 지정된 열들에서 결측치가 있는 행을 제거
merge_df = merge_df.dropna(subset=columns_to_drop_null)

# 분석 프로젝트에 필요 없다고 판단한 컬럼 제거
columns_to_drop = ['payment_sequential']
merge_df = merge_df.drop(columns=columns_to_drop)

merge_df.isna().sum()

2. 이상치 처리

# 변수 이진화 함수
def binarize(df, column_name):
    df[column_name] = df[column_name].apply(lambda x: 1 if x == 1 else 0)
    return df

# IQR 제거 방식
def remove_outliers_iqr(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
    
    return df_cleaned

# 빈도수 기반 필터링 함수
def remove_low_frequency_outliers(df, column_name, min_frequency=0.01):
    value_counts = df[column_name].value_counts()
    proportions = value_counts / len(df)
    
    valid_values = proportions[proportions >= min_frequency].index
    
    df_cleaned = df[df[column_name].isin(valid_values)]
    
    return df_cleaned

def remove_outliers_by_threshold(df, column_name, threshold):
    df_cleaned = df[df[column_name] <= threshold]
    
    return df_cleaned


# order_item_id 1개는 1로, 2개 이상은 0으로 이진화하는 코드
merged_df_cleaned = binarize(merge_df, 'order_item_id')

# price, shipping_charges, payment_value IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'price')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'shipping_charges')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'payment_value')

# payment_installments 12개월 초과는 필터링
merged_df_cleaned = remove_outliers_by_threshold(merged_df_cleaned, 'payment_installments', 12)

# product_weight_g, product_length_cm, product_height_cm, product_width_cm IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_weight_g')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_length_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_height_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_width_cm')

3. 이상치 처리 결과

4. 변수 스케일링

1) 수치형 변수 스케일링

현재 수치형 변수가 오른쪽으로 꼬리가 긴 right-skewed 분포를 가지고 있다. 이에 로그 변환 방법으로 변수들을 조금 더 정규분포에 가깝게 만들어 비대칭성을 줄이고, 정규화(Nomalization)을 통해 데이터의 범위를 일정한 구간으로 변환하여 모델의 학습속도와 성능을 높이고자 한다.

# 'product_length_cm', 'product_height_cm', 'product_width_cm' 곱해서 'volume' 컬럼 생성
merged_df_cleaned['volume'] = merged_df_cleaned['product_length_cm'] * merged_df_cleaned['product_height_cm'] * merged_df_cleaned['product_width_cm']

# 기존의 세 개의 컬럼 드랍
merged_df_cleaned = merged_df_cleaned.drop(columns=['product_length_cm', 'product_height_cm', 'product_width_cm'])

merged_df_cleaned['repeat_order'] = merged_df_cleaned.groupby('customer_id')['order_id'].transform('nunique')
merged_df_cleaned['total_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('sum')
merged_df_cleaned['avg_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('mean')
merged_df_cleaned['total_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('sum')
merged_df_cleaned['avg_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('mean')

# 수치형, 범주형 변수 재정의

numerical_cols = ['price', 'shipping_charges', 'payment_value', 'product_weight_g', 'volume']
categorical_cols = ['order_status','customer_zip_code_prefix', 'customer_city', 'customer_state',
                    'order_item_id', 'payment_type', 'payment_installments', 'product_category_name']
                    
# 수치형 변수 로그변환 함수

def log_transform(df, cols):
    df_log_transformed = df[cols].apply(lambda x: np.log1p(x))
    
    return df_log_transformed

def normalize(df, cols):
    scaler = MinMaxScaler()
    df_normalized = pd.DataFrame(scaler.fit_transform(df[cols]), columns=cols, index=df.index)
    
    return df_normalized

# 로그 변환
log_transformed_df = log_transform(merged_df_cleaned, numerical_cols)

# 정규화 수행
normalized_df = normalize(log_transformed_df, numerical_cols)

# 결과를 원래 데이터프레임에 반영
merged_df_cleaned[numerical_cols] = normalized_df

2) 범주형 변수 인코딩

해당 데이터는 범주형 변수가 많고, 해당 변수의 값들 또한 많기 때문에 One-Hot Encoding 방식은 적절하지 않을 것으로 사료된다. 이에 범주형 변수의 카테고리를 줄인 후에 Label-Encoding를 하고자 한다. 또한 위치 정보를 나타내는 칼럼들은 5개의 큰 구역으로 묶는 절차를 거친다.

# state로 5개의 지방으로 묶기
state_to_region = {
    'AC': '북부 지방', 'AL': '북동부 지방', 'AP': '북부 지방', 'AM': '북부 지방',
    'RR': '북부 지방', 'RO': '북부 지방', 'PA': '북부 지방', 'PB': '북동부 지방',
    'MA': '북동부 지방', 'PI': '북동부 지방', 'PE': '북동부 지방', 'RN': '북동부 지방',
    'CE': '북동부 지방', 'SE': '북동부 지방', 'BA': '북동부 지방', 'DF': '중서부 지방',
    'TO': '북부 지방', 'GO': '중서부 지방', 'MS': '중서부 지방', 'MT': '중서부 지방',
    'RJ': '남동부 지방', 'SP': '남동부 지방', 'MG': '남동부 지방', 'ES': '남동부 지방',
    'RS': '남부 지방', 'SC': '남부 지방', 'PR': '남부 지방'
}

# state 컬럼을 기준으로 지방으로 변환
merged_df_cleaned['region'] = merged_df_cleaned['customer_state'].map(state_to_region)

# customer_zip_code_prefix, customer_city, customer_state 컬럼 drop
columns_to_drop = ['customer_zip_code_prefix', 'customer_city', 'customer_state']
merged_df_cleaned = merged_df_cleaned.drop(columns=columns_to_drop)

# 레이블 인코딩
label_encoder = LabelEncoder()

# 'region' 컬럼을 레이블 인코딩
merged_df_cleaned['region'] = label_encoder.fit_transform(merged_df_cleaned['region'])
merged_df_cleaned['region'].value_counts()

# product_category_name 종류별로 묶어서 카테고리 줄이기

category_mapping = {
    '가구/인테리어': [
        'furniture_decor', 'furniture_living_room', 'furniture_bedroom', 
        'furniture_mattress_and_upholstery', 'kitchen_dining_laundry_garden_furniture', 
        'la_cuisine', 'flowers', 'cool_stuff', 'perfumery', 'party_supplies', 
        'bed_bath_table', 'market_place', 'home_construction', 'christmas_supplies'
    ],
    '패션': [
        'fashion_underwear_beach', 'fashion_bags_accessories', 'fashion_shoes', 
        'fashion_male_clothing', 'fashion_sport', 'fashion_childrens_clothes', 
        'fashio_female_clothing', 'housewares', 'watches_gifts'
    ],
    '전자제품': [
        'telephony', 'computers_accessories', 'audio', 'tablets_printing_image', 
        'cine_photo', 'musical_instruments', 'consoles_games', 'dvds_blu_ray', 
        'music', 'electronics', 'air_conditioning', 'small_appliances', 
        'home_appliances', 'home_appliances_2', 'small_appliances_home_oven_and_coffee', 
        'home_comfort_2', 'signaling_and_security', 'security_and_services', 
        'fixed_telephony'
    ],
    '건설/공구': [
        'construction_tools_construction', 'costruction_tools_garden', 
        'construction_tools_safety', 'construction_tools_lights', 
        'costruction_tools_tools', 'garden_tools'
    ],
    '생활용품': [
        'baby', 'diapers_and_hygiene', 'health_beauty', 'home_confort', 
        'luggage_accessories', 'auto', 'food', 'drinks', 'food_drink', 
        'sports_leisure', 'pet_shop', 'agro_industry_and_commerce'
    ],
    '문구/사무용품': [
        'stationery', 'office_furniture', 'books_technical', 
        'books_general_interest', 'books_imported', 'arts_and_craftmanship', 
        'art', 'industry_commerce_and_business'
    ],
    '장난감': ['toys']
}

# 카테고리 매핑을 수행하는 함수
def map_category(category_name):
    for main_category, subcategories in category_mapping.items():
        if category_name in subcategories:
            return main_category
    return '기타'  

# `product_category_name` 컬럼을 매핑하여 새로운 컬럼 추가
merged_df_cleaned['product_category_group'] = merged_df_cleaned['product_category_name'].apply(map_category)

# 'product_category_name' 컬럼 삭제
merged_df_cleaned = merged_df_cleaned.drop(columns='product_category_name')

merged_df_cleaned['product_category_group'].value_counts()

# product_category_name 레이블 인코딩

label_encoder = LabelEncoder()

# 'product_category_group' 컬럼을 레이블 인코딩
merged_df_cleaned['product_category_group'] = label_encoder.fit_transform(merged_df_cleaned['product_category_group'])

# 변환된 데이터 확인
merged_df_cleaned['product_category_group'].value_counts()

3) dtype 변경

날짜 변수들의 dtype을 datetime으로 변경한다.

# 범주형 변수들 처리 전 날짜 변수들의 dtype datetime으로 변경.
date_columns = ['order_purchase_timestamp', 'order_approved_at', 
                'order_delivered_timestamp', 'order_estimated_delivery_date']

for col in date_columns:
    merged_df_cleaned[col] = pd.to_datetime(merged_df_cleaned[col])

merged_df_cleaned[date_columns].dtypes

4) 2번째 인코딩

# order_status, payment_type 인코딩

# 매핑 딕셔너리 정의
order_status_mapping = {
    'delivered': 0,
    'shipped': 1,
    'canceled': 2,
    'unavailable': 3,
    'processing': 4,
    'invoiced': 5,
    'created': 6,
    'approved': 7
}

payment_type_mapping = {
    'credit_card': 0,
    'wallet': 1,
    'voucher': 2,
    'debit_card': 3,
    'not_defined': 4
}

# 직접 레이블 인코딩 수행
merged_df_cleaned['order_status_encoded'] = merged_df_cleaned['order_status'].map(order_status_mapping)
merged_df_cleaned['payment_type_encoded'] = merged_df_cleaned['payment_type'].map(payment_type_mapping)

# 원본 컬럼 제거
merged_df_cleaned = merged_df_cleaned.drop(columns=['order_status', 'payment_type'])

D. 군집 모델

1. 군집 모델 나누기

cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]

# 군집 2개로 나누기
custoemr_kmeans2 = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans2.fit(cluster_1)

cluster_1['cluster2'] = custoemr_kmeans2.labels_

# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)

cluster_1['cluster3'] = custoemr_kmeans3.labels_

# 군집 4개로 나누기
custoemr_kmeans4 = KMeans(n_clusters=4, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans4.fit(cluster_1)

cluster_1['cluster4'] = custoemr_kmeans4.labels_

# 실루엣 계수

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans2.fit_predict(cluster_1)

silhouette_1 = silhouette_score(cluster_1, custoemr_kmeans2.labels_)
print(f'클러스터 개수 2개일 때: Silhouette Score = {silhouette_1:.6f}')

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans3.fit_predict(cluster_1)

silhouette_2 = silhouette_score(cluster_1, custoemr_kmeans3.labels_)
print(f'클러스터 개수 3개일 때: Silhouette Score = {silhouette_2:.6f}')

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans4.fit_predict(cluster_1)

silhouette_3 = silhouette_score(cluster_1, custoemr_kmeans4.labels_)
print(f'클러스터 개수 4개일 때: Silhouette Score = {silhouette_3:.6f}')

wcss_2 = custoemr_kmeans2.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_2:.4f}')


wcss_3 = custoemr_kmeans3.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_3:.4f}')

wcss_4 = custoemr_kmeans4.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_4:.4f}')

클러스터 개수 2개일 때: Silhouette Score = 0.765187
클러스터 개수 3개일 때: Silhouette Score = 0.575237
클러스터 개수 4개일 때: Silhouette Score = 0.559589

클러스터 개수 3개일 때: WCSS = 3528533448.0313
클러스터 개수 3개일 때: WCSS = 2497989545.3095
클러스터 개수 3개일 때: WCSS = 1931635348.2375

2. 최적의 군집 수 확인하기

def elbow(df):
    sse = []
    for i in range(1,15):
        km = KMeans(n_clusters= i, init='k-means++', random_state=42)
        km.fit(df)
        sse.append(km.inertia_)
    
    plt.plot(range(1,15), sse, marker = 'o')
    plt.xlabel('cluster count')
    plt.ylabel('SSE')
    plt.show()

elbow(cluster_1)

3. 군집 결과 해석

# 원본 자료는 살리기 위해 복사하기
merged_clu_df = merged_df_cleaned.copy()

# 클러스트 항목 설정
cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]

# 학습하기
# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)

# 군집된 결과 저장
merged_clu_df['cluster'] = custoemr_kmeans3.labels_


# 컬럼 지우는 함수
def del_cols(df):
    del df['order_id']
    del df['customer_id']
    del df['seller_id']
    del df['product_id']
    del df['order_approved_at']
    del df['order_delivered_timestamp']
    del df['order_estimated_delivery_date']
    
# 군집 특성 확인하는 함수
def character_visual(df):
    plt.figure(figsize=(20,20))

    for i in range(len(df.columns)):
    
        cols = list(df.columns)[i]
    
        plt.subplot(4,6,i+1)
        sns.histplot(df, x=cols, palette='RdYlGn')
        plt.title(cols)

1) 1번 군집

# 1번 군집만 설정
merged_clu1_df = merged_clu_df[merged_clu_df['cluster'] == 0]

del_cols(merged_clu1_df)
character_visual(merged_clu1_df)

2) 2번 군집

merged_clu2_df = merged_clu_df[merged_clu_df['cluster'] == 1]

del_cols(merged_clu2_df)
character_visual(merged_clu2_df)

3) 3번 군집

merged_clu3_df = merged_clu_df[merged_clu_df['cluster'] == 2]

del_cols(merged_clu3_df)
character_visual(merged_clu3_df)

4) 분기별 시각화

merged_df = merged_clu_df
merged_df['order_purchase_timestamp'] = pd.to_datetime(merged_df['order_purchase_timestamp'])
merged_df['quarter'] = merged_df['order_purchase_timestamp'].dt.to_period('Q')

# 모든 subplot에서 동일한 x축 레이블을 사용하기 위해 고유한 quarter 값을 얻음
unique_quarters = sorted(merged_df['quarter'].unique())

# 분기별 군집
rows = []
for quarter in sorted(merged_df['quarter'].unique()):
    row = {
        'Quarter': quarter,
        'Cluster 0 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 0)].shape[0],
        'Cluster 1 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 1)].shape[0],
        'Cluster 2 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 2)].shape[0]
    }
    rows.append(row)

# 데이터프레임으로 변환
summary_df = pd.DataFrame(rows)

# 분기 열 str로 타입 변경
summary_df['Quarter'] = summary_df['Quarter'].astype(str)

# 시각화
plt.figure(figsize=(12, 6))

plt.plot(summary_df['Quarter'], summary_df['Cluster 0 Count'], marker='o', label='Cluster 0')
plt.plot(summary_df['Quarter'], summary_df['Cluster 1 Count'], marker='o', label='Cluster 1')
plt.plot(summary_df['Quarter'], summary_df['Cluster 2 Count'], marker='o', label='Cluster 2')

plt.xlabel('Quarter')
plt.ylabel('Count')
plt.title('Quarterly Distribution of Orders by Cluster')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

5) 군집 해석

1번 군집: 일회성 구매 고객이 많고, 현금 결제가 많았다. 부피는 평균치지만 무게가 적은 상품을 구매하여 배송비가 세 집단 중 가장 작은 편이며, 장난감, 생활 관련 구매 고객이 다른 집단에 비해 많은 편이다.
2번 군집: 다른 집단에 비해 재구매 고객이 많고, 특히 다른 집단에 비해 남동부 지방 외 고객이 많다. 결제 금액이 높고, 할부 횟수가 많은 편이다.
3번 군집: 가장 최근에 주문한 고객들이 많았고, 재구매 고객이 가장 적다. 또한 전자 제품 구매가 많다는 특징이 있다.

4. 월별 평균 판매 가치의 변화

plt.rcParams['font.family'] = 'AppleGothic'

daily_payment = df.groupby(df['order_purchase_timestamp'].dt.date)['payment_value'].mean()

plt.figure(figsize=(14, 7))

plt.plot(daily_payment.index, daily_payment.values, label='Average Daily Payment Value', color='skyblue')

plt.scatter(daily_payment.index, daily_payment.values, color='red', s=10, label='Payment Value Points')

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gcf().autofmt_xdate()

plt.title('월별 평균 판매 가치의 변화')
plt.xlabel('Date')
plt.ylabel('Average Payment Value')
plt.legend()
plt.grid(True)
plt.show()

E. 페르소나 설정

	유아 및 생활 제품 구매 관련이 높은 자녀가 있는 가정 마케팅 전략: - 우수 리뷰자 상품 증정 이벤트 장난감 및 생활 용품 구매가 많은 것을 보아 안전한 상품을 구매하고 싶을 것으로 추정하여, 실구매자의 리뷰를 이끌어내 이를 바탕으로 그 신뢰를 쌓고자 한다. - 정액 할인 쿠폰 프로모션 현금 구매 및 할부를 이용하는 고객이 적다. 쿠폰을 통해 결제 금액을 줄여 가성비 쇼핑이라 느끼게 한다. 이를 통해 반복 구매를 유도하고자 한다.
	무겁고 부피가 큰 상품을 여러 개, 여러 번 구매하는 고객 마케팅 전략: - 배송 할인 쿠폰 제공 무겁고 부피가 큰 상품은 배송비를 많이 지불해야 함으로 더 많은 구매를 유도하기 위하여 배송 할인 쿠폰을 지급하고자 한다. - 구매 횟수에 따른 포인트 제공 재구매 횟수가 다른 집단에 비해 많은 편이다. 이들을 충성 고객으로 전환하기 위해 방문횟수가 일정 횟수 이상 늘어날 때마다 포인트를 제공하고자 한다..
	전자제품에 관심과 수요가 많은 2030 MZ세대 청년층 마케팅 전략: - 전자 제품 구매 시 악세사리 제공 전자 제품에 따른 악세사리를 제공하여 구매 결정을 쉽게 내릴 수 있도록 한다. - SNS 광고 청년들이 주로 이용하는 SNS을 통한 광고를 통해 소비를 촉진하고자 한다. - 블랙프라이데이, 크리스마스, 신년과 같은 특별한 날이 많은 4분기, 그 다음 해 1분기에 집중적으로 활용할 생각이다.

'Project' 카테고리의 다른 글

[4차 프로젝트] 식중독 발생 확률 예측 및 대시보드 생성 (1)	2024.11.07
[3차 프로젝트] 회귀 예측을 활용한 최적 승객 탑승 위치 추천 (1)	2024.09.20
[1차 프로젝트] 데이터 시각화 및 인사이트 도출 (4)	2024.08.12
[1차 프로젝트] 데이터 전처리 (0)	2024.07.31
[1차 프로젝트] 데이터 이해하기 (0)	2024.07.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

journal201 님의 기록실