Today I Learned

ML Study_์„ ํ˜•ํšŒ๊ท€

by Holly Yoon

ML

๐Ÿ’ก ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ ์—ฐ์Šต

  • ์„ค๋ช… : ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜์—ฌ ํ†ต๊ณ„์ ์œผ๋กœ ํ•ด์„
  • ์˜ˆ์ธก : ์ƒˆ๋กœ์šด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(X) → ๋ฏธ๋ž˜ ๋ฐ˜์‘๋ณ€์ˆ˜ ๊ฐ’(Y) ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€

ํšŒ๊ท€๋ชจ๋ธ์˜ ๊ฒ€์ • (์ฐธ๊ณ  ์ž๋ฃŒ)

  • why? ์ƒ์„ฑ๋œ ๋ชจ๋ธ์ด ์„ ํ˜•ํšŒ๊ท€์˜ ๊ธฐ๋ณธ ๊ฐ€์ •์„ ๋”ฐ๋ฅด๋Š”์ง€ ์•„๋‹Œ์ง€๋Š” ์ฒดํฌํ•ด์•ผํ•œ๋‹ค. ๊ฐ€์ •์„ ๋”ฐ๋ฅด์ง€ ์•Š๋Š”๋ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜•ํšŒ๊ท€์— ๋ผ์›Œ๋งž์ถ˜ ๊ฒฐ๊ณผ๊ฐ’์€ ์—‰๋ง์ง„์ฐฝ์“ฐ. (ํ™•๋ฅ ์˜ค์ฐจ๋ž€? ํ™•๋ฅ ์˜ค์ฐจ(์ž”์ฐจ) = ํƒ€๊ฒŸ๊ฐ’ - ์˜ˆ์ธก๊ฐ’)
    • ํ™•๋ฅ ์˜ค์ฐจ์˜ ์ •๊ทœ์„ฑ ํ™•์ธ (์˜ค์ฐจ๋Š” ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฐ€์ •)
    • ํ™•๋ฅ ์˜ค์ฐจ์˜ ๋“ฑ๋ถ„์‚ฐ์„ฑ ํ™•์ธ (์˜ค์ฐจ์˜ ๋ชจ๋“  ๋ถ„์‚ฐ์ด ๋™์ผํ•˜๋‹ค๋Š” ๊ฐ€์ •)
    • ๋…๋ฆฝ์„ฑ (์˜ˆ์ธก๊ฐ’~์ž”์ฐจ ๋…๋ฆฝ์„ฑ, ๋…๋ฆฝ๋ณ€์ˆ˜~์ž”์ฐจ ๋…๋ฆฝ์„ฑ, ์ž”์ฐจ์˜ ์ž๊ธฐ ์ƒ๊ด€์„ฑ)
  • ์„ค๋ช…
    • ๊ฒฐ์ •๊ณ„์ˆ˜ (R-squared) : ๋ชจํ˜•์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€
    • coef(ghlrnlrPtn)
    • P-value๊ฐ€ 0.05(์œ ์˜์ˆ˜์ค€)์ดํ•˜์ผ ๋•Œ ๋ณ€์ˆ˜๊ฐ€ ์œ ์˜๋ฏธํ•จ

kaggle์— ์ฒ˜์Œ ๋…ธํŠธ๋ถ ์ƒ์„ฑํ•ด๋ดค๋‹ค >MLstudy_W3

๐Ÿ’ก ์ฑ… <ํ˜ผ์ž ๊ณต๋ถ€ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹+๋”ฅ๋Ÿฌ๋‹ ์ฑ…>

  • ์ง€๋„ํ•™์Šต์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€
  • ํšŒ๊ท€๋Š” ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž„์˜์˜ ์–ด๋–ค ์ˆซ์ž๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ

1. K-์ตœ๊ทผ์ ‘ ์ด์›ƒ ํšŒ๊ท€


  • ๊ฒฐ์ •๊ณ„์ˆ˜ : ๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ, ํ…Œ์ŠคํŠธ์— ์ƒ˜ํ”Œ์„ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•œ ๊ฐœ์ˆ˜์˜ ๋น„์œจ์„ ์ •ํ™•๋„๋กœ ์ธก์ •ํ•œ๋‹ค. ํšŒ๊ท€์—์„œ ์ •ํ™•ํ•œ ์ˆซ์ž๋ฅผ ๋งžํžŒ๋‹ค๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ฒฐ์ •๊ณ„์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ‰๊ฐ€ํ•œ๋‹ค.
knr = KNeighborsRegressor()
knr.fit(train_input, train_target)
print(knr.score(test_input, test_target))
  • mean_absolute_error : ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ‰๊ฐ€์ง€ํ‘œ. ํƒ€๊นƒ๊ณผ ์˜ˆ์ธก์˜ ์ ˆ๋Œ“๊ฐ’ ์˜ค์ฐจ๋ฅผ ํ‰๊ท ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•œ๋‹ค. (*์˜ˆ์ธก์ด ํ‰๊ท ์ ์œผ๋กœ ํƒ€๊นƒ๊ณผ ์–ผ๋งˆ๋‚˜ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š”์ง€)
from sklearn.metrics import mean_absolute_error
test_prediction = knr.predict(test_input)
mae = mean_absolute_error(test_target, test_prediction)
print(mae)
  • ๊ณผ๋Œ€์ ํ•ฉ vs ๊ณผ์†Œ์ ‘ํ•ฉ : ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ ์ˆ˜๋ฅผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๊ณผ๋Œ€์ ‘ํ•ฉ, ๊ทธ ๋ฐ˜๋Œ€๊ฑฐ๋‚˜ ๋ชจ๋‘ ์ ์ˆ˜๊ฐ€ ๋‚ฎ์œผ๋ฉด ๊ณผ์†Œ์ ‘ํ•ฉ. ๊ณผ์†Œ์ ‘ํ•ฉ์ด ๋ฐœ์ƒํ•˜๋Š” ์ด์œ ๋Š”, ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์ผ์–ด๋‚œ๋‹ค.
    • ๊ณผ๋Œ€์ ‘ํ•ฉ → ๋ชจ๋ธ์„ ๋œ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์•ผ, k๊ฐ’ ๋Š˜๋ฆฌ๊ธฐ
    • ๊ณผ์†Œ์ ‘ํ•ฉ → ๋ชจ๋ธ์„ ๋” ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์•ผ, k๊ฐ’ ์ค„์ด๊ธฐ
  • k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ๊ฐœ์ˆ˜ ์กฐ์ • : ํ›ˆ๋ จ ์„ธํŠธ๋ณด๋‹ค ํ…Œ์ŠคํŠธ ์ ์ˆ˜๊ฐ€ ๋†’์œผ๋‹ˆ ๊ณผ์†Œ์ ‘ํ•ฉ์ด ๋ฐœ์ƒํ–ˆ์„ ๋•Œ๋Š”, ๋ชจ๋ธ์„ ๋” ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค. ํ›ˆ๋ จ ์„ธํŠธ์— ๋” ์ž˜ ๋งž๊ฒŒ ๋งŒ๋“ค๋ฉด ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ ์ˆ˜๋Š” ์กฐ๊ธˆ ๋‚ฎ์•„์ง„๋‹ค.
knr.n_neighbors=3
knr.fit(train_input, train_tareg)
print(knr.score(train_input, train_target)

2. ์„ ํ˜•ํšŒ๊ท€


  • k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ํšŒ๊ท€์˜ ํ•œ๊ณ„ : ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์ด ํ›ˆ๋ จ ์„ธํŠธ์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด ์—‰๋šฑํ•œ ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค.
import matplotlib.pyplot as plt
distance, indexes = knr.kneighbors([[50]])
plt.scatter(train_input, train_target)
#์ด์›ƒ๋งŒ ๊ตฌํ•˜๊ธฐ
plt.scatter(train_input[indexes],train_target[indexes], marker='D')
#50cm ๋†์–ด ๋ฐ์ดํ„ฐ
plt.scatter(50, 1033, marker='^')
plt.show() 

๊ธธ์ด๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋†์–ด์˜ ๋ฌด๊ฒŒ๋Š” ์ฆ๊ฐ€ํ•˜์ง€๋งŒ, 50cm๋†์–ด๋Š” 45cm ๊ทผ๋ฐฉ์ด๋ผ k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ƒ˜ํ”Œ ๋ฌด๊ฒŒ๋ฅผ ํ‰๊ท ์„ ์ œ๊ณตํ•จ

  • ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์ฃผ๊ธฐ์ ์œผ๋กœ ํ›ˆ๋ จํ•ด์•ผ ํ•œ๋‹ค. ์‹œ๊ฐ„๊ณผ ํ™˜๊ฒฝ์ด ๋ณ€ํ™”ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ๋„ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ๋ฐ˜๋ณต ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ LinearRegressionํด๋ž˜์Šค, fit() score() predict()๋ฉ”์†Œ๋“œ
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_input, train_target)
print(lr.predit([[50]]))
>>[1241.83]

print(lr.coef_, lr.intercept_) #๊ณ„์ˆ˜(๊ฐ€์ค‘์น˜)์™€ y์ ˆํŽธ
>>[39.17] -709.01
plt.scatter(train_input, train_target)
plt.plot([15,50], [15*lr.coef_ + lr.intercept_, 50*lr.coef_ + lr.intercept_])
plt.scatter(50, 1241.8, marker='^')
plt.show()

  • ๋‹คํ•ญํšŒ๊ท€ : ์ตœ์ ์˜ ์ง์„ ์ด ์•„๋‹ˆ๋ผ ์ตœ์ ์˜ ๊ณก์„ ์„ ์ฐพ์ž! column_stack()์„ ํ™œ์šฉํ•˜์—ฌ ๊ธธ์ด๋ฅผ ์ œ๊ณฑํ•œ ํ•ญ์„ ์ถ”๊ฐ€ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
train_poly = np.column_stack((train_input**2, train_input))
test_poly = np.column_stack((test_input**2, test_input))
lr=LinearRegression()
lr.fit(train_poly, train_target)
print(lr.coef_, lr.intercept_)
>>[  1.01433211 -21.55792498] 116.0502107827827

#๊ตฌ๊ฐ„๋ณ„ ์ง์„  ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•œ ์ •์ˆ˜ ๋ฐฐ์—ด ๋งŒ๋“ค๊ธฐ
point=np.arange(15,50)
#ํ›ˆ๋ จ ์„ธํŠธ์˜ ์‚ฐ์ ๋„
plt.scatter(train_input, train_target)
#2์ฐจ ๋ฐฉ์ •์‹ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
plt.plot(point, 1.01*point**2-21.6*point+116.05)
#50cm ๋†์–ด ๋ฐ์ดํ„ฐ
plt.scatter([50], [1574], marker='^')
plt.show()

3. ๋‹ค์ค‘ ํšŒ๊ท€


  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜• ํšŒ๊ท€๋ฅผ ๋‹ค์ค‘ํšŒ๊ท€๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์„ฑ ๊ณตํ•™(feature engineering) : ๊ธฐ์กด์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋ฝ‘์•„๋‚ด๋Š” ์ž‘์—… (๋†์–ด์˜ ๋ฌด๊ฒŒ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋†์–ด์˜ ๊ธธ์ดx๋†์–ด์˜ ๋†’์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ)
  • ๋ฐ์ดํ„ฐ ์ค€๋น„ : ํŒ๋‹ค์Šค → dataframe → numpy ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜
import pandas as pd
df=pd.read_csv('<https://bit.ly/perch_csv>')
perch_full=df.to_numpy()

import numpy as np
perch_weight = np.array([5.9, 32.0, 40.0, 51.5, 70.0, 100.0, 78.0, 80.0, 85.0, 85.0, 110.0,
       115.0, 125.0, 130.0, 120.0, 120.0, 130.0, 135.0, 110.0, 130.0,
       150.0, 145.0, 150.0, 170.0, 225.0, 145.0, 188.0, 180.0, 197.0,
       218.0, 300.0, 260.0, 265.0, 250.0, 250.0, 300.0, 320.0, 514.0,
       556.0, 840.0, 685.0, 700.0, 700.0, 690.0, 900.0, 650.0, 820.0,
       850.0, 900.0, 1015.0, 820.0, 1100.0, 1000.0, 1100.0, 1000.0,
       1000.0])

#ํ›ˆ๋ จ์„ธํŠธ์™€ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋‚˜๋ˆ„๊ธฐ
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(perch_full, perch_weight, random_state=42)
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ๋ณ€ํ™˜๊ธฐ
    • fit() : ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค ํŠน์„ฑ ์กฐํ•ฉ์„ ์ฐพ์Šต๋‹ˆ๋‹ค.
    • transform() : ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
from sklearn.preprocessing import  PolynomialFeatures
poly=PolynomialFeatures(include_bias=False)
poly.fit([[2,3]]) 
print(poly.transform([[2,3]]))
>> [[2.3.4.6.9.]] 
#๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์ดํ‚ท๋Ÿฐ์€ ๊ฐ ํŠน์„ฑ์˜ ์ œ๊ณฑํ•ญ, ํŠน์„ฑ๋ผ๋ฆฌ ๊ณฑํ•œ ํ•ญ์„ ์ถ”๊ฐ€ํ•œ๋‹ค

poly=PolynomialFeatures(include_bias=False)
poly.fit(train_input)
train_poly=poly.transform(train_input)
print(train_poly.shape)
>> (42,9)
  • get_feature_names() : ์–ด๋–ค ์ž…๋ ฅ์˜ ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๋Š”์ง€ ํ™•์ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
poly.get_feature_names()
test_poly.poly.transform(test_input)
  • ๋‹ค์ค‘ ํšŒ๊ท€ ๋ชจ๋ธ๋กœ ํ›ˆ๋ จํ•˜๊ธฐ
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(train_poly, train_target)
  • ํŠน์„ฑ ์ถ”๊ฐ€ํ•˜๊ธฐ : PolynomialFeatures ํด๋ž˜์Šค์˜ degree ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ณ ์ฐจํ•ญ์˜ ์ตœ๋Œ€ ์ฐจ์ˆ˜๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. → ํŠน์„ฑ ๊ฐœ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด, ํ›ˆ๋ จ ์„ธํŠธ์— ๋„ˆ๋ฌด ๊ณผ๋Œ€์ ํ•ฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
poly=PolynomialFeatures(degree=5, include_
  • ๊ทœ์ œ(regularization) : ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ๋„ˆ๋ฌด ๊ณผ๋„ํ•˜๊ฒŒ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ›ผ๋ฐฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

159์ชฝ๋ถ€ํ„ฐ ๊ณต๋ถ€ ์‹œ์ž‘ํ•˜์‹œ๋„๋ก ํ•˜์‹œ์˜ค.

  • ๋ฆฟ์ง€ ํšŒ๊ท€
  • ๋ผ์˜ ํšŒ๊ท€

'ML' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

๊ตฐ์ง‘๋ถ„์„ ๊ณต๋ถ€  (0) 2023.03.09

๋ธ”๋กœ๊ทธ์˜ ์ •๋ณด

Study Log by Holly

Holly Yoon

ํ™œ๋™ํ•˜๊ธฐ