R
[R 프로그램] 나이브베이즈알고리즘으로 mushroom 분류하기
Yeenn
2021. 12. 16. 01:28
728x90
실습 데이터 mushrooms.csv는 위에서 다운로드 하면 된다.
데이터 불러오기 및 확인
#데이터 load
mushroom = read.csv("mushrooms.csv", header=T, stringsAsFactors = T)
#데이터 확인
View(mushroom)
str(mushroom)
'data.frame': 8124 obs. of 23 variables:
$ type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
$ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
$ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
$ cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
$ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
$ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
$ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
$ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
$ gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
$ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
$ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
$ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
$ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
$ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
$ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
$ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
$ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
$ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
$ habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
나이브 베이즈 분류 알고리즘을 사용할 예정인데, 이를 위해서는 Factor형으로 데이터 타입을 변경해주어야 한다.
결측치 확인
#결측치 확인
colSums(is.na(mushroom))
type cap_shape cap_surface cap_color bruises
0 0 0 0 0
odor gill_attachment gill_spacing gill_size gill_color
0 0 0 0 0
stalk_shape stalk_root stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring
0 0 0 0 0
stalk_color_below_ring veil_type veil_color ring_number ring_type
0 0 0 0 0
spore_print_color population habitat
0 0 0
dim(mushroom)
[1] 8124 23
colSums()를 통해 결측치가 존재하는 컬럼을 한번에 확인하였다.
train/test 데이터셋 split
#데이터 shuffle
set.seed(123)
train_cnt <- round(0.75*dim(mushroom)[1])
train_cnt
[1] 6093
#6093만큼 임의의 숫자 뽑기
train_index <- sample(1:dim(mushroom)[1], train_cnt, replace=F)
train_index
[1] 2463 2511 2227 526 4291 2986 1842 1142 3371 5349 5364 5134 3446 4761 6746 1627 7936 2757 5107 5211 953 4444 1017 7817 2013 5475
[27] 2888 6170 2567 1450 5769 1790 4307 2980 1614 6737 555 5991 4469 6988 1167 2592 2538 7789 1799 905 7081 5962 1047 7067 3004 4405
[53] 3207 7989 3995 5344 166 217 1314 2629 6216 588 1599 4237 4818 3937 4089 2907 4249 294 277 5583 41 6575 6234 316 7391 6672
[79] 7284 7774 2822 2795 2504 6742 3926 7207 1183 752 3281 374 8118 6129 2082 4612 4109 2117 6134 6015 755 6553 5428 7446 5209 7072
[105] 1006 2585 7127 2339 1448 3952 3358 3980 4767 6265 1134 3230 5184 5603 1934 1501 4576 3783 6211 7831 7158 1914 5967 1109 4261 7816
[131] 1075 3146 7346 1386 2284 4706 2378 6870 4223 4044 2260 686 3857 6078 6958 7478 5027 7912 7022 6387 847 7960 7281 983 4715 6023
[157] 4573 6095 151 6810 1638 6911 2208 7559 6299 2474 1029 326 3856 7448 2837 7735 1956 5358 5884 4093 5459 985 6183 6966 986 4233
[183] 2503 1762 1584 4685 7251 7864 4084 5999 2087 2244 1793 4776 7454 5726 6644 7947 1808 344 7979 2507 2992 3092 7016 195 5981 3236
[209] 3124 4972 6678 2225 4650 2875 2132 5643 6777 4023 3464 1326 6741 3949 4802 2667 7757 3502 2758 3833 712 1452 4768 5370 1828 3501
[235] 3069 2446 3061 528 4055 6525 1569 6184 6056 6666 473 6344 6098 1149 2037 2313 2823 1927 6330 7741 3324 458 3224 4927 1078 5015
[261] 5658 6379 1313 185 413 4723 2570 1333 7222 6349 5995 4875 1561 564 794 1415 8025 3799 1370 4256 6612 3581 2968 3129 6601 4713
[287] 357 279 4366 7976 6790 6491 3201 2266 618 1905 6842 337 6941 2074 539 5077 7687 956 5877 5786 3625 6868 3462 5618 7390 6815
[313] 7005 6801 3008 1445 5177 6804 2211 2286 5793 2626 7684 7988 7393 5509 4213 7705 4744 1079 1241 7904 2605 1706 6559 7302 7101 873
[339] 6832 757 6061 988 2495 7272 2869 2879 4807 5588 1234 4445 1425 3809 7426 7913 4482 1165 2072 7089 3071 6083 7634 5250 2213 5823
[365] 1612 5497 3853 6314 5565 710 1258 2470 2556 8015 2112 6224 8007 6692 3035 6619 8017 6909 7511 1347 6623 7728 7618 7279 791 1987
[391] 5001 1341 4903 5342 5407 5854 3657 4388 5346 5910 1708 4393 4956 4701 6781 1835 3111 1261 6309 7970 1057 83 3468 866 2325 7377
[417] 5119 4172 5214 7198 7385 7090 7343 6518 2371 2163 7759 6994 2656 7633 5478 682 8016 7082 3168 1562 6541 7572 3814 7316 989 4052
[443] 1961 6536 4007 5834 2641 5567 1362 6888 4339 2154 3959 6155 4721 3436 386 2451 4557 141 31 6046 5235 7725 4190 3786 4112 3781
[469] 7588 178 3489 3536 5532 6321 4620 5557 5020 5698 204 3247 4019 2421 7834 7632 2432 5242 7567 3475 2363 259 4590 5985 5880 5409
[495] 7216 1355 7268 3064 5228 1325 4106 4266 3352 7516 4071 2450 1233 3540 1851 1673 7477 3467 4204 4104 4722 4357 2589 7474 6470 5194
[521] 2330 1609 3339 6007 5382 1760 7646 5843 5339 6840 3424 667 5086 5239 4548 856 1948 5699 2670 1960 646 1060 3127 3562 1264 4987
[547] 1656 1886 1328 8108 3082 665 3494 2035 5732 3177 4889 5508 8123 1204 4374 7541 241 4120 6823 1727 2085 2734 6710 5423 1743 3984
[573] 6163 5791 378 549 5735 1268 2096 7356 4030 6608 4489 3211 8116 1395 4766 5309 311 6333 1442 5502 2086 1108 319 6830 2894 1862
[599] 1666 120 2760 7499 1557 6585 7367 3571 599 8064 5435 6858 6821 6225 5175 134 4520 2804 6150 3200 879 668 5920 49 5596 3311
[625] 193 8107 6603 4399 7685 7042 2238 3696 191 6590 6263 3699 5747 7227 4913 3975 2109 1446 3180 1316 1397 5655 7699 2978 5720 3716
[651] 5046 5533 4794 8018 5883 5323 758 5089 3318 1464 2995 6834 251 2608 643 2593 990 2210 5442 2624 6312 3514 6932 6222 7573 1517
[677] 7520 5544 445 3043 95 3990 7653 379 1366 2269 7595 7329 620 448 242 7071 2862 6088 7070 407 6373 7392 2833 1498 2747 1465
[703] 3243 3095 4314 3556 7796 79 1531 7917 6381 4675 1831 2977 4518 4589 4826 1820 2257 4695 3765 877 2406 5161 904 7297 4944 7030
[729] 4546 1256 1358 2444 3802 1864 2687 4137 3336 3769 201 52 5345 2115 680 3842 577 4553 1015 7873 4045 5661 5140 1230 2172 2640
[755] 3847 2788 6189 3353 2139 6797 2004 7306 606 6271 3080 839 6137 6739 8043 3239 4595 7518 1658 931 2950 71 5892 4042 4177 4016
[781] 6920 7915 4992 3509 5870 7141 270 1436 5766 6281 2721 4724 5166 7750 2579 7428 5730 7399 3294 836 7416 602 7290 331 4997 6646
[807] 5907 291 4001 4839 4925 919 3241 2777 1471 6136 1365 8029 504 3148 1026 2934 786 1796 7274 3183 855 4470 1096 449 7032 3019
[833] 3301 1547 4431 3001 3128 7830 6415 4303 436 2924 3049 4029 601 7481 387 3335 3140 3192 2792 6709 5477 2840 742 7325 1859 4619
[859] 4682 256 3421 7320 4184 5536 5977 2059 586 4559 3827 2748 6986 4781 3899 6424 3584 803 2290 778 8008 1352 6316 5418 7962 1703
[885] 1553 7844 929 2789 6090 2413 1994 6018 2639 3543 3951 2024 5347 2882 5742 1339 4015 2291 3337 535 7568 982 2160 4552 3165 789
[911] 6775 7546 3017 4578 1272 2153 1195 4792 459 6660 2887 5432 5682 2940 139 1782 481 843 1852 4346 7686 5717 6474 4729 1650 8120
[937] 5219 1122 1082 7447 7132 1549 6152 2306 6779 6406 6743 7240 2976 4302 4295 5709 5960 3942 7206 7754 7177 3911 7795 6124 1916 5651
[963] 2801 5608 1295 4014 4494 5338 5275 7738 3617 6962 2834 4756 347 7993 1589 3903 3401 8012 7948 6199 5367 1394 7702 1068 860 5496
[989] 5028 7531 2583 3626 1733 2242 2508 5950 6720 3149 3691 5094
#train/test split
mushroom_train <- mushroom[train_index, ]
mushroom_test <- mushroom[-train_index, ]
nrow(mushroom_train)
[1] 6093
nrow(mushroom_test)
[1] 2031
str(mushroom_train)
'data.frame': 6093 obs. of 23 variables:
$ type : Factor w/ 2 levels "edible","poisonous": 1 1 1 1 2 1 1 1 1 2 ...
$ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 4 3 1 4 4 4 3 4 3 ...
$ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 1 1 4 3 3 1 1 3 4 ...
$ cap_color : Factor w/ 10 levels "brown","buff",..: 4 8 8 10 4 4 9 1 1 2 ...
$ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 2 2 ...
$ odor : Factor w/ 9 levels "almond","anise",..: 7 7 7 2 5 7 7 7 7 5 ...
$ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
$ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 1 1 2 1 1 1 ...
$ gill_size : Factor w/ 2 levels "broad","narrow": 1 1 1 1 1 1 1 1 1 1 ...
$ gill_color : Factor w/ 12 levels "black","brown",..: 2 2 2 11 4 11 8 8 11 4 ...
$ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 2 2 2 1 1 2 2 2 2 2 ...
$ stalk_root : Factor w/ 5 levels "bulbous","club",..: 1 1 1 2 1 1 3 1 1 1 ...
$ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 3 4 4 4 4 1 ...
$ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 3 4 4 4 4 1 ...
$ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 4 8 6 8 1 4 8 8 8 8 ...
$ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 6 6 8 2 6 8 6 4 8 ...
$ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
$ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
$ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
$ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 3 5 1 5 5 5 ...
$ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 2 4 1 2 2 1 4 ...
$ population : Factor w/ 6 levels "abundant","clustered",..: 6 6 6 4 5 6 1 5 5 4 ...
$ habitat : Factor w/ 7 levels "grasses","leaves",..: 7 7 7 1 1 7 1 7 7 5 ...
데이터를 shuffle한 후, train과 test셋 데이터를 split해주었다.
train 데이터를 살펴보니, 6093개의 행과 23열로 이루어진 데이터셋이 만들어졌음을 확인할 수 있었다.
나이브베이즈 분류 알고리즘 예측
#나이브베이즈 알고리즘
library(e1071)
model1 <- naiveBayes(type~., data=mushroom_train)
model1
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
edible poisonous
0.5153455 0.4846545
Conditional probabilities:
cap_shape
Y bell conical convex flat knobbed sunken
edible 0.096815287 0.000000000 0.466242038 0.375796178 0.053821656 0.007324841
poisonous 0.011175076 0.001015916 0.438537081 0.393498137 0.155773789 0.000000000
cap_surface
Y fibrous grooves scaly smooth
edible 0.3719745223 0.0000000000 0.3585987261 0.2694267516
poisonous 0.1967490687 0.0006772773 0.4422621063 0.3603115476
cap_color
Y brown buff cinnamon gray green pink purple red white yellow
edible 0.303503185 0.009554140 0.007961783 0.244904459 0.003184713 0.014012739 0.003184713 0.144267516 0.174203822 0.095222930
poisonous 0.263122249 0.028445648 0.003047748 0.206569590 0.000000000 0.023366068 0.000000000 0.221131053 0.079918727 0.174398916
bruises
Y no yes
edible 0.3493631 0.6506369
poisonous 0.8435489 0.1564511
odor
Y almond anise creosote fishy foul musty none pungent spicy
edible 0.095222930 0.095859873 0.000000000 0.000000000 0.000000000 0.000000000 0.808917197 0.000000000 0.000000000
poisonous 0.000000000 0.000000000 0.049779885 0.143582797 0.550965120 0.009820522 0.029800203 0.065018625 0.151032848
gill_attachment
Y attached free
edible 0.044267516 0.955732484
poisonous 0.004740941 0.995259059
gill_spacing
Y close crowded
edible 0.70700637 0.29299363
poisonous 0.97155435 0.02844565
gill_size
Y broad narrow
edible 0.93184713 0.06815287
poisonous 0.43244158 0.56755842
gill_color
Y black brown buff chocolate gray green orange pink purple red
edible 0.086305732 0.221337580 0.000000000 0.047133758 0.056687898 0.000000000 0.015923567 0.198407643 0.106687898 0.023248408
poisonous 0.017609211 0.027429732 0.441584829 0.130037250 0.136471385 0.006095496 0.000000000 0.163901118 0.011852354 0.000000000
gill_color
Y white yellow
edible 0.229617834 0.014649682
poisonous 0.058923129 0.006095496
stalk_shape
Y enlarging tapering
edible 0.3834395 0.6165605
poisonous 0.4869624 0.5130376
stalk_root
Y bulbous club equal missing rooted
edible 0.45382166 0.12101911 0.20859873 0.16942675 0.04713376
poisonous 0.47375550 0.01185235 0.06501863 0.44937352 0.00000000
stalk_surface_above_ring
Y fibrous scaly silky smooth
edible 0.096178344 0.002547771 0.036624204 0.864649682
poisonous 0.034879783 0.002031832 0.574331189 0.388757196
stalk_surface_below_ring
Y fibrous scaly silky smooth
edible 0.10859873 0.04968153 0.03503185 0.80668790
poisonous 0.03555706 0.01964104 0.55265831 0.39214358
stalk_color_above_ring
Y brown buff cinnamon gray orange pink red white yellow
edible 0.002547771 0.000000000 0.000000000 0.135031847 0.044267516 0.134713376 0.020700637 0.662738854 0.000000000
poisonous 0.111412123 0.108703014 0.009820522 0.000000000 0.000000000 0.334913647 0.000000000 0.433118862 0.002031832
stalk_color_below_ring
Y brown buff cinnamon gray orange pink red white yellow
edible 0.015286624 0.000000000 0.000000000 0.137579618 0.044267516 0.134713376 0.021974522 0.646178344 0.000000000
poisonous 0.115814426 0.105993905 0.009820522 0.000000000 0.000000000 0.329156790 0.000000000 0.433118862 0.006095496
veil_type
Y partial
edible 1
poisonous 1
veil_color
Y brown orange white yellow
edible 0.019745223 0.024522293 0.955732484 0.000000000
poisonous 0.000000000 0.000000000 0.997968168 0.002031832
ring_number
Y none one two
edible 0.000000000 0.874203822 0.125796178
poisonous 0.009820522 0.971554352 0.018625127
ring_type
Y evanescent flaring large none pendant
edible 0.242675159 0.009872611 0.000000000 0.000000000 0.747452229
poisonous 0.451405350 0.000000000 0.332543176 0.009820522 0.206230952
spore_print_color
Y black brown buff chocolate green orange purple white yellow
edible 0.389808917 0.415923567 0.012101911 0.009872611 0.000000000 0.011783439 0.011146497 0.138535032 0.010828025
poisonous 0.058584490 0.056214020 0.000000000 0.403995936 0.018625127 0.000000000 0.000000000 0.462580427 0.000000000
population
Y abundant clustered numerous scattered several solitary
edible 0.09331210 0.06401274 0.09745223 0.21146497 0.28375796 0.25000000
poisonous 0.00000000 0.01320691 0.00000000 0.09414155 0.72536404 0.16728750
habitat
Y grasses leaves meadows paths urban waste woods
edible 0.341082803 0.057006369 0.060191083 0.033757962 0.022611465 0.043949045 0.441401274
poisonous 0.187267186 0.153064680 0.008465967 0.257704030 0.070436844 0.000000000 0.323061294
result1 <- predict(model1, mushroom_test[ , 1])
result1
library(gmodels)
CrossTable(mushroom_test[ ,1], result1)
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 2031
| result1
mushroom_test[, 1] | edible | Row Total |
-------------------|-----------|-----------|
edible | 1068 | 1068 |
| 0.526 | |
-------------------|-----------|-----------|
poisonous | 963 | 963 |
| 0.474 | |
-------------------|-----------|-----------|
Column Total | 2031 | 2031 |
-------------------|-----------|-----------|
정확도 비교
#정확도 비교
temp=c()
laplace_num=c()
for (i in 1:10) {
laplace_num = append(laplace_num, i*0.001)
mushroom_test_pred = naiveBayes(type~., data=mushroom_train, laplace=i*0.001)
result2 <- predict(mushroom_test_pred, mushroom_test[ ,-1])
g2<-CrossTable(mushroom_test[ ,1], result2)
g3<-g2$prop.tbl[1]+g2$prop.tbl[4]
temp=append(temp,g3)
}
result=data.frame("laplace" = laplace_num, "정확도"=temp)
library(plotly)
plot_ly(x=~result[,"laplace"], y=~result[,"정확도"], type='scatter', mode='lines') %>%
layout(xaxis=list(title="laplace 값"), yaxis=list(title="정확도"))
728x90