R

[R 프로그램] 나이브베이즈알고리즘으로 mushroom 분류하기

Yeenn 2021. 12. 16. 01:28
728x90

mushrooms.csv
1.17MB

 

실습 데이터 mushrooms.csv는 위에서 다운로드 하면 된다.

 

 

데이터 불러오기 및 확인

#데이터 load
mushroom = read.csv("mushrooms.csv", header=T, stringsAsFactors = T)

#데이터 확인
View(mushroom)
str(mushroom)
'data.frame':	8124 obs. of  23 variables:
 $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
 $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
 $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
 $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
 $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
 $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
 $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
 $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
 $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
 $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...

 

나이브 베이즈 분류 알고리즘을 사용할 예정인데, 이를 위해서는 Factor형으로 데이터 타입을 변경해주어야 한다.

 

 

결측치 확인

#결측치 확인
colSums(is.na(mushroom))
                    type                cap_shape              cap_surface                cap_color                  bruises 
                       0                        0                        0                        0                        0 
                    odor          gill_attachment             gill_spacing                gill_size               gill_color 
                       0                        0                        0                        0                        0 
             stalk_shape               stalk_root stalk_surface_above_ring stalk_surface_below_ring   stalk_color_above_ring 
                       0                        0                        0                        0                        0 
  stalk_color_below_ring                veil_type               veil_color              ring_number                ring_type 
                       0                        0                        0                        0                        0 
       spore_print_color               population                  habitat 
                       0                        0                        0 
                       
 dim(mushroom)
[1] 8124   23

 

colSums()를 통해 결측치가 존재하는 컬럼을 한번에 확인하였다.

 

 

 

train/test 데이터셋 split

#데이터 shuffle
set.seed(123)
train_cnt <- round(0.75*dim(mushroom)[1])
train_cnt
[1] 6093

#6093만큼 임의의 숫자 뽑기
train_index <- sample(1:dim(mushroom)[1], train_cnt, replace=F)
train_index
   [1] 2463 2511 2227  526 4291 2986 1842 1142 3371 5349 5364 5134 3446 4761 6746 1627 7936 2757 5107 5211  953 4444 1017 7817 2013 5475
  [27] 2888 6170 2567 1450 5769 1790 4307 2980 1614 6737  555 5991 4469 6988 1167 2592 2538 7789 1799  905 7081 5962 1047 7067 3004 4405
  [53] 3207 7989 3995 5344  166  217 1314 2629 6216  588 1599 4237 4818 3937 4089 2907 4249  294  277 5583   41 6575 6234  316 7391 6672
  [79] 7284 7774 2822 2795 2504 6742 3926 7207 1183  752 3281  374 8118 6129 2082 4612 4109 2117 6134 6015  755 6553 5428 7446 5209 7072
 [105] 1006 2585 7127 2339 1448 3952 3358 3980 4767 6265 1134 3230 5184 5603 1934 1501 4576 3783 6211 7831 7158 1914 5967 1109 4261 7816
 [131] 1075 3146 7346 1386 2284 4706 2378 6870 4223 4044 2260  686 3857 6078 6958 7478 5027 7912 7022 6387  847 7960 7281  983 4715 6023
 [157] 4573 6095  151 6810 1638 6911 2208 7559 6299 2474 1029  326 3856 7448 2837 7735 1956 5358 5884 4093 5459  985 6183 6966  986 4233
 [183] 2503 1762 1584 4685 7251 7864 4084 5999 2087 2244 1793 4776 7454 5726 6644 7947 1808  344 7979 2507 2992 3092 7016  195 5981 3236
 [209] 3124 4972 6678 2225 4650 2875 2132 5643 6777 4023 3464 1326 6741 3949 4802 2667 7757 3502 2758 3833  712 1452 4768 5370 1828 3501
 [235] 3069 2446 3061  528 4055 6525 1569 6184 6056 6666  473 6344 6098 1149 2037 2313 2823 1927 6330 7741 3324  458 3224 4927 1078 5015
 [261] 5658 6379 1313  185  413 4723 2570 1333 7222 6349 5995 4875 1561  564  794 1415 8025 3799 1370 4256 6612 3581 2968 3129 6601 4713
 [287]  357  279 4366 7976 6790 6491 3201 2266  618 1905 6842  337 6941 2074  539 5077 7687  956 5877 5786 3625 6868 3462 5618 7390 6815
 [313] 7005 6801 3008 1445 5177 6804 2211 2286 5793 2626 7684 7988 7393 5509 4213 7705 4744 1079 1241 7904 2605 1706 6559 7302 7101  873
 [339] 6832  757 6061  988 2495 7272 2869 2879 4807 5588 1234 4445 1425 3809 7426 7913 4482 1165 2072 7089 3071 6083 7634 5250 2213 5823
 [365] 1612 5497 3853 6314 5565  710 1258 2470 2556 8015 2112 6224 8007 6692 3035 6619 8017 6909 7511 1347 6623 7728 7618 7279  791 1987
 [391] 5001 1341 4903 5342 5407 5854 3657 4388 5346 5910 1708 4393 4956 4701 6781 1835 3111 1261 6309 7970 1057   83 3468  866 2325 7377
 [417] 5119 4172 5214 7198 7385 7090 7343 6518 2371 2163 7759 6994 2656 7633 5478  682 8016 7082 3168 1562 6541 7572 3814 7316  989 4052
 [443] 1961 6536 4007 5834 2641 5567 1362 6888 4339 2154 3959 6155 4721 3436  386 2451 4557  141   31 6046 5235 7725 4190 3786 4112 3781
 [469] 7588  178 3489 3536 5532 6321 4620 5557 5020 5698  204 3247 4019 2421 7834 7632 2432 5242 7567 3475 2363  259 4590 5985 5880 5409
 [495] 7216 1355 7268 3064 5228 1325 4106 4266 3352 7516 4071 2450 1233 3540 1851 1673 7477 3467 4204 4104 4722 4357 2589 7474 6470 5194
 [521] 2330 1609 3339 6007 5382 1760 7646 5843 5339 6840 3424  667 5086 5239 4548  856 1948 5699 2670 1960  646 1060 3127 3562 1264 4987
 [547] 1656 1886 1328 8108 3082  665 3494 2035 5732 3177 4889 5508 8123 1204 4374 7541  241 4120 6823 1727 2085 2734 6710 5423 1743 3984
 [573] 6163 5791  378  549 5735 1268 2096 7356 4030 6608 4489 3211 8116 1395 4766 5309  311 6333 1442 5502 2086 1108  319 6830 2894 1862
 [599] 1666  120 2760 7499 1557 6585 7367 3571  599 8064 5435 6858 6821 6225 5175  134 4520 2804 6150 3200  879  668 5920   49 5596 3311
 [625]  193 8107 6603 4399 7685 7042 2238 3696  191 6590 6263 3699 5747 7227 4913 3975 2109 1446 3180 1316 1397 5655 7699 2978 5720 3716
 [651] 5046 5533 4794 8018 5883 5323  758 5089 3318 1464 2995 6834  251 2608  643 2593  990 2210 5442 2624 6312 3514 6932 6222 7573 1517
 [677] 7520 5544  445 3043   95 3990 7653  379 1366 2269 7595 7329  620  448  242 7071 2862 6088 7070  407 6373 7392 2833 1498 2747 1465
 [703] 3243 3095 4314 3556 7796   79 1531 7917 6381 4675 1831 2977 4518 4589 4826 1820 2257 4695 3765  877 2406 5161  904 7297 4944 7030
 [729] 4546 1256 1358 2444 3802 1864 2687 4137 3336 3769  201   52 5345 2115  680 3842  577 4553 1015 7873 4045 5661 5140 1230 2172 2640
 [755] 3847 2788 6189 3353 2139 6797 2004 7306  606 6271 3080  839 6137 6739 8043 3239 4595 7518 1658  931 2950   71 5892 4042 4177 4016
 [781] 6920 7915 4992 3509 5870 7141  270 1436 5766 6281 2721 4724 5166 7750 2579 7428 5730 7399 3294  836 7416  602 7290  331 4997 6646
 [807] 5907  291 4001 4839 4925  919 3241 2777 1471 6136 1365 8029  504 3148 1026 2934  786 1796 7274 3183  855 4470 1096  449 7032 3019
 [833] 3301 1547 4431 3001 3128 7830 6415 4303  436 2924 3049 4029  601 7481  387 3335 3140 3192 2792 6709 5477 2840  742 7325 1859 4619
 [859] 4682  256 3421 7320 4184 5536 5977 2059  586 4559 3827 2748 6986 4781 3899 6424 3584  803 2290  778 8008 1352 6316 5418 7962 1703
 [885] 1553 7844  929 2789 6090 2413 1994 6018 2639 3543 3951 2024 5347 2882 5742 1339 4015 2291 3337  535 7568  982 2160 4552 3165  789
 [911] 6775 7546 3017 4578 1272 2153 1195 4792  459 6660 2887 5432 5682 2940  139 1782  481  843 1852 4346 7686 5717 6474 4729 1650 8120
 [937] 5219 1122 1082 7447 7132 1549 6152 2306 6779 6406 6743 7240 2976 4302 4295 5709 5960 3942 7206 7754 7177 3911 7795 6124 1916 5651
 [963] 2801 5608 1295 4014 4494 5338 5275 7738 3617 6962 2834 4756  347 7993 1589 3903 3401 8012 7948 6199 5367 1394 7702 1068  860 5496
 [989] 5028 7531 2583 3626 1733 2242 2508 5950 6720 3149 3691 5094

 
#train/test split
mushroom_train <- mushroom[train_index, ]
mushroom_test <- mushroom[-train_index, ]

nrow(mushroom_train)
[1] 6093
nrow(mushroom_test)
[1] 2031

str(mushroom_train)
'data.frame':	6093 obs. of  23 variables:
 $ type                    : Factor w/ 2 levels "edible","poisonous": 1 1 1 1 2 1 1 1 1 2 ...
 $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 4 3 1 4 4 4 3 4 3 ...
 $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 3 1 1 4 3 3 1 1 3 4 ...
 $ cap_color               : Factor w/ 10 levels "brown","buff",..: 4 8 8 10 4 4 9 1 1 2 ...
 $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 2 2 ...
 $ odor                    : Factor w/ 9 levels "almond","anise",..: 7 7 7 2 5 7 7 7 7 5 ...
 $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 1 1 2 1 1 1 ...
 $ gill_size               : Factor w/ 2 levels "broad","narrow": 1 1 1 1 1 1 1 1 1 1 ...
 $ gill_color              : Factor w/ 12 levels "black","brown",..: 2 2 2 11 4 11 8 8 11 4 ...
 $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 2 2 2 1 1 2 2 2 2 2 ...
 $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 1 1 1 2 1 1 3 1 1 1 ...
 $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 3 4 4 4 4 1 ...
 $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 3 4 4 4 4 1 ...
 $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 4 8 6 8 1 4 8 8 8 8 ...
 $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 6 6 8 2 6 8 6 4 8 ...
 $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 3 5 1 5 5 5 ...
 $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 2 4 1 2 2 1 4 ...
 $ population              : Factor w/ 6 levels "abundant","clustered",..: 6 6 6 4 5 6 1 5 5 4 ...
 $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 7 7 7 1 1 7 1 7 7 5 ...

 

데이터를 shuffle한 후, train과 test셋 데이터를 split해주었다.

train 데이터를 살펴보니, 6093개의 행과 23열로 이루어진 데이터셋이 만들어졌음을 확인할 수 있었다.

 

 

나이브베이즈 분류 알고리즘 예측

#나이브베이즈 알고리즘
library(e1071)

model1 <- naiveBayes(type~., data=mushroom_train)
model1
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
   edible poisonous 
0.5153455 0.4846545 

Conditional probabilities:
           cap_shape
Y                  bell     conical      convex        flat     knobbed      sunken
  edible    0.096815287 0.000000000 0.466242038 0.375796178 0.053821656 0.007324841
  poisonous 0.011175076 0.001015916 0.438537081 0.393498137 0.155773789 0.000000000

           cap_surface
Y                fibrous      grooves        scaly       smooth
  edible    0.3719745223 0.0000000000 0.3585987261 0.2694267516
  poisonous 0.1967490687 0.0006772773 0.4422621063 0.3603115476

           cap_color
Y                 brown        buff    cinnamon        gray       green        pink      purple         red       white      yellow
  edible    0.303503185 0.009554140 0.007961783 0.244904459 0.003184713 0.014012739 0.003184713 0.144267516 0.174203822 0.095222930
  poisonous 0.263122249 0.028445648 0.003047748 0.206569590 0.000000000 0.023366068 0.000000000 0.221131053 0.079918727 0.174398916

           bruises
Y                  no       yes
  edible    0.3493631 0.6506369
  poisonous 0.8435489 0.1564511

           odor
Y                almond       anise    creosote       fishy        foul       musty        none     pungent       spicy
  edible    0.095222930 0.095859873 0.000000000 0.000000000 0.000000000 0.000000000 0.808917197 0.000000000 0.000000000
  poisonous 0.000000000 0.000000000 0.049779885 0.143582797 0.550965120 0.009820522 0.029800203 0.065018625 0.151032848

           gill_attachment
Y              attached        free
  edible    0.044267516 0.955732484
  poisonous 0.004740941 0.995259059

           gill_spacing
Y                close    crowded
  edible    0.70700637 0.29299363
  poisonous 0.97155435 0.02844565

           gill_size
Y                broad     narrow
  edible    0.93184713 0.06815287
  poisonous 0.43244158 0.56755842

           gill_color
Y                 black       brown        buff   chocolate        gray       green      orange        pink      purple         red
  edible    0.086305732 0.221337580 0.000000000 0.047133758 0.056687898 0.000000000 0.015923567 0.198407643 0.106687898 0.023248408
  poisonous 0.017609211 0.027429732 0.441584829 0.130037250 0.136471385 0.006095496 0.000000000 0.163901118 0.011852354 0.000000000
           gill_color
Y                 white      yellow
  edible    0.229617834 0.014649682
  poisonous 0.058923129 0.006095496

           stalk_shape
Y           enlarging  tapering
  edible    0.3834395 0.6165605
  poisonous 0.4869624 0.5130376

           stalk_root
Y              bulbous       club      equal    missing     rooted
  edible    0.45382166 0.12101911 0.20859873 0.16942675 0.04713376
  poisonous 0.47375550 0.01185235 0.06501863 0.44937352 0.00000000

           stalk_surface_above_ring
Y               fibrous       scaly       silky      smooth
  edible    0.096178344 0.002547771 0.036624204 0.864649682
  poisonous 0.034879783 0.002031832 0.574331189 0.388757196

           stalk_surface_below_ring
Y              fibrous      scaly      silky     smooth
  edible    0.10859873 0.04968153 0.03503185 0.80668790
  poisonous 0.03555706 0.01964104 0.55265831 0.39214358

           stalk_color_above_ring
Y                 brown        buff    cinnamon        gray      orange        pink         red       white      yellow
  edible    0.002547771 0.000000000 0.000000000 0.135031847 0.044267516 0.134713376 0.020700637 0.662738854 0.000000000
  poisonous 0.111412123 0.108703014 0.009820522 0.000000000 0.000000000 0.334913647 0.000000000 0.433118862 0.002031832

           stalk_color_below_ring
Y                 brown        buff    cinnamon        gray      orange        pink         red       white      yellow
  edible    0.015286624 0.000000000 0.000000000 0.137579618 0.044267516 0.134713376 0.021974522 0.646178344 0.000000000
  poisonous 0.115814426 0.105993905 0.009820522 0.000000000 0.000000000 0.329156790 0.000000000 0.433118862 0.006095496

           veil_type
Y           partial
  edible          1
  poisonous       1

           veil_color
Y                 brown      orange       white      yellow
  edible    0.019745223 0.024522293 0.955732484 0.000000000
  poisonous 0.000000000 0.000000000 0.997968168 0.002031832

           ring_number
Y                  none         one         two
  edible    0.000000000 0.874203822 0.125796178
  poisonous 0.009820522 0.971554352 0.018625127

           ring_type
Y            evanescent     flaring       large        none     pendant
  edible    0.242675159 0.009872611 0.000000000 0.000000000 0.747452229
  poisonous 0.451405350 0.000000000 0.332543176 0.009820522 0.206230952

           spore_print_color
Y                 black       brown        buff   chocolate       green      orange      purple       white      yellow
  edible    0.389808917 0.415923567 0.012101911 0.009872611 0.000000000 0.011783439 0.011146497 0.138535032 0.010828025
  poisonous 0.058584490 0.056214020 0.000000000 0.403995936 0.018625127 0.000000000 0.000000000 0.462580427 0.000000000

           population
Y             abundant  clustered   numerous  scattered    several   solitary
  edible    0.09331210 0.06401274 0.09745223 0.21146497 0.28375796 0.25000000
  poisonous 0.00000000 0.01320691 0.00000000 0.09414155 0.72536404 0.16728750

           habitat
Y               grasses      leaves     meadows       paths       urban       waste       woods
  edible    0.341082803 0.057006369 0.060191083 0.033757962 0.022611465 0.043949045 0.441401274
  poisonous 0.187267186 0.153064680 0.008465967 0.257704030 0.070436844 0.000000000 0.323061294


result1 <- predict(model1, mushroom_test[ , 1])
result1

library(gmodels)
CrossTable(mushroom_test[ ,1], result1)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2031 

 
                   | result1 
mushroom_test[, 1] |    edible | Row Total | 
-------------------|-----------|-----------|
            edible |      1068 |      1068 | 
                   |     0.526 |           | 
-------------------|-----------|-----------|
         poisonous |       963 |       963 | 
                   |     0.474 |           | 
-------------------|-----------|-----------|
      Column Total |      2031 |      2031 | 
-------------------|-----------|-----------|

 

 

 

정확도 비교

#정확도 비교
temp=c()
laplace_num=c()
for (i in 1:10) {
  laplace_num = append(laplace_num, i*0.001)
  mushroom_test_pred = naiveBayes(type~., data=mushroom_train, laplace=i*0.001)
  result2 <- predict(mushroom_test_pred, mushroom_test[ ,-1])
  g2<-CrossTable(mushroom_test[ ,1], result2)
  g3<-g2$prop.tbl[1]+g2$prop.tbl[4]
  temp=append(temp,g3)
}

result=data.frame("laplace" = laplace_num, "정확도"=temp)
library(plotly)
plot_ly(x=~result[,"laplace"], y=~result[,"정확도"], type='scatter', mode='lines') %>% 
  layout(xaxis=list(title="laplace 값"), yaxis=list(title="정확도"))

728x90