Random forest supervised or unsupervised

10/3/2023

Step 2: Having recorded the cluster on your data, run the decision tree in the normal manner. It’s not covered in the literature but this might be a good point to try SMOTE (Synthetic Minority Oversampling Technique) discussed in the earlier article on imbalanced data sets since its objective is to clarify the boundaries among clusters. Since the data is unlabeled there is no objective function to determine an optimum clustering. Practical guidance is to set K at a minimum of 10 and experiment with values up to 50. It might be tempting to set K=2 but it would be unrealistic to expect a good result given the different types of intrusion that may be present. Pretty much all the clustering techniques have been tried and good old k-NN still seems to work best. Step 1: Run a clustering algorithm on your data. The concept of unsupervised decision trees is only slightly misleading since it is the combination of an unsupervised clustering algorithm that creates the first guess about what’s good and what’s bad on which the decision tree then splits. The fact is that supervised methods continue to be somewhat more accurate than unsupervised in intrusion detection but they are completely unable to identify new zero-day attacks which are perhaps an even more serious threat. This is also known as ‘One Class Classification’ and uses one class SVMs or autoencoders in a slightly different way not discussed here. To be complete, there is also category of Semi-Supervised anomaly detection in which the training data consists only of normal transactions without any anomalies. All types of anomaly data tend to be highly dimensional and decision trees can take it all in and offer a reasonably clear guide for pruning back to just what’s important. They’re great at combining numeric and categoricals, and handle missing data like a champ. Decision trees are nonparametric they don’t make an assumption about the distribution of the data. In both supervised and unsupervised cases decision trees, now in the form of random forests are the weapon of choice. That data then combined with normal data creates the supervised training set. When an attack is detected the associated traffic pattern is recorded and marked and classified as an intrusion by humans. The traditional ‘signature based’ approach widely used in intrusion detection systems creates training data that can be used in normal supervised techniques. In anomaly detection we are attempting to identify items or events that don’t match the expected pattern in the data set and are by definition rare. If you think your use case is on the borderline between just being sparse and being an anomaly, do what we all do, try it both ways and see what works best. On a more human scale, anomalies could be events of credit card fraud or the detection of cancer from among normal CT scans. Although these may occur many times during a day they are extremely rare compared to the torrent of data being produced by regular web logs and system packets. The anomalies that we’re going to use as illustrations here are about intrusion detection. Like great art, you’ll know it when you see it. We observed that there’s no standard for when the imbalance becomes so great that it ought to be considered an anomaly. Since I’m not in that world and I suspect few of us are, I thought I’d share what I found.Ī few weeks back we featured techniques for dealing with imbalanced data sets. It turns out that if you’re in the anomaly detection world unsupervised decision trees are pretty common. What do you mean unsupervised decision trees? What would they split on? When they started talking about unsupervised decision trees my antenna went up. I was at a presentation recently that focused on stream processing but the use case presented was about anomaly detection. It’s a very interesting approach to decision trees that on the surface doesn’t sound possible but in practice is the backbone of modern intrusion detection. Summary: Unless you’re involved in anomaly detection you may never have heard of Unsupervised Decision Trees.

0 Comments

Random forest supervised or unsupervised

Leave a Reply.

Author

Archives

Categories