Outlier Detection Toolbox in MATLAB


As a part of my M.Sc. thesis I have been developing an outlier detection toolbox in MATLAB.
Implemented methods are;

  • ActiveOutlier
  • Local Outlier Factor
  • Feature Bagging
  • Parzen Windows
  • Decision Tree

I will be providing more detail on the algorithms in a later post.
You can get the source code from my bitbucket account here, it includes a script that shows how to run the algorithms. I have also written a small document that gives more detail and explains the data set format. You can get the document from here.

Here is a link to my thesis if anyone is interested.

About these ads

45 comments

  1. Hi, it looks like a very good work! Great!
    My question to LOF: the division into train set and test set makes it a bit unclear how the computations are done. What if you want to compute LOF of each point a data set? Should you put the same train and test set on input?
    If yes, the resulting values and even ranking is a bit different from the one provided by ELKI (java application for outlier detection by LMU Munich). How did you verify your program?
    Lastly, you mention some other auxiliary functions in your comments (e.g. ReadDataset), are you planning to make them public as well?
    Thank you! B.

    1. Hi,
      If you’d like to get LOF values for each point in a data set, you can give the same data as training and test inputs to the method as you said. I’m not familiar with ELKI tool, so I cannot comment on why it gives different results. I have implemented LOF from the paper it was first proposed in. I’d recommend you to check that you give the same minPts and maxPts parameters to LOF.
      For the ReadDataset function, I have chosen not to include data sets and related functions in the toolbox in the first place, but as you’ve said it’s a little hard to use the code without them so I’ve included data set related functions and a sample data set in the zip file now. If you download it again, you should be able to see those under datasets folder.

      Thanks for the comment. I’d be glad to answer any questions you have.

      1. Thank you! However, I think it’s not correct to put the same train and test set because then each point is it’s own nearest neighbor and the resulting scores are different.
        (Anyway, it appears to me that the zip-file didn’t change..)
        Regards,
        B.

      2. Hi again,
        it’s true that neighbors of a test point will include itself too if you give same train and test inputs but it is not a problem. Individual scores may change but overall rankings should not differ.
        For the zip file, I guess there is a problem with it, you can get the new one from http://www.gokererdogan.com/files/outlierDetection/od2.zip
        Let me know if you still can’t get the file.
        Regards,

      3. The implementation in ELKI seems to be by the authors that published LOF in the first place. If your results differ, there must be something wrong.

      4. I haven’t checked if my algorithm produces the exact same values with the one in ELKI. But I have tested my code thoroughly during implementation and I’m able to reach the accuracies reported on different data sets by various authors.

      5. Hi,
        Please follow the link in my post to the bitbucket repository for the MATLAB code.

        goker

  2. As you mention above, is it ok to find the LOF of training data by giving the same train and test input?
    I guess 0 value may have some effects because neighbors of a testing point will include itself

    1. If it really bothers you, you may modify the code to exclude the point itself if training and test sets are same. However, it won’t effect the results much and what matters most is that outlier rankings of points won’t change.

      1. I just started learning LOF, so I want to be clear how the algorithm works. Sometimes it is easier to understand the algorithm from a code than from paper.

        I really thanks for your reply

  3. Hey,i am trying to run localOutlierFactor .As parameters for this function,i have to put dataset and other parameters. Could you please give me example how to run it,because it is little bit unclear to me.

    1. Hi, I’m sorry that I haven’t been able to reply earlier. I have been very busy with my thesis. I have updated the code with much more comments and I also provide a script file that shows how to run the methods. I have also written a little document that gives details of the data set format. Please check the post, I have updated it.

  4. hi,
    i am new to matlab please guide me how i can run LOF algorithm by downloading your zip folder. i want to run this and want to check the results how it works.. which file i use for running it thanks

    1. Hi, I’m sorry I haven’t been able to reply earlier. Please check the post, I have updated the toolbox and included a small manual document.

  5. Hello Goker,
    you do not use Euclidean distance, that is why your scores are different from ELKI! I don’t know if it is on purpose because you did not point it out anywhere. I guess it is rather a bug because it makes the ranking different. Adding square root to all distance computations fixes it.

    1. Hi Barbora,
      Thanks for pointing it out. I really never noticed it. I don’t know how much it will affect the rankings, but it’s better to fix it as you said. I will fix it when I have some time.
      Thx again.

  6. Hi, I’d like to cite this toolbox in a document I am writing. Can you give me a bibtex entry for it? Preferably, it should be a formal document, such as your MSc thesis. Thanks!

    1. Hi Gaurav,
      I developed the toolbox for my M.S. thesis but my thesis is not published. You can use the following information to cite it:

      Author: Goker Erdogan
      Title: Spectral Methods for Outlier Detection in Machine Learning
      Year: 2012
      Institution: Bogazici University

    1. Hi Matthias,
      There seems to be an ongoing problem with my hosting account. I’ve updated the links in the post, you should be able to download them now. Thanks for pointing it out.

  7. Hi, Thanks a lot for your work. I have some doubts about how to test the LOF function in my data. I have a matrix with 3700 instances and 63 features and I’m not sure how to set the upper and lower limits for neighboors and the theta value.

    1. Hi Samuel,
      Optimal neighbor count and theta values will depend on your data. The best way is to choose a set of values for these parameters and run the method for each of them, then compare the performances and choose the parameter values that give you the best performance. Note that you do not need to do a parameter search for theta (since it is just a threshold parameter for converting outputs of the method to labels). You can measure the performance without the output labels, using directly LOF values. Just use the LOF values given by the method to calculate Area under the ROC curve (which you can calculate using croc and auroc functions in toolbox). Hope this helps.
      Goker,

      1. thank you for your help. I’m getting results like auc = 0.55. This aren’t really good I guess. I’m using 10/100 neighboors and theta equal 2, with lof method.

      2. I got a matrix of y with half the size of my samples. How do I know which are outliers? Sorry, but I’m kind of confused.

      3. Hi Samuel,
        y should include one row for each sample. I’m not sure what is going on in your case.
        LOF method calculates LOF values for each sample and stores these in yprob. Then any sample with LOF value higher than the theta value you provided is deemed an outlier. So, this theta value apparently will change from data to data. The usual way to determine this theta value is to find the theta value that gives the best performance (predicts outliers best) on a training dataset. Of course, this requires you to know which samples are outliers in your data beforehand. As far as I understand, this is your problem. You don’t have any labeled data that you can use for adjusting the value of theta. In that case, unfortunately you don’t have much choice. You can assume that, say 10% of your data are outliers, and find the theta value that labels LOF values in top 10% as outliers.
        Hope this helps,
        Goker

      4. I undestand that. I have two classes 1 and 2 which are normal. And I want to get outliers which does not seem to fit in any of them, but the y result only give me 1 and 2 labels.

        This labels are from normal classes. And hoe come there’s only half values for y? Hope you can help me.

      5. Hi Samuel,
        All the methods in toolbox assume that there are only two classes: normal and outlier. Normal class has label 1 and outlier class has label 2. You should collect all the samples in normal classes into a single class and label them as 1. And similarly label the outliers (if you know them) as 2.
        Samples with label 2 in the output y vector you get from LOF are the outliers in your dataset. However, remember that the labels are determined using theta, so you should first find a good theta value for your dataset. You can do this by running some kind parameter search as I’ve mentioned before.
        About the problem with y being smaller than you expect, I’m not sure what is happening. y should contain the same number of rows with your testx matrix in dataset variable. I suggest you to check example code (od_script in demo folder for example) to see if you’re calling LOF function correctly.
        Goker

  8. Hi,
    I am using your tool box. When I run “Exp1_EvalDecisionTree” and other experiments. Following error is displayed ,
    “??? Error: File: RunExperiment.m Line: 85 Column: 15
    Expression or statement is incorrect–possibly unbalanced (, {, or [.”
    May i know that how to fix it ?

    1. Hi Jamali,
      I’ve updated the code, please download the zip file again. Those experiment scripts were not supposed to be there. Please look at od.m or od_script.m to figure out how to run the methods. You can get more information on using the toolbox from the manual I’ve provided in the post.
      Thanks,
      goker

    1. Hi Mirco,
      I’m sorry but the zip file was not up-to-date. I’ve updated it, please download it again. You will see the demo script.
      Cheers,
      goker

      1. I run od_script file and there was following error.
        ??? Undefined function or method ‘graphconncomp’ for input arguments of type ‘double’.

        Error in ==> LaplacianEigenmap at 31
        cc = graphconncomp(w);

        May i know that how to fix this problem.

  9. Hi,
    I am using the Outlier detection toolbox. When I run ‘ActiveOutlier’ in od_script. Following error is displayed ,
    ??? Error: File: ActiveOutlierTrain.m Line: 114 Column: 16
    Expression or statement is incorrect–possibly unbalanced (, {, or [.

    Error in ==> RunExperiment at 85
    model = method.trainFcn(dataset, method.trainParams);

    Error in ==> od_script at 106
    results = RunExperiment(method, ds, expParams);

    I also try to run od.m, but I have this error When I run LOF and other experiments:

    ??? Error: File: ActiveOutlierTrain.m Line: 114 Column: 16
    Expression or statement is incorrect–possibly unbalanced (, {, or [.

    Error in ==> RunExperiment at 85
    model = method.trainFcn(dataset, method.trainParams);

    Error in ==> od_runexperiment at 46
    results = RunExperiment(method, ds, expParams);

    Error in ==> od>btnRun_Callback at 400
    [testResults bestParam] = od_runexperiment(ds, method, expParams, dims);

    Error in ==> gui_mainfcn at 96
    feval(varargin{:});

    Error in ==> od at 31
    gui_mainfcn(gui_State, varargin{:});

    Error in ==>
    @(hObject,eventdata)od(‘btnRun_Callback’,hObject,eventdata,guidata(hObject))

    ??? Error while evaluating uicontrol Callback

    >>
    Do I miss something? Thanks for your help.

    1. Hi Tala,
      I think you are using an older version of MATLAB. The tilde (~) operator for ignoring function outputs were introduced in r2009b. That is why you get that error. You can get around it by simple replacing that tilde with a dummy variable name.
      Cheers,
      Goker

      1. Hi Goker,

        Thank you very much for your reply. Right, this is because of using the older version of MATLAB. By the way, do you have any publication for your thesis or can we access your thesis?

        Best Regards,
        Tala

      2. Hi Tala,
        I did not publish my thesis, but I’ve added a link to my thesis in the post.

        Cheers,
        Goker

  10. Thank you very much Goker for providing a link. It is helpful! I have a question about identifying threshold value for theta in LOF. Some papers consider a range varying from 0.1 to 0.9 increasing by 0.1. What range do you suggest for parameter tunning of threshold for LOF? Is it possible that the threshold parameter be more than 1?

    Thanks for your help,
    Tala

    1. Tala,
      To be honest, I don’t really remember. It’s been a while since I worked on these algorithms. I’d suggest looking at the parameters I’ve used in the code and mentioned in my thesis.

      Goker

  11. Dear goker,thank you for your help,but i was unable to read datasets in outlier detection toolbox.When i browse to dataset path and click run,the following error msg pops out
    ” Undefined function or method ‘ReadDataset’ for input arguments of type ‘char’.
    Error in ==> od_script at 25
    ds = ReadDataset(dsPath);”
    please help me out ….

    1. Hi Sam,
      It seems like the datasets folder is not in your MATLAB path. Can you make sure that all folders under odtoolbox are added to MATLAB path. You can do it by right clicking on odtoolbox folder in MATLAB, and selecting “Add to Path”.

      goker

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s