As a part of my M.Sc. thesis I have been developing an outlier detection toolbox in MATLAB.

Implemented methods are;

- ActiveOutlier
- Local Outlier Factor
- Feature Bagging
- Parzen Windows
- Decision Tree

I will be providing more detail on the algorithms in a later post.

You can get the source code from my bitbucket account here, it includes a script that shows how to run the algorithms. I have also written a small document that gives more detail and explains the data set format. You can get the document from here.

Here is a link to my thesis if anyone is interested.

### Like this:

Like Loading...

*Related*

Hi, it looks like a very good work! Great!

My question to LOF: the division into train set and test set makes it a bit unclear how the computations are done. What if you want to compute LOF of each point a data set? Should you put the same train and test set on input?

If yes, the resulting values and even ranking is a bit different from the one provided by ELKI (java application for outlier detection by LMU Munich). How did you verify your program?

Lastly, you mention some other auxiliary functions in your comments (e.g. ReadDataset), are you planning to make them public as well?

Thank you! B.

Hi,

If you’d like to get LOF values for each point in a data set, you can give the same data as training and test inputs to the method as you said. I’m not familiar with ELKI tool, so I cannot comment on why it gives different results. I have implemented LOF from the paper it was first proposed in. I’d recommend you to check that you give the same minPts and maxPts parameters to LOF.

For the ReadDataset function, I have chosen not to include data sets and related functions in the toolbox in the first place, but as you’ve said it’s a little hard to use the code without them so I’ve included data set related functions and a sample data set in the zip file now. If you download it again, you should be able to see those under datasets folder.

Thanks for the comment. I’d be glad to answer any questions you have.

Thank you! However, I think it’s not correct to put the same train and test set because then each point is it’s own nearest neighbor and the resulting scores are different.

(Anyway, it appears to me that the zip-file didn’t change..)

Regards,

B.

Hi again,

it’s true that neighbors of a test point will include itself too if you give same train and test inputs but it is not a problem. Individual scores may change but overall rankings should not differ.

For the zip file, I guess there is a problem with it, you can get the new one from http://www.gokererdogan.com/files/outlierDetection/od2.zip

Let me know if you still can’t get the file.

Regards,

The implementation in ELKI seems to be by the authors that published LOF in the first place. If your results differ, there must be something wrong.

I haven’t checked if my algorithm produces the exact same values with the one in ELKI. But I have tested my code thoroughly during implementation and I’m able to reach the accuracies reported on different data sets by various authors.

hi

sir plz provide me the lof code in c or matlab

Hi,

Please follow the link in my post to the bitbucket repository for the MATLAB code.

goker

Hi Barbora I have been able to run this toolbox but having difficulties understanding the results and outputs with so many Matlab Matrix files and image files.Can you help me understand what the results of the toolbox mean.Thanks in advance

As you mention above, is it ok to find the LOF of training data by giving the same train and test input?

I guess 0 value may have some effects because neighbors of a testing point will include itself

If it really bothers you, you may modify the code to exclude the point itself if training and test sets are same. However, it won’t effect the results much and what matters most is that outlier rankings of points won’t change.

I just started learning LOF, so I want to be clear how the algorithm works. Sometimes it is easier to understand the algorithm from a code than from paper.

I really thanks for your reply

Hey,i am trying to run localOutlierFactor .As parameters for this function,i have to put dataset and other parameters. Could you please give me example how to run it,because it is little bit unclear to me.

Hi, I’m sorry that I haven’t been able to reply earlier. I have been very busy with my thesis. I have updated the code with much more comments and I also provide a script file that shows how to run the methods. I have also written a little document that gives details of the data set format. Please check the post, I have updated it.

hi,

i am new to matlab please guide me how i can run LOF algorithm by downloading your zip folder. i want to run this and want to check the results how it works.. which file i use for running it thanks

Hi, I’m sorry I haven’t been able to reply earlier. Please check the post, I have updated the toolbox and included a small manual document.

Hello Goker,

you do not use Euclidean distance, that is why your scores are different from ELKI! I don’t know if it is on purpose because you did not point it out anywhere. I guess it is rather a bug because it makes the ranking different. Adding square root to all distance computations fixes it.

Hi Barbora,

Thanks for pointing it out. I really never noticed it. I don’t know how much it will affect the rankings, but it’s better to fix it as you said. I will fix it when I have some time.

Thx again.

Hi, I’d like to cite this toolbox in a document I am writing. Can you give me a bibtex entry for it? Preferably, it should be a formal document, such as your MSc thesis. Thanks!

Hi Gaurav,

I developed the toolbox for my M.S. thesis but my thesis is not published. You can use the following information to cite it:

Author: Goker Erdogan

Title: Spectral Methods for Outlier Detection in Machine Learning

Year: 2012

Institution: Bogazici University

Hi,

the downloadlink is dead.

Hi Matthias,

There seems to be an ongoing problem with my hosting account. I’ve updated the links in the post, you should be able to download them now. Thanks for pointing it out.

Hi, Thanks a lot for your work. I have some doubts about how to test the LOF function in my data. I have a matrix with 3700 instances and 63 features and I’m not sure how to set the upper and lower limits for neighboors and the theta value.

Hi Samuel,

Optimal neighbor count and theta values will depend on your data. The best way is to choose a set of values for these parameters and run the method for each of them, then compare the performances and choose the parameter values that give you the best performance. Note that you do not need to do a parameter search for theta (since it is just a threshold parameter for converting outputs of the method to labels). You can measure the performance without the output labels, using directly LOF values. Just use the LOF values given by the method to calculate Area under the ROC curve (which you can calculate using croc and auroc functions in toolbox). Hope this helps.

Goker,

thank you for your help. I’m getting results like auc = 0.55. This aren’t really good I guess. I’m using 10/100 neighboors and theta equal 2, with lof method.

I got a matrix of y with half the size of my samples. How do I know which are outliers? Sorry, but I’m kind of confused.

Hi Samuel,

y should include one row for each sample. I’m not sure what is going on in your case.

LOF method calculates LOF values for each sample and stores these in yprob. Then any sample with LOF value higher than the theta value you provided is deemed an outlier. So, this theta value apparently will change from data to data. The usual way to determine this theta value is to find the theta value that gives the best performance (predicts outliers best) on a training dataset. Of course, this requires you to know which samples are outliers in your data beforehand. As far as I understand, this is your problem. You don’t have any labeled data that you can use for adjusting the value of theta. In that case, unfortunately you don’t have much choice. You can assume that, say 10% of your data are outliers, and find the theta value that labels LOF values in top 10% as outliers.

Hope this helps,

Goker

I undestand that. I have two classes 1 and 2 which are normal. And I want to get outliers which does not seem to fit in any of them, but the y result only give me 1 and 2 labels.

This labels are from normal classes. And hoe come there’s only half values for y? Hope you can help me.

Hi Samuel,

All the methods in toolbox assume that there are only two classes: normal and outlier. Normal class has label 1 and outlier class has label 2. You should collect all the samples in normal classes into a single class and label them as 1. And similarly label the outliers (if you know them) as 2.

Samples with label 2 in the output y vector you get from LOF are the outliers in your dataset. However, remember that the labels are determined using theta, so you should first find a good theta value for your dataset. You can do this by running some kind parameter search as I’ve mentioned before.

About the problem with y being smaller than you expect, I’m not sure what is happening. y should contain the same number of rows with your testx matrix in dataset variable. I suggest you to check example code (od_script in demo folder for example) to see if you’re calling LOF function correctly.

Goker

Hi,

I am using your tool box. When I run “Exp1_EvalDecisionTree” and other experiments. Following error is displayed ,

“??? Error: File: RunExperiment.m Line: 85 Column: 15

Expression or statement is incorrect–possibly unbalanced (, {, or [.”

May i know that how to fix it ?

Hi Jamali,

I’ve updated the code, please download the zip file again. Those experiment scripts were not supposed to be there. Please look at od.m or od_script.m to figure out how to run the methods. You can get more information on using the toolbox from the manual I’ve provided in the post.

Thanks,

goker

Hi, I have downloaded the zip file od2 and the documentation but I don’t find the file demo.m, could you help me? Thanks.

Mirco

Hi Mirco,

I’m sorry but the zip file was not up-to-date. I’ve updated it, please download it again. You will see the demo script.

Cheers,

goker

I run od_script file and there was following error.

??? Undefined function or method ‘graphconncomp’ for input arguments of type ‘double’.

Error in ==> LaplacianEigenmap at 31

cc = graphconncomp(w);

May i know that how to fix this problem.

Hi jamali,

graphconncomp is a function from BioInformatics toolbox that calculates the connected components of a given graph. If you don’t have BioInformatics toolbox, you can substitute the code here for it http://stackoverflow.com/questions/16883367/how-to-find-connected-components-in-matlab

cheers,

Goker

hi goker,

please provide me the lof code

Hi,

Please follow the link in my post to the bitbucket repository for the code.

goker

Hi,

I am using the Outlier detection toolbox. When I run ‘ActiveOutlier’ in od_script. Following error is displayed ,

??? Error: File: ActiveOutlierTrain.m Line: 114 Column: 16

Expression or statement is incorrect–possibly unbalanced (, {, or [.

Error in ==> RunExperiment at 85

model = method.trainFcn(dataset, method.trainParams);

Error in ==> od_script at 106

results = RunExperiment(method, ds, expParams);

I also try to run od.m, but I have this error When I run LOF and other experiments:

??? Error: File: ActiveOutlierTrain.m Line: 114 Column: 16

Expression or statement is incorrect–possibly unbalanced (, {, or [.

Error in ==> RunExperiment at 85

model = method.trainFcn(dataset, method.trainParams);

Error in ==> od_runexperiment at 46

results = RunExperiment(method, ds, expParams);

Error in ==> od>btnRun_Callback at 400

[testResults bestParam] = od_runexperiment(ds, method, expParams, dims);

Error in ==> gui_mainfcn at 96

feval(varargin{:});

Error in ==> od at 31

gui_mainfcn(gui_State, varargin{:});

Error in ==>

@(hObject,eventdata)od(‘btnRun_Callback’,hObject,eventdata,guidata(hObject))

??? Error while evaluating uicontrol Callback

>>

Do I miss something? Thanks for your help.

Hi Tala,

I think you are using an older version of MATLAB. The tilde (~) operator for ignoring function outputs were introduced in r2009b. That is why you get that error. You can get around it by simple replacing that tilde with a dummy variable name.

Cheers,

Goker

Hi Goker,

Thank you very much for your reply. Right, this is because of using the older version of MATLAB. By the way, do you have any publication for your thesis or can we access your thesis?

Best Regards,

Tala

Hi Tala,

I did not publish my thesis, but I’ve added a link to my thesis in the post.

Cheers,

Goker

Thank you very much Goker for providing a link. It is helpful! I have a question about identifying threshold value for theta in LOF. Some papers consider a range varying from 0.1 to 0.9 increasing by 0.1. What range do you suggest for parameter tunning of threshold for LOF? Is it possible that the threshold parameter be more than 1?

Thanks for your help,

Tala

Tala,

To be honest, I don’t really remember. It’s been a while since I worked on these algorithms. I’d suggest looking at the parameters I’ve used in the code and mentioned in my thesis.

Goker

Dear Goker,

Thank you very much for your help.

Best,

Tala

Dear goker,thank you for your help,but i was unable to read datasets in outlier detection toolbox.When i browse to dataset path and click run,the following error msg pops out

” Undefined function or method ‘ReadDataset’ for input arguments of type ‘char’.

Error in ==> od_script at 25

ds = ReadDataset(dsPath);”

please help me out ….

Hi Sam,

It seems like the datasets folder is not in your MATLAB path. Can you make sure that all folders under odtoolbox are added to MATLAB path. You can do it by right clicking on odtoolbox folder in MATLAB, and selecting “Add to Path”.

goker

Hello Goker, First of all let me congratulate you for creating this excellent tool. I have to present a seminar on Local Outlier Factor and hence was searching for suitable practical implementation and came across this.I am trying to run the Local Outlier Factor through the demo but have no idea how to give the 5 files that you mentioned in the supporting document.You have mentioned what these files are for but still unclear to me.Sorry. Can you explain in more detail how to make sample datasets or give an example that would make it more clear.

Thanks a lot

Sohan

Hi Sohan,

Thanks for your nice words. Have you checked the datasets under the datasets folder? The file structure you need is also explained in the ReadDataset function. You basically need to provide a MATLAB struct with fields tvx (training/validation input), tvy (training/validation labels), tsx (test input), tsy (test labels), normalClass (normal class labels) and outlierClass (outlier class labels). The sample script od_script.m in demo folder should also be useful.

Goker

Hi Goker,

I tried running it but now am getting the same error that earlier Sam was getting i.e. “Undefined function or method ‘ReadDataset’ for input arguments of type ‘char’ ” .I am quite sure that the odtoolbox is in the classpath because when I type od in the Matlab commandline the GUI comes up. I am not sure what the problem is here.Can you help me with this please,I am kind a lost ?

Are you sure that subfolders of odtoolbox folder are also in the path? ReadDataset file is in a subfolder, make sure that subfolder is also added to the path.

goker

Hi Goker, i am a chinese student. As far as I know. LOF algorithm is an unsupervised learning algorithm. what do you mean by train set and test set? thank you very much!

Hi kai li,

You are right, LOF is unsupervised. You do not need to provide trainy or tesy matrices (i.e., class labels). You need training and test samples (trainx and testx) because the method calculates LOF values for points in the test set with respect to the points in the training set. If you want to calculate the LOF values for a single dataset, you can provide the same data as trainx and testx.

Goker

can i apply this tool to any dataset(like Iris dataset from UCI machine repository)

Yes nagendra, you just need to get the dataset into the format the toolbox expects. Look into example datasets provided with the toolbox.

goker