BufferedNeuralDataSet Error

nnuser's picture

When training a network using BufferedNeuralDataSet in Ubuntu 10.04 Server, Encog encountered an exception - org.encog.neural.data.NeuralDataError: java.io.FileNotFoundException: exampletest_NORMALIZED_TRAINING.bin (Too many open files). I added few lines to display when the error occurs and it seems that the error is caused by BufferedNeuralDataSetIterator as it tries to create random access files from Normalized Training File. The problem occurs right after it creates 996 Random Access File from bin file in my case. This is caused by the limitation on number of open files in linux machines. Setting a higher limit using "ulimit -n xxxx" solves the problem at times. But, it would be wise to have those files opened and closed rather than continually opening them until the exception is thrown as the number of random access files that gets created differs based on the data size. The bin file that I used is only 36K.

A different problem is with EncogUtility. The convertCSV2Binary method in Encog utility calls buffer.beginLoad(50,6) which set NetworkInputLayerSize to 50 and NetworkOutputLayerSize to 6. If you provide a CSV file with different input and Output size, the training process (usually GA and Simulated Annealing) will throw an exception complaining about the mismatch in expected and actual network input and output layer size. A possible solution is to use CSVNeuralDataSet to load data and train network. However, when using CSVNeuralDataSet, Genetic Algorithm training will always throws Number Format Exception. The same algorithm will work when the file is opened as BufferedNeuralDataSet in Windows machine.

Encog is a good project and I would love to see it progress further. My suggestion is that before introducing more features, encog core team should focus on making it stable and removing bugs.

SeemaSingh's picture

I actually found, and checked in fixes for, both of those two issues. I was in the process of extending the workbench to make use of buffered files and noticed the EncogUtility issue. The files being opened was causing another issue.

It is always a balance between new features and fixing bugs. The problem with bugs is we don't always know about them. :) So bug reports are very valuable.

jeffheaton's picture

Thanks for the bug report. I looked at your changes Seema, very much a change in the right direction. The dynamic way you get the input and ideal for the EncogUtility csv to binary should work out okay.

I want to try the buffer dataset on GA's a bit more. I can see how the heavily threaded GA might wreck some havoc on the way that the buffered data set manages its file handles. The fix Seema made was good, because it did seem to be leaving files open.

As to bugs, we do try to balance between features and bug fixes. We fixed all of the listed issues on Google code with 2.4, though I need to update that list. I will be testing the buffer training set quite a bit soon as we expand the workbench, this becomes a very important feature to the workbench.

I also want to get the unit test coverage expanded. Right now it is at about 50%. Which is great, and does a good job of keeping changes from breaking things, but that really needs to go higher.

nnuser's picture

The same errors occurs with BufferedNeuralDataSet when PruneIncremental is used to prune Feedforward Network. The error is (too many open files) for BufferedNeuralDataSet. I am using the fixed version from Cruise Control .
It also causes java.io.EOFException. I have tried BufferedNeuralDataSet with multiple training methods (feedforward, resilient, GA, Simulated annealing, etc.) and most of them fails in ubuntu 10.4 with exception such as too many open files, EOF reached etc.
To me, it seems like Encog has too many bugs around how it handles data and I would really like to see it gets improved in data handling and be robust in data handling. After all Neural Networks are all about data. I haven't tried every data handling features in encog, but so far I have tried few and all of them are buggy in one way or other. For example, CSVNeuralDataSet fails with GA as GA can't parse number in CSVFile, BufferedNeuralDataSet fails with different exceptions.

Next time, I will try XMLNeuralDataSet and will post errors that I may find in it. But, for now, I must suggest that encog team should focus on getting CSVNeuralDataSet fixed before any other thing as most data are in CSV format. In case of BufferedNeuralDataSet, I think it could solve the problem if there is a hard limit on the number of random access files that can be opened (say around 250) or rather than opening multiple random access file, can't it just open the file as one and put a marker at position it's in (maybe a synchronized marker).

jeffheaton's picture

I will take a look at the open file issue. I can see where that could cause a problem with GA/Prune since it creates quite a few instances.

Thanks for the report, it is quite helpful.

nnuser's picture

I tried using CSVNeuralDataSet instead of BufferedNeuralDataSet and prune the network and got the following exceptions:

Exception in thread "pool-1-thread-14" org.encog.util.csv.CSVError: java.lang.NumberFormatException: For input string: ".7252508127677139747
at org.encog.util.csv.CSVFormat.parse(Unknown Source)
at org.encog.util.csv.ReadCSV.getDouble(Unknown Source)
at org.encog.neural.data.csv.CSVNeuralDataSet$CSVNeuralIterator.next(Unknown Source)
at org.encog.neural.data.csv.CSVNeuralDataSet$CSVNeuralIterator.next(Unknown Source)
at org.encog.neural.networks.training.propagation.gradient.GradientUtil.calculate(Unknown Source)
at org.encog.neural.networks.training.propagation.gradient.GradientWorker.run(Unknown Source)
at org.encog.neural.networks.training.propagation.gradient.CalculateGradient.runWorkersSingleThreaded(Unknown Source)
at org.encog.neural.networks.training.propagation.gradient.CalculateGradient.calculate(Unknown Source)
at org.encog.neural.networks.training.propagation.Propagation.iteration(Unknown Source)
at org.encog.neural.prune.PruneIncremental.performJobUnit(Unknown Source)
at org.encog.util.concurrency.job.JobUnitWorker.run(Unknown Source)
at org.encog.util.concurrency.PoolItem.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

The data is normalized data using DataNormalization. All of the columns uses range between -0.9 to 0.9. Is this really due to the number that can't be formatted or is it something else. If it really is the problem with number format, shouldn't OutputFieldRangeMapped format the data so that it can read from any class across encog.

jeffheaton's picture

As the the problem with BufferedNeuralData set, it is that the Iterator creates a new file handle for each iterator. This will be an easy enough fix. This does not seem to cause an issue with Windows, so that is probably why it has not been run into until now. Further, the propagation training does not use iterators, and they seem to be the most commonly used trainers, which is another reason this went undetected. I will make a correction to the BufferedNeuralData set to fix that issue.

As to the number format error. That is strange, the number above looks correct. I will see if I can reproduce it.

jeffheaton's picture

Okay, I can reproduce the GA issue, when using a buffered dataset. If I use my Mac, then I get the error. So the issue does seem to be UNIX specific. However, the way the iterator is handling files is clearly bad. Windows just compensates for it somehow. Though I suspect it would give Windows an issue if it trained long enough.

Now that I can reproduce it, I can test a fix for it. I will let you know when its in place.

I will try to reproduce the CSV dataset error. Actually, I've been meaning to revamp the CSV dataset. The way it currently works is that it "reparses" the CSV file, every iteration. Which is slow! The better way is to copy the CSV file to a binary file for the buffered dataset to work with. Then it is quite fast. The CSV dataset it one of the oldest files in Encog, and is really not needed anymore. I will likely make it just a thin subclass of the buffered dataset and have it just automatically convert the CSV file to a temp binary file and then delete the temp binary file when done.

Jeff

nnuser's picture

A different problem is with how GA handles exceptions. When trained with BufferedNeuralDataSet, GA seems to log exception and never recover from it. It continuously logs exception until the process is killed. I had a system (ubuntu 10.4) ran for almost 7 hours using BufferedNeuralDataSet and GA (the system max open file limit was set to 65536) and both output and error were sent to a file, GA was still logging exception after 7 hours and the file got to be 43 GB. The same procedure with Simulated Annealing would quit the program, but with GA, my program never caught the exception. I am assuming that it's because GA is logging exception and never recovering from it.
The way it is handing exception should be changed as it will clog up the available space.

SeemaSingh's picture

We did something similar in the multiprop training. Namely, that if a thread throws an exception, the main thread throws the same exception, and training stops. I checked this change in earlier today.

nnuser's picture

It seems that Jeff has fixed the code for BufferedNeuralDataSet. After using the new source code, the difference in number of opened files are huge. Previously it used to get over 1000 and now it gets around 23 to 25 open files for GA and 1 - 3 files for Simulated Annealing. However, there is still a problem with IncrementalPruning. When incremental pruning is used, the number of open files get around 1300 and even after pruning is completed, the open files are not closed. I think Incremental Pruning is not closing the resource and it will certainly crash a linux machine under default configuration.

jeffheaton's picture

Those fixes will be in the next release of Encog (2.5), which should be released later in the summer.

As to the inc prune and the buffered data set. I really need to give the buffered data set a bit of an overhaul. Its only real purpose is to allow Encog to use very large datasets, that would not fit in memory. If you compare the performance of the buffered dataset tot he memory based BasicNeuralDataSet the performance of the buffered dataset is terrible. It works okay for most Encog training methods. But when you put it into a really demanding training situation, like incremental pruning, it really gets bogged down. Incremental pruning works fine, with a memory dataset.

I am going to modify the buffered dataset to make use of "file channels" and other Java "nio" features. This should really speed things up and decrease the need to use too many files. I plan for this to be part of Encog 2.5.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.