mheimann

mheimann / BayesianOptimization / 0.1.1

README.md

This algorithm performs Bayesian optimization to automatically set the hyperparameters of a machine learning algorithm for good performance on a training dataset (as measured by performance on a validation set).  Users must pass in three or optionally four things: 

  • the name of an Algorithmia algorithm that takes in ONLY the hyperparameters to be optimized and returns a number that should be minimized (for supervised learning tasks, this could be the error on the validation data after training on the training data with the hyperparameters supplied).  All other parameters must be supplied manually.  For many algorithms on the system that probably return either a trained model or predictions for input data, this will necessitate the writing of a short wrapper function.  
  • the number of jobs Bayesian optimization should run (i.e. times to call the Hyperparameters --> Function Value to Minimize algorithm).
  • the path to a data collection containing a config file.  This is a json file with the format config_<ALGORITHM_NAME>.json (not including the username of the algorithm creator), which for each variable must contain: the variable name, type (int, float, etc.), minimum value, maximum value, and number (if the user has, say, 2 variables with the exact same specifications, they can make this number 2 (otherwise it should be 1) instead of making duplicate entries).  
  • optionally, the data collection path to write a results file (which will be titled results_<ALGORITHM_NAME>.dat) containing the results (function value and time taken) and hyperparameter settings of all jobs run for further analysis.  If this value is not supplied, by default it will write to the same collection as the config file.  

After Bayesian optimization is performed, the results of the best job (job number, time, and function value--plus hyperparameter settings) will be returned in JSON format, while as mentioned above the results of all jobs will be written in a file to a data collection.

Note about the sample input: probably more jobs should be run to achieve better results.  This number of jobs was chosen so it would finish quickly while still illustrating the format of the sample output.