Soko Loco Mac OS
LOCOlib implements the LOCO and DUAL-LOCO algorithms for distributed statistical estimation.
Given a data matrix ( mathbf{X} in mathbb{R}^{ntimes p} ) and response ( mathbf{y} in mathbb{R}^n ), LOCO is a LOw-COmmunication distributed algorithm for ( ell_2 ) - penalised convex estimation problems of the form
[ min_{boldsymbol{beta}} J(boldsymbol{beta}) = frac{1}{n} sum_{i=1}^n f_i(boldsymbol{beta}^top mathbf{x}_i) + frac{lambda}{2} Vert boldsymbol{beta} Vert^2_2 ]
LOCO is suitable in a setting where the number of features ( p ) is very large so that splitting the design matrix across features rather than observations is reasonable. For instance, the number of observations ( n ) can be smaller than ( p ) or on the same order. LOCO is expected to yield good performance when the rank of the data matrix ( boldsymbol{X} ) is much lower than the actual dimensionality of the observations ( p ).
- Examples
- LOCOlib options
- Choosing the projection dimension
- Preprocessing package
- Data Structures
🚂 Soko Loco Deluxe is a cute train tycoon! In the game, you'll place tracks and buildings, assign workers and make train schedules. Collect planks, bricks and steel to build more infrastructure, and feed towns to grow their population. Over time, natural resources will deplete, so you'll have to think one step ahead to keep up the production. Provided your Mac has either a built-in optical drive or an external one attached, the option DVD or CD Sharing will be the first in the list of sharing services that OS X can provide. Enabling it will provide access to your Mac's optical drive over the local network. As you've probably noticed, the Mac I'm using as a host is a MacBook Air. Summary: Soko Loco is a transport strategy game. Collect resources to expand your empire, feed towns to grow population and streamline your production process to build giant monuments.
One large advantage of LOCO over iterative distributed algorithms such as distributed stochastic gradient descent (SGD) is that LOCO only needs one round of communication before the final results are sent back to the driver. Therefore, the communication cost – which is generally the bottleneck in distributed computing – is very low.
LOCO proceeds as follows. As a preprocessing step, the features need to be distributed across processing units by randomly partitioning the data into K blocks. This can be done with “preprocessingUtils” package. After this first step, some number of feature vectors are stored on each worker. We shall call these features the “raw” features of worker (k). Subsequently, each worker applies a dimensionality reduction on its raw features by using a random projection. The resulting features are called the “random features” of worker (k). Then each worker sends its random features to the other workers. Each worker then adds or concatenates the random features from the other workers and appends these random features to its own raw features. Using this scheme, each worker has access to its raw features and, additionally, to a compressed version of the remaining workers’ raw features. Using this design matrix, each worker estimates coefficients locally and returns the ones for its own raw features. As these were learned using the contribution from the random features, they approximate the optimal coefficients sufficiently well. The final estimate returned by LOCO is simply the concatenation of the raw feature coefficients returned from the ( K ) workers.
LOCO’s distribution scheme is illustrated in the following figure.
A preprocessing step is required to distribute the data across workers according to the features rather than the observations. For this step, we provide the package “preprocessingUtils”.
Building with sbt
Checkout the project repository
and build the packages with
and
To install sbt
on Mac OS X using Homebrew, run brew install sbt
. On Ubuntu run sudo apt-get install sbt
.
Running on Windows
To run LOCO locally under Windows, we recommend using Spark 1.3.1, download winutils.exe, move it to DISK:FOLDERSbin
and set HADOOP_CONF=DISK:FOLDERS
.
Ridge Regression
To run ridge regression locally on the ‘climate’ regression data set provided here, unzip climate-serialized.zip
into the data
directory, download a pre-build binary package of Spark, set SPARK_HOME
to the location of the Spark folder, cd
into loco-lib/LOCO and run:
The estimated coefficients can be visualised as follows as each feature corresponds to one grid point on the globe. For more information on the data set, see LOCO: Distributing Ridge Regression with Random Projections.
SVM
To train a binary SVM with hinge loss locally on the ‘dogs vs. cats’ classification data set provided here, first preprocess the text file with the preprocessingUtils package (see below) and run:
The following list provides a description of all options that can be provided to LOCO.
outdir
Directory where to save the summary statistics and the estimated coefficients as text files
saveToHDFS
True if output should be saved on HDFS
readFromHDFS
True if preprocessing was run with saveToHDFS
set to True
nPartitions
Number of blocks to partition the design matrix
nExecutors
Number of executors used. This information will be used to determine the tree depth in treeReduce
when the random projections are added. A binary tree structure is used to minimise the memory requirements.
trainingDatafile
Path to the training data files (as created by preprocessingUtils package)
testDatafile
Path to the test data files (as created by preprocessingUtils package)
responsePathTrain
Path to response corresponding to training data (as created by preprocessingUtils package)
responsePathTest
Path to response corresponding to test data (as created by preprocessingUtils package)
nFeats
Path to file containing the number of features (as created by preprocessingUtils package)
seed
Random seed
useSparseStructure
True if sparse data structures should be used
classification
True if the problem at hand is a classification task, otherwise ridge regression will be performed
numIterations
Number of iterations used in SDCA
checkDualityGap
If SDCA is chosen, true if duality gap should be computed after each iteration. Note that this is a very expensive operation as it requires a pass over the full local data sets (no communication required). Should only be used for tuning purposes.
stoppingDualityGap
If SDCA is chosen and checkDualityGap
is set to true, duality gap at which optimisation should stop
projection
Random projection to use: can be either “sparse” or “SDCT”
nFeatsProj
Projection dimension
concatenate
True is random projections should be concatenated, otherwise they are added. The latter is more memory efficient.
CV
If true, performs cross validation
kfold
Number of splits to use for cross validation
lambdaSeqFrom
Start of regularisation parameter sequence to use for cross validation
lambdaSeqTo
End of regularisation parameter sequence to use for cross validation
lambdaSeqBy
Step size for regularisation parameter sequence to use for cross validation
lambda
If no cross validation should be performed (CV=false
), regularisation parameter to use
Soko Loco Mac Os Download
Choosing the projection dimension
The smallest possible projection dimension depends on the rank of the data matrix ( boldsymbol{X} ). If you expect your data to be low-rank so that LOCO is suitable, we recommend using a projection dimension of about 1%-10% of the number of features you are compressing. The latter depends on whether you choose to add or to concatenate the random features. This projection dimension should be used as a starting point. Of course you can test whether your data set allows for a larger degree of compression by tuning the projection dimension together with the regularisation parameter ( lambda ).
Concatenating the random features
As described in the original LOCO paper, the first option for collecting the random projections from the other workers is to concatenate them and append these random features to the raw features. More specifically, each worker has ( tau = p / K ) raw features which are compressed to ( tau_{subs} ) random features. These random features are then communicated and concatenating all random features from the remaining workers results in a dimensionality of the random features of ( (K-1) cdot tau_{subs} ). Finally, the full local design matrix, consisting of raw and random features, has dimension ( n times (tau + (K-1) cdot tau_{subs}) ).
Adding the random features
Alternatively one can add the random features. This is equivalent to projecting all raw features not belonging to worker ( k ) at once. If the data set is very low-rank, this scheme may allow for a smaller dimensionality of the random features than concatenation of the random features as we can now project from ( (p - p/K)) to ( tau_{subs} ) instead of from ( tau = p/K ) to ( tau_{subs} ).
The preprocessing package ‘preprocessingUtils’ can be used to
- center and/or scale the features and/or the response to have zero mean and unit variance, using Spark MLlib’s
StandardScaler
. This can only be done when using a dense data structure for the features (i.e.sparse
must be set tofalse
). - save data files in serialised format using the Kryo serialisation library. This code follows the example from @phatak-dev provided here.
convert text files of the formats “libsvm”, “comma”, and “space” (see examples under options) to object files with RDDs containing
observations of type
LabeledPoint
(needed for the algorithms provided in Spark’s machine learning library MLlib)Array[Double]
where the first entry is the response, followed by the features
feature vectors of type
FeatureVectorLP
(needed for LOCO, see details below)
Data Structures
The preprocessing package defines a case class LOCO relies on:
FeatureVectorLP
The case class FeatureVector
contains all observations of a particular variable as a vector in the field observations
(can be sparse or dense). The field index
serves as an identifier for the feature vector.
Mac Os Versions
Example
To use the preprocessing package
- to center and scale the features
- to split one data set into separate training and test sets
- to save the data sets as object files using Kryo serialisation, distributed over (a) observations and (b) features
download the ‘dogs vs. cats’ classification data set provided here, unzip dogs_vs_cats.zip
into the data
directory,change into the corresponding directory with cd loco-lib/preprocessingUtils
and run:
preprocessingUtils options
The following list provides a description of all options that can be provided to the package ‘preprocessingUtils’.
outdir
Directory where to save the converted data files
saveToHDFS
True if output should be saved on HDFS
nPartitions
Number of partitions to use
dataFormat
Can be either “text” or “object”
sparse
True if sparse data structures should be used
textDataFormat
If dataFormat
is “text”, it can have the following formats:
- “libsvm” : LIBSVM format, e.g.:
- “comma” : The response is separated by a comma from the features. The features are separated by spaces, e.g.:
- “spaces” : Both the response and the features are separated by spaces, e.g.:
dataFile
Path to the input data file
separateTrainTestFiles
True if (input) training and test set are provided in different files
trainingDatafile
If training and test set are provided in different files, path to the training data file
testDatafile
If training and test set are provided in different files, path to the test data file
proportionTest
If training and test set are not provided separately, proportion of data set to use for testing
seed
Random seed
outputTrainFileName
File name for folder containing the training data
outputTestFileName
File name for folder containing the test data
outputClass
Specifies the type of the elements in the output RDDs : can be LabeledPoint
or DoubleArray
twoOutputClasses
True if same training/test pair should be saved in two different formats
secondOutputClass
If twoOutputClasses
is true, specifies the type of the elements in the corresponding output RDDs
centerFeatures
True if features should be centred to have zero mean
centerResponse
True if response should be centred to have zero mean
scaleFeatures
True if features should be scaled to have unit variance
scaleResponse
True if response should be scaled to have unit variance
Note that the benefit of some of these setting highly depends on the particular architecture you will be using, i.e. we cannot guarantee that they will yield optimal performance of LOCO.
- Use Kryo serialisation
- Increase the maximum allowable size of the Kryo serialization buffer
- Use Java’s more recent “garbage first” garbage collector which was designed for heaps larger than 4GB if there are no memory constraints
- Set the total size of serialised results of all partitions large enough to allow for the random projections to be send to the driver
The LOCO algorithm is described in the following papers:
- Heinze, C., McWilliams, B., Meinshausen, N., Krummenacher, G., Vanchinathan, H. P. (2015) LOCO: Distributing Ridge Regression with Random Projections
- Heinze, C., McWilliams, B., Meinshausen, N. DUAL-LOCO: Distributing Statistical Estimation Using Random Projections AISTATS 2016
Further references:
- Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, 14:567–599, February 2013c.
Optical media is a dying format, thanks largely in part to services such as iTunes and the App Store, so it's no surprise that Apple has been slowing phasing out a built-in drive from their Mac lineup, leaving just the entry-level 13' MacBook Pro unchanged.
While OS X is compatible with a wide range of USB optical drives (including Apple's own USB SuperDrive), it includes a built-in feature that allows it to share the optical drive of another Mac or PC on the same network, called Remote Disc.
A History of Remote Disc
When the original MacBook Air launched in 2008, it was the first Mac in over a decade designed without a built-in optical drive. Although the launch was only six years ago, neither the App Store or Mac App Store existed, and software was still most commonly distributed on CDs and DVDs.
For MacBook Air owners wanting to still install software from CDs and DVDs, many of whom need a way of installing Microsoft Office, Apple introduced a Remote Disc, providing local network access to another Mac or PC's optical drive.
Limitations
Despite it's supposed versatility, Remote Disc is rather limited. It can only really be used with data CDs and DVDs, such as software installers. The feature doesn't support audio CDs, video DVDs, OS installations and games that require constant disc access. Writing to CDs and DVDs is also not possible. To avoid these limitations, a USB optical drive would be needed.
Prerequisites
Remote Disc hasn't really changed since it was introduced, requiring at least Mac OS X 10.4.11 or above on the host Mac that's equipped with an optical drive. Indeed, the Windows software that provides this functionality from a PC hasn't even been updated since it was released, still sitting at v1.0.
If your Mac lacks a built-in optical drive as standard, Remote Disc will be available for you to use.
Setting Up Remote Disc
Remote Disc is rather straightforward to set up and, once configured, can be left on for continued access. It's important to note that Remote Disc isn't secure, data is not encrypted when transmitted over the network and access control is done by requesting permission.
Host Mac
To enable Remote Disc on your host Mac, open System Preferences and select Sharing.
Provided your Mac has either a built-in optical drive or an external one attached, the option DVD or CD Sharing will be the first in the list of sharing services that OS X can provide. Enabling it will provide access to your Mac's optical drive over the local network.
As you've probably noticed, the Mac I'm using as a host is a MacBook Air. This simply has a USB optical drive attached, allowing Remote Disc functionality.
As far as controlling access to your Mac's optical drive, you have the choice of requiring a remote user to ask for permission before accessing the optical drive. If you have a number of drive-less Macs within the home and a SuperDrive-equipped iMac, it makes sense to not need permission. For more public networks or work environments, requesting permission is advisable. For the purposes of this guide, enable the option so that our remote user needs to ask permission before using the optical drive.
Host PC
Nearly half of all Mac users are new to the platform, having switched to the Mac from a Windows PC. Providing the ability for new Mac users to be able to use their PC's optical drive is a smart way of ensuring that their transition is as seamless as possible.
To enable Remote Disc within Windows XP SP2 or above, you'll need to install the DVD or CD Sharing Update 1.0 for Windows.
Although the system requirements state it is compatible with either Windows XP or Vista, it is compatible with Windows 7 and 8.
Once the software is installed, launch DVD or CD Sharing from the Start Menu (or wherever the heck Windows 8 puts it). You'll see the same options that OS X provides, simply enabling or disabling the service and an option for requiring permission.
One common misconception is that, as Windows PCs cannot access files on a Mac-formatted drive, the same applies to a Mac software CD or DVD. This isn't the case and Windows can happily read Mac CDs or DVDs as both OS X and Windows use the same common filesystem for structuring data on an optical disc.
Remote Mac
Now that the service is enabled, we can now remotely access our host Mac's optical drive.
Your remote Mac should display an option within a Finder window's sidebar, under Devices, labelled Remote Disc. Ensure both Macs are running on the same local network and then select it.
As Remote Disc broadcasts its availability over the network, the host Mac should be seen within the window. Since it is also available on a Windows PC on the local network, it too will be displayed.
Double-click the Mac and, if you enabled the option to require permission before using, click the Ask to use... button.
A dialog box will appear on the host Mac detailing the user and Mac that is requesting permission to access its optical drive, giving you the ability to deny or allow access.
The same dialog box also appears when the request is sent to a Windows PC.
After clicking Accept, the remote Mac can now access the CD we have in the host computer's drive and will allow us to install software or copy files from the inserted disc.
Wrapping Up
Back when Remote Disc was introduced, it served a genuinely useful need that meant new MacBook Air owners could rest easy, knowing they could install software residing on DVDs with some degree of ease.
Nowadays, Remote Disc is likely a feature of OS X you'll never need to use, though it's certainly useful to have in the event that you may need to copy files or install some software from an optical disc.
But 3rd-party USB optical drives are so cheap that it's worth purchasing one just to have handy. As OS X natively supports USB optical drives, there really isn't any point in purchasing the Apple USB SuperDrive for anything other than aesthetic reasons. It's oddly limited to being compatible with Macs that don't feature a built-in optical drive, meaning it cannot be used on other Macs, such as ones with a broken SuperDrive, or PCs.
I personally own a Samsung 8x USB DVD Writer that costs less than $30 and, although I've used it only a handful of times, it has been for uses where Remote Disc just wasn't an option. It's also far more versatile than Apple's own.