Soko Loco Mac OS

Posted on  by

LOCOlib implements the LOCO and DUAL-LOCO algorithms for distributed statistical estimation.

Given a data matrix ( mathbf{X} in mathbb{R}^{ntimes p} ) and response ( mathbf{y} in mathbb{R}^n ), LOCO is a LOw-COmmunication distributed algorithm for ( ell_2 ) - penalised convex estimation problems of the form

[ min_{boldsymbol{beta}} J(boldsymbol{beta}) = frac{1}{n} sum_{i=1}^n f_i(boldsymbol{beta}^top mathbf{x}_i) + frac{lambda}{2} Vert boldsymbol{beta} Vert^2_2 ]

LOCO is suitable in a setting where the number of features ( p ) is very large so that splitting the design matrix across features rather than observations is reasonable. For instance, the number of observations ( n ) can be smaller than ( p ) or on the same order. LOCO is expected to yield good performance when the rank of the data matrix ( boldsymbol{X} ) is much lower than the actual dimensionality of the observations ( p ).

  • Examples
  • LOCOlib options
    • Choosing the projection dimension
  • Preprocessing package
    • Data Structures

🚂 Soko Loco Deluxe is a cute train tycoon! In the game, you'll place tracks and buildings, assign workers and make train schedules. Collect planks, bricks and steel to build more infrastructure, and feed towns to grow their population. Over time, natural resources will deplete, so you'll have to think one step ahead to keep up the production. Provided your Mac has either a built-in optical drive or an external one attached, the option DVD or CD Sharing will be the first in the list of sharing services that OS X can provide. Enabling it will provide access to your Mac's optical drive over the local network. As you've probably noticed, the Mac I'm using as a host is a MacBook Air. Summary: Soko Loco is a transport strategy game. Collect resources to expand your empire, feed towns to grow population and streamline your production process to build giant monuments.

One large advantage of LOCO over iterative distributed algorithms such as distributed stochastic gradient descent (SGD) is that LOCO only needs one round of communication before the final results are sent back to the driver. Therefore, the communication cost – which is generally the bottleneck in distributed computing – is very low.

LOCO proceeds as follows. As a preprocessing step, the features need to be distributed across processing units by randomly partitioning the data into K blocks. This can be done with “preprocessingUtils” package. After this first step, some number of feature vectors are stored on each worker. We shall call these features the “raw” features of worker (k). Subsequently, each worker applies a dimensionality reduction on its raw features by using a random projection. The resulting features are called the “random features” of worker (k). Then each worker sends its random features to the other workers. Each worker then adds or concatenates the random features from the other workers and appends these random features to its own raw features. Using this scheme, each worker has access to its raw features and, additionally, to a compressed version of the remaining workers’ raw features. Using this design matrix, each worker estimates coefficients locally and returns the ones for its own raw features. As these were learned using the contribution from the random features, they approximate the optimal coefficients sufficiently well. The final estimate returned by LOCO is simply the concatenation of the raw feature coefficients returned from the ( K ) workers.

LOCO’s distribution scheme is illustrated in the following figure.

A preprocessing step is required to distribute the data across workers according to the features rather than the observations. For this step, we provide the package “preprocessingUtils”.

Building with sbt

Checkout the project repository

and build the packages with

and

To install sbt on Mac OS X using Homebrew, run brew install sbt. On Ubuntu run sudo apt-get install sbt.

Running on Windows

To run LOCO locally under Windows, we recommend using Spark 1.3.1, download winutils.exe, move it to DISK:FOLDERSbin and set HADOOP_CONF=DISK:FOLDERS.

Ridge Regression

To run ridge regression locally on the ‘climate’ regression data set provided here, unzip climate-serialized.zip into the data directory, download a pre-build binary package of Spark, set SPARK_HOME to the location of the Spark folder, cd into loco-lib/LOCO and run:

The estimated coefficients can be visualised as follows as each feature corresponds to one grid point on the globe. For more information on the data set, see LOCO: Distributing Ridge Regression with Random Projections.

SVM

To train a binary SVM with hinge loss locally on the ‘dogs vs. cats’ classification data set provided here, first preprocess the text file with the preprocessingUtils package (see below) and run:

The following list provides a description of all options that can be provided to LOCO.

outdir Directory where to save the summary statistics and the estimated coefficients as text files

saveToHDFS True if output should be saved on HDFS

readFromHDFS True if preprocessing was run with saveToHDFS set to True

nPartitions Number of blocks to partition the design matrix

nExecutors Number of executors used. This information will be used to determine the tree depth in treeReduce when the random projections are added. A binary tree structure is used to minimise the memory requirements.

trainingDatafile Path to the training data files (as created by preprocessingUtils package)

testDatafile Path to the test data files (as created by preprocessingUtils package)

responsePathTrain Path to response corresponding to training data (as created by preprocessingUtils package)

responsePathTest Path to response corresponding to test data (as created by preprocessingUtils package)

nFeats Path to file containing the number of features (as created by preprocessingUtils package)

seed Random seed

useSparseStructure True if sparse data structures should be used

classification True if the problem at hand is a classification task, otherwise ridge regression will be performed

numIterations Number of iterations used in SDCA

checkDualityGap If SDCA is chosen, true if duality gap should be computed after each iteration. Note that this is a very expensive operation as it requires a pass over the full local data sets (no communication required). Should only be used for tuning purposes.

stoppingDualityGap If SDCA is chosen and checkDualityGap is set to true, duality gap at which optimisation should stop

projection Random projection to use: can be either “sparse” or “SDCT”

nFeatsProj Projection dimension

concatenate True is random projections should be concatenated, otherwise they are added. The latter is more memory efficient.

CV If true, performs cross validation

kfold Number of splits to use for cross validation

lambdaSeqFrom Start of regularisation parameter sequence to use for cross validation

lambdaSeqTo End of regularisation parameter sequence to use for cross validation

lambdaSeqBy Step size for regularisation parameter sequence to use for cross validation

lambda If no cross validation should be performed (CV=false), regularisation parameter to use

Soko Loco Mac Os Download

Choosing the projection dimension

The smallest possible projection dimension depends on the rank of the data matrix ( boldsymbol{X} ). If you expect your data to be low-rank so that LOCO is suitable, we recommend using a projection dimension of about 1%-10% of the number of features you are compressing. The latter depends on whether you choose to add or to concatenate the random features. This projection dimension should be used as a starting point. Of course you can test whether your data set allows for a larger degree of compression by tuning the projection dimension together with the regularisation parameter ( lambda ).

Concatenating the random features

As described in the original LOCO paper, the first option for collecting the random projections from the other workers is to concatenate them and append these random features to the raw features. More specifically, each worker has ( tau = p / K ) raw features which are compressed to ( tau_{subs} ) random features. These random features are then communicated and concatenating all random features from the remaining workers results in a dimensionality of the random features of ( (K-1) cdot tau_{subs} ). Finally, the full local design matrix, consisting of raw and random features, has dimension ( n times (tau + (K-1) cdot tau_{subs}) ).

Adding the random features

Alternatively one can add the random features. This is equivalent to projecting all raw features not belonging to worker ( k ) at once. If the data set is very low-rank, this scheme may allow for a smaller dimensionality of the random features than concatenation of the random features as we can now project from ( (p - p/K)) to ( tau_{subs} ) instead of from ( tau = p/K ) to ( tau_{subs} ).

The preprocessing package ‘preprocessingUtils’ can be used to

  • center and/or scale the features and/or the response to have zero mean and unit variance, using Spark MLlib’s StandardScaler. This can only be done when using a dense data structure for the features (i.e. sparse must be set to false).
  • save data files in serialised format using the Kryo serialisation library. This code follows the example from @phatak-dev provided here.
  • convert text files of the formats “libsvm”, “comma”, and “space” (see examples under options) to object files with RDDs containing

    • observations of type

      • LabeledPoint (needed for the algorithms provided in Spark’s machine learning library MLlib)
      • Array[Double] where the first entry is the response, followed by the features
    • feature vectors of type

      • FeatureVectorLP (needed for LOCO, see details below)

Data Structures

Soko Loco Mac OS

The preprocessing package defines a case class LOCO relies on:

FeatureVectorLP

The case class FeatureVector contains all observations of a particular variable as a vector in the field observations (can be sparse or dense). The field index serves as an identifier for the feature vector.

Mac Os Versions

Example

To use the preprocessing package

  • to center and scale the features
  • to split one data set into separate training and test sets
  • to save the data sets as object files using Kryo serialisation, distributed over (a) observations and (b) features

download the ‘dogs vs. cats’ classification data set provided here, unzip dogs_vs_cats.zip into the data directory,change into the corresponding directory with cd loco-lib/preprocessingUtils and run:

preprocessingUtils options

The following list provides a description of all options that can be provided to the package ‘preprocessingUtils’.

outdir Directory where to save the converted data files

saveToHDFS True if output should be saved on HDFS

nPartitions Number of partitions to use

dataFormat Can be either “text” or “object”

sparse True if sparse data structures should be used

textDataFormat If dataFormat is “text”, it can have the following formats:

  • “libsvm” : LIBSVM format, e.g.:
  • “comma” : The response is separated by a comma from the features. The features are separated by spaces, e.g.:
  • “spaces” : Both the response and the features are separated by spaces, e.g.:

dataFile Path to the input data file

separateTrainTestFiles True if (input) training and test set are provided in different files

trainingDatafile If training and test set are provided in different files, path to the training data file

testDatafile If training and test set are provided in different files, path to the test data file

proportionTest If training and test set are not provided separately, proportion of data set to use for testing

seed Random seed

outputTrainFileName File name for folder containing the training data

outputTestFileName File name for folder containing the test data

outputClass Specifies the type of the elements in the output RDDs : can be LabeledPoint or DoubleArray

twoOutputClasses True if same training/test pair should be saved in two different formats

secondOutputClass If twoOutputClasses is true, specifies the type of the elements in the corresponding output RDDs

centerFeatures True if features should be centred to have zero mean

centerResponse True if response should be centred to have zero mean

scaleFeatures True if features should be scaled to have unit variance

scaleResponse True if response should be scaled to have unit variance

Note that the benefit of some of these setting highly depends on the particular architecture you will be using, i.e. we cannot guarantee that they will yield optimal performance of LOCO.

Mac
  • Use Kryo serialisation
  • Increase the maximum allowable size of the Kryo serialization buffer
  • Use Java’s more recent “garbage first” garbage collector which was designed for heaps larger than 4GB if there are no memory constraints
  • Set the total size of serialised results of all partitions large enough to allow for the random projections to be send to the driver

The LOCO algorithm is described in the following papers:

  • Heinze, C., McWilliams, B., Meinshausen, N., Krummenacher, G., Vanchinathan, H. P. (2015) LOCO: Distributing Ridge Regression with Random Projections
  • Heinze, C., McWilliams, B., Meinshausen, N. DUAL-LOCO: Distributing Statistical Estimation Using Random Projections AISTATS 2016

Further references:

  • Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, 14:567–599, February 2013c.

Optical media is a dying format, thanks largely in part to services such as iTunes and the App Store, so it's no surprise that Apple has been slowing phasing out a built-in drive from their Mac lineup, leaving just the entry-level 13' MacBook Pro unchanged.

While OS X is compatible with a wide range of USB optical drives (including Apple's own USB SuperDrive), it includes a built-in feature that allows it to share the optical drive of another Mac or PC on the same network, called Remote Disc.

A History of Remote Disc

When the original MacBook Air launched in 2008, it was the first Mac in over a decade designed without a built-in optical drive. Although the launch was only six years ago, neither the App Store or Mac App Store existed, and software was still most commonly distributed on CDs and DVDs.

For MacBook Air owners wanting to still install software from CDs and DVDs, many of whom need a way of installing Microsoft Office, Apple introduced a Remote Disc, providing local network access to another Mac or PC's optical drive.

Limitations

Despite it's supposed versatility, Remote Disc is rather limited. It can only really be used with data CDs and DVDs, such as software installers. The feature doesn't support audio CDs, video DVDs, OS installations and games that require constant disc access. Writing to CDs and DVDs is also not possible. To avoid these limitations, a USB optical drive would be needed.

Prerequisites

Remote Disc hasn't really changed since it was introduced, requiring at least Mac OS X 10.4.11 or above on the host Mac that's equipped with an optical drive. Indeed, the Windows software that provides this functionality from a PC hasn't even been updated since it was released, still sitting at v1.0.

If your Mac lacks a built-in optical drive as standard, Remote Disc will be available for you to use.

Setting Up Remote Disc

Remote Disc is rather straightforward to set up and, once configured, can be left on for continued access. It's important to note that Remote Disc isn't secure, data is not encrypted when transmitted over the network and access control is done by requesting permission.

Host Mac

To enable Remote Disc on your host Mac, open System Preferences and select Sharing.

Provided your Mac has either a built-in optical drive or an external one attached, the option DVD or CD Sharing will be the first in the list of sharing services that OS X can provide. Enabling it will provide access to your Mac's optical drive over the local network.

As you've probably noticed, the Mac I'm using as a host is a MacBook Air. This simply has a USB optical drive attached, allowing Remote Disc functionality.

As far as controlling access to your Mac's optical drive, you have the choice of requiring a remote user to ask for permission before accessing the optical drive. If you have a number of drive-less Macs within the home and a SuperDrive-equipped iMac, it makes sense to not need permission. For more public networks or work environments, requesting permission is advisable. For the purposes of this guide, enable the option so that our remote user needs to ask permission before using the optical drive.

Host PC

Nearly half of all Mac users are new to the platform, having switched to the Mac from a Windows PC. Providing the ability for new Mac users to be able to use their PC's optical drive is a smart way of ensuring that their transition is as seamless as possible.

To enable Remote Disc within Windows XP SP2 or above, you'll need to install the DVD or CD Sharing Update 1.0 for Windows.

Although the system requirements state it is compatible with either Windows XP or Vista, it is compatible with Windows 7 and 8.

Once the software is installed, launch DVD or CD Sharing from the Start Menu (or wherever the heck Windows 8 puts it). You'll see the same options that OS X provides, simply enabling or disabling the service and an option for requiring permission.

One common misconception is that, as Windows PCs cannot access files on a Mac-formatted drive, the same applies to a Mac software CD or DVD. This isn't the case and Windows can happily read Mac CDs or DVDs as both OS X and Windows use the same common filesystem for structuring data on an optical disc.

Remote Mac

Now that the service is enabled, we can now remotely access our host Mac's optical drive.

Your remote Mac should display an option within a Finder window's sidebar, under Devices, labelled Remote Disc. Ensure both Macs are running on the same local network and then select it.

As Remote Disc broadcasts its availability over the network, the host Mac should be seen within the window. Since it is also available on a Windows PC on the local network, it too will be displayed.

Double-click the Mac and, if you enabled the option to require permission before using, click the Ask to use... button.

A dialog box will appear on the host Mac detailing the user and Mac that is requesting permission to access its optical drive, giving you the ability to deny or allow access.

The same dialog box also appears when the request is sent to a Windows PC.

After clicking Accept, the remote Mac can now access the CD we have in the host computer's drive and will allow us to install software or copy files from the inserted disc.

Wrapping Up

Back when Remote Disc was introduced, it served a genuinely useful need that meant new MacBook Air owners could rest easy, knowing they could install software residing on DVDs with some degree of ease.

Nowadays, Remote Disc is likely a feature of OS X you'll never need to use, though it's certainly useful to have in the event that you may need to copy files or install some software from an optical disc.

But 3rd-party USB optical drives are so cheap that it's worth purchasing one just to have handy. As OS X natively supports USB optical drives, there really isn't any point in purchasing the Apple USB SuperDrive for anything other than aesthetic reasons. It's oddly limited to being compatible with Macs that don't feature a built-in optical drive, meaning it cannot be used on other Macs, such as ones with a broken SuperDrive, or PCs.

I personally own a Samsung 8x USB DVD Writer that costs less than $30 and, although I've used it only a handful of times, it has been for uses where Remote Disc just wasn't an option. It's also far more versatile than Apple's own.

The Instructional is generously supported by its readers. If you'd like to help, Click here to learn more.