Welcome to NimbleDais!
Posted by vinod.mamtani in General on July 7, 2010
nim.ble[nim-buhl] – adjective – agile, active
da.is[dey-is, dahy-, deys] – noun – a raised platform
NimbleDais is a hub for “All Things Technology” specifically in the fields of operating systems, virtualization, data mining, machine learning, mobile and Web 2.0. My plan is to pick a subject in one of these areas and examine it under the hood. I’ll publish technical reports, block diagrams, code snippets and perform analysis on them. You’re free to use all the materials presented here unless stated otherwise. I hope to make this appealing for novices and experts alike.
If you spend some time and find this useful then I have achieved my goals. Please shoot me an email if you find something in error or require more explanation.
Last Project in Twenty Eleven
Posted by vinod.mamtani in Entrepreneur Corner, Virtualization on December 29, 2011
Cassandra Data Model
Posted by vinod.mamtani in Big Data on June 1, 2011
Cassandra is a distributed key-value store. When I hear the term ‘key-value’, I am reminded of maps or associative arrays. If you are thinking along the same lines then you are not ways off from understanding the Cassandra’s data model.
Column and Column Family
At a primitive level, the value object for the term ‘key-value’ refers to a column name-value-timestamp entity. All three elements – name, value and timestamp are provided by the applications.
Both the column names and values are byte arrays. In addition, the application also specifies the sort ordering on the columns. The columns are sorted by time or column name. Since a column name is a byte array and with a sort order on a column name, it can be useful to store a value in the column name! The column value field may not be used and can be an empty byte array. It is also possible to write a custom sort method for column names. The timestamp value is used for read consistency in case of conflicts and so it is important for applications to use synchronized clocks. Most applications do not use the timestamp values though.
A collection of columns along with the row keys forms a column family. This is analogous to a table in the RDBMS system. Within a column family, each row is uniquely identifiable by a key. Each row in a column family may have a different set of columns. This is the schema-less model in Cassandra. The schema for a column family though is defined upfront and cannot be modified without restarting the Cassandra process.
Every operation on a single row is atomic. In addition, rows are sorted by their key values and placed on different nodes in a cluster according to its Partitioner – RandomPartitioner or OrderPreservingPartitioner.
Cassandra also provides for indexes of the type KEYS on the column values called secondary indexes. This allows for range queries on the column values.
Super Column and Super Column Family
A super column is an associative array of columns. Similar to a column, it is a name-value entity but does not have any timestamps associated with it. The value for a super column is a map of columns.
A super column family is a collection of super columns along with the row keys. Similar to columns, the super column names as well as the sub-columns names inside super columns are sorted. In addition, the super column names are indexed though the sub-column names inside super columns are not indexed. Thus all the data for a row key in a super column is accessed together. A super column is typically used for de-normalizing data from other column families.
Keyspaces
A keyspace is a top-level container for column or super column families. A keyspace is analogous to a schema in the RDBMS system.
References
Cassandra Data Model
WTF is a SuperColumn? An Intro to the Cassandra Data Model
SVM From the Ground Up
Posted by vinod.mamtani in Machine Learning on May 2, 2011
A couple of months back I wrote a post on SVM but have always wanted to write one that explains SVM to people with some knowledge of linear algebra but no background on machine learning. This is my attempt to do so.
Linear Regression
Consider a training set with m training examples. The notation X denotes the input variables or features for a training set while Y denotes the output variables or target variables. An ith training example is denoted as (x(i), y(i)). The number of features in a training set is denoted by n. Define learning hypothesis as:
The function h(x) is a predictor for the corresponding value of y. When the target variable is continuous, the learning problem is called a regression. When the target variable is discrete, the learning problem is called a classification problem. When y is a linear function of x, then:
Gradient Descent
To pick the parameters, define a cost function as:
The goal is to minimize the cost function This is the least squares cost function giving rise to the ordinary least squares regression model. An algorithm to minimize the cost function, called gradient descent, works as follows. Start with some initial guess for the parameters. Plug them into the cost function. Make some changes to the parameters so as to make the cost function smaller. Continue this until the parameter values minimize the cost function. Now, a property of gradient descent is that depending on the initial guess, the end result may be different. This is called reaching the local minima. Gradient descent is formally defined as:
The learning rate controls the rate of change and is often hand-coded. For m training examples, the update function can be written as:
The meaning of convergence is that the change in the value of the parameters doesn’t change the value of the cost function by a lot between successive iterations. Now gradient descent may not seem that useful if the parameter values are different for the initial guess but in this case, the cost function is a convex quadratic function and has a single global minima with no other local minima. This means that gradient descent will always converge to the same value for these parameters. The algorithm given above is called the Batch Gradient Descent. For a very large training set, the algorithm called Stochastic Gradient Descent is used and is as given below:
The advantage of this approach is that all parameter updates happen with the first training example. This may not converge to the global minima but usually wanders around the global minima. In practice this works well especially for a large training set.
It is possible to mathematically derive the parameters by taking the derivative of the cost function and equating to 0. Doing so yields the following equation in a closed-form:
This is the non-iterative form of gradient descent. This equation is also called the normal equation. It is important to note that the term XTX may not always be invertible. This is the case when the features in the training set are dependent.
Probabilistic Interpretation
This section examines the probabilistic interpretation of linear regression. This will also explain why the cost function (of least squares) is defined as such. Define an error term that captures un-modeled effects or even random noise. In that case the relation between the input features and target variables is given as:
Due to the central limit theorem, assume that the error terms are independent and identically distributed (IID) from the Gaussian probabilistic distribution. The Gaussian distribution density of the error terms is given by:
This can be viewed as a function of the model parameters by defining the likelihood function. The likelihood of the parameters is the same as the probability of data. This can be written as:
From the principle of maximum likelihood estimation, choose the model parameters to maximize the likelihood function. In other words, choose the model parameters to make the data highly probable. For mathematical convenience, maximize the log likelihood (instead of likelihood) function defined as:
Thus maximizing the log likelihood is the same as minimizing
which is the least squares cost function defined above. Note that the model parameters do not depend on the variance as long as it is positive.
Classification and Logistic Regression
A classification problem is the one where the target variable is discrete. For simplicity, consider a binary classification problem where y can take any of the two values, 0 or 1. Correspondingly, change the hypothesis so that it outputs a value in the range [0,1]. Furthermore, instead of the output of the hypothesis being the output for the target variable, it is the probability of a value for the target variable.
Similar to the least squares regression, the model parameters can be derived using the maximum likelihood estimate of the parameters.
For convenience, maximize the log likelihood instead of the likelihood function. The log likelihood for the model parameters is given as:
In order to maximize the log likelihood, use the gradient ascent technique. The formula for gradient ascent is given as:
Therefore taking the derivative of the log likelihood for the gradient ascent rule yields:
The gradient ascent rule therefore appears as:
This derivation may appear similar to the derivation for the batch gradient descent, but that is only cosmetically true. In here, the hypothesis function is a sigmoid function as opposed to the linear function for batch gradient descent.
Discriminative, Generative and Discriminant
Posted by vinod.mamtani in General, Machine Learning on March 1, 2011
One fine morning, on my way to work, I found myself in a state of utter confusion over a set of machine learning algorithms that I had read few days back. The confusion was over the use of a certain word and its derivatives. So here it goes..
Of the many machine learning methods, there are learning algorithms that model p(y|x), the conditional distribution of y over x. These algorithms are called discriminative learning algorithms. Few examples of such algorithms are logistic regression and perceptron algorithms. There are other algorithms that try to model p(x|y) and p(y). These algorithms are called generative algorithms. An example of a generative algorithm is the Gaussian Discriminant Analysis or GDA. The confusion that crept in my head was with the use of the word ‘discriminant’ for a generative algorithm versus discriminative algorithm. I am hoping that with this post, I have set my confusion to rest.
Support Vector Machine
Posted by vinod.mamtani in Machine Learning on February 12, 2011
A support vector machine (SVM) is a supervised learning method used for classification and regression. A set of input data with labels is used to train an SVM model. Each input in the training set is marked as belonging to one of the two input classes. The model then assigns a new input to one of the two classes.
SVM as Hyperplane
An SVM model is a representation of input as points in high dimensional space. Points are mapped so that the two categories are divided by a clear gap that is as wide as possible. New input is mapped into the same space and predicted as belonging to a class based on the side of the gap they fall on. A support vector machine creates a hyperplane that provides a wide separation between the nearest training points of any class. In the figure below, the hyperplane represented by a solid line provides a better categorization than the dotted one.
Consider a formal definition of SVM. A training dataset D is a set of n points of the form:
where yi indicates the class to which a point xi belongs. The goal is to find a maximum-margin hyperplane that divides the points belonging to class yi = 1 and yi = –1. An equation for a hyperplane is given as: w.x – b = 0 where w is a vector normal to the hyperplane and b/||w|| represents the offset of the hyperplane from the origin along the normal vector.
Form two parallel hyperplanes at the margin of the data points such that there are no points between them. These hyperplanes can be described by the equations:
w.x – b = 1 and w.x – b = –1.
The distance between the hyperplanes at the margin of the data points is 2/||w||. So in order to maximize the margin, minimize ||w||. The constraints for the data points are given by the equations: w.xi – b >= 1 and w.xi – b <= –1.
Writing this as a single equation yields: yi(w.xi – b ) >= 1.
Thus the constrained optimization problem can be described as:
Minimize ||w|| (in w, b), subject to yi(w.xi – b ) >= 1.
Substituting ||w|| by 1/2||w||2 for convenience, the constrained optimization problem becomes:
Minimize 1/2||w||2 (in w, b), subject to yi(w.xi – b ) >= 1.
Primal Lagrangian Form
The constrained optimization problem can be written in a primal Lagrangian form as:
In this form, the Lagrangian multipliers play an adversarial role in enforcing the margin constraints. Taking partial derivatives of the Lagrangian function w.r.t w and b and equating them to 0 yields:
Note that w is a linear combination of the data points xi. Per the Karush-Kuhn-Tucker (KKT) complementary slackness condition:![]()
When yi(w.xi – b) <> 1 => the Lagrange multiplier is 0 and xi is irrelevant. When the Lagrange multiplier <> 0 => yi(w.xi –b) = 1 and xi is a support vector. Thus support vectors correspond to the non-zero values of Lagrange multipliers.
Substituting the above relations in the primal form gives:
Dual Lagrangian Form
A dual Lagrangian function is defined as the minimum value of the Lagrangian over primal variables. As per Slater conditions, it follows that:
The left hand side, solved above, is the primal form while the right hand side is the dual form. The dual form of the solution is obtained by:
References
Support Vector Machine in Wikipedia
Constrained Optimization Problems from Andrew Ng’s CS229 course
Design Patterns in Map-Reduce
Posted by vinod.mamtani in Machine Learning on January 11, 2011
Map-Reduce is a programming paradigm for parallel processing on large scale data. Map-Reduce programs typically run on a cluster of machines built from commodity hardware. There is a master controller that delegates data processing task to a set of mappers and reducers. The master controller keeps tab on the data portions handed to each and whether they are successfully processed. In the event that an instance of a mapper or reducer fails, the master controller clears its intermediate state and restarts one. The master controller selects a machine in a cluster to run an instance of a mapper or reducer based on its proximity to data. Typically a user specifies the number of reducers while the system selects the number of mappers.
Broadly speaking, mappers read files or chunks of files and generate a sequence of key-value pairs. This intermediate form of data is grouped such that a reducer processes all the data for a single key value. The reducer performs combination or reduction on the key-value pairs generating a single key and associated values as its output. The output from reducers may be fed into other instances of mappers and reducers. Thus it is possible to cascade a series of mappers and reducers for some implementation.
There are a number of challenges with using map-reduce for data processing. First and foremost is the problem of retrofitting a solution to the map-reduce paradigm. Second is forming a solution that is efficient and scalable. I’ve seen cases where a map-reduce solution on a cluster of machines underperforms the execution time on a single machine.
The map-reduce programming is not for any problem related to large datasets. In fact it is mostly suited for specific type of problems that deal with large amounts of mostly read-only data in large files that can be redundantly placed on a large number of machines. Further the algorithm needs to be such that it can be easily parallelized.
Over the years, some interesting design patterns have emerged as commonly applied solutions to recurring problems with large data set using map-reduce. In this blog, my goal is to provide a collection of these design patterns. I’ve another section here that provides specific examples for the use of map-reduce programming. These examples may or may not correspond to a design pattern described here. My goal is to add newer design patterns and examples as I come across them over the next several months. So keep visiting this post for more information. I will quote the sources for the material in here for your reference. If you know the techniques that are not covered here, kindly point them to me and I will update my entries. This is pretty light weight for now but stay tuned.
Design Patterns
Combiners
In the case where a reducer applies an associative and commutative operation, there is an option to push some of the reducer functionality into a mapper. This sub-task in a mapper is the combiner. This doesn’t eliminate the need for reducers but is absolutely good for reducing the communication overhead between the mappers and reducers. In the word count example, a combiner aggregates the key-value pairs for a document or an instance of a mapper.
Examples
Word Count
In this example, the input is a very large repository of documents and the goal is to count the number of occurrences for each word in the collection. A file in the repository is an input element to a mapper. The mapper parses the document and generates a sequence of key-value pairs such that the key is a word in a document and its value is 1 for frequency. Thus the output of a mapper is a sequence of key-value pairs of the form:
(w1, 1), (w2, 1),…,(wn, 1)
The master controller merges the output of mappers into as many intermediate files as the number of reducers. This is done in a way that each reducer processes all the key-value pairs for the same key.
The reducer takes the sequence of key-value pairs and generates as output a key-value pair with the same key and aggregated sum of values for the sequence.
This gives the frequency of all words in a repository using map-reduce.
Matrix-Vector Multiplication
Given a matrix M of size n x n and a vector v of length n, the matrix-vector product is a vector of length n whose ith element xi is given by:
In the first case, assume that n is a large number though vector v can still fit in main memory and is part of an input to a mapper. A mapper takes as input a row-wise chunk of matrix M of size j and vector v and produces a key-value pair of the form (i, mij * vj1). Thus, each term of sum for xi1 is output with the same key. The reducer sums all the values associated with a key. This completes the matrix-vector multiplication using map-reduce.
In the second case, assume that n is so large that vector v cannot fit in main memory. It is still possible to use the algorithm in the previous case, though this may require frequent disk accesses. A better approach is to partition the columns of a matrix and rows of a vector into k blocks such that a single block of a vector can fit in main memory. The implementation of a mapper and reducer is same as in the previous case with the differences being the size and splitting of data chunks for these processes.
In the third case, partition the matrix into k2 blocks and vector into k blocks. A mapper gets a single square block of the matrix and the corresponding vector block. A single vector block is sent to k mappers. Both the matrix and vector blocks can fit in main memory for fast computation.
Matrix Multiplication
Given a matrix M with element mij in row i and column j and matrix N with element njk in row j and column k, the product P = MN is the matrix P with element pik in row i and column k given as:
A requirement for matrix multiplication is that the number of columns of M equal the number of rows of N. In here, I have provided two somewhat similar solutions for matrix multiplication using Map-Reduce.
The first solution is a cascade of two map-reduce operations. In the first stage, the mapper generates a key-value pair of the form (j, (M, i, mij)) and (j, (N, k, njk)). The reducer examines the value sets from M and combines them with the value sets from N to produce a key-value pair of the form (j, (i, k, mij * njk)). In the second stage, the mapper groups all the value sets for a single key of the form (j, [((i1, k1), v1), ((i2, k2), v2),…,((ip, kp), vp)]) and generates a key-value pair of the form ((i1, k1), v1), ((i2, k2), v2),…,((ip, kp), vp). The reducer in the second stage produces a sum of values associated with a key. This completes the matrix multiplication using map-reduce in two stages.
The second solution performs the matrix multiplication in one map-reduce pass. The mapper produces key-value pairs for each element mij in M of the form ((i,k), (M, j, mij)) for k = 1, 2,…, up to the number of columns of N. Similarly the mapper produces key-value pairs for each element njk in N of the form ((i, k), (N, j, njk) for i = 1, 2,…, up to the number of rows of M. The reducer has two lists of values corresponding to matrices M and N for each key (i,k). The reducer matches the values on the lists for the same values of j and multiplies the third element in the value set. This is then summed to produce a value output for the key (i,k). This completes the matrix multiplication using map-reduce in one pass.
References
Mining of Massive Datasets – Anand Rajaraman and Jeffrey D. Ullman
Singular Value Decomposition (SVD)
Posted by vinod.mamtani in Machine Learning on December 29, 2010
During my friend Charles’s visit to the bay area, we had an interesting conversation on SVD as a learning technique. For those of you who are not familiar with SVD, let me first describe what SVD is and how it can be used.
In linear algebra, the Singular Value Decomposition is a factorization of a rectangular matrix of the form:
- U is an orthogonal matrix. The columns of U are orthonormal eigenvectors of AAT and also called the left singular vectors of A. UTU = I
- V is an orthogonal matrix. The columns of V are orthonormal eigenvectors of ATA and also called the right singular vectors of A. VTV = I
- S is a diagonal matrix. The diagonal entries are called the singular values of A. The non-zero singular values are the square roots of eigenvalues from U or V in descending order
- The diagonal matrix is uniquely determined by A though the matrices U and V are not.
Let’s look at an example to better understand the concept of SVD.
Define matrix A as:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] -3 -2 -1
Correspondingly, AT is:
[,1] [,2]
[1,] 1 -3
[2,] 2 -2
[3,] 3 -1
In order to find U, compute AAT:
[,1] [,2]
[1,] 14 -10
[2,] -10 14
The next step is to find the eigenvalues for AAT. Solving the equation for eigenvectors gives the eigenvalues for a matrix. The equation for eigenvector is defined as:
Writing this in an equation form yields two linear equations:
14 * x1 – 10 * x2 = lambda * x1
=> (14 – lambda) * x1 – 10 * x2 = 0
-10 * x1 + 14 * x2 = lambda * x2
=> –10 * x1 + (14 – lambda) * x2 = 0
Equating the determinant of coefficient matrix to 0 gives lambda values or eigenvalues as 24 and 4. Plugging these values into the two linear equation above yields:
x1 = –x2 and
x1 = x2
Now form a set of eigenvectors, corresponding to each eigenvalue, such that the resultant matrix is easier to work with. Note that the eigenvector for the largest eigenvalue is placed in the first column, eigenvector for the second largest value is placed in the second column and so on.
[,1] [,2]
[1,] -1 -1
[2,] 1 -1
Now we need to convert the above matrix into an orthonormal matrix. This is done by applying the Gram-Schmidt orthonormalization method shown below:
Normalizing the first column vector gives:
[,1]
[1,] -0.7071068
[2,] 0.7071068
Applying the Gram-Schmidt process to the second column vector gives:
[,2]
[1,] -1
[2,] -1
Normalizing the second column vector gives:
[,2]
[1,] -0.7071068
[2,] -0.7071068
The resultant matrix U is given as:
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] 0.7071068 -0.7071068
Similarly, it’s possible to calculate V based on ATA. The resultant matrix V is:
[,1] [,2] [,3]
[1,] -0.5773503 7.071068e-01 0.4082483
[2,] -0.5773503 6.234162e-19 -0.8164966
[3,] -0.5773503 -7.071068e-01 0.4082483
Correspondingly, VT is:
[,1] [,2] [,3]
[1,] -0.5773503 -5.773503e-01 -0.5773503
[2,] 0.7071068 6.234162e-19 -0.7071068
[3,] 0.4082483 -8.164966e-01 0.4082483
The diagonal matrix S is composed of square root of non-zero eigenvalues on the main diagonal in descending order. The eigenvalues for both U and V is the same:
[,1] [,2] [,3]
[1,] 4.898979 0 0
[2,] 0.000000 2 0
Putting all together:
U * S * VT = A
> U %*% S %*% t(V)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] -3 -2 -1
Visualization of SVD
The matrix decomposition can also be visualized as three geometric transformations – a rotation followed by a stretch followed by another rotation. VT describes an orthonormal basis in the domain, U describes an orthonormal basis in the co-domain and S describes the stretch for vectors in VT that leads to vectors in U.
An example below illustrates a transformation of a circle by a matrix. This is shown in two different ways. The first part is a transformation by a matrix. The second part illustrates each stage of transformation by the ordered matrix components.
Applications of SVD
Dimensionality Reduction/Data Compression
SVD is a method for identifying and ordering the dimensions along which the data points exhibit the most variation. Once the dimensions are identified, it’s possible to find the best approximation of the original data points using fewer dimensions.
In the SVD method explained above, S is a diagonal matrix with non-zero eigenvalues in descending order. These values indicate the variance of the linearly independent components along each dimension. Restricting the number of singular values in S has an effect of dimensionality reduction on the data set. This method is also called the reduced SVD method.
In an example below, the least significant eigenvalue in S (last value along the diagonal) is dropped thereby reducing the size of matrix S as shown by the shaded portion. This results in further reduction in size of VT again shown by the shaded portion. Overall, this reduces the rank of the product matrix providing a good approximation to matrix A. This type of compression is therefore lossy unless the dropped eigenvalues are all 0s.![]()
Noise Filtering/SVD Filtering
In any experiment related to data collection, measurement errors manifest in form of noise. The technique of reduced SVD discards insignificant eigenvalues thereby not only leading to dimensionality reduction but noise filtering as well! The details on the SVD reduction technique are given in the previous sections.
Latent Semantic Indexing/Analysis
SVD can be seen as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. With SVD, the representation of items that share substructure become more similar to each other and items that were dissimilar become more dissimilar as well. In practical terms, the method of SVD on a document corpus yields various topics for those documents. This aspect is the SVD learning technique along the lines of unsupervised clustering.
References
Singular Value Decomposition in Wikipedia
Introduction to the Singular Value Decomposition
Singular Value Decomposition Tutorial

Recent Comments