What is Apache SystemML?

SystemML Beginner Tutorial

Level: Beginner   |   Time: 20 minutes


How to set up and run Apache SystemML locally.

Docs

Assumptions

If you haven’t run Apache SystemML before, make sure to set up your environment first.

1Install Homebrew

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
# Linux
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"

2Install Java

brew tap caskroom/cask
brew install Caskroom/cask/java

brew install python
pip install jupyter matplotlib numpy

Downloads

Download Spark and SystemML.

3Download and Install Spark 1.6

brew tap homebrew/versions
brew install apache-spark16

Alternatively, you can download Spark directly.

4Download and Install SystemML

If you are a python user, we recommend that you download and install Apache SystemML via pip:

# Python 2
pip install systemml
# Bleeding edge: pip install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python
# Python 3:
pip3 install systemml
# Bleeding edge: pip3 install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python

Alternatively, if you intend to use SystemML via spark-shell (or spark-submit), you only need systemml-0.12.0-incubating.jar, which is packaged into our official binary release (systemml-0.12.0-incubating.zip). Note: If you have installed SystemML via pip, you can get the location of this jar by executing following command:

python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'

Ways to Use

You can use SystemML in one of the following ways:

  1. On Cluster (using our programmatic APIs):
  2. On Cluster (command-line batch mode):
  3. On laptop (command-line batch mode) without installing Spark or Hadoop: Please see our standalone mode tutorial.
  4. In-memory mode (as part of another Java application for scoring): Please see our JMLC tutorial.

Note that you can also run pyspark, spark-shell, spark-submit on you laptop using "--master local[*]" parameter.

5In Jupyter Notebook

Get Started

Start up your Jupyter notebook by moving to the folder where you saved the notebook. Then copy and paste the line below:

# Python 2:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar--conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100
# Python 3:
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar --conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100

6To Run SystemML in the Spark Shell

Start Spark Shell with SystemML

To use SystemML with Spark Shell, the SystemML jar can be referenced using Spark Shell’s --jars option. Start the Spark Shell with SystemML with the following line of code in your terminal:

spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar

Create the MLContext

To begin, start an MLContext by typing the code below. Once successful, you should see a “Welcome to Apache SystemML!” message.

import org.apache.sysml.api.mlcontext._
import org.apache.sysml.api.mlcontext.ScriptFactory._
val ml = new MLContext(sc)

Hello World

The ScriptFactory class allows DML and PYDML scripts to be created from Strings, Files, URLs, and InputStreams. Here, we’ll use the dmlmethod to create a DML “hello world” script based on a String. We execute the script using MLContext’s execute method, which displays “hello world” to the console. The execute method returns an MLResults object, which contains no results since the script has no outputs.

val helloScript = dml("print('hello world')")
ml.execute(helloScript)

DataFrame Example

As an example of how to use SystemML, we’ll first use Spark to create a DataFrame called df of random doubles from 0 to 1 consisting of 10,000 rows and 1,000 columns.

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
import scala.util.Random
val numRows = 10000
val numCols = 1000
val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = sqlContext.createDataFrame(data, schema)

We’ll create a DML script using the ScriptFactory dml method to find the minimum, maximum, and mean values in a matrix. This script has one input variable, matrix Xin, and three output variables, minOut, maxOut, and meanOut. For performance, we’ll specify metadata indicating that the matrix has 10,000 rows and 1,000 columns. We execute the script and obtain the results as a Tuple by calling getTuple on the results, specifying the types and names of the output variables.

val minMaxMean =
"""
minOut = min(Xin)
maxOut = max(Xin)
meanOut = mean(Xin)
"""
val mm = new MatrixMetadata(numRows, numCols)
val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
val (min, max, mean) = ml.execute(minMaxMeanScript).getTuple[Double, Double, Double]("minOut", "maxOut", "meanOut")

Many different types of input and output variables are automatically allowed. These types include Boolean, Long, Double, String, Array[Array[Double]], RDD and JavaRDD in CSV (dense) and IJV (sparse) formats, DataFrame, BinaryBlockMatrix,Matrix, and Frame. RDDs and JavaRDDs are assumed to be CSV format unless MatrixMetadata is supplied indicating IJV format.

RDD Example:

Let’s take a look at an example of input matrices as RDDs in CSV format. We’ll create two 2x2 matrices and input these into a DML script. This script will sum each matrix and create a message based on which sum is greater. We will output the sums and the message.

val rdd1 = sc.parallelize(Array("1.0,2.0", "3.0,4.0"))
val rdd2 = sc.parallelize(Array("5.0,6.0", "7.0,8.0"))
val sums = """
s1 = sum(m1);
s2 = sum(m2);
if (s1 > s2) {
message = "s1 is greater"
} else if (s2 > s1) {
message = "s2 is greater"
} else {
message = "s1 and s2 are equal"
}
"""
scala.tools.nsc.io.File("sums.dml").writeAll(sums)
val sumScript = dmlFromFile("sums.dml").in(Map("m1"-> rdd1, "m2"-> rdd2)).out("s1", "s2", "message")
val sumResults = ml.execute(sumScript)
val s1 = sumResults.getDouble("s1")
val s2 = sumResults.getDouble("s2")
val message = sumResults.getString("message")
val rdd1Metadata = new MatrixMetadata(2, 2)
val rdd2Metadata = new MatrixMetadata(2, 2)
val sumScript = dmlFromFile("sums.dml").in(Seq(("m1", rdd1, rdd1Metadata), ("m2", rdd2, rdd2Metadata))).out("s1", "s2", "message")
val (firstSum, secondSum, sumMessage) = ml.execute(sumScript).getTuple[Double, Double, String]("s1", "s2", "message")

Congratulations! You’ve now run examples in Apache SystemML!