这段Scala代码的意思?class DefaultConfig extends Config(new WithNBigCores(1) ++ new BaseConfig)

DT大数据——第二堂课:Scala面向对象彻底精通及Spark源码阅读
SparkContext.scala实现了一个SparkContext的class和object,SparkContext类似Spark的入口,负责连接Spark集群,创建RDD,累积量和广播量等。
在Spark框架下该类在一个JVM中只加载一次。在加载类的阶段,SparkContext类中定义的属性,代码块,函数均被加载。
(1):class SparkContext(config:SparkConf) extends Logging with
ExecutorAllocationClient
类SparkContext的默认构造参数为SparkConf类型,&&
SparkContext extends了Logging,以及ExecutorAllocationClient
多个trait继承采用with连接,trait没有任何类参数,trait调用的方法是动态绑定的。
(2)private val creationSite:CallSite=Utils.getCallSite()
startTime=System.currentTimeMillis()
1.未加private的变量:使用val声明的字段,只有公有的getter方法,(getter和setter分别叫做creationSite和creationSite_=)
使用var声明的字段,getter和setter方法都是公有的。
2.夹private的变量:相对应的val和var声明的字段的getter或setter方法变成私有的方法
(3):private[spark] val stopped:AtomicBoolean=new
AtomicBoolean(false)
private[class_name]指定可以访问该字段的类,class_name必须是当前定义的类,或当前定义的类的外部类。会生成getter和setter方法
private[this]:只有同一个对象中可见,类私有基础之上的对象私有
(4):private def assertNotStopped():Unit
该方法为一个过程(process),因为返回值为Unit;同时为类的私有方法
(5):def this()=this(new SparkConf())主构造器
SparkContext类的构造器,默认参数为SparkConf类型的参数
this(config:SparkConf,preferredNodeLocationData:Map[String,Set[SplitInfo]])的定义需要首先调用this(config)超方法。
(6):private[spark] def this(master:String,appName:String)
spark类的私有构造方法
(7)@volatile private var _dagScheduler:DAGScheduler=_
private var _applicationId:String=_
@volatile注解,通知编译器,被注解的变量将被多个线程使用
这些变量都将在类加载时被实例化。
(8):第396-615行的try{}catch{}代码块。
其中的各种条件语句,属性的初始化,使用master创建taskSchedule等相应的参数。
(9):private[spark] def
withScope[U](body:=&U):U=RDDOperationScop.withScope[U](this)(body)
其中U代表类型,即自定义的类,或scala固有的类,body指代operation,一段代码段,SparkContext类中多处使用了
该函数。parallelize,range,makeRDD,textFile,wholeTextFiles,binaryFiles等函数
(10):def newAPIHadoopFile[K, V, F &: NewInputFormat[K,
&&& fClass:
&&& kClass:
&&& vClass:
Configuration = hadoopConfiguration): RDD[(K, V)]
函数声明说明:调用newAPIHadoopFile[LongWritable, Text,
TextInputFormat]("hdfs://ip:port/path/to/file")
path:待读取的文件;conf:hadoop配置文件,fClass:InputFormat输入的数据格式
kClass:输入格式的key的类型;vClass:输入格式的value的类型
(11):def sequenceFile[K, V]
(path: String, minPartitions: Int = defaultMinPartitions)
(implicit km: ClassTag[K], vm: ClassTag[V],
kcf: () =& WritableConverter[K], vcf: () =&
WritableConverter[V]): RDD[(K, V)]
该函数中有默认参数设定,以及一个隐式的转换
(12):def stop() 关闭SparkContext
&runJob,submitJob,cancelJobGroup,cancelJOb,cancelStage,clean,setupAndStartListenerBus等操作
(13):类WritableConverter和object WritableConverter 中包含隐式转换操作
def stringWritableConverter():WritableConverter[String]
WritableConverter转换操作
(14):createTaskScheduler创建任务调度器
(15):类WritableFactory和object WritableFactory中包含了隐式工厂操作
implicit def longWritableFactory:WritableFactory[Long] 隐式操作
(16):object SparkMasterRegex 用于模式匹配
RDD 抽象类abstract,extends Serializable with Logging
(1):final 标示的函数和属性均不可被覆写
(2):对于继承抽象类的子类对父类中的方法进行覆写时,需要加override标示
RDD抽象类被其他的RDD类,如HadoopRDD,继承,在子类中对父类的方法进行覆写,以适用于自身的各种RDD操作
排序,map,reduce操作等
以上网友发言只代表其个人观点,不代表新浪网的观点或立场。Spark SQL and DataFrames - Spark 2.3.0 Documentation
Spark SQL, DataFrames and Datasets Guide
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
interact with Spark SQL including SQL and the Dataset API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
computation. This unification means that developers can easily switch back and forth between
different APIs based on which provides the most natural way to express a given transformation.
All of the examples on this page use sample data included in the Spark distribution and can be run in
the spark-shell, pyspark shell, or sparkR shell.
One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the
section. When running
SQL from within another programming language the results will be returned as a .
You can also interact with the SQL interface using the
Datasets and DataFrames
A Dataset is a distributed collection of data.
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong
typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized
execution engine. A Dataset can be
from JVM objects and then
manipulated using functional transformations (map, flatMap, filter, etc.).
The Dataset API is available in
. Python does not have the support for the Dataset API. But due to Python’s dynamic nature,
many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally
row.columnName). The case for R is similar.
A DataFrame is a Dataset organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer
optimizations under the hood. DataFrames can be constructed from a wide array of
as: structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API is available in Scala,
Java, , and .
In Scala and Java, a DataFrame is represented by a Dataset of Rows.
In , DataFrame is simply a type alias of Dataset[Row].
While, in , users need to use Dataset&Row& to represent a DataFrame.
Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.
Getting Started
Starting Point: SparkSession
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName(&Spark SQL basic example&)
.config(&spark.some.config.option&, &some-value&)
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName(&Java Spark SQL basic example&)
.config(&spark.some.config.option&, &some-value&)
.getOrCreate();
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName(&Python Spark SQL basic example&) \
.config(&spark.some.config.option&, &some-value&) \
.getOrCreate()
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
The entry point into all functionality in Spark is the
class. To initialize a basic SparkSession, just call sparkR.session():
sparkR.session(appName = &R Spark SQL basic example&, sparkConfig = list(spark.some.config.option = &some-value&))
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Note that when invoked for the first time, sparkR.session() initializes a global SparkSession singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the SparkSession once, then SparkR functions like read.df will be able to access this global instance implicitly, and users don’t need to pass the SparkSession instance around.
SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to
write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
To use these features, you do not need to have an existing Hive setup.
Creating DataFrames
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json(&examples/src/main/resources/people.json&)
// Displays the content of the DataFrame to stdout
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset&Row& df = spark.read().json(&examples/src/main/resources/people.json&);
// Displays the content of the DataFrame to stdout
df.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json(&examples/src/main/resources/people.json&)
# Displays the content of the DataFrame to stdout
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
With a SparkSession, applications can create DataFrames from a local R data.frame,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
df &- read.json(&examples/src/main/resources/people.json&)
# Displays the content of the DataFrame
NA Michael
# Another method to print the first few rows and optionally truncate the printing of long values
showDF(df)
## +----+-------+
## +----+-------+
## |null|Michael|
19| Justin|
## +----+-------+
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation in , ,
As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the &name& column
df.select(&name&).show()
// +-------+
// +-------+
// |Michael|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select($&name&, $&age& + 1).show()
// +-------+---------+
name|(age + 1)|
// +-------+---------+
// |Michael|
// | Justin|
// +-------+---------+
// Select people older than 21
df.filter($&age& & 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
// Count people by age
df.groupBy(&age&).count().show()
// +----+-----+
// | age|count|
// +----+-----+
// +----+-----+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
For a complete list of the types of operations that can be performed on a Dataset refer to the .
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
// col(&...&) is preferable to df.col(&...&)
import static org.apache.spark.sql.functions.col;
// Print the schema in a tree format
df.printSchema();
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the &name& column
df.select(&name&).show();
// +-------+
// +-------+
// |Michael|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select(col(&name&), col(&age&).plus(1)).show();
// +-------+---------+
name|(age + 1)|
// +-------+---------+
// |Michael|
// | Justin|
// +-------+---------+
// Select people older than 21
df.filter(col(&age&).gt(21)).show();
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
// Count people by age
df.groupBy(&age&).count().show();
// +----+-----+
// | age|count|
// +----+-----+
// +----+-----+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
For a complete list of the types of operations that can be performed on a Dataset refer to the .
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
In Python it’s possible to access a DataFrame’s columns either by attribute
(df.age) or by indexing (df['age']). While the former is convenient for
interactive data exploration, users are highly encouraged to use the
latter form, which is future proof and won’t break with column names that
are also attributes on the DataFrame class.
# spark, df are from the previous example
# Print the schema in a tree format
df.printSchema()
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Select only the &name& column
df.select(&name&).show()
# +-------+
# +-------+
# |Michael|
# | Justin|
# +-------+
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
name|(age + 1)|
# +-------+---------+
# |Michael|
# | Justin|
# +-------+---------+
# Select people older than 21
df.filter(df['age'] & 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+
# Count people by age
df.groupBy(&age&).count().show()
# +----+-----+
# | age|count|
# +----+-----+
# +----+-----+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
For a complete list of the types of operations that can be performed on a DataFrame refer to the .
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
# Create the DataFrame
df &- read.json(&examples/src/main/resources/people.json&)
# Show the content of the DataFrame
NA Michael
# Print the schema in a tree format
printSchema(df)
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the &name& column
head(select(df, &name&))
## 1 Michael
# Select everybody, but increment the age by 1
head(select(df, df$name, df$age + 1))
name (age + 1.0)
## 1 Michael
# Select people older than 21
head(where(df, df$age & 21))
# Count people by age
head(count(groupBy(df, &age&)))
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
For a complete list of the types of operations that can be performed on a DataFrame refer to the .
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&)
val sqlDF = spark.sql(&SELECT * FROM people&)
sqlDF.show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a Dataset&Row&.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&);
Dataset&Row& sqlDF = spark.sql(&SELECT * FROM people&);
sqlDF.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&)
sqlDF = spark.sql(&SELECT * FROM people&)
sqlDF.show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.
df &- sql(&SELECT * FROM table&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Global Temporary View
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it
terminates. If you want to have a temporary view that is shared among all sessions and keep alive
until the Spark application terminates, you can create a global temporary view. Global temporary
view is tied to a system preserved database global_temp, and we must use the qualified name to
refer it, e.g. SELECT * FROM global_temp.view1.
// Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&)
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
// Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&);
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
# Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&)
# Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
# Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
SELECT * FROM global_temp.temp_view
Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
a specialized
to serialize the objects
for processing or transmitting over the network. While both encoders and standard serialization are
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
the bytes back into an object.
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person(&Andy&, 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = &examples/src/main/resources/people.json&
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
import java.util.Arrays;
import java.util.Collections;
import java.io.Serializable;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
public static class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
public void setName(String name) {
this.name = name;
public int getAge() {
return age;
public void setAge(int age) {
this.age = age;
// Create an instance of a Bean class
Person person = new Person();
person.setName(&Andy&);
person.setAge(32);
// Encoders are created for Java beans
Encoder&Person& personEncoder = Encoders.bean(Person.class);
Dataset&Person& javaBeanDS = spark.createDataset(
Collections.singletonList(person),
personEncoder
javaBeanDS.show();
// +---+----+
// |age|name|
// +---+----+
// | 32|Andy|
// +---+----+
// Encoders for most common types are provided in class Encoders
Encoder&Integer& integerEncoder = Encoders.INT();
Dataset&Integer& primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
Dataset&Integer& transformedDS = primitiveDS.map(
(MapFunction&Integer, Integer&) value -& value + 1,
integerEncoder);
transformedDS.collect(); // Returns [2, 3, 4]
// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = &examples/src/main/resources/people.json&;
Dataset&Person& peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
Interoperating with RDDs
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
reflection based approach leads to more concise code and works well when you already know the schema
while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
you to construct Datasets when the columns and their types are not known until runtime.
Inferring the Schema Using Reflection
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
to a DataFrame. The case class
defines the schema of the table. The names of the arguments to the case class are read using
reflection and become the names of the columns. Case classes can also be nested or contain complex
types such as Seqs or Arrays. This RDD can be implicitly converted to a DataFrame and then be
registered as a table. Tables can be used in subsequent SQL statements.
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile(&examples/src/main/resources/people.txt&)
.map(_.split(&,&))
.map(attributes =& Person(attributes(0), attributes(1).trim.toInt))
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView(&people&)
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql(&SELECT name, age FROM people WHERE age BETWEEN 13 AND 19&)
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager =& &Name: & + teenager(0)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// or by field name
teenagersDF.map(teenager =& &Name: & + teenager.getAs[String](&name&)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager =& teenager.getValuesMap[Any](List(&name&, &age&))).collect()
// Array(Map(&name& -& &Justin&, &age& -& 19))
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
Spark SQL supports automatically converting an RDD of
into a DataFrame.
The BeanInfo, obtained using reflection, defines the schema of the table. Currently, Spark SQL
does not support JavaBeans that contain Map field(s). Nested JavaBeans and List or Array
fields are supported though. You can create a JavaBean by creating a class that implements
Serializable and has getters and setters for all of its fields.
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
// Create an RDD of Person objects from a text file
JavaRDD&Person& peopleRDD = spark.read()
.textFile(&examples/src/main/resources/people.txt&)
.javaRDD()
.map(line -& {
String[] parts = line.split(&,&);
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset&Row& peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView(&people&);
// SQL statements can be run by using the sql methods provided by spark
Dataset&Row& teenagersDF = spark.sql(&SELECT name FROM people WHERE age BETWEEN 13 AND 19&);
// The columns of a row in the result can be accessed by field index
Encoder&String& stringEncoder = Encoders.STRING();
Dataset&String& teenagerNamesByIndexDF = teenagersDF.map(
(MapFunction&Row, String&) row -& &Name: & + row.getString(0),
stringEncoder);
teenagerNamesByIndexDF.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// or by field name
Dataset&String& teenagerNamesByFieldDF = teenagersDF.map(
(MapFunction&Row, String&) row -& &Name: & + row.&String&getAs(&name&),
stringEncoder);
teenagerNamesByFieldDF.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
from pyspark.sql import Row
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile(&examples/src/main/resources/people.txt&)
parts = lines.map(lambda l: l.split(&,&))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView(&people&)
# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql(&SELECT name FROM people WHERE age &= 13 AND age &= 19&)
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: &Name: & + p.name).collect()
for name in teenNames:
print(name)
# Name: Justin
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed
and fields will be projected differently for different users),
a DataFrame can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.
For example:
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile(&examples/src/main/resources/people.txt&)
// The schema is encoded in a string
val schemaString = &name age&
// Generate the schema based on the string of schema
val fields = schemaString.split(& &)
.map(fieldName =& StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(&,&))
.map(attributes =& Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView(&people&)
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql(&SELECT name FROM people&)
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes =& &Name: & + attributes(0)).show()
// +-------------+
// +-------------+
// |Name: Michael|
Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
When JavaBean classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a Dataset&Row& can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.
For example:
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
// Create an RDD
JavaRDD&String& peopleRDD = spark.sparkContext()
.textFile(&examples/src/main/resources/people.txt&, 1)
.toJavaRDD();
// The schema is encoded in a string
String schemaString = &name age&;
// Generate the schema based on the string of schema
List&StructField& fields = new ArrayList&&();
for (String fieldName : schemaString.split(& &)) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
StructType schema = DataTypes.createStructType(fields);
// Convert records of the RDD (people) to Rows
JavaRDD&Row& rowRDD = peopleRDD.map((Function&String, Row&) record -& {
String[] attributes = record.split(&,&);
return RowFactory.create(attributes[0], attributes[1].trim());
// Apply the schema to the RDD
Dataset&Row& peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView(&people&);
// SQL can be run over a temporary view created using DataFrames
Dataset&Row& results = spark.sql(&SELECT name FROM people&);
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset&String& namesDS = results.map(
(MapFunction&Row, String&) row -& &Name: & + row.getString(0),
Encoders.STRING());
namesDS.show();
// +-------------+
// +-------------+
// |Name: Michael|
Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
When a dictionary of kwargs cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a DataFrame can be created programmatically with three steps.
Create an RDD of tuples or lists from the original RDD;
Create the schema represented by a StructType matching the structure of
tuples or lists in the RDD created in the step 1.
Apply the schema to the RDD via createDataFrame method provided by SparkSession.
For example:
# Import data types
from pyspark.sql.types import *
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile(&examples/src/main/resources/people.txt&)
parts = lines.map(lambda l: l.split(&,&))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = &name age&
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView(&people&)
# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql(&SELECT name FROM people&)
results.show()
# +-------+
# +-------+
# |Michael|
# | Justin|
# +-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Aggregations
provide common
aggregations such as count(), countDistinct(), avg(), max(), min(), etc.
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
to work with strongly typed Datasets.
Moreover, users are not limited to the predefined aggregate functions and can create their own.
Untyped User-Defined Aggregate Functions
Users have to extend the
abstract class to implement a custom untyped aggregate function. For example, a user-defined average
can look like:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
object MyAverage extends UserDefinedAggregateFunction {
// Data types of input arguments of this aggregate function
def inputSchema: StructType = StructType(StructField(&inputColumn&, LongType) :: Nil)
// Data types of values in the aggregation buffer
def bufferSchema: StructType = {
StructType(StructField(&sum&, LongType) :: StructField(&count&, LongType) :: Nil)
// The data type of the returned value
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
// Register the function to access it
spark.udf.register(&myAverage&, MyAverage)
val df = spark.read.json(&examples/src/main/resources/employees.json&)
df.createOrReplaceTempView(&employees&)
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
val result = spark.sql(&SELECT myAverage(salary) as average_salary FROM employees&)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala" in the Spark repo.
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public static class MyAverage extends UserDefinedAggregateFunction {
private StructType inputSchema;
private StructType bufferSchema;
public MyAverage() {
List&StructField& inputFields = new ArrayList&&();
inputFields.add(DataTypes.createStructField(&inputColumn&, DataTypes.LongType, true));
inputSchema = DataTypes.createStructType(inputFields);
List&StructField& bufferFields = new ArrayList&&();
bufferFields.add(DataTypes.createStructField(&sum&, DataTypes.LongType, true));
bufferFields.add(DataTypes.createStructField(&count&, DataTypes.LongType, true));
bufferSchema = DataTypes.createStructType(bufferFields);
// Data types of input arguments of this aggregate function
public StructType inputSchema() {
return inputSchema;
// Data types of values in the aggregation buffer
public StructType bufferSchema() {
return bufferSchema;
// The data type of the returned value
public DataType dataType() {
return DataTypes.DoubleType;
// Whether this function always returns the same output on the identical input
public boolean deterministic() {
return true;
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
public void initialize(MutableAggregationBuffer buffer) {
buffer.update(0, 0L);
buffer.update(1, 0L);
// Updates the given aggregation buffer `buffer` with new input data from `input`
public void update(MutableAggregationBuffer buffer, Row input) {
if (!input.isNullAt(0)) {
long updatedSum = buffer.getLong(0) + input.getLong(0);
long updatedCount = buffer.getLong(1) + 1;
buffer.update(0, updatedSum);
buffer.update(1, updatedCount);
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
public void merge(MutableAggregationBuffer buffer1, Row buffer2) {
long mergedSum = buffer1.getLong(0) + buffer2.getLong(0);
long mergedCount = buffer1.getLong(1) + buffer2.getLong(1);
buffer1.update(0, mergedSum);
buffer1.update(1, mergedCount);
// Calculates the final result
public Double evaluate(Row buffer) {
return ((double) buffer.getLong(0)) / buffer.getLong(1);
// Register the function to access it
spark.udf().register(&myAverage&, new MyAverage());
Dataset&Row& df = spark.read().json(&examples/src/main/resources/employees.json&);
df.createOrReplaceTempView(&employees&);
df.show();
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
Dataset&Row& result = spark.sql(&SELECT myAverage(salary) as average_salary FROM employees&);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedUntypedAggregation.java" in the Spark repo.
Type-Safe User-Defined Aggregate Functions
User-defined aggregations for strongly typed Datasets revolve around the
abstract class.
For example, a type-safe user-defined average can look like:
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
val ds = spark.read.json(&examples/src/main/resources/employees.json&).as[Employee]
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name(&average_salary&)
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala" in the Spark repo.
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.TypedColumn;
import org.apache.spark.sql.expressions.Aggregator;
public static class Employee implements Serializable {
private String name;
private long salary;
// Constructors, getters, setters...
public static class Average implements Serializable
private long sum;
private long count;
// Constructors, getters, setters...
public static class MyAverage extends Aggregator&Employee, Average, Double& {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
public Average zero() {
return new Average(0L, 0L);
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
public Average reduce(Average buffer, Employee employee) {
long newSum = buffer.getSum() + employee.getSalary();
long newCount = buffer.getCount() + 1;
buffer.setSum(newSum);
buffer.setCount(newCount);
return buffer;
// Merge two intermediate values
public Average merge(Average b1, Average b2) {
long mergedSum = b1.getSum() + b2.getSum();
long mergedCount = b1.getCount() + b2.getCount();
b1.setSum(mergedSum);
b1.setCount(mergedCount);
return b1;
// Transform the output of the reduction
public Double finish(Average reduction) {
return ((double) reduction.getSum()) / reduction.getCount();
// Specifies the Encoder for the intermediate value type
public Encoder&Average& bufferEncoder() {
return Encoders.bean(Average.class);
// Specifies the Encoder for the final output value type
public Encoder&Double& outputEncoder() {
return Encoders.DOUBLE();
Encoder&Employee& employeeEncoder = Encoders.bean(Employee.class);
String path = &examples/src/main/resources/employees.json&;
Dataset&Employee& ds = spark.read().json(path).as(employeeEncoder);
ds.show();
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
MyAverage myAverage = new MyAverage();
// Convert the function to a `TypedColumn` and give it a name
TypedColumn&Employee, Double& averageSalary = myAverage.toColumn().name(&average_salary&);
Dataset&Double& result = ds.select(averageSalary);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java" in the Spark repo.
Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface.
A DataFrame can be operated on using relational transformations and can also be used to create a temporary view.
Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section
describes the general methods for loading and saving data using the Spark Data Sources and then
goes into specific options that are available for the built-in data sources.
Generic Load/Save Functions
In the simplest form, the default data source (parquet unless otherwise configured by
spark.sql.sources.default) will be used for all operations.
val usersDF = spark.read.load(&examples/src/main/resources/users.parquet&)
usersDF.select(&name&, &favorite_color&).write.save(&namesAndFavColors.parquet&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& usersDF = spark.read().load(&examples/src/main/resources/users.parquet&);
usersDF.select(&name&, &favorite_color&).write().save(&namesAndFavColors.parquet&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.load(&examples/src/main/resources/users.parquet&)
df.select(&name&, &favorite_color&).write.save(&namesAndFavColors.parquet&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/users.parquet&)
write.df(select(df, &name&, &favorite_color&), &namesAndFavColors.parquet&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options
that you would like to pass to the data source. Data sources are specified by their fully qualified
name (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short
names (json, parquet, jdbc, orc, libsvm, csv, text). DataFrames loaded from any data
source type can be converted into other types using this syntax.
To load a JSON file you can use:
val peopleDF = spark.read.format(&json&).load(&examples/src/main/resources/people.json&)
peopleDF.select(&name&, &age&).write.format(&parquet&).save(&namesAndAges.parquet&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& peopleDF =
spark.read().format(&json&).load(&examples/src/main/resources/people.json&);
peopleDF.select(&name&, &age&).write().format(&parquet&).save(&namesAndAges.parquet&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.load(&examples/src/main/resources/people.json&, format=&json&)
df.select(&name&, &age&).write.save(&namesAndAges.parquet&, format=&parquet&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/people.json&, &json&)
namesAndAges &- select(df, &name&, &age&)
write.df(namesAndAges, &namesAndAges.parquet&, &parquet&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
To load a CSV file you can use:
val peopleDFCsv = spark.read.format(&csv&)
.option(&sep&, &;&)
.option(&inferSchema&, &true&)
.option(&header&, &true&)
.load(&examples/src/main/resources/people.csv&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& peopleDFCsv = spark.read().format(&csv&)
.option(&sep&, &;&)
.option(&inferSchema&, &true&)
.option(&header&, &true&)
.load(&examples/src/main/resources/people.csv&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.load(&examples/src/main/resources/people.csv&,
format=&csv&, sep=&:&, inferSchema=&true&, header=&true&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/people.csv&, &csv&)
namesAndAges &- select(df, &name&, &age&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Run SQL on files directly
Instead of using read API to load a file into DataFrame and query it, you can also query that
file directly with SQL.
val sqlDF = spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& sqlDF =
spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the
Scala/JavaAny LanguageMeaning
SaveMode.ErrorIfExists (default)
"error" or "errorifexists" (default)
When saving a DataFrame to a data source, if data already exists,
an exception is expected to be thrown.
SaveMode.Append
When saving a DataFrame to a data source, if data/table already exists,
contents of the DataFrame are expected to be appended to existing data.
SaveMode.Overwrite
"overwrite"
Overwrite mode means that when saving a DataFrame to a data source,
if data/table already exists, existing data is expected to be overwritten by the contents of
the DataFrame.
SaveMode.Ignore
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
the save operation is expected to not save the contents of the DataFrame and to not
change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Saving to Persistent Tables
DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable
command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
default local Hive metastore (using Derby) for you. Unlike the createOrReplaceTempView command,
saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
be created by calling the table method on a SparkSession with the name of the table.
For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the
path option, e.g. df.write.option("path", "/some/path").saveAsTable("t"). When the table is dropped,
the custom table path will not be removed and the table data is still there. If no custom table path is
specified, Spark will write data to a default table path under the warehouse directory. When the table is
dropped, the default table path will be removed too.
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for tables created with the Datasource API.
Note that partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.
Bucketing, Sorting and Partitioning
For file-based data source, it is also possible to bucket and sort or partition the output.
Bucketing and sorting are applicable only to persistent tables:
peopleDF.write.bucketBy(42, &name&).sortBy(&age&).saveAsTable(&people_bucketed&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
peopleDF.write().bucketBy(42, &name&).sortBy(&age&).saveAsTable(&people_bucketed&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df.write.bucketBy(42, &name&).sortBy(&age&).saveAsTable(&people_bucketed&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
CREATE TABLE users_bucketed_by_name(
name STRING,
favorite_color STRING,
favorite_numbers array&integer&
) USING parquet
CLUSTERED BY(name) INTO 42 BUCKETS;
while partitioning can be used with both save and saveAsTable when using the Dataset APIs.
usersDF.write.partitionBy(&favorite_color&).format(&parquet&).save(&namesPartByColor.parquet&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
.partitionBy(&favorite_color&)
.format(&parquet&)
.save(&namesPartByColor.parquet&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df.write.partitionBy(&favorite_color&).format(&parquet&).save(&namesPartByColor.parquet&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
CREATE TABLE users_by_favorite_color(
name STRING,
favorite_color STRING,
favorite_numbers array&integer&
) USING csv PARTITIONED BY(favorite_color);
It is possible to use both partitioning and bucketing for a single table:
.partitionBy(&favorite_color&)
.bucketBy(42, &name&)
.saveAsTable(&users_partitioned_bucketed&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
.partitionBy(&favorite_color&)
.bucketBy(42, &name&)
.saveAsTable(&people_partitioned_bucketed&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.parquet(&examples/src/main/resources/users.parquet&)
.partitionBy(&favorite_color&)
.bucketBy(42, &name&)
.saveAsTable(&people_partitioned_bucketed&))
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
CREATE TABLE users_bucketed_and_partitioned(
name STRING,
favorite_color STRING,
favorite_numbers array&integer&
) USING parquet
PARTITIONED BY (favorite_color)
CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
partitionBy creates a directory structure as described in the
Thus, it has limited applicability to columns with high cardinality. In contrast
bucketBy distributes
data across a fixed number of buckets and can be used when a number of unique values is unbounded.
Parquet Files
is a columnar format that is supported by many other data processing systems.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
compatibility reasons.
Loading Data Programmatically
Using the data from the above example:
// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._
val peopleDF = spark.read.json(&examples/src/main/resources/people.json&)
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet(&people.parquet&)
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet(&people.parquet&)
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView(&parquetFile&)
val namesDF = spark.sql(&SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19&)
namesDF.map(attributes =& &Name: & + attributes(0)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset&Row& peopleDF = spark.read().json(&examples/src/main/resources/people.json&);
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write().parquet(&people.parquet&);
// Read in the Parquet file created above.
// Parquet files are self-describing so the schema is preserved
// The result of loading a parquet file is also a DataFrame
Dataset&Row& parquetFileDF = spark.read().parquet(&people.parquet&);
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView(&parquetFile&);
Dataset&Row& namesDF = spark.sql(&SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19&);
Dataset&String& namesDS = namesDF.map(
(MapFunction&Row, String&) row -& &Name: & + row.getString(0),
Encoders.STRING());
namesDS.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
peopleDF = spark.read.json(&examples/src/main/resources/people.json&)
# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet(&people.parquet&)
# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet(&people.parquet&)
# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView(&parquetFile&)
teenagers = spark.sql(&SELECT name FROM parquetFile WHERE age &= 13 AND age &= 19&)
teenagers.show()
# +------+
# +------+
# |Justin|
# +------+
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/people.json&, &json&)
# SparkDataFrame can be saved as Parquet files, maintaining the schema information.
write.parquet(df, &people.parquet&)
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile &- read.parquet(&people.parquet&)
# Parquet files can also be used to create a temporary view and then used in SQL statements.
createOrReplaceTempView(parquetFile, &parquetFile&)
teenagers &- sql(&SELECT name FROM parquetFile WHERE age &= 13 AND age &= 19&)
head(teenagers)
## 1 Justin
# We can also run custom R-UDFs on Spark DataFrames. Here we prefix all the names with &Name:&
schema &- structType(structField(&name&, &string&))
teenNames &- dapply(df, function(p) { cbind(paste(&Name:&, p$name)) }, schema)
for (teenName in collect(teenNames)$name) {
cat(teenName, &\n&)
## Name: Michael
## Name: Andy
## Name: Justin
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
path &examples/src/main/resources/people.parquet&
SELECT * FROM parquetTable
Partition Discovery
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet)
are able to discover and infer partitioning information automatically.
For example, we can store all our previously used
population data into a partitioned table using the following directory structure, with two extra
columns, gender and country as partitioning columns:
└── table
├── gender=male
├── ...
├── country=US
└── data.parquet
├── country=CN
└── data.parquet
└── ...
└── gender=female
├── ...
├── country=US
└── data.parquet
├── country=CN
└── data.parquet
└── ...
By passing path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL
will automatically extract the partitioning information from the paths.
Now the schema of the returned DataFrame becomes:
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
Notice that the data types of the partitioning columns are automatically inferred. Currently,
numeric data types, date, timestamp and string type are supported. Sometimes users may not want
to automatically infer the data types of the partitioning columns. For these use cases, the
automatic type inference can be configured by
spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true. When type
inference is disabled, string type will be used for the partitioning columns.
Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
by default. For the above example, if users pass path/to/table/gender=male to either
SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a
partitioning column. If users need to specify the base path that partition discovery
should start with, they can set basePath in the data source options. For example,
when path/to/table/gender=male is the path of the data and
users set basePath to path/to/table/, gender will be a partitioning column.
Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the
examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i =& (i, i * i)).toDF(&value&, &

我要回帖

 

随机推荐