Spark-SparkSQL深入学习系列二（转自OopsOutOfMemory）

本文先从入口开始分析，即如何解析SQL文本生成逻辑计划的，主要设计的核心组件式SqlParser是一个SQL语言的解析器，用scala实现的Parser将解析的结果封装为Catalyst TreeNode ，关于Catalyst这个框架后续文章会介绍。

Sql Parser 其实是封装了scala.util.parsing.combinator下的诸多Parser，并结合Parser下的一些解析方法，构成了Catalyst的组件UnResolved Logical Plan。

先来看流程图：

Spark-SparkSQL深入学习系列二（转自OopsOutOfMemory）

一段SQL会经过SQL Parser解析生成UnResolved Logical Plan(包含UnresolvedRelation、 UnresolvedFunction、 UnresolvedAttribute)。

在源代码里是：

def sql(sqlText: String): SchemaRDD = new SchemaRDD(this, parseSql(sqlText))//sql("select name,value from temp_shengli") 实例化一个SchemaRDD

protected[sql] def parseSql(sql: String): LogicalPlan = parser(sql) //实例化SqlParser

class SqlParser extends StandardTokenParsers with PackratParsers {

def apply(input: String): LogicalPlan = { //传入sql语句调用apply方法，input参数即sql语句

// Special-case out set commands since the value fields can be

// complex to handle without RegexParsers. Also this approach

// is clearer for the several possible cases of set commands.

if (input.trim.toLowerCase.startsWith("set")) {

input.trim.drop(3).split("=", 2).map(_.trim) match {

case Array("") => // "set"

SetCommand(None, None)

case Array(key) => // "set key"

SetCommand(Some(key), None)

case Array(key, value) => // "set key=value"

SetCommand(Some(key), Some(value))

}

} else {

phrase(query)(new lexical.Scanner(input)) match {

case Success(r, x) => r

case x => sys.error(x.toString)

}

1. 当我们调用sql("select name,value from temp_shengli")时，实际上是new了一个SchemaRDD

2. new SchemaRDD时，构造方法调用parseSql方法，parseSql方法实例化了一个SqlParser，这个Parser初始化调用其apply方法。

3. apply方法分支：

3.1 如果sql命令是set开头的就调用SetCommand，这个类似Hive里的参数设定，SetCommand其实是一个Catalyst里TreeNode之LeafNode，也是继承自LogicalPlan，关于Catalyst的TreeNode库这个暂不详细介绍，后面会有文章来详细讲解。

3.2 关键是else语句块里，才是SqlParser解析SQL的核心代码：

phrase(query)(new lexical.Scanner(input)) match {

case Success(r, x) => r

case x => sys.error(x.toString)

}

可能 phrase方法大家很陌生，不知道是干什么的，那么我们首先看一下SqlParser的类图：

SqlParser类继承了scala内置集合Parsers,这个Parsers。我们可以看到SqlParser现在是具有了分词的功能，也能解析combiner的语句（类似p ~> q，后面会介绍）。

Phrase方法：

/** A parser generator delimiting whole phrases (i.e. programs).

* `phrase(p)` succeeds if `p` succeeds and no input is left over after `p`.

* @param p the parser that must consume all input for the resulting parser

* to succeed.

* @return a parser that has the same result as `p`, but that only succeeds

* if `p` consumed all the input.

def phrase[T](p: Parser[T]) = new Parser[T] {

def apply(in: Input) = lastNoSuccessVar.withValue(None) {

p(in) match {

case s @ Success(out, in1) =>

if (in1.atEnd)

else

lastNoSuccessVar.value filterNot { _.next.pos < in1.pos } getOrElse Failure("end of input expected", in1)

case ns => lastNoSuccessVar.value.getOrElse(ns)

}

Phrase是一个循环读取输入字符的方法，如果输入in没有到达最后一个字符，就继续对parser进行解析，直到最后一个输入字符。

我们注意到Success这个类，出现在Parser里, 在else块里最终返回的也有Success:

/** The success case of `ParseResult`: contains the result and the remaining input.

* @param result The parser's output

* @param next The parser's remaining input

case class Success[+T](result: T, override val next: Input) extends ParseResult[T] {

通过源码可知，Success封装了当前解析器的解析结果result，和还没有解析的语句。

所以上面判断了Success的解析结果中in1.atEnd？如果输入流结束了，就返回s，即Success对象，这个Success包含了SqlParser解析的输出。

在SqlParser里phrase接受2个参数：

第一个是query，一种带模式的解析规则，返回的是LogicalPlan。

第二个是lexical词汇扫描输入。

SqlParser parse的流程是，用lexical词汇扫描接受SQL关键字，使用query模式来解析符合规则的SQL。

在SqlParser里定义了KeyWord这个类：

protected case class Keyword(str: String)

在我使用的spark1.0.0版本里目前只支持了一下SQL保留字：

protected val ALL = Keyword("ALL")

protected val AND = Keyword("AND")

protected val AS = Keyword("AS")

protected val ASC = Keyword("ASC")

protected val APPROXIMATE = Keyword("APPROXIMATE")

protected val AVG = Keyword("AVG")

protected val BY = Keyword("BY")

protected val CACHE = Keyword("CACHE")

protected val CAST = Keyword("CAST")

protected val COUNT = Keyword("COUNT")

protected val DESC = Keyword("DESC")

protected val DISTINCT = Keyword("DISTINCT")

protected val FALSE = Keyword("FALSE")

protected val FIRST = Keyword("FIRST")

protected val FROM = Keyword("FROM")

protected val FULL = Keyword("FULL")

protected val GROUP = Keyword("GROUP")

protected val HAVING = Keyword("HAVING")

protected val IF = Keyword("IF")

protected val IN = Keyword("IN")

protected val INNER = Keyword("INNER")

protected val INSERT = Keyword("INSERT")

protected val INTO = Keyword("INTO")

protected val IS = Keyword("IS")

protected val JOIN = Keyword("JOIN")

protected val LEFT = Keyword("LEFT")

protected val LIMIT = Keyword("LIMIT")

protected val MAX = Keyword("MAX")

protected val MIN = Keyword("MIN")

protected val NOT = Keyword("NOT")

protected val NULL = Keyword("NULL")

protected val ON = Keyword("ON")

protected val OR = Keyword("OR")

protected val OVERWRITE = Keyword("OVERWRITE")

protected val LIKE = Keyword("LIKE")

protected val RLIKE = Keyword("RLIKE")

protected val UPPER = Keyword("UPPER")

protected val LOWER = Keyword("LOWER")

protected val REGEXP = Keyword("REGEXP")

protected val ORDER = Keyword("ORDER")

protected val OUTER = Keyword("OUTER")

protected val RIGHT = Keyword("RIGHT")

protected val SELECT = Keyword("SELECT")

protected val SEMI = Keyword("SEMI")

protected val STRING = Keyword("STRING")

protected val SUM = Keyword("SUM")

protected val TABLE = Keyword("TABLE")

protected val TRUE = Keyword("TRUE")

protected val UNCACHE = Keyword("UNCACHE")

protected val UNION = Keyword("UNION")

protected val WHERE = Keyword("WHERE")

这里根据这些保留字，反射，生成了一个SqlLexical

override val lexical = new SqlLexical(reservedWords)

SqlLexical利用它的Scanner这个Parser来读取输入，传递给query。

query的定义是Parser[LogicalPlan] 和一堆奇怪的连接符（其实都是Parser的方法啦，看上图），*，~，^^^，看起来很让人费解。通过查阅读源码，以下列出几个常用的：

| is the alternation combinator. It says “succeed if either the left or right operand parse successfully”

左边算子和右边的算子只要有一个成功了，就返回succeed，类似or

~ is the sequential combinator. It says “succeed if the left operand parses successfully, and then the right parses successfully on the remaining input”

左边的算子成功后，右边的算子对后续的输入也计算成功，就返回succeed

opt `opt(p)` is a parser that returns `Some(x)` if `p` returns `x` and `None` if `p` fails.

如果p算子成功则返回则返回Some（x）如果p算子失败，返回fails

^^^ `p ^^^ v` succeeds if `p` succeeds; discards its result, and returns `v` instead.

如果左边的算子成功，取消左边算子的结果，返回右边算子。

~> says “succeed if the left operand parses successfully followed by the right, but do not include the left content in the result”

如果左边的算子和右边的算子都成功了，返回的结果中不包含左边的返回值。

protected lazy val limit: Parser[Expression] =

LIMIT ~> expression

<~ is the reverse, “succeed if the left operand is parsed successfully followed by the right, but do not include the right content in the result”

这个和~>操作符的意思相反，如果左边的算子和右边的算子都成功了，返回的结果中不包含右边的

termExpression <~ IS ~ NOT ~ NULL ^^ { case e => IsNotNull(e) } |

^^{} 或者 ^^=> is the transformation combinator. It says “if the left operand parses successfully, transform the result using the function on the right”

rep => simply says “expect N-many repetitions of parser X” where X is the parser passed as an argument to rep

变形连接符，意思是如果左边的算子成功了，用^^右边的算子函数作用于返回的结果

接下来看query的定义：

protected lazy val query: Parser[LogicalPlan] = (

select * (

UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) } |

UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }

)

| insert | cache

)

没错，返回的是一个Parser，里面的类型是LogicalPlan。

query的定义其实是一种模式，用到了上述的诸多操作符，如|， ^^, ~> 等等

给定一种sql模式，如select，select xxx from yyy where ccc =ddd 如果匹配这种写法，则返回Success，否则返回Failure.

这里的模式是select 模式后面可以接union all 或者 union distinct。

即如下书写式合法的，否则出错。

select a,b from c

union all

select e,f from g

这个 *号是一个repeat符号，即可以支持多个union all 子句。

看来目前spark1.0.0只支持这三种模式，即select, insert, cache。

那到底是怎么生成LogicalPlan的呢？我们再看一个详细的：

protected lazy val select: Parser[LogicalPlan] =

SELECT ~> opt(DISTINCT) ~ projections ~

opt(from) ~ opt(filter) ~

opt(grouping) ~

opt(having) ~

opt(orderBy) ~

opt(limit) <~ opt(";") ^^ {

case d ~ p ~ r ~ f ~ g ~ h ~ o ~ l =>

val base = r.getOrElse(NoRelation)

val withFilter = f.map(f => Filter(f, base)).getOrElse(base)

val withProjection =

g.map {g =>

Aggregate(assignAliases(g), assignAliases(p), withFilter)

}.getOrElse(Project(assignAliases(p), withFilter))

val withDistinct = d.map(_ => Distinct(withProjection)).getOrElse(withProjection)

val withHaving = h.map(h => Filter(h, withDistinct)).getOrElse(withDistinct)

val withOrder = o.map(o => Sort(o, withHaving)).getOrElse(withHaving)

val withLimit = l.map { l => Limit(l, withOrder) }.getOrElse(withOrder)

withLimit

这里我给称它为select模式。

看这个select语句支持什么模式的写法：

select distinct projections from filter grouping having orderBy limit.

给出一个符合的该select 模式的sql, 注意到带opt连接符的是可选的，可以写distinct也可以不写。

select game_id, user_name from game_log where date<='2014-07-19' and user_name='shengli' group by game_id having game_id > 1 orderBy game_id limit 50.

projections是什么呢？

其实是一个表达式，是一个Seq类型，一连串的表达式可以使 game_id也可以是 game_id AS gmid 。

返回的确实是一个Expression，是Catalyst里TreeNode。

protected lazy val projections: Parser[Seq[Expression]] = repsep(projection, ",")

protected lazy val projection: Parser[Expression] =

expression ~ (opt(AS) ~> opt(ident)) ^^ {

case e ~ None => e

case e ~ Some(a) => Alias(e, a)()

模式里from是什么的？

其实是一个relations，就是一个关系，在SQL里可以是表，表join表

protected lazy val from: Parser[LogicalPlan] = FROM ~> relations

protected lazy val relation: Parser[LogicalPlan] =

joinedRelation |

relationFactor

protected lazy val relationFactor: Parser[LogicalPlan] =

ident ~ (opt(AS) ~> opt(ident)) ^^ {

case tableName ~ alias => UnresolvedRelation(None, tableName, alias)

} |

"(" ~> query ~ ")" ~ opt(AS) ~ ident ^^ { case s ~ _ ~ _ ~ a => Subquery(a, s) }

protected lazy val joinedRelation: Parser[LogicalPlan] =

relationFactor ~ opt(joinType) ~ JOIN ~ relationFactor ~ opt(joinConditions) ^^ {

case r1 ~ jt ~ _ ~ r2 ~ cond =>

Join(r1, r2, joinType = jt.getOrElse(Inner), cond)

}

这里看出来，其实就是table之间的操作，但是返回的Subquery确实是一个LogicalPlan

case class Subquery(alias: String, child: LogicalPlan) extends UnaryNode {

override def output = child.output.map(_.withQualifiers(alias :: Nil))

override def references = Set.empty

scala里的语法糖很多，这样写的确比较方便，但是对初学者可能有点晦涩了。

至此我们知道，SqlParser是怎么生成LogicalPlan的了。

本文从源代码剖析了Spark Catalyst 是如何将Sql解析成Unresolved逻辑计划（包含UnresolvedRelation、 UnresolvedFunction、 UnresolvedAttribute）的。

sql文本作为输入，实例化了SqlParser，SqlParser的apply方法被调用，分别处理2种输入，一种是命令参数，一种是sql。对应命令参数的会生成一个叶子节点，SetCommand，对于sql语句，会调用Parser的phrase方法，由lexical的Scanner来扫描输入，分词，最后由query这个由我们定义好的sql模式利用parser的连接符来验证是否符合sql标准，如果符合则随即生成LogicalPlan语法树，不符合则会提示解析失败。

通过对spark catalyst sql parser的解析，使我理解了，sql语言的语法标准是如何实现的和如何解析sql生成逻辑计划语法树。

——EOF——

原创文章，转载请注明：

Spark-SparkSQL深入学习系列二（转自OopsOutOfMemory）

继续阅读

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

sqlServer根据经纬查距离

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method