天天看點

從 "No module named pyspark" 到遠端送出 spark 任務

版權聲明:本文為半吊子子全棧工匠(wireless_com,同公衆号)原創文章,未經允許不得轉載。 https://blog.csdn.net/wireless_com/article/details/51170246

能在本地Mac環境用python送出spark 任務會友善很多,但是在安裝了 spark-1.6-bin-without-hadoop  (spark.apache.org/download) 之後,在python 中  “import pyspark” 會報“no module named pyspark” 錯誤。 沒錯,這種錯誤都是 路徑問題。

為了本地使用spark,需要在~/.bash_profile 中增加兩個環境變量:SPARK_HOME 以及必知的PYTHONPATH

export SPARK_HOME=/Users/abc/Documents/

spark-1.6.0-bin-without-hadoop #這是spark 的安裝路徑

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

注:Py4J 有點像 Python 版的 JNI,通過它, Python 程式可以利用 Python 解釋器直接調用Java虛拟機中的 Java 對象,也可以讓 Java 調用 Python 對象。

然後,别忘了,source ~/.bash_profile 讓它生效。 運作 python shell,

from pyspark import SparkContext 

都可以了麼, 但是 當你單獨執行pyspark 或者 在python 中初始化SparkConf 等其它類的時候,又報錯了

"Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream"

spark 通路FS 時庫檔案丢失,看來spark 和hadoop 的結合還需要指明更多的紐帶,簡單的換一下 spark distribution吧。将 spark-1.6.0-bin-without-hadoop 換成 spark-1.6.0-bin-hadoop2.6,然後更新 .bash_profile 中SPARK_HOME 的路徑。

直接運作pyspark:

$ pyspark

Python 2.7.11 (default, Mar  1 2016, 18:40:10) 

[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

16/04/16 21:41:02 INFO spark.SparkContext: Running Spark version 1.6.0

16/04/16 21:41:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/04/16 21:41:05 INFO spark.SecurityManager: Changing view acls to: abel,hdfs

16/04/16 21:41:05 INFO spark.SecurityManager: Changing modify acls to: abel,hdfs

16/04/16 21:41:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(abel, hdfs); users with modify permissions: Set(abel, hdfs)

16/04/16 21:41:06 INFO util.Utils: Successfully started service 'sparkDriver' on port 55162.

16/04/16 21:41:06 INFO slf4j.Slf4jLogger: Slf4jLogger started

16/04/16 21:41:06 INFO Remoting: Starting remoting

16/04/16 21:41:07 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:55165]

16/04/16 21:41:07 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 55165.

16/04/16 21:41:07 INFO spark.SparkEnv: Registering MapOutputTracker

16/04/16 21:41:07 INFO spark.SparkEnv: Registering BlockManagerMaster

16/04/16 21:41:07 INFO storage.DiskBlockManager: Created local directory at /private/var/folders/wk/fxn2zdyd7rz8rm66rst4h15w0000gn/T/blockmgr-6de54d08-31c9-430e-ac3c-9f3e0635e486

16/04/16 21:41:07 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB

16/04/16 21:41:07 INFO spark.SparkEnv: Registering OutputCommitCoordinator

16/04/16 21:41:07 INFO server.Server: jetty-8.y.z-SNAPSHOT

16/04/16 21:41:07 INFO server.AbstractConnector: Started [email protected]:4040

16/04/16 21:41:07 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.

16/04/16 21:41:07 INFO ui.SparkUI: Started SparkUI at http://192.168.1.106:4040

16/04/16 21:41:07 INFO executor.Executor: Starting executor ID driver on host localhost

16/04/16 21:41:07 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55167.

16/04/16 21:41:07 INFO netty.NettyBlockTransferService: Server created on 55167

16/04/16 21:41:07 INFO storage.BlockManagerMaster: Trying to register BlockManager

16/04/16 21:41:07 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:55167 with 511.5 MB RAM, BlockManagerId(driver, localhost, 55167)

16/04/16 21:41:07 INFO storage.BlockManagerMaster: Registered BlockManager

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0

      /_/

Using Python version 2.7.11 (default, Mar  1 2016 18:40:10)

SparkContext available as sc, HiveContext available as sqlContext.

>>>

OK, 至此,pyspark 算是在本機的MAC 環境中可以基本上正常工作了。