-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
Following the official SparkXGBClassifier example shown here , I try to run XGBoost training on my own dataset and get the following error:
22/09/02 17:05:15 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 666, in main
eval_type = read_int(infile)
File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
raise EOFError
EOFError
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/core.py", line 763, in _train_booster
**train_call_kwargs_params,
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 245, in create_dmatrix_from_partitions
cache_partitions(iterator, append_fn)
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 62, in cache_partitions
make_blob(valid, True)
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 40, in make_blob
append(part, alias.data, is_valid)
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 179, in append_m
array = stack_series(array)
File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 16, in stack_series
array = np.stack(array)
File "<__array_function__ internals>", line 6, in stack
File "/usr/local/lib/python3.7/site-packages/numpy/core/shape_base.py", line 422, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stackHere is my code:
spark = SparkSession.builder.master('local[*]')\
.appName("xgboost_train")\
.config("spark.driver.memory", '300g')\
.config("spark.local.dir", "/mnt/spark")\
.getOrCreate()
train = spark.read.parquet(f'{train_data_path}/*').withColumn('isVal', lit(False))
valid = spark.read.parquet(f'{valid_data_path}/*').withColumn('isVal', lit(True))
data = train.union(valid)
data = data.withColumnRenamed(name, 'label')
feature_list = [...] # a list with over 140 feature names
vector_assembler = VectorAssembler()\
.setInputCols(feature_list)\
.setOutputCol("features")
data_trans = vector_assembler.setHandleInvalid("keep").transform(data)
xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0, eval_metric='logloss', early_stopping_rounds=1, validation_indicator_col='isVal')
xgb_clf_model = xgb_classifier.fit(data_trans)The code runs through when I don't specify the validation_indicator_col in SparkXGBClassifier. But with the same configuration, it works with the official example code. I also checked the data types of both datasets. My isVal and features columns have the same data types as the example data frame.
Here is my environment setup:
- pyspark=3.3.0
- xgboost=2.0.0-dev
I build the xgboost from source and installed pyspark using standard pip install pyspark.
If more info is needed from my side, pls let me know. Thanks!!
Metadata
Metadata
Assignees
Labels
No labels