Skip to content

[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col #8221

@faaany

Description

@faaany

Following the official SparkXGBClassifier example shown here , I try to run XGBoost training on my own dataset and get the following error:

22/09/02 17:05:15 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 666, in main
    eval_type = read_int(infile)
  File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
    raise EOFError
EOFError
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
        at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
        at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
        at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/core.py", line 763, in _train_booster
    **train_call_kwargs_params,
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 245, in create_dmatrix_from_partitions
    cache_partitions(iterator, append_fn)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 62, in cache_partitions
    make_blob(valid, True)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 40, in make_blob
    append(part, alias.data, is_valid)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 179, in append_m
    array = stack_series(array)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 16, in stack_series
    array = np.stack(array)
  File "<__array_function__ internals>", line 6, in stack
  File "/usr/local/lib/python3.7/site-packages/numpy/core/shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

Here is my code:

spark = SparkSession.builder.master('local[*]')\
        .appName("xgboost_train")\
        .config("spark.driver.memory", '300g')\
        .config("spark.local.dir", "/mnt/spark")\
        .getOrCreate()
train = spark.read.parquet(f'{train_data_path}/*').withColumn('isVal', lit(False))       
valid = spark.read.parquet(f'{valid_data_path}/*').withColumn('isVal', lit(True)) 
data = train.union(valid)
data = data.withColumnRenamed(name, 'label')
feature_list = [...] # a list with over 140 feature names
vector_assembler = VectorAssembler()\
                            .setInputCols(feature_list)\
                            .setOutputCol("features")
data_trans = vector_assembler.setHandleInvalid("keep").transform(data)
xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0, eval_metric='logloss', early_stopping_rounds=1, validation_indicator_col='isVal')
xgb_clf_model = xgb_classifier.fit(data_trans)

The code runs through when I don't specify the validation_indicator_col in SparkXGBClassifier. But with the same configuration, it works with the official example code. I also checked the data types of both datasets. My isVal and features columns have the same data types as the example data frame.

Here is my environment setup:

  • pyspark=3.3.0
  • xgboost=2.0.0-dev

I build the xgboost from source and installed pyspark using standard pip install pyspark.

If more info is needed from my side, pls let me know. Thanks!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions