[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col

Following the official SparkXGBClassifier example shown [here ](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark), I try to run XGBoost training on my own dataset and get the following error: 

```bash
22/09/02 17:05:15 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 666, in main
    eval_type = read_int(infile)
  File "/opt/spark-3.3.0-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
    raise EOFError
EOFError
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
        at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
        at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
        at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/core.py", line 763, in _train_booster
    **train_call_kwargs_params,
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 245, in create_dmatrix_from_partitions
    cache_partitions(iterator, append_fn)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 62, in cache_partitions
    make_blob(valid, True)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 40, in make_blob
    append(part, alias.data, is_valid)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 179, in append_m
    array = stack_series(array)
  File "/usr/local/lib/python3.7/site-packages/xgboost/spark/data.py", line 16, in stack_series
    array = np.stack(array)
  File "<__array_function__ internals>", line 6, in stack
  File "/usr/local/lib/python3.7/site-packages/numpy/core/shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
```

Here is my code: 

```bash
spark = SparkSession.builder.master('local[*]')\
        .appName("xgboost_train")\
        .config("spark.driver.memory", '300g')\
        .config("spark.local.dir", "/mnt/spark")\
        .getOrCreate()
train = spark.read.parquet(f'{train_data_path}/*').withColumn('isVal', lit(False))       
valid = spark.read.parquet(f'{valid_data_path}/*').withColumn('isVal', lit(True)) 
data = train.union(valid)
data = data.withColumnRenamed(name, 'label')
feature_list = [...] # a list with over 140 feature names
vector_assembler = VectorAssembler()\
                            .setInputCols(feature_list)\
                            .setOutputCol("features")
data_trans = vector_assembler.setHandleInvalid("keep").transform(data)
xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0, eval_metric='logloss', early_stopping_rounds=1, validation_indicator_col='isVal')
xgb_clf_model = xgb_classifier.fit(data_trans)
```

The code runs through when I don't specify the `validation_indicator_col` in `SparkXGBClassifier`. But with the same configuration, it works with the official example code.  I also checked the data types of both datasets. My `isVal` and `features` columns have the same data types as the example data frame. 
 
Here is my environment setup:

- pyspark=3.3.0
- xgboost=2.0.0-dev 

I build the xgboost from source and installed pyspark using standard pip install pyspark. 

If more info is needed from my side, pls let me know. Thanks!!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col #8221

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col #8221

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions