@@ -13,10 +13,9 @@ As you follow this walkthrough, you run Python code that calls
1313[ Dataproc gRPC APIs] ( https://cloud.google.com/dataproc/docs/reference/rpc/ )
1414to:
1515
16- * create a Dataproc cluster
17- * submit a small PySpark word sort job to run on the cluster
18- * get job status
19- * tear down the cluster after job completion
16+ * Create a Dataproc cluster
17+ * Submit a PySpark word sort job to the cluster
18+ * Delete the cluster after job completion
2019
2120## Using the walkthrough
2221
@@ -32,144 +31,127 @@ an explanation of how the code works.
3231
3332 cloudshell launch-tutorial python-api-walkthrough.md
3433
35- ** To copy and run commands** : Click the "Paste in Cloud Shell" button
34+ ** To copy and run commands** : Click the "Copy to Cloud Shell" button
3635 (<walkthrough-cloud-shell-icon ></walkthrough-cloud-shell-icon >)
3736 on the side of a code box, then press ` Enter ` to run the command.
3837
3938## Prerequisites (1)
4039
41- <walkthrough-watcher-constant key="project_id" value="<project_id>"
42- > </walkthrough-watcher-constant >
40+ <walkthrough-watcher-constant key =" project_id " value =" <project_id> " ></walkthrough-watcher-constant >
4341
44421 . Create or select a Google Cloud project to use for this
45- tutorial.
46- * <walkthrough-project-setup billing =" true " ></walkthrough-project-setup >
43+ tutorial.
44+ * <walkthrough-project-setup billing =" true " ></walkthrough-project-setup >
4745
48461 . Enable the Dataproc, Compute Engine, and Cloud Storage APIs in your
49- project.
50- ``` sh
51- gcloud services enable dataproc.googleapis.com \
52- compute.googleapis.com \
53- storage-component.googleapis.com \
54- --project={{project_id}}
55- ```
47+ project.
48+
49+ ``` bash
50+ gcloud services enable dataproc.googleapis.com \
51+ compute.googleapis.com \
52+ storage-component.googleapis.com \
53+ --project={{project_id}}
54+ ```
5655
5756# # Prerequisites (2)
5857
59581. This walkthrough uploads a PySpark file (` pyspark_sort.py` ) to a
6059 [Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in
6160 your project.
6261 * You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser)
63- in Google Cloud Platform Console to view existing buckets in your project.
62+ in Google Cloud Console to view existing buckets in your project.
6463
65- & nbsp ;& nbsp ;& nbsp ;& nbsp ; ** OR**
64+ ** OR**
6665
6766 * To create a new bucket, run the following command. Your bucket name must be unique.
68- ``` bash
69- gsutil mb -p {{project-id}} gs://your-bucket-name
70- ```
7167
72- 1 . Set environment variables.
68+ gsutil mb -p {{project-id}} gs://your-bucket-name
69+
7370
74- * Set the name of your bucket .
75- ``` bash
76- BUCKET=your-bucket-name
77- ```
71+ 2. Set environment variables .
72+ * Set the name of your bucket.
73+
74+ BUCKET=your-bucket-name
7875
7976# # Prerequisites (3)
8077
81781. Set up a Python
82- [virtual environment](https://virtualenv.readthedocs.org/en/latest/)
83- in Cloud Shell.
79+ [virtual environment](https://virtualenv.readthedocs.org/en/latest/).
8480
8581 * Create the virtual environment.
86- ` ` ` bash
87- virtualenv ENV
88- ` ` `
82+
83+ virtualenv ENV
84+
8985 * Activate the virtual environment.
90- ` ` ` bash
91- source ENV/bin/activate
92- ` ` `
86+
87+ source ENV/bin/activate
9388
94- 1. Install library dependencies in Cloud Shell.
95- ` ` ` bash
96- pip install -r requirements.txt
97- ` ` `
89+ 1. Install library dependencies.
90+
91+ pip install -r requirements.txt
9892
9993# # Create a cluster and submit a job
10094
101951. Set a name for your new cluster.
102- ` ` ` bash
103- CLUSTER=new-cluster-name
104- ` ` `
10596
106- 1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/# available)
107- where your new cluster will be located. You can change the
108- " us-central1-a" zone that is pre-set in the following command.
109- ` ` ` bash
110- ZONE=us-central1-a
111- ` ` `
97+ CLUSTER=new-cluster-name
11298
113- 1. Run ` submit_job.py ` with the ` --create_new_cluster ` flag
114- to create a new cluster and submit the ` pyspark_sort.py ` job
115- to the cluster .
99+ 1. Set a [region](https://cloud.google.com/compute/docs/regions-zones/ # available)
100+ where your new cluster will be located. You can change the pre-set
101+ " us-central1 " region beforew you copy and run the following command .
116102
117- ` ` ` bash
118- python submit_job_to_cluster.py \
119- --project_id={{project-id}} \
120- --cluster_name=$CLUSTER \
121- --zone=$ZONE \
122- --gcs_bucket=$BUCKET \
123- --create_new_cluster
124- ` ` `
103+ REGION=us-central1
104+
105+ 1. Run ` submit_job_to_cluster.py` to create a new cluster and run the
106+ ` pyspark_sort.py` job on the cluster.
107+
108+ python submit_job_to_cluster.py \
109+ --project_id={{project-id}} \
110+ --cluster_name=$CLUSTER \
111+ --region=$REGION \
112+ --gcs_bucket=$BUCKET
125113
126114# # Job Output
127115
128- Job output in Cloud Shell shows cluster creation, job submission,
129- job completion, and then tear-down of the cluster.
130-
131- ...
132- Creating cluster...
133- Cluster created.
134- Uploading pyspark file to Cloud Storage.
135- new-cluster-name - RUNNING
136- Submitted job ID ...
137- Waiting for job to finish...
138- Job finished.
139- Downloading output file
140- .....
141- [' Hello,' , ' dog' , ' elephant' , ' panther' , ' world!' ]
142- ...
143- Tearing down cluster
144- ` ` `
145- # # Congratulations on Completing the Walkthrough!
116+ Job output displayed in the Cloud Shell terminaL shows cluster creation,
117+ job completion, sorted job output, and then deletion of the cluster.
118+
119+ ` ` ` xml
120+ Cluster created successfully: cliuster-name.
121+ ...
122+ Job finished successfully.
123+ ...
124+ [' Hello,' , ' dog' , ' elephant' , ' panther' , ' world!' ]
125+ ...
126+ Cluster cluster-name successfully deleted.
127+ ` ` `
128+
129+ # # Congratulations on completing the Walkthrough!
146130< walkthrough-conclusion-trophy></walkthrough-conclusion-trophy>
147131
148132---
149133
150134# ## Next Steps:
151135
152- * ** View job details from the Console.** View job details by selecting the
153- PySpark job from the Dataproc
154- =
136+ * ** View job details in the Cloud Console.** View job details by selecting the
137+ PySpark job name on the Dataproc
155138 [Jobs page](https://console.cloud.google.com/dataproc/jobs)
156- in the Google Cloud Platform Console .
139+ in the Cloud console .
157140
158141* ** Delete resources used in the walkthrough.**
159- The ` submit_job_to_cluster.py` job deletes the cluster that it created for this
142+ The ` submit_job_to_cluster.py` code deletes the cluster that it created for this
160143 walkthrough.
161144
162- If you created a bucket to use for this walkthrough,
163- you can run the following command to delete the
164- Cloud Storage bucket (the bucket must be empty).
165- ` ` ` bash
166- gsutil rb gs://$BUCKET
167- ` ` `
168- You can run the following command to delete the bucket ** and all
169- objects within it. Note: the deleted objects cannot be recovered.**
170- ` ` ` bash
171- gsutil rm -r gs://$BUCKET
172- ` ` `
145+ If you created a Cloud Storage bucket to use for this walkthrough,
146+ you can run the following command to delete the bucket (the bucket must be empty).
147+
148+ gsutil rb gs://$BUCKET
149+
150+ * You can run the following command to ** delete the bucket and all
151+ objects within it. Note: the deleted objects cannot be recovered.**
152+
153+ gsutil rm -r gs://$BUCKET
154+
173155
174156* ** For more information.** See the [Dataproc documentation](https://cloud.google.com/dataproc/docs/)
175157 for API reference and product feature information.
0 commit comments