You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/learn/documentation/versioned/connectors/hdfs.md
+23-25Lines changed: 23 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,24 +47,22 @@ While streaming sources like Kafka are unbounded, files on HDFS have finite data
47
47
48
48
#### Defining streams
49
49
50
-
Samza uses the notion of a _system_ to describe any I/O source it interacts with. To consume from HDFS, you should create a new system that points to - `HdfsSystemFactory`. You can then associate multiple streams with this _system_. Each stream should have a _physical name_, which should be set to the name of the directory on HDFS.
In Samza high level API, you can use `HdfsSystemDescriptor` to create a HDFS system. The stream name should be set to the name of the directory on HDFS.
57
51
52
+
{% highlight java %}
53
+
HdfsSystemDescriptor hsd = new HdfsSystemDescriptor("hdfs-clickstream");
The above example defines a stream called `hdfs-clickstream` that reads data from the `/data/clickstream/2016/09/11` directory.
61
58
62
59
#### Whitelists & Blacklists
63
60
If you only want to consume from files that match a certain pattern, you can configure a whitelist. Likewise, you can also blacklist consuming from certain files. When both are specified, the _whitelist_ selects the files to be filtered and the _blacklist_ is later applied on its results.
Samza allows writing your output results to HDFS in AVRO format. You can either use avro's GenericRecords or have Samza automatically infer the schema for your object using reflection.
76
74
77
-
{% highlight jproperties %}
78
-
# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs'
Samza allows you to control the base HDFS directory to write your output. You can also organize the output into sub-directories depending on the time your application ran, by configuring a date-formatter.
You can configure the maximum size of each file or the maximum number of records per-file. Once either limits have been reached, Samza will create a new file.
Each Kinesis stream in a given AWS [region](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html) can be accessed by providing an [access key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys). An Access key consists of two parts: an access key ID (for example, `AKIAIOSFODNN7EXAMPLE`) and a secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`) which you can use to send programmatic requests to AWS.
ksd.getInputDescriptor("STREAM-NAME", new NoOpSerde<byte[]>())
66
+
.withRegion("STREAM-REGION")
67
+
.withAccessKey("YOUR-ACCESS_KEY")
68
+
.withSecretKey("YOUR-SECRET-KEY");
73
69
{% endhighlight %}
74
70
75
71
### Advanced Configuration
76
72
77
73
#### Kinesis Client Library Configs
78
74
Samza Kinesis Connector uses the [Kinesis Client Library](https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-kcl.html#kinesis-record-processor-overview-kcl)
79
75
(KCL) to access the Kinesis data streams. You can set any [KCL Configuration](https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client-multilang/src/main/java/software/amazon/kinesis/coordinator/KinesisClientLibConfiguration.java)
80
-
for a stream by configuring it with the **systems.system-name.streams.stream-name.aws.kcl.*** prefix.
76
+
for a stream by configuring it through `KinesisInputDescriptor`.
Samza allows you to specify any [AWS client configs](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html) to connect to your Kinesis instance.
93
-
You can configure any [AWS client configuration](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html)with the `systems.your-system-name.aws.clientConfig.*` prefix.
99
+
You can configure any [AWS client configuration](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html)through `KinesisSystemDescriptor`.
Alternately, if you want to start from a particular offset in the Kinesis stream, you can login to the [AWS console](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ConsoleDynamoDB.html) and edit the offsets in your DynamoDB Table.
0 commit comments