Skip to content

Commit 009beda

Browse files
ViBiOhge0Ajajanine-c
authored
feat(aws): Add scheduled invocation for retry event (#1027)
* feat(aws): Add scheduled invocation for retry event Signed-off-by: Vincent Boutour <[email protected]> * fixup! feat(aws): Add scheduled invocation for retry event Signed-off-by: Vincent Boutour <[email protected]> * docs(aws): Update the doc related to retry Co-authored-by: Georgi <[email protected]> * fixup! feat(aws): Add scheduled invocation for retry event Signed-off-by: Vincent Boutour <[email protected]> * docs(aws): Reword section on store failed events Co-authored-by: Janine Chan <[email protected]> * docs(aws): Adding more documentation on the retry mechanism Signed-off-by: Vincent Boutour <[email protected]> * feat(aws): Ensure both storage and retry are enabled for creating scheduler Signed-off-by: Vincent Boutour <[email protected]> --------- Signed-off-by: Vincent Boutour <[email protected]> Co-authored-by: Georgi <[email protected]> Co-authored-by: Janine Chan <[email protected]>
1 parent 7d1d4de commit 009beda

File tree

3 files changed

+104
-8
lines changed

3 files changed

+104
-8
lines changed

aws/logs_monitoring/README.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,14 @@ If you can't install the Forwarder using the provided CloudFormation template, y
9292
5. Some AWS accounts are configured such that triggers will not automatically create resource-based policies allowing Cloudwatch log groups to invoke the forwarder. Reference the [CloudWatchLogPermissions][103] to see which permissions are required for the forwarder to be invoked by Cloudwatch Log Events.
9393
6. [Configure triggers][104].
9494
7. Create an S3 bucket, and set environment variable `DD_S3_BUCKET_NAME` to the bucket name. Also provide `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, and `s3:DeleteObject` permissions on this bucket to the Lambda execution role. This bucket is used to store the different tags cache i.e. Lambda, S3, Step Function and Log Group. Additionally, this bucket will be used to store unforwarded events incase of forwarding exceptions.
95-
8. Set environment variable `DD_STORE_FAILED_EVENTS` to `true` to enable the forwarder to also store event data in the S3 bucket. In case of exceptions when sending logs, metrics or traces to intake, the forwarder will store relevant data in the S3 bucket. On custom invocations i.e. on receiving an event with the `retry` keyword set to a non empty string (which can be manually triggered - see below), the forwarder will retry sending the stored events. When successful it will clear up the storage in the bucket.
95+
8. Set the environment variable `DD_STORE_FAILED_EVENTS` to `true`, so you can enable the forwarder to also store event data in the S3 bucket. If an exception occurs when sending logs, metrics, or traces to intake, the forwarder stores relevant data in the S3 bucket. On custom invocations, such as on receiving an event with the `retry` keyword explicitly set to `true`, the forwarder retries sending the stored events. Upon a successful forwarding, the forwarder cleans up the stored logs.
9696

9797
```bash
98-
aws lambda invoke --function-name <function-name> --payload '{"retry":"true"}' --cli-binary-format raw-in-base64-out --log-type Tail /dev/stdout
98+
aws lambda invoke --function-name <function-name> \
99+
--payload '{"retry":true}' \
100+
--cli-binary-format raw-in-base64-out \
101+
--log-type Tail /dev/stdout |
102+
jq -r 'select(.LogResult) | .LogResult' | base64 -d | xargs -0 printf "%s"
99103
```
100104

101105
<div class="alert alert-warning">
@@ -312,6 +316,14 @@ Otherwise, if you are using Web Proxy:
312316
7. Set `DdNoSsl` to `true` if connecting to the proxy using `http`.
313317
8. Set `DdSkipSslValidation` to `true` if connecting to the proxy using `https` with a self-signed certificate.
314318
319+
### Scheduled retry
320+
321+
When you enable `DdStoreFailedEvents`, the Lambda forwarder stores any events that couldn’t be sent to Datadog in an S3 bucket. These events can be logs, metrics, or traces. They aren’t automatically re‑processed on each Lambda invocation; instead, you must trigger a [manual Lambda run](https://docs.datadoghq.com/logs/guide/forwarder/?tab=manual) to process them again.
322+
323+
You can automate this re‑processing by enabling `DdScheduleRetryFailedEvents` parameter, creating a scheduled Lambda invocation through [AWS EventBridge](https://docs.aws.amazon.com/lambda/latest/dg/with-eventbridge-scheduler.html). By default, the forwarder attempts re‑processing every six hours.
324+
325+
Keep in mind that log events can only be submitted with [timestamps up to 18 hours in the past](https://docs.datadoghq.com/logs/log_collection/?tab=host#custom-log-forwarding); older timestamps will cause the events to be discarded.
326+
315327
### Code signing
316328
317329
The Datadog Forwarder is signed by Datadog. To verify the integrity of the Forwarder, use the manual installation method. [Create a Code Signing Configuration][19] that includes Datadog’s Signing Profile ARN (`arn:aws:signer:us-east-1:464622532012:/signing-profiles/DatadogLambdaSigningProfile/9vMI9ZAGLc`) and associate it with the Forwarder Lambda function before uploading the Forwarder ZIP file.
@@ -456,6 +468,15 @@ To test different patterns against your logs, turn on [debug logs](#troubleshoot
456468
`AdditionalTargetLambdaArns`
457469
: Comma separated list of Lambda ARNs that will get called asynchronously with the same `event` the Datadog Forwarder receives.
458470
471+
`DdStoreFailedEvents`
472+
: Set to true to enable the forwarder to store events that failed to send to Datadog.
473+
474+
`DdScheduleRetryFailedEvents`
475+
: Set to true to enable a scheduled forwarder invocation (via AWS EventBridge) to process stored failed events.
476+
477+
`DdScheduleRetryInterval`
478+
: Interval in hours for scheduled forwarder invocation (via AWS EventBridge).
479+
459480
`InstallAsLayer`
460481
: Whether to use the layer-based installation flow. Set to false to use the legacy installation flow, which installs a second function that copies the forwarder code from GitHub to an S3 bucket. Defaults to true.
461482
@@ -622,6 +643,9 @@ To test different patterns against your logs, turn on [debug logs](#troubleshoot
622643
`ADDITIONAL_TARGET_LAMBDA_ARNS`
623644
: Comma separated list of Lambda ARNs that will get called asynchronously with the same `event` the Datadog Forwarder receives.
624645
646+
`DD_STORE_FAILED_EVENTS`
647+
: Set to true to enable the forwarder to store events that failed to send to Datadog.
648+
625649
`INSTALL_AS_LAYER`
626650
: Whether to use the layer-based installation flow. Set to false to use the legacy installation flow, which installs a second function that copies the forwarder code from GitHub to an S3 bucket. Defaults to true.
627651

aws/logs_monitoring/lambda_function.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,17 @@ def datadog_forwarder(event, context):
6262
init_cache_layer(function_prefix)
6363
init_forwarder(function_prefix)
6464

65+
if len(event) == 1 and str(event.get(DD_RETRY_KEYWORD, "false")).lower() == "true":
66+
logger.info("Retry-only invocation")
67+
68+
try:
69+
forwarder.retry()
70+
except Exception as e:
71+
if logger.isEnabledFor(logging.DEBUG):
72+
logger.debug(f"Failed to retry forwarding {e}")
73+
74+
return
75+
6576
parsed = parse(event, context, cache_layer)
6677
enriched = enrich(parsed, cache_layer)
6778
transformed = transform(enriched)
@@ -71,12 +82,11 @@ def datadog_forwarder(event, context):
7182
parse_and_submit_enhanced_metrics(logs, cache_layer)
7283

7384
try:
74-
if bool(event.get(DD_RETRY_KEYWORD, False)) is True:
85+
if str(event.get(DD_RETRY_KEYWORD, "false")).lower() == "true":
7586
forwarder.retry()
7687
except Exception as e:
7788
if logger.isEnabledFor(logging.DEBUG):
7889
logger.debug(f"Failed to retry forwarding {e}")
79-
pass
8090

8191

8292
def init_cache_layer(function_prefix):

aws/logs_monitoring/template.yaml

Lines changed: 66 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,17 @@ Parameters:
261261
- true
262262
- false
263263
Description: Set to true to enable the forwarder to store events that failed to send to Datadog.
264+
DdScheduleRetryFailedEvents:
265+
Type: String
266+
Default: false
267+
AllowedValues:
268+
- true
269+
- false
270+
Description: Set to true to enable a scheduled forwarder invocation (via AWS EventBridge) to process stored failed events.
271+
DdScheduleRetryInterval:
272+
Type: Number
273+
Default: 6
274+
Description: Interval in hours for scheduled forwarder invocation (via AWS EventBridge).
264275
DdForwarderExistingBucketName:
265276
Type: String
266277
Default: ""
@@ -292,7 +303,7 @@ Parameters:
292303
KmsKeyList:
293304
Type: CommaDelimitedList
294305
Default: ""
295-
Description: List of KMS Key ARNs the Lambda forwarder function can use to decrypt, seperated by comma
306+
Description: List of KMS Key ARNs the Lambda forwarder function can use to decrypt, seperated by comma
296307
Conditions:
297308
IsAWSChina: !Equals [!Ref "AWS::Partition", aws-cn]
298309
IsGovCloud: !Equals [!Ref "AWS::Partition", aws-us-gov]
@@ -348,7 +359,8 @@ Conditions:
348359
SetLayerARN: !Not
349360
- !Equals [!Ref LayerARN, ""]
350361
SetDdForwardLog: !Equals [!Ref DdForwardLog, false]
351-
SetDdStepFunctionsTraceEnabled: !Equals [!Ref DdStepFunctionsTraceEnabled, true]
362+
SetDdStepFunctionsTraceEnabled:
363+
!Equals [!Ref DdStepFunctionsTraceEnabled, true]
352364
SetDdUseCompression: !Equals [!Ref DdUseCompression, false]
353365
SetDdCompressionLevel: !Not
354366
- !Equals [!Ref DdCompressionLevel, 6]
@@ -384,6 +396,9 @@ Conditions:
384396
- !Equals [!Ref DdLogLevel, ""]
385397
SetDdForwarderDecryptKeys: !Not
386398
- !Equals [!Join ["", !Ref KmsKeyList], ""]
399+
CreateRetryScheduler: !And
400+
- !Equals [!Ref DdStoreFailedEvents, true]
401+
- !Equals [!Ref DdScheduleRetryFailedEvents, true]
387402
Rules:
388403
MustSetDdApiKey:
389404
Assertions:
@@ -431,7 +446,10 @@ Resources:
431446
- !Ref DdForwarderExistingBucketName
432447
S3Key: !Sub
433448
- "aws-dd-forwarder-${DdForwarderVersion}.zip"
434-
- {DdForwarderVersion: !FindInMap [Constants, DdForwarder, Version]}
449+
- {
450+
DdForwarderVersion:
451+
!FindInMap [Constants, DdForwarder, Version],
452+
}
435453
- ZipFile: " "
436454
MemorySize: !Ref MemorySize
437455
Runtime: python3.13
@@ -831,7 +849,7 @@ Resources:
831849
- !Ref SourceZipUrl
832850
- !Sub
833851
- "https://github.com/DataDog/datadog-serverless-functions/releases/download/aws-dd-forwarder-${DdForwarderVersion}/aws-dd-forwarder-${DdForwarderVersion}.zip"
834-
- {DdForwarderVersion: !FindInMap [Constants, DdForwarder, Version]}
852+
- { DdForwarderVersion: !FindInMap [Constants, DdForwarder, Version] }
835853
# The Forwarder's source code is too big to fit the inline code size limit for CloudFormation. In most of AWS
836854
# partitions and regions, the Forwarder is able to load its source code from a Lambda layer attached to it.
837855
# In places where Datadog can't/doesn't yet publish Lambda layers, use another Lambda to copy the source code
@@ -970,6 +988,50 @@ Resources:
970988
- - "arn:*:s3:::"
971989
- !Select [1, !Split ["s3://", !Ref SourceZipUrl]]
972990
- !Ref AWS::NoValue
991+
SchedulerRole:
992+
Type: AWS::IAM::Role
993+
Condition: CreateRetryScheduler
994+
Properties:
995+
AssumeRolePolicyDocument:
996+
Version: "2012-10-17"
997+
Statement:
998+
- Action:
999+
- sts:AssumeRole
1000+
Effect: Allow
1001+
Principal:
1002+
Service: !If
1003+
- IsAWSChina
1004+
- "scheduler.amazonaws.com.cn"
1005+
- "scheduler.amazonaws.com"
1006+
PermissionsBoundary: !If
1007+
- SetPermissionsBoundary
1008+
- !Ref PermissionsBoundaryArn
1009+
- !Ref AWS::NoValue
1010+
Policies:
1011+
- PolicyName: SchedulerRolePolicy0
1012+
PolicyDocument:
1013+
Version: "2012-10-17"
1014+
Statement:
1015+
- Effect: Allow
1016+
Action:
1017+
- lambda:InvokeFunction
1018+
Resource:
1019+
- !GetAtt
1020+
- Forwarder
1021+
- Arn
1022+
Scheduler:
1023+
Type: AWS::Scheduler::Schedule
1024+
Condition: CreateRetryScheduler
1025+
Properties:
1026+
Name: !Sub "${AWS::StackName}-retry"
1027+
Description: Retry the failed events from the Datadog Lambda Forwarder
1028+
ScheduleExpression: !Sub "rate(${DdScheduleRetryInterval} hours)"
1029+
FlexibleTimeWindow:
1030+
Mode: "OFF"
1031+
Target:
1032+
Arn: !GetAtt "Forwarder.Arn"
1033+
RoleArn: !GetAtt "SchedulerRole.Arn"
1034+
Input: '{"retry": true}'
9731035
Outputs:
9741036
DatadogForwarderArn:
9751037
Description: Datadog Forwarder Lambda Function ARN

0 commit comments

Comments
 (0)