Skip to content

Commit c85604e

Browse files
committed
Merge remote-tracking branch 'upstream/master'
2 parents 242d844 + 66d6c5d commit c85604e

File tree

11 files changed

+827
-17
lines changed

11 files changed

+827
-17
lines changed

docs/_case-studies/ebay.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
layout: case-study
3+
hide_title: true # so we have control in case-study layout, but can still use page
4+
title: Low Latency Web Scale Fraud Prevention
5+
study_domain: ebay.com
6+
menu_title: eBay
7+
excerpt_separator: <!--more-->
8+
---
9+
<!--
10+
Licensed to the Apache Software Foundation (ASF) under one or more
11+
contributor license agreements. See the NOTICE file distributed with
12+
this work for additional information regarding copyright ownership.
13+
The ASF licenses this file to You under the Apache License, Version 2.0
14+
(the "License"); you may not use this file except in compliance with
15+
the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
-->
25+
26+
Low Latency Web Scale Fraud Prevention
27+
28+
<!--more-->
29+
30+
eBay Enterprise is the world’s largest omni-channel commerce provider with
31+
hundreds millions of units shipped annually, as commerce gets more
32+
convenient and complex, so does fraud. The engineering team at eBay
33+
Enterprise selected Samza as the platform to build the horizontally
34+
scalable, realtime (sub-seconds) and fault tolerant abnormality detection
35+
system. For example, the system computes and evaluates key metrics to
36+
detect abnormal behaviors
37+
38+
- Transaction velocity (#tnx/day) and change (#tnx/day vs #tnx/day over n days)
39+
- Amount velocity ($tnx/day) and change ($tnx/day vs $tnx/day over n days)
40+
41+
A wide range of realtime and historical adjunct data from various sources
42+
including people, places, interests, social and connections are ingested
43+
through Kafka, and stored in local RocksDB state store with changelog
44+
enabled for recovery. Incoming transaction data is aggregated using
45+
windowing and then joined with adjunct data stores in multiple stages.
46+
The system generates potential fraud cases for review real time. Finally,
47+
the engineering team at eBay Enterprise has built an OpenTSDB and Grafana
48+
based monitoring system using metrics collected through JMX.
49+
50+
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*,
51+
*JMX-metrics*
52+
53+
More information
54+
55+
- [https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends](https://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends)
56+
- [http://ebayenterprise.com/](http://ebayenterprise.com/)

docs/_case-studies/optimizely.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: case-study
33
hide_title: true # so we have control in case-study layout, but can still use page
4-
title: Real Time Session Aggregation at Optimizely
4+
title: Real Time Session Aggregation
55
study_domain: optimizely.com
66
menu_title: Optimizely
77
excerpt_separator: <!--more-->
@@ -23,18 +23,27 @@ excerpt_separator: <!--more-->
2323
limitations under the License.
2424
-->
2525

26-
Testing the excerpt
26+
Real Time Session Aggregation
2727

2828
<!--more-->
2929

30-
Optimizely is a world’s leading experimentation platform, enabling businesses to deliver continuous experimentation and personalization across websites, mobile apps and connected devices. At Optimizely, billions of events are tracked on a daily basis. Session metrics are among the key metrics provided to their end user in real time. Prior to introducing Samza for realtime computation, the engineering team at Optimizely used HBase to store and serve experimentation data, and Druid for personalization data including session metrics. As business requirements evolved, the Druid-based solution became more and more challenging.
30+
Optimizely is a world’s leading experimentation platform, enabling businesses to
31+
deliver continuous experimentation and personalization across websites, mobile
32+
apps and connected devices. At Optimizely, billions of events are tracked on a
33+
daily basis. Session metrics are among the key metrics provided to their end user
34+
in real time. Prior to introducing Samza for their realtime computation, the
35+
engineering team at Optimizely built their data-pipeline using a complex
36+
[Lambda architecture] (http://lambda-architecture.net/) leveraging
37+
[Druid and Hbase] (https://medium.com/engineers-optimizely/building-a-scalable-data-pipeline-bfe3f531eb38).
38+
As business requirements evolve, this solution became more and more challenging.
3139

32-
- Long delays in session metrics caused by M/R jobs
33-
- Reprocessing of events due to inability to incrementally update Druid index
34-
- Difficulties in scaling dimensions and cardinality
35-
- Queries expanding long time periods are expensive
36-
37-
The engineering team at Optimizely decided to move away from Druid and focus on HBase as the store, and introduced stream processing to pre-aggregate and deduplicate session events. They evaluated multiple stream processing platforms and chose Samza as their stream processing platform. In their solution, every session event is tagged with an identifier for up to 30 minutes; upon receiving a session event, the Samza job updates session metadata and aggregates counters for the session that is stored in a local RocksDB state store. At the end of each one-minute window, aggregated session metrics are ingested to HBase. With the new solution
40+
The engineering team at Optimizely decided to move away from Druid and focus on
41+
HBase as the store, and introduced stream processing to pre-aggregate and
42+
deduplicate session events. In their solution, every session event is tagged
43+
with an identifier for up to 30 minutes; upon receiving a session event, the
44+
Samza job updates session metadata and aggregates counters for the session
45+
that is stored in a local RocksDB state store. At the end of each one-minute
46+
window, aggregated session metrics are ingested to HBase. With the new solution
3847

3948
- The median query latency was reduced from 40+ ms to 5 ms
4049
- Session metrics are now available in realtime
@@ -44,15 +53,22 @@ The engineering team at Optimizely decided to move away from Druid and focus on
4453

4554
Here is a testimonial from Optimizely
4655

47-
“At Optimizely, we have built the world’s leading experimentation platform, which ingests billions of click-stream events a day from millions of visitors for analysis. Apache Samza has been a great asset to Optimizely's Event ingestion pipeline allowing us to perform large scale, real time stream computing such as aggregations (e.g. session computations) and data enrichment on a multiple billion events / day scale. The programming model, durability and the close integration with Apache Kafka fit our needs perfectly” said Vignesh Sukumar, Senior Engineering Manager at Optimizely”
56+
“At Optimizely, we have built the world’s leading experimentation platform,
57+
which ingests billions of click-stream events a day from millions of visitors
58+
for analysis. Apache Samza has been a great asset to Optimizely's Event
59+
ingestion pipeline allowing us to perform large scale, real time stream
60+
computing such as aggregations (e.g. session computations) and data enrichment
61+
on a multiple billion events / day scale. The programming model, durability
62+
and the close integration with Apache Kafka fit our needs perfectly” said
63+
Vignesh Sukumar, Senior Engineering Manager at Optimizely”
4864

49-
In addition, stream processing is also applied to other use cases such as data enrichment, event stream partitioning and metrics processing at Optimizely.
65+
In addition, stream processing is also applied to other use cases such as
66+
data enrichment, event stream partitioning and metrics processing at Optimizely.
5067

51-
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*, *Scalability*, *Fault-tolerant*
68+
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*
5269

5370
More information
5471

5572
- [https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-1-aed2051dd7a3)
56-
5773
- [https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820](https://medium.com/engineers-optimizely/from-batching-to-streaming-real-time-session-metrics-using-samza-part-2-b596350a7820)
5874

docs/_case-studies/redfin.md

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: case-study # the layout to use
33
hide_title: true # so we have control in case-study layout, but can still use page
4-
title: Totally awesome use-case of samza by Redfin # title of case study page
4+
title: Realtime Notifications at Redfin
55
study_domain: redfin.com # just the domain, not the protocol
66
menu_title: Redfin # what shows up in the menu
77
excerpt_separator: <!--more-->
@@ -23,8 +23,50 @@ excerpt_separator: <!--more-->
2323
limitations under the License.
2424
-->
2525

26-
Testing the excerpt
26+
Realtime Notifications
2727

2828
<!--more-->
2929

30-
Markdown content goes here
30+
Redfin is a leading full-service real estate brokerage that uses modern technology
31+
to help people buy and sell homes. Notification is the critical feature to
32+
communicate with Redfin’s customers, notification includes recommendations, instant
33+
emails, scheduled digests and push notifications. Thousands of emails are delivered
34+
to customers every minute at peak.
35+
36+
The notification system used to be a monolithic system, which served the company
37+
well. However, as business grew and requirements evolved, it became harder and
38+
harder to maintain and scale.
39+
40+
![Samza pipeline at Redfin](/img/case-studies/redfin.svg)
41+
42+
The engineering team at Redfin decided to replace
43+
the existing system with Samza primarily for Samza’s performance, scalability,
44+
support for stateful processing and Kafka-integration. A multi-stage stream
45+
processing pipeline was developed. At the Identify stage, external events
46+
such as new Listings are identified as candidates for new notification;
47+
then potential recipients of notifications are determined by analyzing data in
48+
events and customer profiles, results are grouped by customer at the end of
49+
each time window at the Match Stage; once recipients and notification outlines are
50+
identified, the Organize stage retrieves adjunct data necessary to appear in each
51+
notification from various data sources by joining them with notification and
52+
customer profiles, results are stored/merged in local RocksDB state store; finally
53+
notifications are formatted at the Format stage and sent to notification
54+
delivery system at the Notify stage.
55+
56+
With the new notification system
57+
58+
- The system is more performant and horizontally scalable
59+
- It is now easier to add support for new use cases
60+
- Reduced pressure on other system due to the use of local RocksDB state store
61+
- Processing stages can be scaled individually
62+
63+
Other engineering teams at Redfin are also using Samza for business metrics
64+
calculation, document processing, event scheduling.
65+
66+
Key Samza Features: *Stateful processing*, *Windowing*, *Kafka-integration*
67+
68+
More information
69+
70+
- [https://www.youtube.com/watch?v=cfy0xjJJf7Y](https://www.youtube.com/watch?v=cfy0xjJJf7Y)
71+
- [https://www.redfin.com/](https://www.redfin.com/)
72+

docs/_case-studies/tripadvisor.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
layout: case-study
3+
hide_title: true # so we have control in case-study layout, but can still use page
4+
title: Hedwig - Converting Hadoop M/R ETL systems to Stream Processing
5+
study_domain: tripadvisor.com
6+
menu_title: TripAdvisor
7+
excerpt_separator: <!--more-->
8+
---
9+
<!--
10+
Licensed to the Apache Software Foundation (ASF) under one or more
11+
contributor license agreements. See the NOTICE file distributed with
12+
this work for additional information regarding copyright ownership.
13+
The ASF licenses this file to You under the Apache License, Version 2.0
14+
(the "License"); you may not use this file except in compliance with
15+
the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
-->
25+
26+
Hedwig - Converting Hadoop M/R ETL systems to Stream Processing
27+
28+
<!--more-->
29+
30+
TripAdvisor is one of the world’s largest travel website that provides hotel
31+
and restaurant reviews, accommodation bookings and other travel-related
32+
content. It produces and processes billions events processed everyday
33+
including billing records, reports, monitoring events and application
34+
notifications.
35+
36+
Prior to migrating to Samza, TripAdvisor used Hadoop to ETL its data. Raw
37+
data was rolled up to hourly and daily in a number of stages with joins
38+
and sliding windows applied, session data is then produced from daily data.
39+
About 300 million sessions are produced daily. With this solution, the
40+
engineering team were faced with a few challenges
41+
42+
- Long lag time to downstream that is business critical
43+
- Difficult to debug and troubleshoot due to scripts, environments, etc.
44+
- Adding more nodes doesn’t help to scale
45+
46+
The engineering team at TripAdvisor decided to replace the Hadoop solution
47+
with a multi-stage Samza pipeline.
48+
49+
![Samza pipeline at TripAdvisor](/img/case-studies/trip-advisor.svg)
50+
51+
In the new solution, after raw data is first collected by Flume and ingested
52+
through a Kafka cluster, they are parsed, cleansed and partitioned by the
53+
Lookback Router; then processing logic such as windowing, grouping, joining,
54+
fraud detection are applied by the Session Collector and the Fraud Collector,
55+
RocksDB is used as the local store for intermediate states; finally the Uploader
56+
uploads results to HDFS, ElasticSearch, RedShift and Hive.
57+
58+
The new solution achieved significant improvements:
59+
60+
- Processing time is reduced from 3 hours to 1 hour
61+
- Individual stages in the pipeline are scaled independently
62+
- Overall hardware requirement is reduced to ⅓ thanks to optimized usage
63+
- Much simpler to debug and test the solution
64+
65+
Key Samza features: *Stateful processing*, *Windowing*, *Kafka-integration*
66+
67+
More information
68+
69+
- [https://www.youtube.com/watch?v=KQ5OnL2hMBY](https://www.youtube.com/watch?v=KQ5OnL2hMBY)
70+
- [https://www.tripadvisor.com/](https://www.tripadvisor.com/)
71+

docs/_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ highlighter: pygments
2020
markdown: redcarpet
2121
exclude: ['_notes']
2222
redcarpet:
23-
extensions: ['with_toc_data', 'smart', 'strikethrough']
23+
extensions: ['with_toc_data', 'smart', 'strikethrough', 'tables']
2424
exclude: [_docs]
2525
baseurl: http://samza.apache.org
2626
version: latest

docs/css/main.new.css

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -826,9 +826,20 @@ a.side-navigation__group-title::after {
826826
}
827827

828828
table {
829+
border-collapse: collapse;
830+
margin: 1em 0;
829831
font-size: 15px;
830832
}
831833

834+
table th, table td {
835+
text-align: left;
836+
vertical-align: top;
837+
padding: 12px;
838+
border-bottom: 1px solid #ccc;
839+
border-top: 1px solid #ccc;
840+
border-left: 0;
841+
border-right: 0;
842+
}
832843
pre {
833844
padding: 20px;
834845
font-size: 15px;

docs/img/case-studies/redfin.svg

Lines changed: 1 addition & 0 deletions
Loading

docs/img/case-studies/trip-advisor.svg

Lines changed: 1 addition & 0 deletions
Loading
142 KB
Loading
193 KB
Loading

0 commit comments

Comments
 (0)