Skip to content

docs: add snippets for Matrix Factorization tutorials #1630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
May 8, 2025
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8703eb4
docs: add matrix_factorization snippets
rey-esp Apr 1, 2025
5b71583
incomplete mf snippets
rey-esp Apr 2, 2025
edd4cd5
prep implicit
rey-esp Apr 2, 2025
5de40af
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 8, 2025
9da5027
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 9, 2025
9f14b7a
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 14, 2025
2957d93
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 17, 2025
846c421
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 17, 2025
24efc36
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 18, 2025
d7df5d8
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 21, 2025
03a96a5
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 21, 2025
a938b7d
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 21, 2025
23286c3
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 22, 2025
2e9a2d8
Merge branch 'b338873783-mf-snippets' of github.com:googleapis/python…
rey-esp Apr 22, 2025
08c60c4
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 22, 2025
123f4d4
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 23, 2025
c262f19
Merge branch 'main' into b338873783-mf-snippets
rey-esp Apr 24, 2025
ce87b14
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 1, 2025
b898a76
near complete tutorial
rey-esp May 1, 2025
33f446a
implicit create
rey-esp May 1, 2025
431e9eb
add doc note
rey-esp May 1, 2025
976eda7
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 2, 2025
b1a4287
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 5, 2025
7380f6f
complete explicit tutorial
rey-esp May 5, 2025
74c0d85
remove implicit snippets
rey-esp May 5, 2025
699bf1e
Merge branch 'main' into b338873783-mf-snippets
tswast May 6, 2025
e845379
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 7, 2025
862e118
Update samples/snippets/mf_explicit_model_test.py
tswast May 7, 2025
7f2d7f6
add snippets to create dataset and movielens tables
tswast May 7, 2025
6bdfbfb
correct the region tags
tswast May 7, 2025
1847d61
correct more region tags
tswast May 7, 2025
0ad94b2
Update samples/snippets/mf_explicit_model_test.py
rey-esp May 7, 2025
1a81d6a
Update samples/snippets/mf_explicit_model_test.py
rey-esp May 7, 2025
dbfadbd
update evaluate section
rey-esp May 7, 2025
82efd99
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 7, 2025
6c3dee2
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 7, 2025
c03abf2
Merge branch 'main' into b338873783-mf-snippets
rey-esp May 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions samples/snippets/mf_explicit_model_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (t
# you may not use this file except in compliance wi
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in
# distributed under the License is distributed on a
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, eit
# See the License for the specific language governi
# limitations under the License.


def test_explicit_matrix_factorization(random_model_id: str) -> None:
your_model_id = random_model_id

# [START bigquery_dataframes_bqml_mf_explicit_create_dataset]
import google.cloud.bigquery

bqclient = google.cloud.bigquery.Client()
bqclient.create_dataset("bqml_tutorial", exists_ok=True)
# [END bigquery_dataframes_bqml_mf_explicit_create_dataset]

# [START bigquery_dataframes_bqml_mf_explicit_upload_movielens]
import io
import zipfile

import google.api_core.exceptions
import requests

try:
# Check if you've already created the Movielens tables to avoid downloading
# and uploading the dataset unnecessarily.
bqclient.get_table("bqml_tutorial.ratings")
bqclient.get_table("bqml_tutorial.movies")
except google.api_core.exceptions.NotFound:
# Download the https://grouplens.org/datasets/movielens/1m/ dataset.
ml1m = requests.get("http://files.grouplens.org/datasets/movielens/ml-1m.zip")
ml1m_file = io.BytesIO(ml1m.content)
ml1m_zip = zipfile.ZipFile(ml1m_file)

# Upload the ratings data into the ratings table.
with ml1m_zip.open("ml-1m/ratings.dat") as ratings_file:
ratings_content = ratings_file.read()

ratings_csv = io.BytesIO(ratings_content.replace(b"::", b","))
ratings_config = google.cloud.bigquery.LoadJobConfig()
ratings_config.source_format = "CSV"
ratings_config.write_disposition = "WRITE_TRUNCATE"
ratings_config.schema = [
google.cloud.bigquery.SchemaField("user_id", "INT64"),
google.cloud.bigquery.SchemaField("item_id", "INT64"),
google.cloud.bigquery.SchemaField("rating", "FLOAT64"),
google.cloud.bigquery.SchemaField("timestamp", "TIMESTAMP"),
]
bqclient.load_table_from_file(
ratings_csv, "bqml_tutorial.ratings", job_config=ratings_config
).result()

# Upload the movie data into the movies table.
with ml1m_zip.open("ml-1m/movies.dat") as movies_file:
movies_content = movies_file.read()

movies_csv = io.BytesIO(movies_content.replace(b"::", b"@"))
movies_config = google.cloud.bigquery.LoadJobConfig()
movies_config.source_format = "CSV"
movies_config.field_delimiter = "@"
movies_config.write_disposition = "WRITE_TRUNCATE"
movies_config.schema = [
google.cloud.bigquery.SchemaField("movie_id", "INT64"),
google.cloud.bigquery.SchemaField("movie_title", "STRING"),
google.cloud.bigquery.SchemaField("genre", "STRING"),
]
bqclient.load_table_from_file(
movies_csv, "bqml_tutorial.movies", job_config=movies_config
).result()
# [END bigquery_dataframes_bqml_mf_explicit_upload_movielens]

# [START bigquery_dataframes_bqml_mf_explicit_create]
from bigframes.ml import decomposition
import bigframes.pandas as bpd

# Load data from BigQuery
bq_df = bpd.read_gbq(
"bqml_tutorial.ratings", columns=("user_id", "item_id", "rating")
)

# Create the Matrix Factorization model
model = decomposition.MatrixFactorization(
num_factors=34,
feedback_type="explicit",
user_col="user_id",
item_col="item_id",
rating_col="rating",
l2_reg=9.83,
)
model.fit(bq_df)
model.to_gbq(
your_model_id, replace=True # For example: "bqml_tutorial.mf_explicit"
)
# [END bigquery_dataframes_bqml_mf_explicit_create]
# [START bigquery_dataframes_bqml_mf_explicit_evaluate]
# Evaluate the model using the score() function
model.score(bq_df)
# Output:
# mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
# 0.485403 0.395052 0.025515 0.390573 0.68343 0.68343
# [END bigquery_dataframes_bqml_mf_explicit_evaluate]
# [START bigquery_dataframes_bqml_mf_explicit_recommend_df]
# Use predict() to get the predicted rating for each movie for 5 users
subset = bq_df[["user_id"]].head(5)
predicted = model.predict(subset)
print(predicted)
# Output:
# predicted_rating user_id item_id rating
# 0 4.206146 4354 968 4.0
# 1 4.853099 3622 3521 5.0
# 2 2.679067 5543 920 2.0
# 3 4.323458 445 3175 5.0
# 4 3.476911 5535 235 4.0
# [END bigquery_dataframes_bqml_mf_explicit_recommend_df]
# [START bigquery_dataframes_bqml_mf_explicit_recommend_model]
# import bigframes.bigquery as bbq

# Load movies
movies = bpd.read_gbq("bqml_tutorial.movies")

# Merge the movies df with the previously created predicted df
merged_df = bpd.merge(predicted, movies, left_on="item_id", right_on="movie_id")
Comment on lines +127 to +131
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicted here is only a subset. The SQL shows creating a table of all the predictions.

CREATE OR REPLACE TABLE `bqml_tutorial.recommend`
AS
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.mf_explicit`);

Did we implement a way for predict() to work without any inputs like in the SQL?

Suggested change
# Load movies
movies = bpd.read_gbq("bqml_tutorial.movies")
# Merge the movies df with the previously created predicted df
merged_df = bpd.merge(predicted, movies, left_on="item_id", right_on="movie_id")
predicted = model.predict()
# Load movies
movies = bpd.read_gbq("bqml_tutorial.movies")
# Merge the movies df with the previously created predicted df
merged_df = bpd.merge(predicted, movies, left_on="item_id", right_on="movie_id")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature isn't supported yet and an issue will be filed shortly


# Separate users and predicted data, setting the index to 'movie_id'
users = merged_df[["user_id", "movie_id"]].set_index("movie_id")

# Take the predicted data and sort it in descending order by 'predicted_rating', setting the index to 'movie_id'
sort_data = (
merged_df[["movie_title", "genre", "predicted_rating", "movie_id"]]
.sort_values(by="predicted_rating", ascending=False)
.set_index("movie_id")
)

# re-merge the separated dfs by index
merged_user = sort_data.join(users, how="outer")

# group the users and set the user_id as the index
merged_user.groupby("user_id").head(5).set_index("user_id").sort_index()
print(merged_user)
# Output:
# movie_title genre predicted_rating
# user_id
# 1 Saving Private Ryan (1998) Action|Drama|War 5.19326
# 1 Fargo (1996) Crime|Drama|Thriller 4.996954
# 1 Driving Miss Daisy (1989) Drama 4.983671
# 1 Ben-Hur (1959) Action|Adventure|Drama 4.877622
# 1 Schindler's List (1993) Drama|War 4.802336
# 2 Saving Private Ryan (1998) Action|Drama|War 5.19326
# 2 Braveheart (1995) Action|Drama|War 5.174145
# 2 Gladiator (2000) Action|Drama 5.066372
# 2 On Golden Pond (1981) Drama 5.01198
# 2 Driving Miss Daisy (1989) Drama 4.983671
# [END bigquery_dataframes_bqml_mf_explicit_recommend_model]