Skip to content

Commit 6dd26b5

Browse files
authored
Content File Vector Search Endpoint (#1907)
* updating deps and adding method for getting token count * working contentfile chunk embeds * storing chunks and update initial resource record with embeddings from contentfile chunk * adding management command flag to generate embeds by id * fixing test * ensuring we stay under token size * removing full content from points * moving splitter to separate function * adding test for text splitter * adding more tests * fixing test * changing chunk key name * fix test setting: * fixing test * adding initial serializers * adding initial working view * pinning onnxruntime to older version * add setting for qdrant chunk size and some refactoring * removing _id field * test fix * fix docstring * adding more filters * regenerate oai spec * add view tests * regenerate spec * setting some indexes * fixes * fixing tests * docstring fix * fixes to filter out empty chunks and adding file extension as property * fix test * fix method name * making sure we only process unique content for each run * filktering for published * removing next run method * removing published field * removed unique content id function and updating spec * fixing issue in generate by id * pinning litellm version
1 parent f91ac07 commit 6dd26b5

File tree

18 files changed

+1801
-479
lines changed

18 files changed

+1801
-479
lines changed

frontends/api/src/generated/v0/api.ts

Lines changed: 679 additions & 44 deletions
Large diffs are not rendered by default.

frontends/api/src/generated/v1/api.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,12 @@ export interface ContentFile {
373373
* @memberof ContentFile
374374
*/
375375
file_type?: string | null
376+
/**
377+
*
378+
* @type {string}
379+
* @memberof ContentFile
380+
*/
381+
file_extension?: string | null
376382
/**
377383
*
378384
* @type {LearningResourceOfferor}

learning_resources/serializers.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -815,6 +815,7 @@ class Meta:
815815
"resource_readable_id",
816816
"course_number",
817817
"file_type",
818+
"file_extension",
818819
"offered_by",
819820
"platform",
820821
"run_readable_id",

learning_resources/serializers_test.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,7 @@ def test_content_file_serializer(settings, expected_types, has_channels):
496496
"name": PlatformType[platform].value,
497497
"code": platform,
498498
},
499+
"file_extension": content_file.file_extension,
499500
"offered_by": {
500501
"name": content_file.run.learning_resource.offered_by.name,
501502
"code": content_file.run.learning_resource.offered_by.code,

learning_resources_search/api.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
adjust_search_for_percolator,
3636
document_percolated_actions,
3737
)
38+
from vector_search.constants import RESOURCES_COLLECTION_NAME
3839

3940
log = logging.getLogger(__name__)
4041

@@ -929,7 +930,7 @@ def _qdrant_similar_results(doc, num_resources):
929930
return [
930931
hit.payload
931932
for hit in client.query_points(
932-
collection_name=f"{settings.QDRANT_BASE_COLLECTION_NAME}.resources",
933+
collection_name=RESOURCES_COLLECTION_NAME,
933934
query=vector_point_id(doc["readable_id"]),
934935
limit=num_resources,
935936
using=encoder.model_short_name(),

main/settings.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -804,6 +804,11 @@ def get_all_config_keys():
804804
name="QDRANT_SPARSE_MODEL", default="prithivida/Splade_PP_en_v1"
805805
)
806806

807+
QDRANT_CHUNK_SIZE = get_int(
808+
name="QDRANT_CHUNK_SIZE",
809+
default=100,
810+
)
811+
807812
QDRANT_ENCODER = get_string(
808813
name="QDRANT_ENCODER", default="vector_search.encoders.fastembed.FastEmbedEncoder"
809814
)

openapi/specs/v0.yaml

Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,6 +314,121 @@ paths:
314314
schema:
315315
$ref: '#/components/schemas/CKEditorSettings'
316316
description: ''
317+
/api/v0/content_files_vector_search/:
318+
get:
319+
operationId: content_files_vector_search_retrieve
320+
description: Vector Search for content
321+
summary: Content File Vector Search
322+
parameters:
323+
- in: query
324+
name: content_feature_type
325+
schema:
326+
type: array
327+
items:
328+
type: string
329+
minLength: 1
330+
description: The feature type of the content file. Possible options are at
331+
api/v1/course_features/
332+
- in: query
333+
name: course_number
334+
schema:
335+
type: array
336+
items:
337+
type: string
338+
minLength: 1
339+
description: Course number of the content file
340+
- in: query
341+
name: file_extension
342+
schema:
343+
type: array
344+
items:
345+
type: string
346+
minLength: 1
347+
description: 'The extension of the content file. '
348+
- in: query
349+
name: key
350+
schema:
351+
type: array
352+
items:
353+
type: string
354+
minLength: 1
355+
description: The filename of the content file
356+
- in: query
357+
name: limit
358+
schema:
359+
type: integer
360+
description: Number of results to return per page
361+
- in: query
362+
name: offered_by
363+
schema:
364+
type: array
365+
items:
366+
type: string
367+
minLength: 1
368+
description: Offeror of the content file
369+
- in: query
370+
name: offset
371+
schema:
372+
type: integer
373+
description: The initial index from which to return the results
374+
- in: query
375+
name: platform
376+
schema:
377+
type: array
378+
items:
379+
type: string
380+
minLength: 1
381+
description: platform(s) of the content file
382+
- in: query
383+
name: q
384+
schema:
385+
type: string
386+
minLength: 1
387+
description: The search text
388+
- in: query
389+
name: resource_readable_id
390+
schema:
391+
type: array
392+
items:
393+
type: string
394+
minLength: 1
395+
description: The readable_id value of the parent learning resource for the
396+
content file
397+
- in: query
398+
name: run_readable_id
399+
schema:
400+
type: array
401+
items:
402+
type: string
403+
minLength: 1
404+
description: The readable_id value of the run that the content file belongs
405+
to
406+
- in: query
407+
name: sortby
408+
schema:
409+
enum:
410+
- id
411+
- -id
412+
- resource_readable_id
413+
- -resource_readable_id
414+
type: string
415+
minLength: 1
416+
description: |-
417+
if the parameter starts with '-' the sort is in descending order
418+
419+
* `id` - id
420+
* `-id` - -id
421+
* `resource_readable_id` - resource_readable_id
422+
* `-resource_readable_id` - -resource_readable_id
423+
tags:
424+
- content_files_vector_search
425+
responses:
426+
'200':
427+
content:
428+
application/json:
429+
schema:
430+
$ref: '#/components/schemas/ContentFileVectorSearchResponse'
431+
description: ''
317432
/api/v0/learning_resources_search_admin_params/:
318433
get:
319434
operationId: learning_resources_search_admin_params_retrieve
@@ -1618,6 +1733,184 @@ components:
16181733
default: false
16191734
required:
16201735
- message
1736+
ContentFile:
1737+
type: object
1738+
description: Serializer class for course run ContentFiles
1739+
properties:
1740+
id:
1741+
type: integer
1742+
readOnly: true
1743+
run_id:
1744+
type: integer
1745+
run_title:
1746+
type: string
1747+
run_slug:
1748+
type: string
1749+
departments:
1750+
type: array
1751+
items:
1752+
$ref: '#/components/schemas/LearningResourceDepartment'
1753+
semester:
1754+
type: string
1755+
year:
1756+
type: integer
1757+
topics:
1758+
type: array
1759+
items:
1760+
$ref: '#/components/schemas/LearningResourceTopic'
1761+
key:
1762+
type: string
1763+
nullable: true
1764+
maxLength: 1024
1765+
uid:
1766+
type: string
1767+
nullable: true
1768+
maxLength: 36
1769+
title:
1770+
type: string
1771+
nullable: true
1772+
maxLength: 1024
1773+
description:
1774+
type: string
1775+
nullable: true
1776+
url:
1777+
type: string
1778+
nullable: true
1779+
content_feature_type:
1780+
type: array
1781+
items:
1782+
type: string
1783+
content_type:
1784+
$ref: '#/components/schemas/ContentTypeEnum'
1785+
content:
1786+
type: string
1787+
nullable: true
1788+
content_title:
1789+
type: string
1790+
nullable: true
1791+
maxLength: 1024
1792+
content_author:
1793+
type: string
1794+
nullable: true
1795+
maxLength: 1024
1796+
content_language:
1797+
type: string
1798+
nullable: true
1799+
maxLength: 24
1800+
image_src:
1801+
type: string
1802+
format: uri
1803+
nullable: true
1804+
maxLength: 200
1805+
resource_id:
1806+
type: string
1807+
resource_readable_id:
1808+
type: string
1809+
course_number:
1810+
type: array
1811+
items:
1812+
type: string
1813+
description: Extract the course number(s) from the associated course
1814+
readOnly: true
1815+
file_type:
1816+
type: string
1817+
nullable: true
1818+
maxLength: 128
1819+
file_extension:
1820+
type: string
1821+
nullable: true
1822+
maxLength: 32
1823+
offered_by:
1824+
$ref: '#/components/schemas/LearningResourceOfferor'
1825+
platform:
1826+
$ref: '#/components/schemas/LearningResourcePlatform'
1827+
run_readable_id:
1828+
type: string
1829+
required:
1830+
- content_feature_type
1831+
- course_number
1832+
- departments
1833+
- id
1834+
- offered_by
1835+
- platform
1836+
- resource_id
1837+
- resource_readable_id
1838+
- run_id
1839+
- run_readable_id
1840+
- run_slug
1841+
- run_title
1842+
- semester
1843+
- topics
1844+
- year
1845+
ContentFileVectorSearchResponse:
1846+
type: object
1847+
description: SearchResponseSerializer with OpenAPI annotations for Content Files
1848+
search
1849+
properties:
1850+
count:
1851+
type: integer
1852+
readOnly: true
1853+
next:
1854+
type: string
1855+
nullable: true
1856+
readOnly: true
1857+
previous:
1858+
type: string
1859+
nullable: true
1860+
readOnly: true
1861+
results:
1862+
type: array
1863+
items:
1864+
$ref: '#/components/schemas/ContentFile'
1865+
readOnly: true
1866+
metadata:
1867+
type: object
1868+
properties:
1869+
aggregations:
1870+
type: object
1871+
additionalProperties:
1872+
type: array
1873+
items:
1874+
type: object
1875+
properties:
1876+
key:
1877+
type: string
1878+
doc_count:
1879+
type: integer
1880+
required:
1881+
- doc_count
1882+
- key
1883+
suggestions:
1884+
type: array
1885+
items:
1886+
type: string
1887+
required:
1888+
- aggregations
1889+
- suggestions
1890+
readOnly: true
1891+
required:
1892+
- count
1893+
- metadata
1894+
- next
1895+
- previous
1896+
- results
1897+
ContentTypeEnum:
1898+
enum:
1899+
- page
1900+
- file
1901+
- video
1902+
- pdf
1903+
type: string
1904+
description: |-
1905+
* `page` - page
1906+
* `file` - file
1907+
* `video` - video
1908+
* `pdf` - pdf
1909+
x-enum-descriptions:
1910+
- page
1911+
- file
1912+
- video
1913+
- pdf
16211914
Counts:
16221915
type: object
16231916
properties:

openapi/specs/v1.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8599,6 +8599,10 @@ components:
85998599
type: string
86008600
nullable: true
86018601
maxLength: 128
8602+
file_extension:
8603+
type: string
8604+
nullable: true
8605+
maxLength: 32
86028606
offered_by:
86038607
$ref: '#/components/schemas/LearningResourceOfferor'
86048608
platform:

0 commit comments

Comments
 (0)