Skip to content

Dict implementation design #983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks
certik opened this issue Aug 17, 2022 · 13 comments
Open
5 tasks

Dict implementation design #983

certik opened this issue Aug 17, 2022 · 13 comments

Comments

@certik
Copy link
Contributor

certik commented Aug 17, 2022

Here are the freedoms that we have in implementation dictionaries:

  • Hash function
  • load factor / initial size of an array
  • array resizing strategy (rehashing or not)
  • key collision strategy
  • memory allocation
  • ordered vs unordered
  • hash table vs red-black tree

Here is how we can do this:

  • We start with some defaults (just one choice for all of the above)
  • We implement more (eventually all) choices, and allow the user to select using a type annotation, such as x: Dict[hash3, loadfactor(5), resizing("x"), collision_strategy("list")].
  • We can then benchmark various combinations and choose some good solid defaults
  • We can later add a user defined hash function that gets compiled and inlined in LLVM, so optimal performance.
  • Ideally the hash function should be defined in the frontend language too

Note: #975 implements unordered dict. The ordered dict implementation can be implemented on top by adding some data structure to maintain the order of key insertion nodes (such as linked list), and another hash table of key -> node, so that you can do insertion, lookup and deletion all in O(1); it seems lookup is as fast as for unordered dict, and so is unordered iteration; but insertion and deletion is slower due to updating of the two extra datastructures; and there is a new feature "ordered iteration" (that did not exist in unordered dict, but it is slower than "unordered iteration"); more details at https://github.com/Aqcurate/OrderedDict#ordered-dictionary-implementation.

@czgdp1807
Copy link
Collaborator

czgdp1807 commented Aug 17, 2022

Adding TODOs here after #975 is merged (in decreasing priority order),

@czgdp1807
Copy link
Collaborator

czgdp1807 commented Aug 24, 2022

Question - Should we allow floats/reals as keys?

The reason why I am asking this question is because floats are always approximately equal. So, converting them to int of same size by memcpy won't work well. Plus, its not a good practice to use floats as keys. 123.000 and 123.000001 are approximately same, now user may expect them to work as a single key but due to slight difference both will give totally different hash value. So disallowing these as keys will prevent a lot of bad practices.

Also, in case of collisions we will go for exact equality checks in keys but as I said above floats are approximately equal mostly, so users may get very random results due to this property of floats.

Even if we want to allow floats as keys, we should ask the user to provide an equality function in the dict type which we should use to compare floats in case of collisions.

See https://softwareengineering.stackexchange.com/a/391107 and https://diego.assencio.com/?index=67e5393c40a627818513f9bcacd6a70d for details.

@certik
Copy link
Contributor Author

certik commented Aug 24, 2022

For now I would not. Later once we allow a use defined hash, we can, but this is rarely used in Python anyway, so for now let's just give a compiler error.

@harshsingh-24
Copy link
Contributor

harshsingh-24 commented Mar 16, 2023

Hey everyone. I want to work on development and enhancements of Data Structures for LPython. Can anyone suggest a good-first-feature associated with this? Specifically, I want to explore the LLVM part of the compiler. @czgdp1807 @certik

#983 seems to be dormant from some time. So, anything which I can help with?

@czgdp1807
Copy link
Collaborator

Support passing dict as function arguments and return value of functions

This one may be.

@harshsingh-24
Copy link
Contributor

harshsingh-24 commented Mar 16, 2023

For Support of Passing Dictionary as Function Arguments and Return Value of Functions, I realized a few things which I would like to jot down. I will break the problem statement into two parts which will be independent of each other.

  • Allowing dict as function argument
def addDict(a: dict[i32, i32]):
    print(a[1])

print(addDict({1: 100, 2:400}))

For the above code, when I sequentially ran LPython on it with flags --show-tokens, show-ast and show-asr, it fails on show-asr which means there is a problem from AST to ASR conversion. Therefore, need to write logic for that first.

  • Allowing dict as function return
def addPair(a: i32, b: i32) -> dict[i32, i32]:
    return {a: b}

x : dict[i32, i32]
x = addPair(2, 1000)

print(x[2])

For the above code, when I again ran sequentially with the above mentioned flags it fails on show-llvm tag with a code generation error, which implies that I need to modify asr_to_llvm.cpp for IR generation of Dictionary Return.

@czgdp1807 Sounds good? If possible, can you point me to the associated files with above Issue?

@czgdp1807
Copy link
Collaborator

Yes. Makes sense to me.

@Thirumalai-Shaktivel
Copy link
Collaborator

If possible, can you point me to the associated files with above Issue?

This issue is to be handled in asr_to_llvm.cpp. If the ASR doesn't seem to be correct, then we have to redesign the handling of ASR in python_ast_to_asr.cpp

It seems the Dict return type is not implemented in your case:

throw CodeGenError("Type not implemented " + std::to_string(return_var_type));

@harshsingh-24
Copy link
Contributor

harshsingh-24 commented Mar 18, 2023

Hey. I was able to figure out a few things as to how to implement the solution of Return Type at least. First of all, there are two apis that have been exposed by LLVM to us - dict_api_lp (Linear Probing), dict_api_sc (Separate Chaining). set_dict_api is used for setting which API to use depending on what kind of objects have been passed.

In the above mentioned place by @Thirumalai-Shaktivel, I added the following logic for a return-type of Dictionary.

case (ASR::ttypeType::Dict) : {
                    ASR::Dict_t* asr_dict = ASR::down_cast<ASR::Dict_t>(return_var_type0);
                    std::string key_type_code = ASRUtils::get_type_code(asr_dict->m_key_type);
                    std::string value_type_code = ASRUtils::get_type_code(asr_dict->m_value_type);
                    
                    std ::cout << "DICT DEBUG: " << key_type_code << " " << value_type_code << std :: endl;
                    
                    bool is_local_array_type = false, is_local_malloc_array_type = false;
                    bool is_local_list = false;
                    ASR::dimension_t* local_m_dims = nullptr;
                    ASR::storage_typeType local_m_storage = ASR::storage_typeType::Default;
                    int local_n_dims = 0, local_a_kind = -1;

                    llvm::Type* key_llvm_type = get_type_from_ttype_t(asr_dict->m_key_type, local_m_storage,is_local_array_type, is_local_malloc_array_type,is_local_list, local_m_dims, local_n_dims,local_a_kind);
                    llvm::Type* value_llvm_type = get_type_from_ttype_t(asr_dict->m_value_type, local_m_storage,is_local_array_type, is_local_malloc_array_type,is_local_list, local_m_dims, local_n_dims,local_a_kind);
                    int32_t key_type_size = get_type_size(asr_dict->m_key_type, key_llvm_type, local_a_kind);
                    int32_t value_type_size = get_type_size(asr_dict->m_value_type, value_llvm_type, local_a_kind);

                    set_dict_api(asr_dict);

                    return_type = llvm_utils->dict_api->get_dict_type(key_type_code, value_type_code, key_type_size,value_type_size, key_llvm_type, value_llvm_type);

                    // throw CodeGenError("DICT RETURN NOT IMPLEMENTED YET");
                    break;
                }

@certik @czgdp1807 After this, what else needs to be changed? I guess visit_Variable function also needs to changed, because it throws an error - throw CodeGenError("Variable type not supported " + std::to_string(x.m_type->type), x.base.base.loc);? We might need to update llvm_symtab?

@rsaba5612
Copy link

It seems like you have provided a description of the different freedoms we have in implementing dictionaries and some suggestions on how to approach implementing them.

There are different choices that can be made in implementing dictionaries, such as the hash function, load factor, resizing strategy, collision strategy, memory allocation, ordered vs unordered, and hash table vs red-black tree.

A good starting point is to choose default options for all of the above choices.

To allow users to select their own options, a type annotation can be used, such as x: Dict[hash3, loadfactor(5), resizing("x"), collision_strategy("list")].

Benchmarking can be used to choose good solid defaults.

A user-defined hash function can be added later for optimal performance.

Ideally, the hash function should be defined in the frontend language as well.

Ordered dictionaries can be implemented on top of unordered dictionaries by maintaining a data structure to track the order of key insertion nodes and another hash table of key -> node, but this may result in slower insertion and deletion due to the additional data structures.
Overall, it seems like you have provided a good starting point for designing and implementing dictionaries with different options and flexibility.

@harshsingh-24
Copy link
Contributor

harshsingh-24 commented Apr 7, 2023

Hey!! Can you give an example of Support nested dictionaries, tuples as keys, list as values which I can initially work on and then incrementally I will add support for each such type. Basically, my doubt is in what order should I go implementing the above feature? @czgdp1807 @certik

I feel that I should go with nested dicts as value support first and then should try for other data types? As per my knowledge, A Dictionary keys of a Dictionary must be unique and of immutable data type such as Strings, Integers, Floats, tuples (Hashable Types) but the values can be repeated and be of any type.

@harshsingh-24
Copy link
Contributor

@czgdp1807 I have completed the task Support passing dict as function arguments and return value of functions in the above checklist. What should I work on next? @certik

@aryangupta701
Copy link
Contributor

Hello @czgdp1807 is this project still available for GSoC or is this completed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants