Project #2 - Extendible Hash Index - CMU 15-445/645 :: Intro to Database Systems (Spring 2024) (2024)

Last Updated: Mar 10, 2024

Do not post your project on a public Github repository.

In this programming project you will implement disk-backed hash index in your database system. You will be using a variant of extendible hashing as the hashing scheme. Unlike the two-level scheme taught in class, we added a non-resizable header page on top of the directory pages so that the hash table can hold more values and potentially achieve better multi-thread performance.

The following diagram shows an extendible hash table with a header page of max depth 2, directory pages with max depth 2, and bucket pages holding at most two entries. The values are omitted, and the hash of the keys are shown in the bucket pages instead of the key themselves.

The index provides fast data retrieval without needing to search every row in a database table enabling rapid random lookups. Your implementation should support thread-safe search, insertion, and deletion (including growing/shrinking the directory and splitting/mergeing buckets).

You must implement the following tasks:

Extendible Hash Table Pages
Extendible Hashing Implementation
Concurrency Control

This is a single-person project that must be completed individually (i.e. no groups).

Release Date: Feb 19, 2024
Due Date: Mar 15, 2024 @ 11:59pm

Before starting, run git pull public master to pull the latest code from the public BusTub repo.

Remember to pull latest code from the bustub repository.

Your work here depends on your implementation of the buffer pool from Project 1. If your Project 1 solution was incorrect, you must fix it to successfully complete this project. We will not provide solutions for the previous programming projects.

Like the first project, we are providing stub classes that contain the API that you need to implement. You should not modify the signatures for the pre-defined functions in these classes. If you modify the signatures, our grading test code will not work and you will get no credit for the project.

If a class already contains data members, you should not remove them. These are required to implement functionality needed by the rest of the system. Unless specified otherwise, you may add data members and helper functions to these classes to correctly implement the required functionality.

You may use any built-in C++17 containers in your project unless specified otherwise. It is up to you to decide which ones you want to use. Be warned that these containers are not thread-safe; will need to use latches to protect access to them. You may not use additional third-party libraries (e.g. boost).

Task #1 - Extendible Hash Table Pages

You must implement three Page classes to store the data of your Extendible Hash Table.

Hash Table Header Page
Hash Table Directory Page
Hash Table Bucket Page

Hash Table Header Page

The header page sits the at the first level of our disk-based extendible hash table, and there is only one header page for a hash table. It stores the logical child pointers to the directory pages (as page ids). You can think about it as a static first-level directory page. The header page has the following fields:

Variable Name	Size	Description
`directory_page_ids_`	2048	An array of directory page ids
`max_depth_`	4	The maximum depth the header page could handle

Note that although there is a physical limit of how large a page is, you should use max_depth_ to determine the upper bound of your directory_page_ids array size.

You must implement the extendible hash table header page by modifying only its header file (src/include/storage/page/extendible_htable_header_page.h) and corresponding source file (src/storage/page/extendible_htable_header_page.cpp).

Hash Table Directory Page

Directory pages sit at the second level of our disk-based extendible hash table. Each of them stores the logical child pointers to the bucket pages (as page ids), as well as metadata for handling bucket mapping and dynamic directory growing and shrinking. The directory page has the following fields:

Variable Name	Size	Description
`max_depth_`	4	The maximum depth the header page could handle
`global_depth_`	4	The current directory global depth
`local_depths_`	512	An array of bucket page local depths
`bucket_page_ids_`	2048	An array of bucket page ids

Note that although there is a physical limit of how large a page is, you should use max_depth_ to determine the upper bound of your bucket_page_ids_ array size.

You must implement the extendible hash table directory page by modifying only its header file (src/include/storage/page/extendible_htable_directory_page.h) and corresponding source file (src/storage/page/extendible_htable_directory_page.cpp).

Hash Table Bucket Page

Bucket pages sit at the third level of our disk-based extendible hash table. They are the ones that are actually storing the key-value pairs. The bucket page has the following fields:

Variable Name	Size	Description
`size_`	4	The number of key-value pairs the bucket is holding
`max_size_`	4	The maximum number of key-value pairs the bucket can handle
`array_`	less than or equal to 4088	An array of key-value pairs

Note that although there is a physical limit of how large a page is, you should use max_size_ to determine the upper bound of your array_ key-value pairs array size.

You must implement the extendible hash table bucket page by modifying only its header file (src/include/storage/page/extendible_htable_bucket_page.h) and corresponding source file (src/storage/page/extendible_htable_bucket_page.cpp).

Each extendible hash table header/directory/bucket page corresponds to the content (i.e., the data_ part) of a memory page fetched by the buffer pool. Every time you read or write a page, you must first fetch the page from the buffer pool (using its unique page_id), reinterpret cast it the corresponding type, and unpin the page after reading or writing it. We strongly encourage you to take advantage of the PageGuard APIs you implemented in Project #1 to achieve this.

Task #2 - Extendible Hashing Implementation

Your implementation needs to support insertions, point search and deletions. There are many helper functions either implemented or documented the extendible hash table's header and cpp files. Your only strict API requirement is adhering to Insert, GetValue, and Remove. You also must leave the VerifyIntegrity function as it is. Please feel free to design and implement additional functions as you see fit.

For this semester, the hash table is intend to support only unique keys. This means that the hash table should return false if the user tries to insert duplicate keys.

Note: You should use the page classes you implemented in Task #1 to store the key-value pairs as well as the metadata to maintain the hash table (page ids, global/local depths). For instance, you should not use in-memory data structure such as a std::unordered_map to mock the hash table.

The extendible hash table is parameterized on arbitrary key, value, and key comparator types.

template <typename KeyType, typename ValueType, typename KeyComparator>

The type parameters are:

KeyType: The type of each key in the index. In practice this will be a GenericKey. The actual size of a GenericKey varies, and is specified with its own template argument that depends on the type of indexed attribute.
ValueType: The type of each value in the index. In practice, this will be a 64-bit RID.
KeyComparator: A class used to compare whether two KeyType instances are less than, greater than, or equal to each other. These will be included in the KeyType implementation files.

Note: Our hash table functions also take a Transaction* with default value nullptr. This is intended for project 4 if you want to implement concurrent index lookup in concurrency control. You generally don't need to use it in this project.

You must implement the extendible hash table by modifying only its header file (src/include/container/disk/hash/disk_extendible_hash_table.h) and corresponding source file (src/container/disk/hash/disk_extendible_hash_table.cpp).

This project requires you to implement bucket splitting/merging and directory growing/shrinking. The following subsections provide a specification on the implementation details.

Empty Table

When you first create an empty hash table, it should only have the (one and only) header page. Directory pages and bucket pages should be created on demand.

Header Indexing

You will want to use the most-significant bits for indexing into the header page's directory_page_ids_ array. This involves taking the hash of your key and perform bit operations with the depth of the header page. The header page depth will not change.

Directory Indexing

You will want to use the least-significant bits for indexing into the directory page's bucket_page_ids_ array. This involves taking the hash of your key and perform bit operations with the current depth of the directory page.

Bucket Splitting

You must split a bucket if there is no room for insertion. Ensure you support recursive splitting, in the case the new bucket after a split does not have enough space, unless max_depth_ is reached. You can ostensibly split as soon as the bucket becomes full, if you find that easier. However, the reference solution splits only when an insertion would overflow a page. Hence, you may find that the provided API is more amenable to this approach. As always, you are welcome to factor your own internal API.

Bucket Merging

Merging must be attempted when a bucket becomes empty. There are ways to merge more aggressively by checking the occupancy of buckets and their split images, but these expensive checks and extra merges can increase thrashing.

To keep things relatively simple, we provide the following rules for merging:

Only empty buckets can be merged.
Buckets can only be merged with their split image if their split image has the same local depth.
You should keep merging recursively if the new split image of the merged bucket is empty.

If you are confused about a "split image,” please review the algorithm and code documentation. The concept falls out quite naturally.

Directory Growing

There are no fancy rules for part of the hash table. You either have to grow the directory, or you do not.

Directory Shrinking

Only shrink the directory if the local depth of every bucket is strictly less than the global depth of the directory.

Task #3 - Concurrency Control

Finally, modify your extendible hash table implementation so that it safely supports concurrent operations with multiple threads. The thread traversing the index should acquire latches on hash table pages as necessary to ensure safe concurrent operations, and should release latches on parent pages as soon as it is possible to determine that it is safe to do so.

We recommend that you complete this task by using the FetchPageWrite or FetchPageRead buffer pool API, depending on whether you want to access a page with read or write privileges. Then modify your implementation to grab and release read and write latches as necessary to implement the latch crabbing algorithm.

Note: You should never acquire the same read lock twice in a single thread. It might lead to deadlock.

Note: You should make careful design decisions on latching. Always holding a global latch the entire hash table is probably not a good idea. TAs will manual review your implementation and bad latching design would result in points deduction.

Leaderboard Task (Optional)

For this project's leaderboard challenge, we are doing a benchmark on your hash table with a very large buffer pool.

Optimizing for the leaderboard is optional (i.e., you can get a perfect score in the project after finishing all previous tasks).However, your solution must finish the leaderboard test with a correct result and without deadlock and segfault.

The leaderboard test is compiled with the release profile:

mkdir cmake-build-relwithdebinfocd cmake-build-relwithdebinfocmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfomake -j`nproc` htable-bench./bin/bustub-htable-bench

We strongly recommend you to checkpoint your code before optimizing for leaderboard tests, so that if these optimizations cause problems in future projects, you can always revert them.

In the leaderboard test, we will have multiple threads accessing the hash table. There are two types of threads running in the benchmark:

Read threads. Each thread will get values for a range of keys.
Write threads. Each thread will insert and remove keys from the hash table.

The final score is computed as a geometric average of read and write qps (throughput).

Recommended Optimizations

Optimistic latching. For instance, during insertion, instead of taking the write lock on the header page before determining whether a new directory needs to be added, you can optimistically take the read lock on the header page, check whether a new directory needs to be added, and only obtain the write lock on the header page if you need to add a new directory. Similar case for locking on directory pages.
Bloom filters stored in the header and/or directory pages. You are not allowed to create a separate page to store a bloom filter.
- If your bloom filter functions need input keys of type GenericKey which comes in variable sizes, template functions may be useful. For reference on the use of template, see src/storage/page/extendible_htable_bucket_page.cpp.
- To derive your k hash functions, please refer to third_party/murmur3, the only supported third-party hashing library.

Note: We recommend starting with optimistic latching, as this will likely yield the largest improvement in performance. You might find it difficult to achieve good performance by implementing both optimistic latching and bloom filters completely, so you may choose to stick with one approach or experiment with a mix of both.

Leaderboard Policy

Submissions with leaderboard bonus are subject to manual review by TAs.
- By saying "review", it means that TAs will manually look into your code, or if they are unsure whether an optimization is correct or not by looking, they will make simple modification to existing test cases to see if your leaderboard optimization correctly handled the specific cases that you want to optimize.
- One example of simple modification: change the buffer pool manager size for the benchmark.
Your optimization should not affect correctness. You can optimize for specific cases, but it should work for all inputs in your optimized cases.
You should not try detecting whether your submission is running leaderboard test by using side information.
- Unless we allow you to do so.
- Disallowed: use#ifdef NDEBUG, etc.
Submissions with obvious correctness issues will not be assigned leaderboard bonus.
You cannot use late days for leaderboard tests.
If you are unsure about whether an optimization is reasonable, you should post on Piazza or visit any TA's office hour.

See the Project #0 instructions on how to create your private repository and setup your development environment.

Testing

You can test the individual components of this assigment using our testing framework. We use GTest for unit test cases. There are three separate files that contain tests for each component:

Hash Table Pages: test/storage/extendible_htable_page_test.cpp
Extendible Hash Table: test/container/disk/hash/extendible_htable_test.cpp

You can compile and run each test individually from the command-line:

$ make extendible_htable_page_test -j$(nproc)$ ./test/extendible_htable_page_test

You can also run make check-tests to run ALL of the test cases. Note that some tests are disabled as you have not implemented future projects. You can disable tests in GTest by adding a DISABLED_ prefix to the test name.

Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your project. You should write additional test cases on your own to check the complete functionality of your implementation.

Formatting

Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of your source code. Your project grade will be zero if your submission fails any of these checks.

Execute the following commands to check your syntax. The format target will automatically correct your code. The check-lint and check-clang-tidy-p2 targets will print errors and instruct you how to fix it to conform to our style guide.

$ make format$ make check-clang-tidy-p2

Memory Leaks

For this project, we use LLVM Address Sanitizer (ASAN) and Leak Sanitizer (LSAN) to check for memory errors. To enable ASAN and LSAN, configure CMake in debug mode and run tests as you normally would. If there is memory error, you will see a memory error report. Note that macOS only supports address sanitizer without leak sanitizer.

In some cases, address sanitizer might affect the usability of the debugger. In this case, you might need to disable all sanitizers by configuring the CMake project with:

$ cmake -DCMAKE_BUILD_TYPE=Debug -DBUSTUB_SANITIZER= ..

Development Hints

You can use BUSTUB_ASSERT for assertions in debug mode. Note that the statements within BUSTUB_ASSERT will NOT be executed in release mode.If you have something to assert in all cases, use BUSTUB_ENSURE instead.

Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.

We encourage you to use a graphical debugger to debug your project if you are having problems.

If you are having compilation problems, running make clean does not completely reset the compilation process. You will need to delete your build directory and run cmake .. again before you rerun make.

Each project submission will be graded based on the following criteria:

Does the submission successfully execute all of the test cases and produce the correct answer?
Does the submission execute without any memory leaks?
Does the submission follow the code formatting and style policies?

Note that we will use additional test cases to grade your submission that are more complex than the sample test cases that we provide you.

See the late policy in the syllabus.

After completing the assignment, you can submit your implementation to Gradescope:

https://www.gradescope.com/courses/579715

Running make submit-p2 in your build/ directory will generate a zip archive called project2-submission.zip under your project root directory that you can submit to Gradescope.

You can submit your answers as many times as you like and get immediate feedback.

Notes on Gradescope and Autograder

If you are timing out on Gradescope, it's likely because you have a deadlock in your code or your code is too slow and does not run in 60 seconds. If your code is too slow it may be because you have performance issue on the buffer pool manager you implemented in project 1.
The autograder will not work if you are printing too many logs in your submissions.
If the autograder did not work properly, make sure that your formatting commands work and that you are submitting the right files.

CMU students should use the Gradescope course code announced on Piazza.

Every student has to work individually on this assignment.
Students are allowed to discuss high-level details about the project with others.
Students are not allowed to copy the contents of a white-board after a group meeting with other students.
Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.