Bulk Load Data into Cosmos DB for NoSQL

45 minutes
  • 4 Learning Objectives

About this Hands-on Lab

Bulk load refers to scenarios where you need to move a large volume of data, and you need to do it with as much throughput as possible. Workloads can be based on batch processes, such as nightly data loads, or based on streaming processes where you are receiving hundreds of thousands of documents that you need to update.

In this hands-on lab, you will use the Cosmos DB SDK along with vanilla C# code to enable bulk execution on a CosmosClient class. Then you will generate synthetic data to test a bulk load of 1,000 JSON documents into Cosmos DB for NoSQL.

Students with solid experience coding in .Net C# — and/or experience with the Cosmos DB for NoSQL SDK for any language — will be the most prepared to complete this lab without assistance. However, tips are provided for developers with less experience, visit the solution videos and the lab guide for full solutions.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Housekeeping
  1. Open an incognito or in-private window and log in to the Azure portal using the user name and password provided in the lab environment.
  2. From within the portal, initiate the Cloud Shell to select Bash (versus PowerShell and set up with new backing storage.
  3. From the Bash command prompt, execute the git clonecommand using the URL provided in the Additional Information and Resources section of the lab, followed by DP420Labs to alias the downloaded folder to a friendly name.
  4. Once the project is downloaded use Cloud Editor to open the Program.cs file.
  5. From the Bash command prompt, change to the working directory cd DP420Labs/DP420/BulkLoad.

NOTE: You are free to write the code for this lab in Visual Studio Code or another IDE, if you have experience in that environment. Just make sure you download the GitHub project file to ensure you have the right library references and using directives. Be aware that the lab guide and the solution video are based on working in the Cloud Shell editor, but it won’t substantially change the code you write.

Instantiating the CosmosClient Object
  1. Navigate to the Cosmos DB account that is already set up for you and copy the primary connection string to connect to Cosmos DB in your code.
  2. Navigate to Data Explorer and note the name of the database and container already deployed to your account. The partition key for the container is itemId. You will need this information later.
  3. Run a quick SQL query to confirm that the container is empty.
  4. In the main method of the Program.cs file, author the code required to connect to your Cosmos DB account. Operate on the database and container already set up in that account. When you instantiate the CosmosClient, you will also need to enable bulk execution.

Tips:

  • You will need to instantiate a CosmosClient, a Database, and a Container using the connection string, database name, and container name you retrieved from the portal.
  • There may be abiguity when instantiating the Database object due to the Bogus library that also has a Documentclass, so you can use the fully qualified path: Microsoft.Azure.Cosmos.Database
  • You will need to use a CosmosClientOptions class in order to set AllowBulkExecution to true ,or you can optionally use a CosmosClientBuilder fluent class.
  • If you still need help after considering these tips, you can copy-paste the code from the lab guide and/or watch the solution video.

    NOTE: If you do copy/paste the code from the lab guide, be sure to save the connection string you copied from the portal, first, so that you do not have to go retrieve it again.

Loading Synthetic Data

You are not expected to write the data generation code from scratch. You can simply copy/paste the following code. However, do take a few minutes to study it, taking particular note of the property that generates 1000 records, which is about right, for our bulk load test; if you set it much higher, you are likely to receive a 429 throttling error.

  1. Inside the main method, following the Cosmos DB connection code, paste this code:
    var fruit = new[] {"apple", "peach", "lemon", "strawberry", "pear"};
    //get items from a source; we're using a fake data generator, here
    List<GenericItem> itemsToInsert = new Faker<GenericItem>()
    .RuleFor(i => i.id, f => Guid.NewGuid())
    //itemId is partition key
    .RuleFor(i => i.itemId, f => f.Random.Number(1, 10))
    .RuleFor(i => i.itemName, f=> f.PickRandom(fruit))
    .Generate(1000);
  2. Outside of the main method, paste this code that creates an item class for the data generator:

    
        public class GenericItem
        {
            public Guid id {get; set;}
            public string? itemName {get; set;}
            public int itemId {get; set;} 
        } 
Executing the Code

The benefit of using the SDK to batch up data for bulk load is that you do not have to write the batching and caching logic. The SDK takes care of that under the covers. You just need to write vanilla C# code to add the items to the container.

  1. In the previous objective, the code populates a List<GenericItem> object, called itemsToInsert, with synthetic JSON documents. In this objective, you need to write code that iterates over that list and asynchrously inserts the items into the Cosmos DB container.

    NOTE: Better yet, you can create another List, but this time a List<Task> object. Iterate over itemsToInsert and load up the List<Task> object with the tasks that perform the container insert. Then return a Task with what is expected by the Main method.

  2. After you have written the code, save the changes to Program.cs file. Then, build the code. Assuming it builds without error, run the code.

  3. Assuming the code runs successfully, go back to the Data Explorer to run the SQL query again. You should now see the documents in the container.

Tips:

  • Create a new List<Task> object and use a foreach construct to loop over the ItemsToInsert list in order to build a list of tasks that insert items into the container.
  • Use the CreateItemAsync<GenericItem> member on the Containerobject, which you instantiated in the first code block, to add items to the container.
  • When inserting an item, you need a reference to the item and, optionally, the container partition key, which isn’t required but is more efficient for the database engine. If you decide to include it, the partition key for the container is itemId.
  • Building a list of tasks does not actually execute the inserts to the container. To do the work defined in the tasks, use this syntax to return a Task, which is the data type expected by the main method:
    await Task.WhenAll([whatever you named your batch of tasks]);
    Remember: You don’t have to worry about collecting up the documents into batches before inserting. The SDK code takes care of that for you.
  • If you still need help after considering these tips, you can copy/paste the code from the lab guide or watch the solution video.

Additional Resources

Imagine you track worldwide commodities for a financial firm. You manage a nightly batch processing job that loads tens of thousands of JSON documents related to fruit prices from around the world into your Cosmos DB for NoSQL container. The batch job is not leveraging the available throughput as efficiently as it might, so you are investigating the option of using a bulk execution feature that is available in the Cosmos DB for NoSQL SDK library. You have found a code library with sample code that can generate test data quickly. You just need to write the code to connect to your container using the bulk load feature in the SDK.

GitHub Code Repository: https://github.com/linuxacademy/DP-420-Designing-and-Implementing-Cloud-Native-Applications-Using-Microsoft-Azure-Cosmos-DB

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?