{"id":6813,"date":"2023-07-27T06:21:21","date_gmt":"2023-07-27T14:21:21","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6813"},"modified":"2025-04-24T17:15:04","modified_gmt":"2025-04-24T17:15:04","slug":"optimized-deep-learning-pipelines-with-tfrecords-and-protobufs","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/","title":{"rendered":"Optimized Deep Learning Pipelines"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">A Deep Dive into TFRecords and Protobufs<\/h2>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1640\" height=\"924\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2.png\" alt=\"Using TFRecords and Protobufs to optimize deep learning (computer vision) pipelines and track our results in Comet ML\" class=\"wp-image-7003\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2.png 1640w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2-300x169.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2-1024x577.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2-768x433.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Untitled-design-2-1536x865.png 1536w\" sizes=\"auto, (max-width: 1640px) 100vw, 1640px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Learn how to optimize your deep learning pipelines using TFRecords and Google&#8217;s Protobufs (protocol buffers) in this end-to-end tutorial.<\/p>\n\n\n\n<h2 class=\"wp-block-heading lu lv fw be lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr bj\" id=\"0d9d\">Introduction<\/h2>\n\n\n\n<p>When it comes to practicing deep learning at home vs. industry, there\u2019s a huge disconnect. Every course, tutorial, and YouTube video presents you with a nicely prepared dataset to feed any DL algorithm for any DL framework. TensorFlow itself comes with the Dataset API that allows you to simply download and train data with just a couple of lines of code. However, when it comes to real life production work at a company, nowhere on this earth will someone just hand you a pristine dataset ready for consumption. Considerations must be given to things like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>File format:<\/strong> Are flat files sufficient, should the data be serialized, etc.<\/li>\n\n\n\n<li><strong>File Structure:<\/strong> Should there be a pattern to the directories for separation of training examples vs labels, some hybrid data structure, etc.<\/li>\n\n\n\n<li><strong>Data location:<\/strong> Can the data be batch fetched from the cloud or does it need to exist locally<\/li>\n\n\n\n<li><strong>Data Processing:<\/strong> Is there another system responsible for collecting and processing the data? And is that system in a completely different framework or programming language: If so, how much effort does it take to go from that system to a deep learning framework-ready system?<\/li>\n\n\n\n<li><strong>CPU\/GPU:<\/strong> Are you limited to only CPU processing (hopefully not) or is there GPU access<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"e21c\">Although not as sexy as model building, these items are important since time is money. Slow input pipelines means <a href=\"https:\/\/www.comet.com\/site\/blog\/reduce-model-training-spending\/\">slow training time<\/a>, which has a few consequences. The longer it takes a model to train, the longer engineers must wait between iterations for tweaking and updating. This ties up said engineers from working on other value propositions. If a company is utilizing cloud resources, this means large bills for resource utilization. Also, the longer a model is in development, that\u2019s time lost it could have been in production generating value.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">TFRecords<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"d563\">So today, we are going to explore how we can optimize our deep learning pipelines using TensorFlow\u2019s TFRecords. I\u2019ve seen a few blogs on this topic, and all of them fail to adequately describe what TFRecords are. They mostly regurgitate example from docs, which themselves are quite lacking. So today, I\u2019m going to to teach you everything you wanted (and didn\u2019t want) to know about TFRecords.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>The TFRecord format is a protobuf-backed format for storing a sequence of binary records. Protobufs are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by .proto files, these are often the easiest way to understand a message type.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading lu lv fw be lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr bj\" id=\"517c\">Protobufs?<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no fp bj\" id=\"3457\">So what is a protobuf (aka protocol buffer)? To answer that question, I\u2019m going to mix some technical jargon with some actual examples that explains the jargon.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"63a3\">Protobufs are a language-agnostic, platform-neutral, and extensible mechanism for serializing structured data. They were developed by Google and released as an open-source project. Protocol Buffers are widely used for efficient and reliable data exchange between different systems, especially in scenarios where performance, compactness, and language interoperability are important.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"1c6f\">Now all of that may or may not mean anything to you, but let\u2019s explain it by stepping through all the pros of using protobufs. We are going to touch on:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Language Interoperability<\/li>\n\n\n\n<li>Forward &amp; Backward Compatibility<\/li>\n\n\n\n<li>Efficiency<\/li>\n\n\n\n<li>Schema Validation<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"3ba8\">1. Language Interoperability:<\/h3>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no fp bj\" id=\"68ec\">Protocol Buffers provide support for generating code in<a class=\"af pa\" href=\"https:\/\/grpc.io\/docs\/languages\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">&nbsp;multiple programming languages<\/a>, enabling different systems written in different languages to communicate seamlessly by sharing a common data structure.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"7950\">Let\u2019s say I want to create a system that collects social media posts from people so that I can train models with the data. The backend of the site is written in Golang, the web scraper might be written in C++, the data cleaning and preparation might be written in Python. By using protobufs, we can define the schema once, and compile for all the aforementioned languages. This empowers a \u201cwrite once \u2014 use everywhere\u201d type of system which drastically reduces engineering time.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"0132\">Conversely, if we were to pass around our data as say, JSON or XML, we would have to individually write wrapper classes or objects for each language to consume that data. That opens the doors for a lot of bugs, mistakes and maintenance. For any update to the data schema, you must update a bunch of code across a bunch of different frameworks.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Implementing Protobufs<\/h4>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"2217\">We start off defining our protobufs by creating a file called &#8220;<strong class=\"be pb\"><em class=\"oi\">text_data.proto&#8221;<\/em><\/strong><strong class=\"be pb\"><em class=\"oi\">.<\/em><\/strong>&nbsp;We then define the attributes of a \u201cPost\u201d which would be comprised of a body of text and when it was written.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/72f0d16df224a959cb4eaebafa7229cf.js\"><\/script><\/p>\n\n\n\n<p>Notice that we are defining the data types for each attribute. This is because all code generated by the proto file will be strong, statically typed objects. Yes, even in python (we will see this later). Next, a post would belong to a user. That user may have 0 or more posts. So let\u2019s define that.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/abbd654d1806e3efc5791ce13a647c44.js\"><\/script><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"d0e6\">We define 0 or more posts by using the \u201c<strong class=\"be pb\"><em class=\"oi\">repeated<\/em><\/strong>\u201d keyword. This signals to the protobuf compiler, at compile time, that the generated object should be an array of some sort that holds objects of type \u201cPost\u201d and nothing else. Finally we just need a way to collect all the users and their posts in a single parent object.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/0e961605acd2034186b905a7d8da7821.js\"><\/script><\/p>\n\n\n\n<p>Each of these messages defines individual objects that will be created for any language we compile for. The overall proto file should look like this:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6818\"><img loading=\"lazy\" decoding=\"async\" width=\"529\" height=\"902\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_mDY6rqF5f2G0ePXK5L5NBQ.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6818\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_mDY6rqF5f2G0ePXK5L5NBQ.webp 529w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_mDY6rqF5f2G0ePXK5L5NBQ-176x300.webp 176w\" sizes=\"auto, (max-width: 529px) 100vw, 529px\" \/><figcaption class=\"wp-element-caption\">Complete text_data.proto file<\/figcaption><\/figure>\n\n\n\n<p>In order to compile, you need to install the&nbsp;<a class=\"af pa\" href=\"https:\/\/grpc.io\/docs\/protoc-installation\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">protoc command line tool<\/a>. I won\u2019t go into super details on this part because it\u2019s not overly important to this post. This is just a quick crash course on protobufs and what they are. You won\u2019t actually need to do this when it comes to TFRecords and training models. This just sets the basis for what comes later.<\/p>\n\n\n\n<h4 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"a2a7\">Compiling for Golang<\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/c9937a4cceab0668b2d7d7fd3e6d76b3.js\"><\/script><\/p>\n\n\n\n<p>Again, don\u2019t worry too much about this since you don\u2019t really need to do this, but this just signals to the compiler to generate Golang code using the define protobufs in the proto file. Once we run this, it will generate a file called \u201ctext_data.pb.go\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6820 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1506\" height=\"1462\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6820\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2.webp 1506w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2-300x291.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2-1024x994.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2-768x746.webp 768w\" sizes=\"auto, (max-width: 1506px) 100vw, 1506px\" \/><figcaption class=\"wp-element-caption\">Compiling text_data.pb.go file in Golang<\/figcaption><\/figure>\n\n\n\n<div class=\"fp fq fr fs ft\">\n<div class=\"ab ca\">\n<div class=\"ch bg fb fc fd fe\">\n<p id=\"f60b\" class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" data-selectable-paragraph=\"\">What this file is, is the Golang version of what we defined. It uses Golang native data types and structures. Golang doesn\u2019t have classes. Instead it uses C-style structs. And you can see that there is a Post struct, representing the message \u201c<strong class=\"be pb\"><em class=\"oi\">Post<\/em><\/strong>\u201d we created. In the outline section on the left, you can see it created Structs for all the messages we defined, and a bunch of other methods and goodies. Some of these goodies allows us to represent our object as a string to print to console if we so choose. It also gives us the capabilities to convert our protobufs to and from JSON objects.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"95df\">Compiling for Python<\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/0cd1fb43df0e82709c928b61e4868676.js\"><\/script><\/p>\n\n\n\n<p>When compiling for Python, you get something different. You don\u2019t get actual Class implementations of our protobufs, you get a Metaclass.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"ms mt oi be b mu np mw mx my nq na nb nr ns ne nf nt nu ni nj nv nw nm nn no fp bj\" id=\"f386\">\u201cMetaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don\u2019t (the people who actually need them know with certainty that they need them, and don\u2019t need an explanation about why).\u201d<\/p>\n\n\n\n<p class=\"ms mt oi be b mu np mw mx my nq na nb nr ns ne nf nt nu ni nj nv nw nm nn no fp bj\" id=\"a392\">\u2014 Tim Peters<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6821 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"780\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6821\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3.webp 1400w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3-300x167.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3-1024x571.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3-768x428.webp 768w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Compiling text_data_pb2.py in Python<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"153b\">You can see on line 16 in the Python generated code image above that there is a&nbsp;<strong class=\"be pb\">DESCRIPTOR&nbsp;<\/strong>variable that is being injected with a serialized definition of our proto. Although cut off in the image, that serialized string is extremely long. This description is used by the Meta class to ensure that any instance of our protobufs in Python code strictly adhere to the definition in the proto file. We will circle back and talk about this more.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"e145\">When we are ready to use our newly compiled protos, we just import them and use them as if they were a native object for the programming language we are working with. The best way to think of protobufs while using them in your codebase is as \u201cvalue classes\u201d.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"a21f\">For those who don\u2019t come from CS or SWE backgrounds, a \u201cvalue class\u201d typically refers to a specific type of class or data structure that represents a single value or entity. A value class encapsulates a value and provides operations or methods related to that value, but it does not have identity or mutability. Tangential examples would be&nbsp;<a class=\"af pa\" href=\"https:\/\/www.baeldung.com\/introduction-to-autovalue\" target=\"_blank\" rel=\"noopener ugc nofollow\">Autovalue in Java<\/a>,&nbsp;<a class=\"af pa\" href=\"https:\/\/kotlinlang.org\/docs\/data-classes.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Data Class in Kotlin<\/a>, and the&nbsp;<a class=\"af pa\" href=\"https:\/\/realpython.com\/python-data-classes\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Data Class decorator in Python<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6822 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1658\" height=\"830\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6822\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4.webp 1658w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4-300x150.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4-1024x513.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4-768x384.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4-1536x769.webp 1536w\" sizes=\"auto, (max-width: 1658px) 100vw, 1658px\" \/><figcaption class=\"wp-element-caption\">test_data.py (Python) and test_data.go (Golang)<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"0f24\">2. Forward and Backward Compatibility:<\/h3>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no fp bj\" id=\"a221\">Protocol Buffers support versioning and evolution of data structures. You can add new fields to a message without breaking existing code that was built with the previous version of the message. This allows for easier maintenance and updates in distributed systems.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"22c2\">This one is very simple in explanation. If we ever want to change or improve any of our protos we can simply add a new field. If this field is meant to replace an old field, all we have to do is mark that old field as deprecated. Any new code using our protos will be flagged to not use the deprecated field. Any old code that isn\u2019t aware of the new field will still work as intended because we never actually removed the old field.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6823 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"665\" height=\"363\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6823\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5.webp 665w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5-300x164.webp 300w\" sizes=\"auto, (max-width: 665px) 100vw, 665px\" \/><figcaption class=\"wp-element-caption\">Forward and backward compatibility means updating or adding a field is simple and easy.<\/figcaption><\/figure>\n\n\n\n<p>This is far better than JSON or XML. If you are expecting JSON\/XML version \u2018X\u2019 but you get version \u2018Y\u2019, your code more than likely won\u2019t work. Or It will fail to parse properly because there\u2019s new fields your code isn\u2019t aware of. Or worse, there\u2019s fields that have been removed that your code is expecting to be there. Here, we don\u2019t have that problem. Backwards compatibility will always exists as long as you don\u2019t delete the field from the proto message. There\u2019s also no penalty for not using a field either.<\/p>\n\n\n\n<h3 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"8f7e\">3. Efficiency:<\/h3>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no fp bj\" id=\"af3d\">Protocol Buffers are highly efficient in terms of both space and processing time. The serialized data is usually smaller than equivalent XML or JSON representations, resulting in reduced storage and transmission costs. Additionally, the encoding and decoding operations are faster, making it suitable for high-performance systems.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph ms mt fw be b mu np mw mx my nq na nb nc ns ne nf ng nu ni nj nk nw nm nn no fp bj\" id=\"2f73\">As a demonstration, we will create 1 Million users, each of whom have written a social media message with the maximum character length of 280 characters. We will then write the data both in a serialized binary format as well as JSON format from the proto. As I said earlier, protos afford you the ability to transition back and forth between JSON as long as you adhere to the strict schema. We will then time the write operation, as well as inspect the overall file size written to disk.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6824 size-full\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6-scaled.webp\" alt=\"Results of writing 1 million users in JSON format as well as binary format. We can see it took ~0.24 seconds to write the data in a binary format. It took ~3.3 seconds to write the same data to JSON.\" class=\"wp-image-6824\"\/><figcaption class=\"wp-element-caption\">Results of writing 1 million users in JSON format as well as binary format. We can see it took ~0.24 seconds to write the data in a binary format. It took ~3.3 seconds to write the same data to JSON.<\/figcaption><\/figure>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/4c8eca29a7b1618b00f70bd18013e34a.js\"><\/script><\/p>\n\n\n\n<h3 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"4bf2\"><strong>4.&nbsp;Schema Validation:<\/strong><\/h3>\n\n\n\n<p>The defined message structures in Protocol Buffers act as a schema that can be used to validate the data being exchanged. It ensures that the received data adheres to the expected structure and type constraints. The reason for Python Metaclasses (as shown earlier) is because protobufs inherently provide type safety \u2014 meaning they have defined types that must be obeyed. They are immutable, and the structure of the class must not and can not ever change. I.e. what we defined in the proto file and generated by protoc should be exactly how the class is\u2026..always.<\/p>\n\n\n\n<p>No code at runtime is allowed to change the structure of the class, only the data it contains. Python on the other hand, is a dynamic \u201cduck-typing\u201d language that has no true concept of static types. Nor does it have any native access modifiers that make members private or protected. The below examples are problems with Python with respect to protobufs.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/0eb629cf03f5d9ceedbd20df3d24f1e7.js\"><\/script><\/p>\n\n\n\n<p>Thus, the metaclass ensures we follow the exact structure and types as defined in the proto file. This type safety is what enables the platform agnostic nature of protobufs. We can\u2019t alter or add anything to a protobuf at runtime that wouldn\u2019t be understood by the same protobuf running on a different platform.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1285\" height=\"806\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7.webp\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6825\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7.webp 1285w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7-300x188.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7-1024x642.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7-768x482.webp 768w\" sizes=\"auto, (max-width: 1285px) 100vw, 1285px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Enforcing this in static typed languages such as Java, C++, &amp; Go is pretty straightforward. If you defined a variable as some type, it can only ever be that type. These languages also come with access modifiers so that you can make fields private and non accessible from outside the class. This way, as you pass the protos from system to system that utilize different platforms, they still know how to handle the data since we know it adheres to the strict schema of the proto.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj lv fw be lw ok ol om ma on oo op me nc oq or os ng ot ou ov nk ow ox oy oz bj\" id=\"975c\">Protobuf Conclusion<\/h2>\n\n\n\n<p>Overall, Protocol Buffers are a powerful and flexible tool for data serialization and interchange. They are commonly used in various domains, including distributed systems, APIs, communication protocols, and data storage formats. These are only a few benefits of using them since we never even touched on data transmission across networks. Which, as a quick aside to this point \u2014 protobufs are the backbone for:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6819 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"560\" height=\"315\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_cxVQsr1n4Vqo1GEBkDsoNw.webp\" alt=\"gRPC, Google Remote Procedure Call\" class=\"wp-image-6819\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_cxVQsr1n4Vqo1GEBkDsoNw.webp 560w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_cxVQsr1n4Vqo1GEBkDsoNw-300x169.webp 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><figcaption class=\"wp-element-caption\">Protobufs are the backbone for gRPC<\/figcaption><\/figure>\n\n\n\n<p>Which if you are unfamiliar with gRPC, I highly suggest you&nbsp;<a class=\"af pa\" href=\"https:\/\/grpc.io\/docs\/what-is-grpc\/introduction\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">checkout the docs<\/a>. In short, it\u2019s a better framework than REST services. It supports HTTP\/2 and enables full duplex communication. This means faster network transfers and less network requests, all powered by protobufs! Something to think about as you\u2019re constantly requesting the next batch of data from remote storage to train your model.<\/p>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">Protobufs for TFRecords<\/h2>\n\n\n\n<p class=\"graf graf--p\">So why the heck did I just spend all that time covering protobufs. Well, that\u2019s because at the heart of TensorFlow\u2019s TFRecords, are protobufs. You can view the actual proto file <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/tensorflow\/tensorflow\/blob\/master\/tensorflow\/core\/example\/feature.proto\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/tensorflow\/tensorflow\/blob\/master\/tensorflow\/core\/example\/feature.proto\">here<\/a>. But we are going to step our way through this file, message by message.<\/p>\n\n\n\n<p class=\"graf graf--p\">The comments at the top of the proto file already provide an example data structure using movie data as the example data. We will just use the same information to make our way through the explanation. So go ahead and open up that file now and take a look as we go through this.<\/p>\n\n\n\n<p class=\"graf graf--p\">The most basic component of a TFRecord (and by extension, the protobufs that make up TFRecords) is data that consists of one of three types<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">bytes<\/strong>\u200a\u2014\u200aWould be used for text, audio, or video based features\/inputs.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">float<\/strong>\u200a\u2014\u200aWould be used for features\/inputs with floating point precision e.g. 3.14<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">int64<\/strong>\u200a\u2014\u200aWould be used for features\/inputs with simple integer values e.g. 100<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">The basics<\/h3>\n\n\n\n<p class=\"graf graf--p\">Representing this in the TFRecord, there are three proto messages defined.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"301\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6830\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1.png 606w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1-300x149.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">The initialization and utilization for any of them would require 0 or more entries of data consisting of the specified data type. This is due to the \u201crepeated\u201d keyword. This signals to the proto compiler that this field is not a single value, but it is an array of values of the defined data type.<\/p>\n\n\n\n<p class=\"graf graf--p\">I\u2019ve taken the proto file from their git repo and have compiled it. I\u2019m using the generated code in the examples that follow.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"771\" height=\"557\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6831\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2.png 771w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2-300x217.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/2-768x555.png 768w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Since this is just the raw values with no labels, we need a way to itemize the data. Obviously this would be necessary for data understanding as well as feature engineering. To do this, we first create a \u201cFeature\u201d proto.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"536\" height=\"272\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6832\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3.png 536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/3-300x152.png 300w\" sizes=\"auto, (max-width: 536px) 100vw, 536px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">The \u201coneof\u201d keyword signals to a user that this proto will be a feature containing one and only one of the base proto types. When compiling a proto with protoc that contains a \u201coneof\u201d member, it also generates extra API for type-checking capabilities. This makes it so a user and\/or code can inspect the proto and determine which \u201ckind\u201d it contains. And of course it also enforces that an instance of the feature proto can, and will, have a single type. Otherwise a run-time or compile-time error will be thrown depending on the programming language being used. You can read more about the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/protobuf.dev\/programming-guides\/proto3\/#oneof\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/protobuf.dev\/programming-guides\/proto3\/#oneof\">\u201coneof\u201d keyword here<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"765\" height=\"192\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6833\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4.png 765w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/4-300x75.png 300w\" sizes=\"auto, (max-width: 765px) 100vw, 765px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">But wait, why did I create the individual protos of ByteList, FloatList, and Int64List just to wrap them in yet another proto? Well this is mostly a design choice. And whether you feel like it\u2019s a good one or not, simply boils down to philosophical differences. But the next part might shed some light on this design choice.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Feature proto map<\/h1>\n\n\n\n<p class=\"graf graf--p\">After we have created our Feature protos, we still need a way to assign a label to them. And we do this by aggregating all of our newly created features in a feature map proto called \u201cFeatures\u201d (unique naming, I know). In this proto map, each feature we have created is indexed by a string key. If you\u2019ve only been programming in Python land your whole life, and have no clue what I mean when I say map, you can think of it as no different than a Python dictionary. It\u2019s a key-value data structure.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6834 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"492\" height=\"130\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6834\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5.png 492w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/5-300x79.png 300w\" sizes=\"auto, (max-width: 492px) 100vw, 492px\" \/><figcaption class=\"wp-element-caption\">TensorFlow&#8217;s &#8220;Features&#8221; proto definition<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"608\" height=\"1748\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6835\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6.png 608w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6-104x300.png 104w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6-356x1024.png 356w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/6-534x1536.png 534w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Because our raw data is contained as either BytesList, FloatList, or Int64List and wrapped in a \u201coneof\u201d Feature proto, that simplifies the map (and thus justifies the design choice). If we weren\u2019t wrapping the base protos in the Feature proto, then we would have to create an individual map member for all the base types. For example, \u201cFeatures\u201d would have to become:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"533\" height=\"288\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6836\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7.png 533w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/7-300x162.png 300w\" sizes=\"auto, (max-width: 533px) 100vw, 533px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\"><\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p\">Again, if you\u2019ve only ever programmed in Python, or something of the sort, this might seem strange to you. But unlike Python, which is dynamic and doesn\u2019t enforce declared data types, protobufs are strongly type. You can\u2019t create arrays or maps and insert mixed data types into them. Hence we would have to make our Features proto like above. But this would be more cumbersome to deal with in practice. Instead of having a single map containing all of our data, we would now have to inspect 3 separate maps for the possibility of any data.<\/p>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">TFRecords in Tensorflow<\/h2>\n\n\n\n<p class=\"graf graf--p\">If you\u2019ve ever looked at the documentation in TensorFlow for utilizing TFRecords, of if you\u2019ve ever just used them in practice, you may realize that the API docs don\u2019t actually import any generated protobuf files, nor does it mention anything of protobufs apart from the fact that TFRecords are backed by protos. This is because the TensorFlow API has its own wrappers and abstractions around the protos. Quite conveniently however, the TensorFlow API almost matches exactly what we did with the raw protobufs 1:1.<\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h4\">Writing TFRecord&nbsp;Files<\/h3>\n\n\n\n<p class=\"graf graf--p\">To demonstrate creating TFRecords using the TensorFlow API, I\u2019m going to use the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.kaggle.com\/datasets\/jessicali9530\/stanford-cars-dataset\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.kaggle.com\/datasets\/jessicali9530\/stanford-cars-dataset\">Stanford Cars Dataset<\/a>. This is a great example dataset to use in this demonstration since all the training images are contained in a single folder and their actual labels and names are in a separate \u201c.mat\u201d file. We can use this opportunity to not only convert these images from JPG to TFRecords, but when converting them, we can even write them with their appropriate label and any other metadata we wish to store with the image itself.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1338\" height=\"1137\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/8.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6837\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/8.png 1338w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/8-300x255.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/8-1024x870.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/8-768x653.png 768w\" sizes=\"auto, (max-width: 1338px) 100vw, 1338px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Before we start though, let\u2019s do some setup steps. Because the names and labels for each car are in a separate file, lets create a dictionary where the keys are the image names\u200a\u2014\u200ae.g. \u201c00151.jpg\u201d\u200a\u2014\u200aand the values are another dictionary containing the car name and the class label of the car for classification. This label is represented as an integer in the data. Some example entries of the dictionary would be<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/4fe3d36ea088e4cc97e0c63a4ea061b5.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">Since this isn\u2019t an article on data cleaning\/preparation, for this initial step, I\u2019m just going to show my code with comments. I\u2019m not going to explain it.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"2092\" height=\"1907\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6838\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9.png 2092w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9-300x273.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9-1024x933.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9-768x700.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9-1536x1400.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/9-2048x1867.png 2048w\" sizes=\"auto, (max-width: 2092px) 100vw, 2092px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">To better organize the training data, each individual image will be converted from its JPEG form to a TensorFlow \u201cFeatures\u201d object. Remember that the \u201cFeatures\u201d proto is a map of string to feature. The same holds true for the TensorFlow \u201cFeatures\u201d object. Thus for a single image, it will be represented in the following manner:<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/e024b30b82bf5978a861a1160b7bf8ac.js\"><\/script><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Defining our helper functions<\/h3>\n\n\n\n<p class=\"graf graf--p\">Following the same pattern that we did above when we were using the actual protobufs, we first need to convert each of these features into either a TensorFlow BytesList, FloatList, or Int64List object. We then need to wrap that newly created Bytes, Float, or Int64 list, in a TensorFlow \u201cFeature\u201d object(not \u201cFeatures\u201d\u200a\u2014\u200aagain, unique naming, I know). We will create helper functions to do so. This part is where you typically see most tutorials on TFRecords start!!!!<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1114\" height=\"710\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/10.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6839\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/10.png 1114w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/10-300x191.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/10-1024x653.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/10-768x489.png 768w\" sizes=\"auto, (max-width: 1114px) 100vw, 1114px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Next, We will create a helper function that will<\/p>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li>Read an image into memory<\/li>\n\n\n\n<li>Extract its name and label from the dictionary in the preprocessing step<\/li>\n\n\n\n<li>Convert the image to bytes<\/li>\n\n\n\n<li>Convert all features into either a bytes, int64, or float feature<\/li>\n\n\n\n<li>Create the final \u201cFeatures\u201d map object.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1083\" height=\"819\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/11.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6840\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/11.png 1083w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/11-300x227.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/11-1024x774.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/11-768x581.png 768w\" sizes=\"auto, (max-width: 1083px) 100vw, 1083px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">With our helper functions done, all we need to do now is to:<\/p>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li>Get a list of all the images in the train directory (their paths)<\/li>\n<\/ol>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/f22a6e882a0771a51fad2e31f3ea19ee.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">2. Shard them into a certain amount\u200a\u2014\u200ae.g. a list of 50 pictures sharded into 5 sets would result in 5 lists of 10 images<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/b332f5556a8d392332ff3ccc85385066.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">3. Write each shard as a separate TFRecord file.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/606aa902dbb874ab920237395fe4e63e.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">Last thing of note\u200a\u2014\u200abefore we write the data as TFRecord, we will wrap the features map object in one last object, the <code class=\"markup--code markup--p-code\">tf.train.Example<\/code> object. Once in that form, we can write to disk. There is no requirement to use <code class=\"markup--code markup--p-code\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/Example\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/Example\">tf.train.Example<\/a><\/code> in TFRecord files. <code class=\"markup--code markup--p-code\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/Example\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/Example\">tf.train.Example<\/a><\/code> is just a method of serializing dictionaries to byte-strings.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1735\" height=\"1762\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6841\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12.png 1735w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12-295x300.png 295w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12-1008x1024.png 1008w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12-768x780.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/12-1512x1536.png 1512w\" sizes=\"auto, (max-width: 1735px) 100vw, 1735px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"632\" height=\"381\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/13.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6842\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/13.png 632w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/13-300x181.png 300w\" sizes=\"auto, (max-width: 632px) 100vw, 632px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h4\">Reading TFRecords<\/h3>\n\n\n\n<p class=\"graf graf--p\">Reading the TFRecords and preparing them for model training is straightforward and doesn\u2019t deviate very much from all the examples in the tf.Dataset docs. We will write a couple of helper functions. The first helper function will allow us to parse a TFRecord file that is loaded into memory. Since the data is serialized, we need a way to deserialize it. We do that by first defining the expected schema of the data in a dictionary format so the parser knows what to expect.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/a75ccab3dfe6af2f9dde833e111ffef3.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">You can see that this follows the same dictionary format we created when first wrote the TFRecord files. This informs the parser that it should expect to be able to deserialize the data into the 5 feature fields of image, width, height, label, and class. It also informs the parser of the expected data types as well as the default value if the data is missing for that particular feature.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Decoding the binary file<\/h4>\n\n\n\n<p class=\"graf graf--p\">You might be asking yourself why you need to yet again define the structure of the data if you already did so when writing the TFRecords. Well this is because the data is serialized in one long string of information. Without giving the parser the expected structure of that information, it doesn\u2019t know how to interpret it. Think of it this way, if I just gave you the binary of:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote graf graf--blockquote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>01000001<\/p>\n<\/blockquote>\n\n\n\n<p class=\"graf graf--p\">In pure binary, this is just the number 65. However I might have intended for you to parse it as ASCII, and in that case this is actually the letter <strong class=\"markup--strong markup--p-strong\">\u2018A\u2019<\/strong>. Without the extra information, there\u2019s ambiguity as far as what the data could actually mean.<\/p>\n\n\n\n<p class=\"graf graf--p\">Next we need to convert the serialized image back into matrix form. I\u2019m also going to take this opportunity to resize the image since I\u2019m going to use a pre-trained model that expects image input to be (299,299,3).<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/8bc78fd5b740d06ca1a7185adcb54b3e.js\"><\/script><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1412\" height=\"921\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/14.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6843\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/14.png 1412w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/14-300x196.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/14-1024x668.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/14-768x501.png 768w\" sizes=\"auto, (max-width: 1412px) 100vw, 1412px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Configuring out input pipeline<\/h4>\n\n\n\n<p class=\"graf graf--p\">For the next step we will use the TensorFlow Dataset API and configure our input pipeline. To do this, we need to tell TensorFlow where the files are and how to load them.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/f1f02b529cf6d5edd954e189cc96cf77.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">The actual loading of the TFRecords is handled by the mapping function <code class=\"markup--code markup--p-code\">TFRecordDataset<\/code>. What the <code class=\"markup--code markup--p-code\">interleave<\/code> method does is it spawns as many threads as you specify, or how many it thinks is optimal if you use the AUTOTUNE parameter. Each thread with load and process its own part of the data concurrently. So instead of processing a single file one at a time, many can be processed at once. As each thread finishes processing a portion of its data, TensorFlow will \u201cinterleave\u201d the processed data from various threads to make a batch of processed data. Hence instead of loading and processing a single file and making all the data from that file a part of a batch, it will gather data randomly from the various threads and create a batch.<\/p>\n\n\n\n<p class=\"graf graf--p\">After the data is loaded, we need to apply our parsing function we created above as well as other pipeline parameters such as batch size and prefetching.<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/c01fa9673fe4e658868098a5e93a2ae2.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">The prefetch method basically tells the CPU to prepare the next batch of data and have it ready to go while the GPU is working. That way, once the GPU is done with its current batch, there\u2019s minimal idle time while it waits for the CPU to prepare the next batch.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1292\" height=\"653\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/15.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6844\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/15.png 1292w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/15-300x152.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/15-1024x518.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/15-768x388.png 768w\" sizes=\"auto, (max-width: 1292px) 100vw, 1292px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">And that\u2019s it for reading the data. We can now test it out and plot an image as well as inspect the image size transformation.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1008\" height=\"1560\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6845\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16.png 1008w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16-194x300.png 194w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16-662x1024.png 662w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16-768x1189.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/16-992x1536.png 992w\" sizes=\"auto, (max-width: 1008px) 100vw, 1008px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">Training with TFRecords vs Raw&nbsp;Input<\/h2>\n\n\n\n<p class=\"graf graf--p\">Most deep learning tutorials, both PyTorch and TensorFlow, typically show you how to prepare your data for model training by using simple DataGenerators which read the raw data. With this method, you are (lazily) reading the data from disk in its raw form. In our case, a DataGenerator with batch size of 10 would have to read 10 separate JPG files into memory prior to any other preprocessing in the pipeline. To compare how (in)efficient this is compared to TFRecords, let\u2019s create our own DataGenerator that still uses the same pipeline as our TFRecords. The steps for this will be<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Create a helper function that gets the class and label from the meta-dictionaries we made in the preprocessing step.<\/li>\n<\/ul>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/ecca7d7773ee009110ecedfd46fecc9b.js\"><\/script><\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Create a helper function that reads the image into memory and bundles it with it label and class as retrieved in the previous in the helper function.<\/li>\n<\/ul>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/b0fd0a6c94ea5ec488d8a3c6ef1b1bbf.js\"><\/script><\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Returns the data generator with applied pipeline settings of \u201cbatch size\u201d and \u201cprefetching\u201d<\/li>\n<\/ul>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/6941b3eb660c152cb0507aef45267336.js\"><\/script><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1370\" height=\"1891\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6846\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17.png 1370w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17-217x300.png 217w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17-742x1024.png 742w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17-768x1060.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/17-1113x1536.png 1113w\" sizes=\"auto, (max-width: 1370px) 100vw, 1370px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h4\">Monitoring Training<\/h3>\n\n\n\n<p class=\"graf graf--p\">We don\u2019t care so much about the actual model for our testing. So we will just use transfer learning and load InceptionResNetV2 model. What we do care about is<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>CPU Utilization\u200a\u2014\u200aWhen is it active and how long is it active for<\/li>\n\n\n\n<li>Memory Utilization\u200a\u2014\u200aHow much memory is being used for processing of the data prior to loading to the GPU<\/li>\n\n\n\n<li>GPU Utilization\u200a\u2014\u200aHow much idle time does the GPU experience while it is waiting for the CPU to prepare the next batch<\/li>\n\n\n\n<li>Time to train between each batch<\/li>\n\n\n\n<li>Time to train between each epoch<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p\">To monitor the utilization, we will do two separate things that ultimately achieve the same goal. The first one requires more work by you, the second handles everything for you after minimal setup using Comet\u2019s online dashboard.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Manually logging system resources<\/h2>\n\n\n\n<p class=\"graf graf--p\">1. We will spawn a separate thread (so it\u2019s not blocking our model training) and sample utilization metrics 5x a second. Below is the code for this operations. Again since this isn\u2019t an article on logging system resources, I leave it to the reader to figure out the code.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"987\" height=\"1773\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6847\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18.png 987w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18-167x300.png 167w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18-570x1024.png 570w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18-768x1380.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/18-855x1536.png 855w\" sizes=\"auto, (max-width: 987px) 100vw, 987px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Logging with Comet<\/h2>\n\n\n\n<p class=\"graf graf--p\">2. We will use Comet.ml to log our metrics to a web based dashboard. We will do this so we can chart in real-time the performance of model training with respect to both datasets as it\u2019s training. If you\u2019re not familiar with comet, you can just head over to <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.comet.com\/site\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.comet.com\/site\/\">Comet.ml<\/a> and sign up for free. You will need your API key. After you create your account, click your profile pic in the top right corner, then go to <code class=\"markup--code markup--p-code\">Account settings<\/code>, On the left navigation panel, you will see <code class=\"markup--code markup--p-code\">API Keys<\/code>&nbsp;. It is here where you can find your key.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"685\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6848\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19.png 1600w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19-300x128.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19-1024x438.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19-768x329.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/19-1536x658.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">To log the time between batches and epoch, we will simply have the data written to a list during training. We could also log the batch and epoch information to Comet as well. However I prefer to collect the data locally so that I can do some extra analysis and plotting after the fact for this blog post. The entire training loop is the following:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"941\" height=\"1789\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6849\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20.png 941w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20-158x300.png 158w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20-539x1024.png 539w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20-768x1460.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/20-808x1536.png 808w\" sizes=\"auto, (max-width: 941px) 100vw, 941px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Parameters<\/h4>\n\n\n\n<p class=\"graf graf--p\">System parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>RTX 3090 24GB Ram<\/li>\n\n\n\n<li>Ryzen 5950x 16 Core 32 Thread CPU<\/li>\n\n\n\n<li>64GB DDR4 Ram<\/li>\n\n\n\n<li>ASUS Crosshair III Formula X570 Motherboard<\/li>\n\n\n\n<li>2 TB Samsung EVO SSD<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p\">Test parameters<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>batch size = 64<\/li>\n\n\n\n<li>epochs = 10<\/li>\n\n\n\n<li>prefetching = True<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p\">Below shows the setup, testing and logging using the TFRecords first. As the model is training, Comet is logging system resources for us after we have setup the experiment as seen above. Notice that we call exp.end() after the model is done training. This will signal to Comet to stop logging. Also, our custom logging function is recording system resource utilization on a separate thread locally. It is recording that data to the \u201cmonitoring_data\u201d dictionary we passed into the \u201cmonitoring_func\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"877\" height=\"603\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/1_Lb9JcADd6YJ7K6lPO-ufvg.gif\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6829\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-6850 size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2600\" height=\"1377\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21.png\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers, Comet ML\" class=\"wp-image-6850\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21.png 2600w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21-300x159.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21-1024x542.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21-768x407.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21-1536x813.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/21-2048x1085.png 2048w\" sizes=\"auto, (max-width: 2600px) 100vw, 2600px\" \/><figcaption class=\"wp-element-caption\">TFRecords (left) vs Raw JPG (right) realtime system metrics logging using Comet.ml<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">Results<\/h2>\n\n\n\n<p class=\"graf graf--p\">While monitoring the metrics during training on Comet (image above), it was already clear how much more effective TFRecords were vs raw JPG on disk. The dashboard shows that the GPU was utilized almost 100% of the entire training loop. Conversely, the JPGs induced a lot of idle time. In the dashboard above, we can see that TFRecords took 18 minutes to train for 10 epochs, where JPG took almost &gt;9mins longer, at 27 minutes to train. The Comet data was aggregated and uploaded at 1-minute intervals.<\/p>\n\n\n\n<p class=\"graf graf--p\">Looking at the data logged locally at a higher resolution (5x a second) \u200a\u2014\u200awhen training with the original JPGs, there\u2019s a lot of \u201cdead time\u201d on the GPU while it\u2019s waiting on the CPU to read each image, process it, and load it. Compared to the raw results of the training session with TFRecords, the GPU stayed busy almost the whole time as we saw in real-time on Comet.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/22-scaled.jpg\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6851\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Note:<\/h4>\n\n\n\n<p class=\"graf graf--p\">As an FYI\u200a\u2014\u200aWhen reading the x-axis, it is plotting a python DateTime as the value, so the first number on the axis is the date, not the time. Hence the value of 22:21:10 actually means day=22nd, hour=21, min=10. It also looks as if the CPU was never quite as busy than when it was training with the TFRecords. It\u2019s almost as if it was working double time to ensure it was keeping up with the demand of the GPU. Which is why the GPU was so busy compared to just the raw JPG files. Let\u2019s apply a Gaussian filter on the time-series and see if the smoothed curves reveal any further trends.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/23-scaled.jpg\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6852\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">After smoothing, it\u2019s very evident when a new epoch was starting while training with the TFRecords. There\u2019s a 10 spaced out spikes in CPU Utilization that corresponds with 10 spikes in GPU memory utilization. This is clearly the CPU processing and loading the GPU with the next batches for the next epoch. This should be even more evident this is the case since we trained the model for 10 epochs, and there\u2019s 10 spikes! Meanwhile, While training with the JPG images, there\u2019s a lot of down time on the GPUs at each epoch. There\u2019s 10 significant drops in GPU Utilization that appears to last for about a minute or so. Also it seems that the CPU is just taking its time processing the images.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Time distributions<\/h3>\n\n\n\n<p class=\"graf graf--p\">Let\u2019s now take a look at the distributions for the time it took to process each batch. One would assume that we shouldn\u2019t see a significant difference in the time it takes to process a batch. This is because for both formats, they\u2019re being converted to a (None, 299, 299, 3) tensor. Any differences between the two inputs would be purely due to stochasticity.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1731\" height=\"942\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24.jpg\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6853\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24.jpg 1731w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24-300x163.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24-1024x557.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24-768x418.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/24-1536x836.jpg 1536w\" sizes=\"auto, (max-width: 1731px) 100vw, 1731px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Sure enough this is what we see. The time to process each batch is more or less the same. The bit that would should be more interested in is the time for each epoch, since that involves the entire pipeline of the CPU and GPU processing. The distribution won\u2019t be very interesting to look at since we only did 10 epochs. Hence setting the bins on the histogram to 3, this is what we have:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1710\" height=\"942\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25.jpg\" alt=\"Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers\" class=\"wp-image-6854\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25.jpg 1710w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25-300x165.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25-1024x564.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25-768x423.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/25-1536x846.jpg 1536w\" sizes=\"auto, (max-width: 1710px) 100vw, 1710px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"graf graf--p\">Although not a graph bountiful of data, it adds more context to what we already saw on Comet. The time spent per epoch is significantly less when training with TFRecords than with the raw data on disk. The data breaks down like:<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/anmorgan24\/3b9b1ab364dc6beaddcbd4f3cfc59977.js\"><\/script><\/p>\n\n\n\n<p class=\"graf graf--p\">Considering that it\u2019s estimated to take 34 days to train ChatGPT with 1000 Nvidia A1000 GPUs, the nearly 80s difference between the two would add up very quickly to something significant.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Deep Dive into TFRecords and Protobufs &nbsp; Learn how to optimize your deep learning pipelines using TFRecords and Google&#8217;s Protobufs (protocol buffers) in this end-to-end tutorial. Introduction When it comes to practicing deep learning at home vs. industry, there\u2019s a huge disconnect. Every course, tutorial, and YouTube video presents you with a nicely prepared [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":6979,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6,7],"tags":[40,14,30,15,35,16,56,57,58,59],"coauthors":[157],"class_list":["post-6813","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-tutorials","tag-comet","tag-comet-ml","tag-deep-learning","tag-deep-learning-experiment-management","tag-image-classification","tag-ml-experiment-management","tag-optimization","tag-protobufs","tag-tensorflow","tag-tfrecords"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Optimized Deep Learning Pipelines - Comet<\/title>\n<meta name=\"description\" content=\"Learn how to optimize your deep learning pipelines using TFRecords and Google&#039;s Protobufs (protocol buffers) in this end-to-end tutorial.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimized Deep Learning Pipelines\" \/>\n<meta property=\"og:description\" content=\"Learn how to optimize your deep learning pipelines using TFRecords and Google&#039;s Protobufs (protocol buffers) in this end-to-end tutorial.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-27T14:21:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"304\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Kelly (Scott) Sims\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kelly (Scott) Sims\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"31 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Optimized Deep Learning Pipelines - Comet","description":"Learn how to optimize your deep learning pipelines using TFRecords and Google's Protobufs (protocol buffers) in this end-to-end tutorial.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/","og_locale":"en_US","og_type":"article","og_title":"Optimized Deep Learning Pipelines","og_description":"Learn how to optimize your deep learning pipelines using TFRecords and Google's Protobufs (protocol buffers) in this end-to-end tutorial.","og_url":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-07-27T14:21:21+00:00","article_modified_time":"2025-04-24T17:15:04+00:00","og_image":[{"width":300,"height":304,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","type":"image\/png"}],"author":"Kelly (Scott) Sims","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Kelly (Scott) Sims","Est. reading time":"31 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/"},"author":{"name":"Kelly (Scott) Sims","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/789012bde06206a2490fafac5769f667"},"headline":"Optimized Deep Learning Pipelines","datePublished":"2023-07-27T14:21:21+00:00","dateModified":"2025-04-24T17:15:04+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/"},"wordCount":5472,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","keywords":["Comet","Comet ML","Deep Learning","Deep Learning Experiment Management","Image Classification","ML Experiment Management","Optimization","Protobufs","TensorFlow","TFRecords"],"articleSection":["Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/","url":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/","name":"Optimized Deep Learning Pipelines - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","datePublished":"2023-07-27T14:21:21+00:00","dateModified":"2025-04-24T17:15:04+00:00","description":"Learn how to optimize your deep learning pipelines using TFRecords and Google's Protobufs (protocol buffers) in this end-to-end tutorial.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","width":300,"height":304,"caption":"An image of TensorFlow's TFRecord logo and below it, Google's Protobuf logo (protocol buffers)"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/optimized-deep-learning-pipelines-with-tfrecords-and-protobufs\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Optimized Deep Learning Pipelines"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/789012bde06206a2490fafac5769f667","name":"Kelly (Scott) Sims","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/3e31e470bc031b484cdf8e1a71158950","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/cropped-Screen-Shot-2023-07-10-at-2.45.33-PM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/cropped-Screen-Shot-2023-07-10-at-2.45.33-PM-96x96.png","caption":"Kelly (Scott) Sims"},"description":"Senior Software Engineer @Google","jobTitle":"Senior Software Engineer","worksFor":"Google","url":"https:\/\/www.comet.com\/site\/blog\/author\/mooseburger1msn-com\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-11.48.18-AM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6813","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6813"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6813\/revisions"}],"predecessor-version":[{"id":15596,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6813\/revisions\/15596"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/6979"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6813"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6813"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6813"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6813"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}