Leveraging MLOps to Speed Up Historical Document Processing
Stanley Fujimoto, a senior data scientist at Ancestry, discusses how his team leveraged MLOps and collaboration to quickly process and extract information from 6.6 million historical census images. He explains their multi-model pipeline using deep learning for document layout extraction and handwriting recognition. Key themes include the importance of infrastructure, transparency, and calibrated confidence scores for business decision making. The webinar also covers initial survey results on MLOps adoption and challenges.
A transcript is available below if you prefer to read through the interview.
Hi, everyone. Thank you so much for joining us today for the webinar. We’re super, super excited to have you here, and to talk about our 2021 Machine Learning Practitioner Survey and of course, spend some time with Stanley here from Ancestry. Before we kick it off, just to give you quick background about myself and Stanley.
Stanley Fujimoto is a senior data scientist at Ancestry, the largest for-profit genealogy company in the world, which operates a network of historical records and a genealogy website. He has used traditional computer vision, NLP and deep learning approaches for text classification, clustering, fact extraction, and handwriting recognition.
I’m Gideon Mendels, I’m the CEO and co-founder of Comet. I started my career as a software engineer about 16 years ago and shifted to working on machine learning about nine years ago, both in academia, startups. Then my last role before Comet, I was working on hate speech detection on YouTube comments at Google, so a lot of language modeling and NLP.
Unfortunately, that was the pre-transformer days, but Stanley, I’m sure we’re going to talk more about that today. But before we kick it off, Stanley, and hear more about what you’re working on, I want to spend just a few minutes to talk about the survey that I mentioned before. A few months ago, Comet ran a survey. We spoke to about 500 practitioners.
These are machine learning practitioners, data scientists who build models on a weekly and daily basis, from various industries working in various use cases, different types of model, different types of problems. We’ve definitely learned a lot of really, really interesting things and we’d like to share some of these results with you today.
I’ll only cover a small portion of it, but if you’re interested in digging in, the full survey results are available in the webinar. If you look on the sidebar, there should be a handout tab and you’ll be able to have full access to it. Let’s kick it off. One of the things we’re seeing, and I think for a lot of practitioners that’s not a huge surprise, is there’s too much friction and unfortunately not enough ML.
One of the things we’ve heard, is almost 60% of the data scientists and practitioners out there use manuals tools to track machine learning experiments. By manual, we’re typically talking about notes and spreadsheets, which is better than trying to remember everything you did. But unfortunately, as a lot of you probably know, with everything that comes to building a model, hyperparameters, configuration, datasets, orchestrations.
It’s very, very hard to keep track of everything manually, not alone try to reproduce it later. The second part, which is closely related, is almost 70% of the people surveyed shared that they abandoned about 40% to 80% of their experiments in the past year. These are experiments that could potentially go somewhere and worked, but because of challenges with infrastructure, challenges with tracking, potential bugs in the process, they had to throw them away.
Of course, not only that leads to wasted time, but in many cases also wasted GPU hours. Additionally, as we keep hearing, and obviously I think the industry have improved on this side of things significantly, there’s a huge lag between when a model’s finished training and the data science and the machine learning team feels it’s ready to go to production, to when it actually happens.
43% responded that it takes about three months, and 47% said it takes longer, so four to six months. A lot of people talk about how painful that handoff is when you have your model binary and maybe some basic Python code that calls the predict function. How hard it is to get it to the team or the person that has to put it behind an API and actually in production, and everything that goes with that.
The third point, and the last one I’ll cover from the survey today, is we are seeing a lot of companies out there understand that this is something worth investing in. As you can see, 63% of teams reported they’re planning and increasing their machine learning budget. This is excluding headcount budget. This is just for platform and tooling.
But yet today if you look at the budgets, about 88% of them said that their full budget for platform and tooling is less than $75,000 a year. There’s another result that you’ll see in the handout if you look that’s missing here, but most of these organizations have between four to six teams, and the average team is about five people. If you factor $75,000 for quite a meaningful amount of people, that actually shows that there’s a very meaningful opportunity to make an impact here.
And allow you and your team to focus on what you’re good at and build models, focus on the science, try different things. And not be boggled by infrastructure issues, tracking issues, deployment issues, all these things that are super critical, but from my experience as a data scientist, they’re a necessary evil. There are very good news here. Companies like Netflix and Uber have pioneered solutions to some of these challenges with really excellent results.
Some of them like Netflix made some of their tooling open source. Uber has a platform that’s called Michelangelo, which isn’t open source, but they are using Comet on top of it to manage all the experimentations. On top of the actual tools, a lot of the lessons that they’ve learned is documented in their engineering blogs and conference sessions. With the growing ecosystem of tools, it makes it much, much easier for machine learning practitioners to build machine learning models better and faster.
Back in the day, I don’t know, Stanley, if you were training deep learning models at the time. When I started was using Fiano, which even getting a simple logic that isn’t very trivial was really, really painful. Now these days with PyTorch, TensorFlow and some of the high-level libraries like Keras and Pyro lightning, make the task that we’ve been dealing with back then trivial. We’re very excited to see the ecosystem move forward and continue to improve on that.
That said, one of the things that’s important with the growing ecosystems and a lot of teams out there are trying to figure out, “Okay, what should we buy? What should we build? What open-source solution should we use? How do we even go about this?” There’s definitely a lot of noise in the MLOps ecosystem. We hear from a lot of customers and prospects, that they sometimes find it hard to understand who’s doing what and what’s important.
We collected a few of the lessons that we’ve heard from our customers that work for them and just want to cover those points, which some of them might not seem apparent on first look, but are absolutely critical to make this successful. The first thing is around integration and customizability. The analog stack is going to continue to grow and improve. It could be that in the future you want to introduce a new tool.
It could be that your workflow is going to change, and you’re really looking for tools that are highly customizable to your workflow. If today you’re working with PyTorch and a specific type of data, that could look very, very different. You want to have the flexibility and this future-proof feature set that allows you to continue to iterate, to update things on your side, without finding that the product, the tool, the platform that you bought is holding you back.
The second thing here is scalability. I think a lot of people are aware of some of the challenges in serving in scale, but it’s not just on the serving side. Even things like tracking, orchestration are really, really hard to do in scale. If you’re testing a product with a couple of workflows, maybe a notebook, it’s really hard to understand that. You’re looking for something that really can scale. It’s interesting here, scale doesn’t mean you have to have hundreds of data scientists working on it.
It’s obviously use case dependent, but we’ve seen some of our customers where there are small teams that are really, really pushing the envelope and training hundreds of models. Some of the solutions that I’ve used before Comet, unfortunately couldn’t meet that scale. The last thing when you’re thinking about this, is how does this fit your overall MLOps strategy? Or how do you plan on solving the complete machine learning lifecycle?
We’re huge believers that you want to have the flexibility to bring in different tools, connect them and build a stack that works well for you. That’s why customizability is super important, but it’s also important to think about the entire workflow. Not just solve a specific problem today, because when you get to deploy, and when you get to monitor, and when you get your retraining, you might find that whatever you put in place in the first place doesn’t cut it.
Thinking about the three key points, super, super critical in order to be successful with bringing in an MLOps tool, open source, vendor, commercial, it doesn’t matter. These things come into play in both cases. All right. Thank you so much for listening to me and we want to switch gears now. We have an amazing guest, Stanley here, so definitely want to spend some time with you. I have some questions here just high level to set the discussion.
But before we jump in, Stanley, maybe if you can tell us a little bit about yourself and maybe one of the recent machine learning projects you’ve been working on.
Yeah, thanks for having me. I’ve been at Ancestry since 2019. I guess I interned and then I started full-time the year following, so I’ve been at Ancestry for a little bit. It’s been a really great experience. I’ve seen a lot of things where the data science org is new at Ancestry, so it started about five years ago. There have been a lot of different phases that we’ve been in, and a lot of things that we’ve experienced that have been interesting experiences.
One of the really cool projects that we’ve been working on recently, so the US 1950 census was released earlier this month. Ancestry, as a genealogical company, they’re very interested in the US census. I don’t know if any of you participated in the census, I think it was two years ago or something like that, or I guess it was postponed because of the pandemic.
But they gather all this information about all of the people that live in the United States, and so you can find your ancestors in it. I found my grandparents in it, and it’s this very weird feeling of seeing or feeling this connection to people in the past. The National Archives, they’re the ones that digitize all of the census forms. They’re just these really big images of scanned, machine printed forms that a person would fill out with a pen.
Then the problem is that none of those images are searchable, so we needed a way to actually index them. We used a bunch of different methods, some traditional computer vision methods, but mostly relying on deep learning techniques too. We used deep correspondence matching as a way to identify where on the forms all of the information was. Then we used a bunch of different architectures in our handwriting recognition.
So using LTMs, transformers and a full encoder-decoder transformer architecture, to be able to actually read the handwriting recognition. This was a really big and risky project for us, because there’s competition in the family history space to be able to publish these records as fast as possible. The situation that we had was data science, you all have one shot to process all this stuff. You won’t be able to see any of the data beforehand.
You’re going to have to build all your models based off of previous censuses, and you won’t know exactly what it’s going to look like. But you have one shot to process it and if it doesn’t work out, you don’t get another chance. We’re just going to send it to vendors to transcribe them by hand.
This was a project… What was that?
Yeah. No pressure, I said.
Yeah. It was funny because the data got released midnight April 1st in Washington, DC time. We were all in the office Thursday night waiting for all the data to release, and we were doing hyperparameter tuning as we were seeing the data come in. They release the census data every 10 years. I think every 10 years is about the right cadence to have that kind of stress in your life. But it was a really fun project.
We developed a lot of really cool models that had really high performance. We’re seeing some really great feedback on it. Yeah. Comet played a huge role in us being able to be successful. Like I said, it was a very risky project for us, and so I think that there’s been a bit of a paradigm shift within our organization as to the types of things that we need to focus on.
For myself, coming from an academic background, and a lot of people that I’ve worked with as well, we’re used to reading papers and reading about new architectures on archive. Seeing all these papers that come out so fast, you can’t even read them all. I’m getting really excited about that stuff. I guess to this first question, what are some of the challenges we face?
I was actually talking to a friend in my neighborhood, so he works as a data scientist somewhere, and I asked him if he felt a lot of pressure to keep up on all the latest methodologies and stuff. What he said is, “No, we have enough math. What we’re lacking is execution.” I thought that was very interesting. It really is the execution where in a paper, your results can be really cool.
Some of the bad results you can still show and people will be forgiving of it because they’re like, “Man, this is such a cool architecture.” But when we show our business partners bad results or bad results get shown to the customers, they don’t think about how innovative the model was. They just say they’re putting money on the line, they’re paying for a service, they don’t care how cool the model is. They want to see good results.
A lot of those good results come from infrastructure, from being able to collaborate well with your partners. From really being able to dig into your work and your outputs, and have some level of service that you can guarantee. That’s not something often focused on in academic papers, and it’s not something you find in some random GitHub repo that has research-grade code in it.
Yeah. There’s definitely a huge gap from that perspective between academia and industry. I think there’s a lot of things feeding each other, which is great, but super excited. I want to dive in, I think you mentioned a few things. But for people listening who are practitioners, and we talked a little bit yesterday about the model, the problem you’ve been working on, so want to dig in a little bit deeper into that.
You mentioned the census data is essentially images of scanned documents, but I think you said it was microfilm, which I don’t know how many data scientists could say they work the microfilm data, but that’s super exciting. If I understand correctly, the task is you’re getting this form and you’re trying to understand, essentially extract all the content from it, but in a structured way.
What I’m curious about, you mentioned you didn’t know what the form would look like beforehand. What was the structure in any way, and you guys were planning and getting ready for that dataset to drop. How did you go ahead and before that and prepare? What was the things you could have done earlier, that hopefully paid off when you went into training?
The model we use, so we just use an off-the-shelf model. We really were hesitant to fine-tune any model to extract information from the pages, like fine-tune on the 1940 census or some previous forms, because we really just had no idea what the 1950 census was going to look like. We actually built this really cool architecture to where we could interact with our partner teams, and they could provide inputs to the model as well.
Usually in the past when we’ve built models, we train some deep learning model and it’s this big, black box. Our partners assume that we’ve baked in all of the domain knowledge that’s necessary. Usually, we don’t understand family history and genealogical stuff as well as they do. When I hear family history, it’s not something that inspires a lot of excitement in me.
But being at Ancestry, I’ve seen a lot of really cool stuff. The types of problems that we’re facing, I’m amazed at the cutting-edge stuff that we’re doing. There’s just so many problems, so many computer vision problems that we’re facing. Specifically with the 1950 census, we wanted it so that our partners with domain knowledge could feed in information to the model, so that it could do better at production time.
Using a deep correspondence framework versus an object detection or semantic segmentation approach, we didn’t have to fine-tune a model. We just needed good exemplar images of the actual census forms, and then some additional meta information like specific landmarks that are telling of a specific. In the 1950 census, there are slight variations in certain census forms. The variations are important because they keep a bunch of supplemental information at the bottom of the page.
The way that you link a person from the bottom of the page to the top of the page, is by doing this fine-grained recognition of different layouts of the forms, which otherwise looked very, very similar. Again, it was something that we don’t know a lot and it’s like, “I don’t know much about the 1950 census.” But our partners, they’re very familiar with these old historical documents.
By allowing them to as well identify and say like, “Hey, this is an important landmark within this form or feature, and it’s telling of the specific form type we’re using.” You can use this to differentiate them. It was cool too, where we used this approach that was very collaborative with our partners. Then again, Comet is this really big component for us because of the collaboration, especially where we’re working with non-data science groups.
They can give us a config file with marked landmarks and we can process data through our models, and we can log the images to Comet, and we can just look at them through a web interface. I don’t do UX stuff. I feel like making a command line interface is something that’s very beautiful, but not everyone feels that way. When we print out a confusion matrix in the terminal and show that in the slide presentation, our boss doesn’t like that very much.
Then whenever I use Matplotlib, I feel like I don’t know what’s happening. Anyways, people told me I should learn ggplot, but I just don’t feel like I should learn R at this point in my life. Having a user interface like Comet, it looks great. Our partners when they look at it, it’s always very impressive. It’s easier to take screenshots off of that and throw that into a slide deck.
That’s something that happens way more frequently than I thought would happen in my life, is needing to prepare slide decks to show our results and show how things are actually working to other people. Hundreds of hours of our time, I think, were taken up just making visualizations. Dumping images to disk, rsync or SEP them over to our local machine, try to find the text file with the corresponding ground truth and then what the model predicted. Then overlay them and then space everything out in the slide.
That sounds painful. Yeah.
Yeah. It was all saved by automating it and visualizing it within these infrastructure tools.
Yeah. Curious, jumping a little bit ahead, I know we talked a little bit about it, but essentially once you make using the deep correspondence method, you detect what field is it in the form, and then you use OCR to extract the actual text from it. Then one of the things you were talking about when we spoke is how to debug this, how to look at the model results.
It’s trivial if you’re doing something like maybe object detection or segmentation, that’s a little bit easier, but in your case, you have multiple models involved. I think you mentioned in some cases, there were almost six models here, and you’re trying to compare and trying to understand. I think the more interesting point is you’re also collaborating with people and partners who are not necessarily data scientists.
You started talking about this, but I’m curious, how did Comet help to do something like this that has solved this problem?
Yeah. It was a huge collaboration, this 1950s project. The data science group, we had our own set of engineers that were deploying our models. Then we had another group of engineers from several other teams that we were integrating with, excuse me. That Thursday night when we were all processing, everything, everyone was heads down and everyone was focusing on making sure their component of the pipeline was working correctly.
A lot of times, we’ll package up a model and then we’ll send it off to somewhere. It’s deployed and it’s processing stuff and we don’t know what’s going on with it. We hope it’s doing well, but that’s our level of involvement. But there was so much risk involved with this project, that we put a lot of emphasis into our monitoring tooling. When everything is running through inference, we didn’t have ground truth.
So we couldn’t check to see how our model’s performing in an automated way, but we could log a bunch of stuff. We were logging confidence scores, we were logging visualizations of the predictions that we were having. All of that stuff was being dumped out to Comet. It was really nice because well, we didn’t have to be doing some manual process of pulling our processing logs from S3 and doing all that stuff.
All that stuff got automated, and being able to visualize it in graphs was super helpful. One of the experiences that was really nice, because we didn’t have any ground truth, it’s hard to know what exactly is going to go wrong or why it’s going to be weird. But when you’re watching the plots and you have a feeling of how your models work, you’ll start to notice things.
It was actually monitoring the confidence scores that we were plotting, we noticed that certain fields within the census weren’t performing that well, because they looked quite a bit different from what we were expecting. We didn’t think that the enumerators for the census were going to fill them out that way. Logging those confidence scores, we saw that these fields, the confidence scores, were much lower than other fields.
We were able to inspect the results, anecdotally look at some of the images that we were visualizing, and then make modifications to our model. We were having these daily stand-ups very early in the morning. The next morning, we show up to this meeting and we say, “We noticed it dropped in performance and we have a fix already for it, so we can deploy this model if you give us the go ahead.”
That is a much better position to be in, than a week later when another team finally has time to go look at the results and say, “Hey, something was wrong. Do you know what’s happening?” It’s much better being able to present the solution because we were monitoring it, versus being told something is wrong after the fact and then not really having a good grasp on what was going on in the situation.
That was one of the biggest things that Comet helped us with during that time. It was really nice, the reporting feature. Our partners could just go in and they could poke around and see what the outputs looked like, and if they had any input, they could give it to us. I think that was one of the other things is the collaboration. We found that it can be helpful even for collaboration with people outside of the data science group.
Awesome. No, thank you. Thank you so much for saying. It’s interesting, I guess, from a machine learning perspective, that even though you’re using deep correspondence.
I don’t know if that’s the right term, but some ground truth of at least the locations of these forms. Obviously, they look different in production or in prediction time, but the model still gave you a low confidence score. I’m curious why that happened.
Yeah. Specifically for the correspondence matching and the layout stuff, so one of the big focuses too that we’ve been having, is I think the term is calibrated confidence score. We want the confidence score to be meaningful, and so oftentimes you’ll just see like, “Well, this was the max logit of whatever,” and so we do a greedy selection of what the prediction should be. This one was the highest probability or something like that.
We’ve been focusing a lot on doing a calibrated confidence score and asking ourselves, “What does the confidence score actually mean?” One of the things that came out of some of the retrospective that we’ve had, is that we should stop using the term confidence score, and there should always be some descriptor as to what the confidence score actually is. We say confidence score, and then we ask our partners, we say, “Well, we predicted 90% confidence on this.”
Some of them interpret it as well, out of 100 characters, maybe 10 of them are wrong. Or out of a prediction of some transcription, we’re 90% sure that the whole thing is right and that it would need to be fixed at all. We’ve been thinking a lot about what confidence score actually means. The confidence score for us is really important, because we have multiple steps in our pipeline for processing the data. We don’t want to just dump whatever the machine comes out to our users.
We want the machine to be smart enough to say, “This is what I think it is and I’m not very sure about it and so you should have a human check it.” We use these calibrated confidence scores to then filter what should be funneled straight to our website or straight to vendors to check and see if it is actually correct or not. Specifically with the layout model, that was one of the things that we found that we needed to improve on.
Because we didn’t fine-tune it for the 1950s data, the confidence scores weren’t as trustworthy or weren’t as useful. We’ve found in this experience that these calibrated confidence scores, they should be more meaningful beyond just choosing what your prediction is going to be. It should be indicative to your partners and allow them to make business decisions based off of it.
Yeah, yeah. No, that’s super impressive. I think being able to find that metric that explains how well the model is doing, I think that’s hard in every task, but specifically in this one when there’s so many different models involved. It’s not just straight classification or something that’s fairly easy to explain, or at least we have metrics that with a little bit of explanation, you could precision recall these.
These kinds of things are fairly understandable. I’m curious, you mentioned you had these two main tasks around deep correspondence and then the OCR component. I think a lot of people consider OCR a solved problem, but I’m curious, I know you mentioned you tried a few things. You actually use different models for the same tasks, so I’m curious if you can share more about that.
Yeah. Processing speed was really important to us, and so we wanted to have great results and we wanted it as fast as possible for a number of reasons.
To get it out to our users, but then the faster we could generate results, the faster we could start verifying that the results are good or bad.
The census, I can’t remember, there have to be more than 50 different fields. The different fields [inaudible].
Sorry, how big is the dataset? I’m curious.
The dataset that we processed, ended up being about 6.6 million images. I think that there were about 171 million individuals that we extracted from it. For each of those individuals, I can’t remember, maybe 50 or more than 80 different fields. Capturing things as diverse as within how much money did they make in the last month? Really weird things that you wouldn’t necessarily have to capture, but it was important at the time, I guess.
Obviously, there are some fields that are more important. The name of the person is really important, the age of the person. Having those two fields work really well, allows you to find people pretty easily. Knowing the income of a person isn’t that important for finding them, but it’s interesting to have. We found that we could use simpler models on some of these simpler fields. Some of them were just check boxes.
That was essentially just a classification model like is it checked, is it not checked? But some of the more complex models like the name field, we ended up having to do a full encoder-decoder transformer model to be able to read it. That model architecture is obviously more expensive computationally, but that field is so important to us that we’re willing to bite the cost of computation.
Additionally, building infrastructure so that we could do easy post-processing on a lot of these. Trying to bake in knowledge, there’s this huge instructions document that they gave to the enumerators who walked around and filled out this information. Trying to become familiar with that, and then using that information as a way to post-process the outputs.
If we knew a field was only supposed to be numeric, we limit the outputs there. There were a lot of different architectures that we tried. People familiar in the handwriting recognition space, there was a paper that came in a little while ago called ORIGAMI Net, which was very influential for us too. That was a peer CNN method for doing multi-line reading, which was really interesting.
OCR, the connotation of that is it’s applied to machine printed text and there’s online handwriting recognition. If you can capture stroke and pixel position, and timing and stuff like that, those are fairly well-solved problems. The historical documents are quite difficult, because they can often be quite degraded and the handwriting can be… My handwriting is not as good as theirs was back then.
We found that historical documents, it really is important to have customized models, especially when high accuracy is really important to you. For personal use, you could probably get away with OCR and if there’s some mistakes in it, it’s usually you can forgive it. But again, if someone’s paying you for it, they want something really high quality. We’ve found that building these specialized approaches to historical documents, is what gets us there.
Awesome. Yeah. Definitely I want to follow up on that, but before I do, for those of you in the audience, if you have any questions for Stanley or myself or in general, feel free to drop them in the question section, it’s right next to the chat on the right. We’ll keep going and then we’ll have at the end, a dedicated session to talk about some of these questions. But if there’s something relevant for the discussion now, I’m happy to pick it up.
I think, Stanley, you mentioned something really interesting, which typically affects people maybe in production environments where inference time could matter. In your case, it sounds like it was more of an offline inference job where you’re trying to index things. You had multiple models where you have this trade-off between inference time versus accuracy or some form of quality.
You mentioned on the high end for these fields that are super important, you use encoder-decoder methods. What was the range here? For things that are less important, maybe someone’s salary last month, what other models have you used for this?
Yeah. The other handwriting recognition model architectures that we’re using?
Yeah. The classic approach that a lot of OCR engines use and a lot of handwriting condition models use are LSTMs or some variation of RNN. That was one of the architectures that we used. We also found that we could use just a normal transformer encoder approach to doing the predictions. The full encoder-decoder one is helpful, especially for fields if they start trying to fit multiple lines of text within a field.
Oftentimes, occupations will take up multiple lines within the single cell. Usually, the way that LSTM models and just the encoder transformer model have difficulties with multi-line text, because they essentially take each pixel column of the image and use that as a time step, and they’ll make a prediction based off of that. If you have multiple lines of text, you would need to be able to predict multiple characters per column or per time step.
The full encoder-decoder paradigm gets away from that, so it can learn what features it should be extracting at different time points. Then theoretically, it should be learning to read through one line and then go back, and then read the next line. We found that that model was really important for us. Then the name field, again, it was complex and names, they can be quite unique and quite long.
We found that, again, having the full transformer encoder-decoder, having the memory that can access the full image, seemed to be important for extracting those names.
Awesome. No, that’s super impressive. Curious, I know you mentioned you guys were on a tight timeline before this gets passed off to human annotators.
How long did it take you from the moment you got the dataset till you had, I guess, the first version of it indexed?
We were able to process it in about nine days, so that was a big difference. When the 1940s census released 10 years ago, which was a smaller dataset, it took Ancestry, I think, about nine months using vendors to hand key it.
Yeah. This was a big investment and a big area that we think that we can really serve users and create a cool experience for them.
That’s amazing. Everything you shared with us, obviously you’ve spent a lot of time in the data, the different fields and what they look like and how they impact them all, quality.
But everything you were talking about, you guys did in nine days. That’s like competitive machine learning knowledge.
Yeah, it was a long nine days.
Yeah. It was a very stressful, long nine days, and so I was very glad when I was over, but it was a ton of fun. It was a very exciting experience where, like I said, it was the night of when the data was getting released, we were pulling samples of the data and running them through our development environment, to try to do hyperparameter tuning on it.
It was we were all sitting in a conference room at 3:00 AM and people from the partner team would walk by and ask how everything was going, and we would just tell them it was going okay. Not sure if we were going to get a good set of hyperparameters, but luckily we were able to. Yeah, it was a really exciting experience.
Awesome. Just talking to you now, there’s so much scope in what you’ve guys built, and the fact that you’ve done all of it in nine days. Obviously, there was some prep work, but the vast majority of the research and the model building.
Obviously, getting the results happened in nine days is super impressive. I know the answer, but just to give a sense for those in the audience, how big is the machine learning team that you’re working with?
Our direct team, like the handwriting recognition group, there are four of us or five of us. Someone just came on this week actually, so there are five of us. Within our data science group, I think that there are about 10 or 15 of us. Then there’s the engineering group that we work with.
I think that’s another five or eight people. Our direct team is the four, now five of us, and then we’ve worked pretty closely with three of the engineers. That was the group that did most of the data science work for this particular project.
Impressive. Yeah. This sounds like work for a team of 30 or 40 people, so super impressive. Maybe to switch gears, we covered a little bit of some of the technical aspects. But obviously in order to be successful in something like this, it’s not just about the technical aspect or figuring out the right model, or optimizing it, it’s about the process.
I think there’s a lot of teams out there that take months, if not years, to build something that works. Obviously, every dataset, every problem is different, but I’m curious if you can share a little bit about what your team’s process looked like, and maybe how it changed based on what you’ve seen in this marathon, not marathon, but sprint on the census data?
It’s related to this first question, what are some of the biggest challenges that we faced? I think that as our data science group has grown and as it’s matured, projects were often tackled by a single person. They would own it, they would write all the code for it, and the code wasn’t necessarily shared with anybody else, so they owned that project. One of the big changes for this 1950 census thing, is that there were multiple people, so there were four of us working on this.
If it’s just you working on a project, it can be easy to track things, and really you probably think you’re doing better than you actually are. You’re probably not tracking everything as well as you think you are and you’re probably forgetting things. But we found, especially with multiple people working on the same project, that collaboration and transparency and reproducibility were super, big things.
The term we started using was doing good science, and so making sure that are the results that I’m generating comparable to what the other person is generating? If we’re both trying to improve the model, can we actually even compare the results that we’re getting? What are the other people doing? Constantly having to ask someone what experiments they’re running, what hyperparameters they’re using, what model architectures they’re experimenting with, that’s a really slow and cumbersome process.
In Comet, we had a common workspace that we were logging all of our experiments to, and so we could see what the results were looking like for somebody else. We could peek at their hyperparameters and see what they were doing to try to get it to work better. There was just way more collaboration. Additionally, a standardization in terms of how we report our results and how we’re all logging things the same way.
Some people are dumping everything to a text file and then bringing it into an Excel spreadsheet or a Google sheet or something like that, and then making charts or something. That’s not scalable when you have really large and high-risk types of projects. You want something that’s automated, something that’s standardized. This infrastructure thing is so key where, like I said earlier, a lot of times the focus is the actual model architecture and innovations there.
But having a cool model architecture, that doesn’t make it easier to maintain, that doesn’t make it easier to know what you need to tweak to get it to work well. It doesn’t help you report to your partners how it’s doing. The amount of standardization that we’ve been able to do, just because we have this infrastructure tool that Comet provides, has been really great for us. There are other parts of it that we haven’t utilized yet, but we’re really excited for.
You mentioned this in the survey results, but how do we version certain model weights that were trained on certain datasets? Maybe we’ve augmented the dataset with some new labeled data and now but now we want to retain the old model. We found that just keeping the model weights in an S3 bucket that you can then overwrite and you don’t know where the old weights went, that’s not a good way to version your models.
We’re really excited for that. Also, another thing that we’re really excited for like the artifact store. Being able to version our data in the first place, it’s something that we’re amassing so much labeled data, now it’s becoming difficult to track all of it. I was talking to someone the other day, and they said that they found a dataset sitting in an S3 bucket, and they’re not sure where it came from, but they started using it.
It happened to be the same dataset as the test set, so results. Yeah.
Hopefully that wasn’t the case.
Yeah, I think that happens all the time. S3 is a great object store, but if you don’t know the version of the dataset, you don’t know which code, which pipeline, which experiment generated it, you don’t know which model consumes it. If you don’t have all that information, it’s really, really hard and these problems have been going on for years.
When I was at Google and I was working, I mentioned on hate speech detection and YouTube comments, and one of the things when I picked up this project, my task order was to try to beat the production model. The first thing you ask is like, “Okay, what’s running in production?” That’s the first thing. To my surprise, people shared slides, Excel sheets with me, so it was super hard to reproduce.
But then the second one is like what data is the strain and where’s the data? Then Google, it’s not S3, they have something called SSTabless, but they shared this and I was like, “Okay, there’s different subset here, feature, what are we using? What’s working? What’s not working? Do I know I’m not going to mixing up test and strain here?” Super, super painful.
Well, we’ll hear from some of our customers that once you have that lineage, “Okay, this is the dataset it was trained on, this is the code that generated it, this is where it’s stored, this is the multiple versions of it, this is the experiment that consumed it.” This is the model where it’s coming out from that experiment, just gives you a lot more confidence for your comment and doing good science.
Yeah. Yeah, emphasizing the science part of data science is something that we found is important.
Cool, Stanley. Thank you so much by the way for answering. I know we might have some questions from the audience that I’d like to cover.
But before we move to the next section, one question I like to ask every data scientist I speak to, is with unlimited budget of GPUs, unlimited data, unlimited everything, which problem would you work on or what model would you build?
Yeah. The reason I love computers is the idea of automating the mundane stuff in our lives, and making it so that people are freed to do creative stuff. I think that’s one of the reasons I really like this historical document, automated document processing space. It’s just something that people shouldn’t have to do. We shouldn’t have to sit here and read to try to extract that information.
Something I would love to do, is I think we’ve all seen those reinforcement learning like playing Atari or AlphaGo and stuff like that, and it’s just like, “Man, that stuff is so cool.” I would love to be able to build an AI that is looking at documents and being able to automatically extract the information, tell you what information is on it. Try to link it to other documents that might exist somewhere else that it’s seen before.
Or somehow be able to link it to pictures of the person, and it knows a certain thing about them because of things that were indicative within the historical document based off of, I don’t know, contextual clues or something. A lot of that logic, well, all that logic, we have to bake in manually and we have to think explicitly about, but for it to be able to reason about documents.
Then get it so that people aren’t sitting here scrolling through a huge pile of images just trying to find a relative’s name. But they’re able to construct this life story based off of all this information that was extracted automatically and just being fed to them.
Awesome. No, I think if you’ve done this in nine days, I don’t know if you need an unlimited budget. I’d probably get it done in two weeks in this case, but totally see the value of it. Thanks for answering. Switching gears, if you’re in the audience and you have questions for Stanley, now’s definitely the time to drop them in the question section. I’ll kick it off with some of the questions.
We have a question from, I hope I’m pronouncing the name right, Ade or Ida, I’m sorry if I’m mispronouncing it. The question is for Stanley. Stanley, you mentioned the ability to get pretty speedy feedback from your end users about model performance. Could you say more about how this was engineered? What was considered feedback in this instance and what was your process for incorporating this into your model experimentation, hyperparameter tuning, et cetera?
Yeah. We’re integrating into a legacy system where the original pipeline, it was all built around humans doing these things. They would have one vendor transcribe, and then they’d have one vendor check to make sure that the transcriptions are correct. There’s this large network or this large pipeline of doing manual QA that was built out already.
They have contractors that they have longstanding relationships that they have these pages long documents that indicate how they’re supposed to evaluate certain things and they do trainings with them. That is a method that they use to provide feedback to us in a quick manner. When data hit Thursday night and we were processing stuff, as soon as it would go through the data science pipeline, it was being moved off to these QA pipelines.
They have tooling to be able to visualize and see how well things are actually performing. This is something that they do in normal projects as well, but they just scaled it up quite a bit to be able to process this volume of data quickly. We were getting periodic results from our partner teams saying how the vendors thought it was performing. I remember we were about a day into processing, and we found that some of the results were pretty good.
But there were a lot of certain census form types that we were failing on, and that came back via that vendor network. A lot of this infrastructure was built around having good communication and having daily standup meetings with our partner meetings so we weren’t siloed with them. It’s building those relationships so you’re not siloed can be really difficult, especially in your day-to-day work.
I think one of the things that specifically about Comet we’ve really liked is that it allows for an amount of transparency. We’ve also experimented with using, I think, the Gradio Panels in so that our users can actually, they can submit inputs. Not that they would ever distrust us or that we were fudging our results, but they can input things and see what the outputs look like. That just adds another level of transparency that they can look at.
That is one of the big pushes we’re making is trying to, I hesitate to use the word commoditize or commodify, but democratize the models that we’re using. It’s not sitting behind this data science firewall that people can’t access and stuff like that. Just some person, we have a new collection coming in and we’re not sure how it’s going to perform. They can just send it through the pipeline and see what the results look like themselves.
If they notice something, they can give us feedback. But a lot of times it’s been where the model itself is a black box, partner teams don’t get to have input on how it actually works. But that’s a paradigm that we’re trying to change with the infrastructure and tooling that we’re using, but also the types of model architectures that we’re using. The deep correspondence where the partner teams, they chose what the exemplar images were supposed to be and not us.
Because they felt like that those were the best things, and those were the best landmarks that indicate differences. I have no idea if I’m answering the question. I feel like [inaudible].
100%, the first part, 100%. Maybe let’s dive in a little bit in the second part. I think you talked about how do you get this feedback from the partner’s team?
What was your process for incorporating it back into the modeling project or trying different things, the hyperparameters, models, architectures, dataset changes? What did that look like?
Yeah, so we would get the feedback. Specifically with the 1950 census, it was a day in when we got that feedback that we were having poor performance in certain areas. All the data scientists, we had standardized our codes, so other people, not just the one person who had developed it could use it. We were all just experimenting with different hyperparameters.
We had also built it so that we could log to Comet if our partner teams wanted to be able to see that information. But that night, it was a big scramble. I don’t know if it’s exactly representative of what the process would normally be. In some of the other projects that we’ve had, the process for us updating our models, it usually comes down to there needs to be an evaluation of how does data science best use their time?
Maybe it’s not worth us updating the model, because we can fix it easily in some post-processing thing, or the vendors, we can get them to fix something really easily. Like having them rekey someone’s name is really expensive, but having them fix the line number, it’s just a sequence number of where they were on the page. That’s really easy and cheap for them to do. Being able to identify and make business decisions that way as to what actually is causing a problem.
Evaluating the feasibility and the cost associated with building a solution. Usually with the models that we’ve built, we’ve tried to remove every single magic number that exists in our code. We have this, it’s all sitting in config files, and we’ve tried to modularize everything. Modification of the models is quite cheap now and experimenting with new hyperparameters is easy. We’re also standardizing how we do our QA and so that we can do it in an automated fashion.
If you ask us the same question in a year, it’s probably going to be a totally different process, which the data science at Ancestry, we’re evolving very rapidly. When I started just in 2019, things are totally different now. It’s really cool to see the changes that have come and especially highlights the need for infrastructure, which is not something that I thought much about at school.
Yeah. You shouldn’t think about it in school, but for a lot of people who go from academia to industry, it’s just a shock of how hard these things are. In some ways, how the specific architecture you have, how little the impact is. Where really you can make huge, huge changes by investing in infrastructure by things like post-processing, which might not be sexy as coming up with a new architecture, but it works really, really well.
I think we have time for one last question before we have to wrap up. I think it’s in a sense similar to the last one, but maybe let me know if you want to add anything that. This is a question from Annette. I will ask, how did you do evaluation given labels weren’t given? I think you touched on it earlier, but we’d love to hear your response to that.
Yeah. Our partner teams, they have a standard process for sampling from datasets and then sending them through to their vendors to manually check. It’s a process that we’re actually trying to get away from because having a human-in-the-loop, it can cause errors, especially when we’re using vendors that English might not be their first language or something like that.
It’s when we have an English specific dataset that we’re working with, especially I have a hard time reading cursive, and I grew up here so that can be difficult. But that was the way that we were evaluating things. There were specific criteria that we could use to see how well the stuff we call is layout detection, so identifying where the form is in the image, and making sure that we’re extracting each of the columns correctly.
There are specific types of errors that we had identified and so a person would check those. Then we have the vendors again, look at specific fields and see if a particular transcription is correct or not. It’s an all or nothing check. We found that we were actually getting really, really great results. We found that we don’t actually need perfect results too, because search will compensate for a lot of the missed transcriptions.
We’re actually experimenting with something right now where we actually try to use… I think this is another struggle that we’ve had, is where each model lives in a vacuum and it’s unaware of what the other model is doing. We’re experimenting with using the handwriting recognition to verify that we actually extracted the layout correctly. If we are extracting the data on the form in the correct places, we should know where certain things are.
Each of the rows of the census form that we extract, the first one should start with the number one within that line number field. We can use the handwriting recognition to read that and if it doesn’t match up, then we know we’ve found an error. We’ve been, again, it’s like automating mundane tasks and trying to get humans to do more the difficult, creative type tasks, and so can use our models to fix that stuff?
It’s definitely an area that we’re exploring and trying to develop, but specifically for the 1950 census, it was a lot of manual checking that luckily didn’t come out of our budget.
Awesome. Yeah. I think one of the things among a lot of others, one of the things that’s most impressive to me, and I think maybe speaking about academia versus industry, is how deep you went in the data and not just focus on architectures and models, and transformers and metrics. But really you obviously understand the set extremely, extremely well, which sounds like helps you come up with different ideas to build better models, and overall improve this task. Yeah. Unfortunately, we have to wrap up. I feel like we could keep going for another hour easily, but Stanley, it was an absolute pleasure to have you here today. Thank you so much for joining us and sharing some of your learnings and success with our audience. Really, really appreciate it. Thanks everyone in the audience for joining us.
Like I mentioned, if you want to see the survey results, it’s available on the handout section. Then if you’re watching this async on YouTube or through another system, you can always go to our website, Comet.ml, and you should have access to the survey there. Thank you so much for joining us today, and I hope everyone has a great rest of the week.