This article was initially available on our podcast. Click here to listen.
We recently did a podcast or article about malignant transformation. This is part two, talking about non-destructive change.
Check the names
First, go back really quickly to what malignant transformation is. A destructive transformation means when you’re doing a cleanup of data, when you’re doing calculations, when you’re joining fields, V-lookups, or whatever it might be, somehow you change the original data.
Let’s say, for example, you have names that are like the last comma first, and you change it to first space last. Or you have names in two separate columns, and you concatenate them (in an Excel sense) and smush those together to make the full name, and then you delete the two original columns of first and last names. That’s a destructive transformation.
All about non-destructive transformation
Why is it essential to have a non-destructive transformation, meaning that you don’t destroy the original data? Remember, we talked earlier about why you end up doing that. It’s because otherwise, you end up with crazy amounts of calculation. It takes up way too much space. In addition, you have hundreds of thousands or millions of records.
Further, your Excel can’t handle all of that. It starts to slow down or crash. So you end up effectively cutting corners and destroying some of the underlying information.
The reason why destroying information and having destructive transformation is such a problem is. First, you can’t do quality control, and you can’t quickly figure out what kind of mistakes you’ve made doing calculations because often, they’ve disappeared.
You can’t dig into them and figure out what you did wrong. Some of these things are simple calculation kinds of things. But when you’re doing transformation stuff, when you’re doing V-lookups or other types of things, it becomes challenging to figure out what’s going on. QA is just about half of the battle when it comes to data wrangling.
Determine the filters
The other issue is, you can’t identify what you’ve filtered out and what records you’ve deleted potentially incorrectly. Often, we end up doing compound filters, like I want females between this age who have this insurance. Then, when you have those multiple things, especially negatives of those words, not something, meaning not these criteria, it gets very complicated from sort of a Venn diagram standpoint, and often something gets screwed up. Then, figuring out what you did and what you got rid of becomes very, very hard.
Things end up disappearing. Worse, you don’t even know what’s gone because you can’t QA, and you can’t know what’s not somewhere.
Let’s say later, for example, if you’ve done some destructive transformation like we gave in the last podcast. We want to look at the year 2021. If you filtered out all of the records for the other seven years to make the dataset usable, now if you want to look at 2020 or compare 2020 to 2021, you can’t do that because the data is gone. You’ve deleted it. It’s gone. Even if you want to look at different CPT codes, you can’t look at this information. That’s been deleted.
Sometimes later, we even change specifications for what we’re doing in our analysis, like how we calculate something. That may be something fairly complex, or it may be something quite simple. But this happens all the time. It’s not uncommon that we realize after a few weeks or even a few months that we need to calculate something completely different. Why? Because we didn’t see some collateral issue popping up, or we didn’t consider, “Ah, if we do this, it means this, so it hadn’t considered this.”
Right now, we’re arguing because we’re redoing some datasets. We’re arguing about outer joins, which make a massive difference in what you keep or lose. We are arguing about this because there are pluses and minuses even when you have effectively unlimited data space as we do. Real analytical issues come up, not just size or slowing down the system and things like that.
Another issue that can come up is that you lose a lot of information. That means you frequently can’t do some other type of analysis.
Take the example of denials analysis
Let’s say we are just trying to do a denials analysis. You may have only pulled denials information, or maybe you pulled flat files that included charges and payments which included denials. Then you got rid of something to make it a usable dataset like, “Well, we’re only looking at denials.” Then, six months later, you say, “Ah, well, we don’t want to look at denials now. We want to look at payments.”
Maybe, that’s something as complex as, “Ah, we want to see which denials were overturned and got paid.” Perhaps, it’s even tied to the original analysis, or maybe it’s something completely new like, “We want to see just what got paid.” If that’s the case and you want to do further analyses, different types of things, you have to start all over with a completely new study. That’s a ton of work to redo. I mean, a ton of work to redo.
A bigger problem than even just the fact that it’s a ton of work is, it’s improbable that you documented every little thing that you did along the way to modify that data. So you’re not going to remember everything that you did. A lot of stuff will be different, or even there are going to be mistaken, or something’s going to change. It’s going to create all kinds of other problems.
Ensure there’s enough storage
Another problem that pops up is, if you do this a bunch of times, meaning you do a bunch of analysis and then you go, “Oh, okay, well, we made it usable, so it’s a smaller dataset, but now we’ve got to start over.” You do that repeatedly and repeatedly. Your computer starts to fill up pretty quickly with datasets, sort of modified datasets.
It’s not a storage space problem. It can be because when everybody’s now got a terabyte or terabytes or minimum like a half-terabyte, 25 meg files isn’t going to make a big difference, which is a pretty decent-sized file in Excel.
The more considerable promise, you’re not going to remember which file is which, what analysis you did in which one.
Even if you’re going to remember what analysis, you do not remember which modified datasets include what you need. You’re going to start working on one and do some new research and then realize, “Oh, whoops! I don’t have all the data that I need. Crap!” And you’ve got to start over all over again.
These are the reasons why destructive transformation is a problem. We need to make sure we’re doing the non-destructive transformation. We’ll talk in another podcast about how to make those suggestions.