Data For Social Good: People, Processes & Technology - Episode 2
This podcast features Frank Romo, founder of Detroit based RomoGIS Enterprises: a data, design and research collaborative aimed at promoting the public good through innovative technical solutions. Frank has a long history of being a community advocate, planner and activist for public health and safety, and social justice. As the CEO of RomoGIS, Frank provides technical solutions that empower residents to effectively impact their local communities. In his work with the University of Michigan, Frank engages in community-based research and develops geospatial applications that advance equity and social justice in cities.
This is the second episode of our miniseries on data, maps, and social movements with Frank Romo, hosted by CoLab Radio Producers Emmett McKinney and Allison Lee. This interview was recorded on Oct 23, 2020 and has been lightly edited for clarity and length. A transcription of the conversation is below.
You can also check out Episode 1 here.
Race and Policing in America 2013-2015, published by RomoGIS (2021).
Emmett McKinney: Welcome to CoLab Radio. This is a podcast of the Community Innovators Lab at the MIT Department of Urban Studies and Planning. This platform is dedicated to focusing on the first person narratives of scholars, activists and researchers doing the work of social justice.
Today is the second episode in our three part series with Frank Romo, talking about how data, maps and data visualization play a role in social justice movements. Last time, we talked about the role of mapping, protests and the Black Lives Matter movement. And today, we're going to go past the discussion of just how maps actually get presented and visualized, to consider where the underlying data comes from. Data is not found; it is generated by humans, and is managed and curated. And there's an entire infrastructure of data that enables any given map to be created. Like any infrastructure, there are power dynamics reflected in that and reinforced or reconfigured through it. So we are excited today to pick up where we left off last time. If you haven't listened to that first episode, I highly recommend it. But we are so delighted to have Frank back with us. So welcome.
Frank Romo: Thank you again for having me. It's very great to be here with you both. And I'm excited to talk today about the data curation piece.
Emmett McKinney: Sure. So one of the projects that you have worked on, focused on police violence. And police data, I think is a really good example of the topics we're thinking about, because it is an artifact of the way that police behave. If we are only tracking crime by police stop data, we're only going to see where police are actually showing up to interact with, and in many cases, harass communities. But that may not be a good representation of a just world. In fact, it's a representation of an unjust world. So could you talk about your police violence project and how you thought about those power dynamics as you tried to collect data and make sense of it?
Frank Romo: Right, absolutely. So while working, while at the University of Michigan, where I was working on project related to race and policing, I was also a national organizer for the Million Hoodies Movement for Justice (currently Brighter Days for Justice). And that was an organization that was founded after Trayvon Martin's death, and did work around policing, over-surveillance and racialization of communities of color. And, as an activist, that's something that's very important to me. On a personal level as well, I've had my own interactions with police that have been somewhat skewed in terms of potentially violating rights, potentially being harassed, and having certain interactions that were not favorable for me. And so, the idea behind the research was to try to give a voice to those folks who, unfortunately, when we're looking at data, end up being a point on a map and a statistic.
And it's really important to recognize that data is a representation of the world, and it is just that. It is a singular representation of the world. And it is in no way the full truth. There are people who manipulate the data, who edit the data, who share the data in certain ways. There's little decisions every step of the way when data is being transacted. When we finally get to a map and a visualization, even through that process, the mapmaker has to make a lot of small decisions on how to visualize the data. So I think it's really important, the question you're asking about - where does the data come from? Who owns it, who's generating it, and what does it actually represent?
So while I was working on this project, we took data from 2013 to 2015 calendar years, and looked at all the people who were killed by police through those three - '13, '14 and '15 - three years, and we analyzed it, dropped all the points on a map, geocoded them, and then look at what counties and cities had the highest level of fatal altercations with police.
Now to take a step back, when we looked at the data, there were lots of news outlets collecting data. The Washington Post, for instance, has a dataset called "Fatal Force". The Guardian has a dataset called "The Counted". There is another group that is affiliated with the Black Lives Matter movement that are a conglomeration of researchers who do a project called "Mapping Police Violence". And since then, there's been other datasets that have come out that try to quantify these fatal altercations. One of the first steps that we took when trying to do this analysis was, grab those three datasets from the Guardian, The Washington Post, and from "Mapping Police Violence", and compare them. That was a very simple first step that we could do. And what did you know it, the first thing that we did was quantify how many fatal encounters happened between those three years, and each dataset had a different number of fatal encounters that it enumerated. And that brings a question of methodology.
For instance, "Fatal Force" done by The Washington Post was a project that only focused on people who were killed by police who were victims of firearms. And so that ruled out any other things, such as people who died in custody, people who may have had injuries due to an altercation and then maybe had internal bleeding, or other things like that. So there are real nuances to how folks collect the data and the methodology they choose, that comes to show the real outcome. And so that's really one of the first things to think about is, questioning your source. That's something we learned early on - in gaining GIS data, you have to question your source. And you have to really fact check and be able to see, is this the kind of data that I want to map?
So when we first mapped those three datasets, we found that there were discrepancies in the data. Then we drilled down and looked at why there were discrepancies in each of the datasets. We recognized that methodology was a big part of it. So not only is the source a part of how valid the data is, but it's also important to recognize the methodology, because some data points were excluded from The Washington Post dataset if they didn't include death by a firearm. So if somebody was tasered, or somebody had been beaten and had sustained injuries, that didn't get classified in the dataset. Whereas in the other two datasets, those fatal encounters did get classified. So that's just a start. That's one thing that we need to do when we're looking at datasets, is recognize a source and a methodology.
Comparing different datasets shows how data can tell different stories, depending on how it is collected, processed, and represented. Included above: Fatal Encounters, Fatal Force, and Mapping Police Violence. [Source: RomoGIS]
Emmett McKinney: Thanks for laying that out. It also occurs to me that while the datasets would have different ways of counting how many fatal encounters there were with police, they also, I imagine, have different ways of counting the full toll of police violence and police presence. There are plenty of folks who are not killed by police, who are still made to feel afraid, who still have their lives disrupted, who are still not treated fairly under the law. And the data that's reflected in people who were actually killed, doesn't capture the complexity of that experience. So in data curation, there's also a process of deciding what data is important to include, and what data can be left out as a parenthetical to the main discussion. I'm curious in your police violence work, how you think about the power dynamics that filter through the curation processes that you just described.
Frank Romo: So just to pick up where I left off, we had these three datasets. These are all datasets from somewhat reputable news outlets, and they were scraping data from news outlets, smaller local news outlets, to find where there were people who were killed by police, and then they would put them in a spreadsheet. And most of these were made publicly available. And then once you get into the data, you see, from my perspective, you see some of the decisions that were made, even prior to me being able to download it. There are folks, methodologically, making different decisions. But also from, what you're talking about, the power dynamics, the source of the data comes into play because certain police departments, based on the state, based on jurisdiction, may not have to report certain cases. Those cases might be closed. They might be under investigation. When you run into some of those legal parameters, things get really murky because one dataset might include fatal police encounters, but another one might not because it's still under investigation, or that case is still open.
When we're talking about power dynamics, there's a whole system that comes into the play - the legal system, the shielding that some of the jurisdictions, police unions, other stakeholders, play in making sure that some of that data isn't available. We have now the Freedom of Information Act, which allows folks to pull data by requesting freedom of information, which means that they can request information from police departments directly, or any government agency to get that data. But then even that process of using a FOIA (Freedom of Information Act) to request data from a department doesn't mean you're going to get it, and it doesn't mean you're going to get it in a timely fashion. So we might ask for data now in 2020, and might not be able to see results from that till '22. And as you can see, that really throws off the numbers when we're talking on a calendar year or trying to quantify how many people in a certain region were affected by this.
RomoGIS visualization built from the “Fatal Force” dataset by The Washington Post. [Source: RomoGIS]
But to your point as well, the fatal police encounters, unfortunately like I said, when we see folks on the map, that is unfortunately the most egregious interaction that could happen, the most final or ultimate thing that can happen. As you've mentioned, people have these altercations with police every day. There are communities that have been shown to be over-policed. And you have this hyper vigilant surveillance, that affects how people interact with the police. If I live in a community that is overly-surveilled by the police, I'm already going to have a heightened sense of wariness in terms of my interaction with them, which is going to raise my blood pressure, which is going to raise my heart rate, which is going to change how I think and how I interact. Unfortunately, that could change the outcome. And that can be the difference between life and death. And I'd be remiss to say that I have been in those same situations and felt that happen. Your body changes, your mental state changes, and you're like, "Okay, how is this going to work out? How is this going to happen?"
RomoGIS visualization built from the “Fatal Encounters” dataaset by the Fatal Encounters independent research team. [Source: RomoGIS]
It's been very well said, in the Black Lives Matter movement and by other activists, that we as community members who have been over-policed, who have been over-surveilled, have to be "trained" on how to handle those situations. Where at the same time, the folks that we are asking to protect and serve us, they also need to be trained in that way to make sure that these altercations are deescalated on both sides of the fence. That folks need to learn how to deescalate these situations, because there's no reason why we should be having this many people killed. And when we look at other countries, I think you see a really stark discrepancy. You look at other countries, not in just terms of gun violence, but in terms of altercations with police. We unfortunately are right at the top of the list when it comes to people who have been killed by the police year after year. And on a recurrent basis, I would say. I've been in this data for a few years now, and I say on average, you're looking at about 1200 to1300 per year on average. And so there's a lot of people who could still be alive if we had better protocols in place. And we're able to handle those situations better.
Emmett McKinney: You're absolutely right, that there is no reason that we ought to be having this level of violence in the United States. There's nothing just and there's nothing normal about it. One thing that I think is really powerful about data is its ability to enact social change. And the ability that it lends to activists and people who have been marginalized to tell their story in a way that is more authoritative. Not to diminish the lived experience - that is absolutely vital. But data can add an additional component to it that allows that lived experience to be placed side by side with the information that folks in power have been using for a long time. I'm curious about how you use your data, and how you present it in a way that enables that social movement to happen.
Frank Romo: I think it's really about being able to harness the data and visualize it in a way that is understandable. This is already a very difficult topic. Even me talking about it right now, there's emotions involved. This is a very difficult topic. A lot of times people think data is just neutral. Data is something that is, "Hey, let's look at the data. And it tells us what we know about the world. And let's move forward." But that's never the case, as we've already talked about. You have stakeholders who don't want that data to be released. You have other stakeholders who are demanding for that data to be released. And you have this constant conflict. When it finally gets to a spreadsheet, like I said, thousands of micro decisions have already happened to determine what actually gets on that spreadsheet and is shareable to the general public.
In our last episode, we talked about the idea of open data, and what that means for the people to have the power of that data. And open data is a really good push. And I would like to commend a lot of the news outlets who are the ones who are doing this scraping and pulling data from all different sources and using these algorithms to pull this data. And we’ve got to ask ourselves, "Why are the news stations reporting this information? And why do they have to do so many backflips to get the data?" And the answer is in plain sight. It's because there are powers that be that don't want that data released. Why don't we have, on a state-by-state basis, a system that reports this in a consistent manner? From one state to another state, there's differences if they have to report or not. For instance, in certain states - I don't want to get in the state-by-state analysis right now - but in certain states, officers don't necessarily have to file a report if all the bullets are still left in their gun. If there was not a bullet fired, then there may not have to be a report. There is different criteria by jurisdiction and by state, which makes the data extremely complicated.
“When it finally gets to the spreadsheet ...thousands of micro decisions have already happened to determine what actually gets on that spreadsheet and is shareable to the general public.”
And so the news organizations and researchers like myself try to step in and say, "Hey, what the heck is really going on here? How do we figure out what's actually going on?" And so one of the things that I try to do in my research is give that geographic context, because one of the things that the news media outlets do very well is enumerate and get statistics by race, by gender, whether the victim was armed or unarmed. And those are all very important statistics. But one of the things that I've found in my research is that, that doesn't tell the whole story. And one of the things I try to do is say, "Well, where did this happen? Very much as a planner - where did this happen? And what is the surrounding urban context actually look like? Are people who are dying mostly in communities that have high poverty levels? Are people who are experiencing these fatal encounters happening in communities that are predominantly communities of color?" And it varies from county to county, from state to state. But a lot of times you see some of these trends - that they are happening in communities where a census tract is predominantly people of color. And so when you start to look at those trends in that geographic analysis of, what is the urban environment actually look like, then we start to actually understand how planning, how to urban design, how policing, they all work together to create this oppressive urban environment that makes it a lot harder for folks to deescalate the situations.
Emmett McKinney: The decision about how much context to include is really vital to what story you're telling with a particular data point, like the location of a police killing. You touched on another theme, which is data cleaning. You said that journalists have to do a lot of backflips to get the data, even once it's provided, into a format that's actually valuable. Can you talk for our listeners a bit more about the data cleaning process, and what is potentially lost when data goes from messy to clean?
Frank Romo: The data cleaning process, that is where I live. I make jokes sometimes - 90% of my job is cleaning data and cleaning addresses. That's what we do as GIS folks. For instance, analyzing three datasets - again, from The Washington Post, The Guardian, as well as "Mapping Police Violence", and now we've included another one called "Fatal Encounters" - and in just trying to combine these datasets and to see where the duplicates are, is a monstrous job because you have misspellings of names, you have different attributes. One might register as "European American", one might say "White", and one might say "Black", and one might say "African American". These datasets have all kinds of different categories, different headers in their spreadsheets, that we have to clean.
“A lot of times people think data is just neutral. But that’s never the case. When it finally gets to a spreadsheet, thousands of micro decisions have already happened to determine what actually gets on that spreadsheet and is shareable to the general public.”
There are some things that could get lost in that process because again, at the end of the day, there's a human on the other end making that decision. Sometimes when I'm cleaning the data, I have to ask myself, "Is this the best way to do it?" And I've done it so many times now with these datasets that I have an understanding of how the data needs to be standardized. But at the same time, even in standardizing the data, I have to make certain calls and say, "Hey, that address doesn't exist. I looked for that address on Google, I've looked for that address through multiple geocoders and it doesn't exist." So what do I do at that point now? That was still a person who had this fatal interaction. That was still family that had this experience, and it deserves to be represented. However, me as the data curator, I have to make the decision to say, "Well, if it doesn't go on the map, where does it go?" So those are the things that we have to deal with every day is trying to figure out, how do we do justice to the data, while at the same time making something that is usable and understandable by our end users.
Emmett McKinney: Building on this idea of handling messy data that is generated by humans in a non-standard way, I've had the pleasure to work with open data streams, for example, in Boston. Recently, I've been working with Bluebikes data. And what's really nice about that it's super standard and clean. It comes prepackaged on a silver platter. And what enables that is the fact that it's generated by computers. There was some programmer who could set up a database and define the tables that each of those trips would populate, and they would do it like clockwork. Contrasting that with data that is crowd-sourced by folks at a community event, or is turned in on post-its, or is entered manually, or entered into a database that's not set up with the same level of rigor, it creates this downstream impact where it's a lot easier to analyze data that's generated by machines than it is to analyze data that's generated by humans.
And so I'm noticing this cycle where, when we build more and more machines and our cities become smarter and smarter and smarter, it also warps our focus towards particular data streams that come prepackaged. And it makes the challenge of holding a candle to that with human-generated data and lived experience all the more challenging. So I think data cleaning is this un-sexy piece of the process, but it is the filter through which knowledge is moved. How data gets cleaned, and what our definition of "clean data" is, totally impacts the way that we understand society.
Frank Romo: You make a good point. We as cities are going more to this automation and more to this generative data that is out of algorithms, that is from machine learning and things like that. And my statement to that is, question everything. You have to question everything, because even those machines, those algorithms, were built by people.
And what we see a lot with the idea of predictive policing, or machine learning - when we're talking about artificial intelligence and facial recognition, when we talk about facial recognition, machine learning - these things were still built by people. And people do have these implicit biases. And if folks are not at the table who those processes are being run on, then there's a huge opportunity for bias.
We see that a lot in the face recognition software, where there's been studies that have identified that facial recognition that is being used by police departments, along with predictive policing, these are tools that are of the 21st century, and everybody's like, "Wow, these are amazing." Yet folks that look like me, and folks that look like the communities that are most affected by these are not at the table. And when that happens, those implicit biases get embedded in those lines of code.
It makes a huge difference to what an officer sees when they pull up to the scene, and what information they're fed prior to pulling out to the scene, and really changes their mentality and changes their emotions in terms of how they enter a situation because they are entering difficult situations.
“One of the things that the news media outlets do very well is enumerate and get statistics by race, by gender, whether the victim was armed or unarmed. And those are all very important statistics. But one of the things I’ve found in my research is that, that doesn’t tell the whole story.”
Emmett McKinney: Just to give credit where credit is due, that facial recognition research has really been led by Joy Buolamwini and the Algorithmic Justice League. This discourse raises such an important point about, even data that is generated by machines and ingested by machines is still reflective of, and impactful to, humans. Just because nobody did it by hand doesn't mean that nobody did it. And so remembering that there is a human involved at every single step of this process is absolutely vital. I'm curious how you think about the role of dashboards, and data curation - the public facing way. So we've talked a lot about the map widget itself. But there's a whole infrastructure around it whenever you see it on an online portal, or presented in a public kiosk, that helps people know what to do with the data and that actually makes that useful. So can you talk about how it gets packaged up and handed out, and the potential of that?
Frank Romo: So we start with - I'll just take us through the whole process - we start with a dataset that's a flat dataset on a spreadsheet, or Excel document or something like that. And then we turn it into a map by assigning those addresses X and Y coordinates, plot them on the map. Once they're on the map, they still bring some of that attribute information with them - what the victim's name was, with the victim's race is - that gets plotted on a map. Then what you see a lot of now with some of the platforms is that there are these dashboards that are being created that allow for quick enumeration by boundary. So I can say, "This is how many cases were in LA County. This is how many cases were in Cook County. And you can enumerate them rather quickly. And then on top of that, you can filter the map using bar charts, using filters, to say, "Let me see all the people who died who were of a certain race, or let me see all the people who had a fatal interaction who were in LA County." And what we do on the curation side is try to provide users a step-by-step process of, "Hey, do you want to look at the data? Why don't you click here and see how you can filter down to a specific category that you might be interested in." Whether that's by race, whether that's by geography, by whether a victim was armed or unarmed - there's all these different tools that allow users to interact with the data.
And I think that's great. There's a good opportunity, because the more users interact with the data, the more knowledgeable they become about the data. And what I always teach in my courses is, we have to teach map readers to think critically about the data. Where did it come from? How did it get set up? Just like with anything else that is curated - just like an art show, or a museum, anything that is curated - I, as a curator, am walking you through what I want you to see. So I'm going to steer your eyes away from the dirty data. When I know that there's a dataset, or an issue in the data, I'm gonna say, "Hey, don't look over there. Look over here. Let me show you this."
And there's inherent power dynamics in that as well, where the mapmaker has the responsibility to the public to say, "How do I show this data in the most true form possible, in the most honest form possible?" Because there are many ways to - there are books about it, how to lie with maps and things like that - there are many ways to lie with data and tell false stories. And unfortunately, if our map readers aren't critical about the data, aren't critical thinkers, that becomes very problematic, because then you have people who take this data at face value and say, "Look, look at the data, it doesn't lie, it doesn't lie." Yet, if they don't know how that data was curated to them, or know how the data was cleaned on the back end, they are flying blind and don't necessarily know what they're looking at.
“How do we do justice to the data, while at the same time making something that is usable and understandable by our end users?”
Emmett McKinney: A classic example of this dynamic I note is that, various political parties in the United States, like Republicans, a lot of states that tend to vote Republican, control a lot more land. So if you visualize American political dynamics by the color of states, it appears as though a party which, in the last election and the election before, lost the majority vote, it appears to be really, really dominant. But there are a lot of other countervailing political maps that actually scale by population, and that tell a very different story about where people's political leanings lie with respect to their geography, and how many people hold a particular view. And to your point about the impact of this, it's these types of visualizations that then are used to inform discussions about who is a "real American". So data visualization, very quickly, can feed into these broader political and racial and demographic dynamics. And so the way we visualize data is not at all neutral.
Frank Romo: Absolutely. Maps are very powerful. And you can see this all the way back to the 17th, 16th century. Maps are very powerful. There are quotes that say, "The person who has the better map is the one who's going to win the war," and things like that. Maps are extremely powerful. And if you want to look at war times, you will see that maps are used very much as a form of propaganda. And are very much used to radicalize or move the base that they want to communicate to and say, "Hey, look, what you are thinking is true. Look at this map." And so maps have a long history of being used as political tools, or propaganda. So I think you're absolutely right.
A good example is in our dataset for people who've been killed by police. When you look at the raw numbers, more white people have been killed by police than any other race. That is a fact in terms of raw numbers. Now, when you look at per capita, that changes dramatically. And what that means is because the white population is much larger in the US, that on a per person basis, per thousand basis, Black and Latino people are more likely to be killed by police or have a fatal interaction than a person who is white, because of just how the numbers play out. They are over-represented. We found that in LA County, where Latinos make up a large portion - during our analysis, again, from 2013 to 2015 - Latinos make up a large portion of LA County. So then you would assume that those fatal encounters match up with that. And that is true, that more Latinos were killed in LA County than any other race. But at the same time, as it relates to the population data, they were still over-represented. And that's what we're talking about when we look at these numbers.
When you normalize the data, looking at a per capita basis, we see that black, African American people and Latino people, they are over-represented no matter what place you look at in the country. And that it's really easy to say, "Hey, no look, more white people have been killed by police throughout your study than any other race." But that's not the only factor. So we can't just cherry pick these data points and say, "Hey, look at this map, look at this dataset, you're wrong." We need to actually go further into the data and ask those questions about well, how much of the population does that represent, and how much of the population is that of that city? And then makes a huge difference to how we actually understand the data.
Number of incidents by race, as reported in the Fatal Encounters dataset. [Source: RomoGIS]
Emmett McKinney: The technology writer Ruha Benjamin has, in her book, Race after Technology, written extensively about surveillance of communities of color. And she introduces the notion of these communities being watched, but not seen. The idea being that, just because there is data about a particular community does not mean that they're seen or respected in their fullest humanity by policymakers. So with this example in mind, I am curious on your thoughts on, is there any data that we shouldn't visualize? That just because we have it and it's clean, does not mean it should be public?
Frank Romo: Again, I think that comes to the responsibility of the mapmaker and what they are trying to actually do. So for instance, with the datasets that I've been working on, I have a very close relationship with the data and sometimes I need to take a step back and say, "Okay, what are we actually doing here?" And I talk to folks who are in the movement. I talk to folks who are doing the work on the ground and say, "How can we support the projects that you're working on? How can we support the fight for justice?" And you see a lot of times that, in order to get legislation passed, in order to make changes in communities, you do need some of that data. So you're right, maybe sometimes it is potentially counterintuitive to visualize it, if it is not done properly. But that's where, again, we have a responsibility as the curators, as the makers, to say, "Hey, we know how to visualize this best because we've taken into account all these other factors, but on top of that, and more so, we've talked to the people that it affects. We know the communities who are being affected by these instances."
And to your point earlier, these points on a map, it's really important to know that, as I said before, these are all people and these are all families. And these are all people who have been affected by this. And so when you look at that point on the map, and there's another one two and a half blocks away, and there's another one less than a quarter of a mile away, you start to think about, "Well, what does this community actually look like? What does it look like to have that many people having that kind of interaction with the police?" And it's important for us to always tie it back to the folks on the ground, the folks who are in these communities, and provide tools that aren't just for us to look at and be pretty and visualize, but it's about having a purpose behind it and having a mission to actually support those communities.
“Question everything. You have to question everything, because even those machines, those algorithms, were built by people.”
Allison Lee: It's a lot to think about. I really appreciate you talking and bringing up all of the nuances of mapmaking and what goes into it from the back end and what decisions are made. I think this is something that's overlooked a lot. When we, as the public, look at maps, we think we see the whole story, when in fact, we are only seeing a portion of the story. And I really liked that you were speaking about your role as the data cleaner. And there's a lot of decisions made as that role, and also a lot of information that is potentially lost or skewed in that role.
I'm also curious, on one hand, you were talking about engaging as many people as possible - so bringing them into the map, having them interact with the map - as well as your role as this data cleaner. And I'm curious to know the line between making information intelligible to the wider population - simplifying it potentially, in that sense, talking about dashboards and quick-glance information - while at the same time giving the information justice and putting it into context and showing the complexities of the information to give it truth. And that I think, it’s a decision that you as the mapmaker probably have to ask yourself over and over as you're creating the map. So I'm curious to know, what does that process look like for you?
Frank Romo: I appreciate you acknowledging this is a difficult conversation. It really is. When you're working with data like this, that is people's lives, as I said, there's a responsibility there. And me personally, before I release a map, I go over it a million times, and I think about all the things that could come up and go wrong. And because another thing is, you clean that data so much, and one error in your data can undermine your whole dataset. And that is huge. If something is wrong in your dataset, now your whole map has been called into question. That is really a huge, huge, important aspect when we're dealing with this kind of topic. This topic is highly politicized. This topic gets people's emotions riled up. And so it's high stakes to some degree in the way I feel about it. It's very high stakes when we're working on data like this, because, again, these are people's lives. And if we don't do justice to the data, we are undermining all the work that has been going on to try to say, "Hey, this is wrong, we shouldn't have people dying in the streets like this. We shouldn't have these altercations."
One professor told me, "This is a very highly politicized topic," and he said, "you need to think about what somebody who completely disagrees with your research is going to think about this data, and how they're going to try to poke holes in this data." Because it is such a highly politicized topic, people are looking for errors in the data. People are looking for ways to undermine the topic because they don't want to wrestle with this difficult conversation. And so again, before I click that button that shares it and releases it with everybody, I go over it a million times. And only then I'm able to say, "Okay, I gotta let it go. And we'll see what happens." And when I get feedback, I try to incorporate it as soon as possible, update the map, and say, "Hey, sorry, that was an error on our part. We fixed it." And let folks know, because I think there's also something to be said about transparency, of communicating back to folks who either point out errors on the data side, or say, "Hey, I couldn't interact with it this way, it didn't really work for me." There's a responsibility there too because folks want to feel heard, and they want to know that what they're actually seeing is the truth. And if you just put it out there and then just leave it out there and then don't touch it ever again, and folks are saying, "Hey, this data is wrong, this data is wrong," and they don't get heard, again, that's a really easy way to undermine your map and undermine the justice that we're fighting for.
Allison Lee: And from a technical standpoint, when you're creating the map, how much information do you include, or do you feel like you should include, about the data? A lot of maps that we see, whether in print or digital, essentially, they have one note at the bottom right corner saying where they got their source from, and they just link it to another dataset. But they don't go into details about that dataset, where that comes from, what that includes, what that omits. So it's a very simplified version of data collection. And we trust it. As Emmett was talking about before, there's an authority that comes with maps, a scientific security, but we really don't know any of the details. So how much information do you include, while still keeping in mind that the information needs to be accessible to the public - you want them to engage with it over long periods of time - so I'm curious to know that.
Frank Romo: I think it's a really important topic. I'm guilty of that sometimes, just linking to it and saying, "Okay, we're done with the data. Here's where the source is". That's a whole other piece to it, and that's really important. I try to put as much information in as possible. Even prior to this conversation, I had one of my analysts ask me, "What do you want me to put in the metadata?" Every map we put out, we need to have metadata. It needs to be clean. It needs to say when it was last updated, where the data came from, and what we did to clean it. It's just like any other research project, honestly, where in order for your findings to be valid, somebody needs to be able to replicate it. And that is very important.
One of the things we focus on very heavily is data documentation, data dictionaries, and things like that. Making sure that we know what every field means, when it was last collected, if there are any nuances in the field. For instance, in our victim name field, there are tons of John Doe's and Jane Doe's. And so what does that mean, if they are around the same location, if they are on the same day - is this the same person? You have all these other questions and then you have to say, "Okay, well, we don't really know what to do with that data so we might have to omit it." And if you omit data, it's important that you say that, "Certain records were omitted, because we couldn't validate where the actual location was. Data points without geocodes or XY coordinates were not included in this dataset." That's really important to state because, again, a general reader won't know that. And again, there are only a few people who actually go into the source and actually look at it. So again, the one thing that is really important is trying to make sure that people are more map literate, and more critical about when they intake maps and visualizations.
Number of incidents 2015-2020, as reported by the Fatal Encounters dataset. [Source: RomoGIS]
Emmett McKinney: That's such an important and critical reflection. In addition to the piece about who is represented in the dataset, and who had the job of cleaning it, there's another group of people - whose labor went into generating that dataset? Especially in academic contexts, it can be grad students, or volunteers, or people who were brought to a community meeting and asked to chip in. They spent their time and creative energy helping to generate that data.
In the case of services, like for example, the Bluebike dataset that I was working with, people were paying to use that service, and therefore they were paying to generate some of that data. Even with services like Amazon's Mechanical Turk, there are people out there who are being paid to fill out surveys and do small tasks.
And not to mention there is all the labor of folks who go into building the actual devices that enable us to manage that data, be that a programming language that is painstakingly developed and documented, so that the novice programmer can come in and say I need a function to do this. And they go to that documentation. There's a whole chain of people whose labor goes into making data accessible, and legible. And I think it's really important to credit them as well in this pretty little map. Because if we can just draw that quickly, that's thanks to the labor of legions of people who are both seen and not.
“When you look at the raw numbers, more white people have been killed by police than any other race. That is a fact in terms of raw numbers. Now, when you look at per capita, that changes dramatically.”
Frank Romo: One of our recent projects where we were jumping into the coronavirus and how it has had an effect on people in prison – in the federal prison system – and immigration facilities and in youth detention centers, this data came from UCLA’s School of Law. Again, they put out a dataset, shared to the public, when I looked at that I said, “Oh this is great,” but when I actually went further down into the dataset I was like, “Whoa, we need to clean some of this stuff, and I need to make some decisions if I’m going to represent it.” So I had to make some really specific decisions on geocodes, and double check the geocoding, double check the XY coordinates, double check when the data was last updated. That was a five-month process, and every day I was wrestling with the data.
And to your point about intellectual labor, I have so many analysts who’ve done such a great job of spot checking one row at a time to make sure that that location was at the proper location, make sure that the name of the facility was correct, all of that stuff. And all of that contributes to the final map and I think you’re right that, there’s a lot of unseen labor that goes into it that we still need to acknowledge.
Emmett McKinney: Absolutely. There’s a professor in the Department [of Urban Studies and Planning] at MIT, Karilyn Crockett, who has focused a lot of her work on archival research, and creating records of what happened in the past in a way that is accessible to planners who want to really take stock of where we’ve come from and how historical legacies shape today. As it relates to this conversation, I think data cleaning is really the mechanism that can make it hard to bring the past into the present. For anybody who has had to, for example, extract tabular data from a PDF, it is mind-numbing and annoying. Not to mention, gathering tabular data that was written by hand in some ancient text, or say, a ledger or business record from the 19th century before computers even existed. So data cleaning goes beyond just dusting off the datasets that were generated digitally within the past 50 years. It’s actually this really critical mechanism for trying to understand how slavery continues to impact us today.
One of my favorite projects – it's down at the University of Richmond, I think it’s called “Mapping Inequality” - and they take these famous redline maps that were printed in the ‘30s to redline neighborhoods and deny communities of color fair financing for their homes, and they took those maps and digitized them which is a painstaking process. But it’s that type of moving from analog into the digital that also goes into the creation of these beautiful maps. And again, a lot of unseen labor.
Frank Romo: I think it comes down to a lot of what we talked about last time – accessibility. The whole point of that is to make it more accessible. I find whenever I make a map, the first thing that somebody does is zoom in to a place that they’re familiar with, or a place that they know. That’s where the power comes, they want to see what’s happening near them. They know the data near them sometimes, and so if you’re off by one or two, they’re gonna know and they’ll let you know about it. And that’s why it’s so important to be very precise with your data when you are digitizing it, and that goes to the cleaning process – things like that are very small but it’s in the details that really make the map valid. Show me a mapmaker who doesn’t pay attention to the details in their cleaning process, and I’ll show you a map that’s not telling the truth.
Emmett McKinney: It strikes me that racism and racial identities and colonialism shape the way that data is generated in the first place. It’s often said in the United States that there was some family who arrived at Ellis Island during the migration from Europe to the early United States, and they arrived at the island and changed their name to something that was a little more anglicized. And that creates this rift in the data, this breaking point, between where people came from before and the way that they are recorded now in the United States. And so data cleaning to me is this prism through which we can understand, how we understand our own history and why that can so warp the discussions that we get into today.
Frank Romo: I think that’s a great example. Even that idea, like you said, of data cleaning – it's like we need to clean the data, we need to change it. We don’t have too many terms for it. Another term that you would use is data manipulation. Before we see the data, there are a thousand things that happened to it before then, and without knowing that information we are less powerful. And so understanding how the data was cleaned, or grabbing it ourselves and learning how to make sense of it, there is a power in that as well.
Number of incidents by race, as reported in the Mapping Police Violence dataset. [Source: RomoGIS]
I’ve seen so many times where you have community organizations that I’ve worked with and I’ve trained them how to use GIS, and they do manipulate or change the data, again not in a negative way, but in a way that tells their story more firmly and can communicate better because not all data is good data and if you are able to parse out some of that bad data, you can make your point a lot stronger sometimes. Because you can say, “Hey, look we got rid of all that noise.” Sometimes there’s data in there specifically to create more noise, and you as the curator have to be able to find your thread and say, “Here’s the story I want to tell. Here’s how we’re going to show the data. Here’s what we have to do to the data before we can show it.” At each one of those points, there is an ethical decision that you have to make about what you’re doing and how you’re trying to present the data.
Allison Lee: And that data cleaning, or data manipulation, as you said, is inevitable. Something that’s not inevitable is keeping good records of how that data was changed in that transition period. And so I do really appreciate you talking about that, and communicating to us and to the world of how important that is, to maintain that process and show that transparency and have the records about the records that you are taking from one source and recommunicating in the present.
Emmett McKinney: I’ve heard it said frequently that, “Data are people.” And here we are saying that data needs to be cleaned. So what does that say about the way that we think about people, if we’re so focused on data? Does it mean that data-driven planning is less able to engage with the full complexities of people’s lived experience? People are asked to fill out a survey. What if someone’s honest opinion is, “Forget this survey. This doesn’t see me at all.” This is a constant issue, for example, with racial designations in the U.S. census. This need to clean, this need to shoehorn this complex reality into our computers, into our digital machines, I think sometimes can relay a certain disdain, a frustration for people whose realities are more complex than we know how to deal with.
Frank Romo: Absolutely. Another term that comes to mind is data standardization. That’s another way we can talk about it. Just like we talked about power last time – whose standards are we talking about? That’s the first question we should ask. If we’re going to standardize the data, then whose standards are we talking about? And you talked about it – it is a colonial standard, it is a racial standard, there’s this hegemonic standard by which we have to put people in boxes and say, “You live in this box. You live in that box. And because of that, we’re going to treat you a certain way.”
To your point about data-driven planning, I love data. I’m in it. I love data. I think it’s great and we should use it to its fullest ability, but I would say I would slightly disagree that I don’t think “Data are people.” I think data is a representation of what we have decided works for our systems. And to your point that we need to fit people into these boxes so that we can categorize them and quantify them, behind each person, especially when we’re working with data points of people who have been killed by police or people who have been killed by coronavirus, these are beautiful visualizations. Things that bring attention to these issues. But there’s a story behind each one of those things and that story is a lot more complex than people make of it.
They can make all kinds of different statements about whether that was justified killing, or whether there was a bad altercation, and data has definitely helped us in the dehumanization process and not allowing to see people for their fullness, as you said. And it does make a really big impact on how we plan our cities and what we do to try to actually serve people who fall in between the data lines or who don’t fall in the perfect category of our dataset.
“I don’t think “data are people”. I think data is a representation of what we have decided works for our systems.”
Emmett McKinney: As Nikole Hannah-Jones has reported in the 1619 Project, entire new systems of accounting human beings were created for slavery. New systems of organizing data were built with the explicit purpose of dehumanizing. And that’s something that I think about often, like the very tools we use were designed not for a neutral purpose. I think they can put some blinders on us. I say this as somebody who also loves a good map. I love ripping through a spreadsheet and making it beautiful. But we have to be really critical about where that comes from and I think every so often say, “You know what, data’s not the right tool for this. I think it’s fun, but I don’t think this belongs here.” And so maybe what we’re looking for is some self-awareness.
Frank Romo: The one thing to take away is that data is not neutral, and it is very biased in its own right and it has its own power dynamics. In our next episode, we’ll talk about that. That’s where machine learning comes in, and we say, “Hey, we have a tool for that too. We can fix that too.” But again, as you pointed out Emmett, those are imperfect tools on an imperfect dataset, and if we’re using those to create policy and affect people’s lives, that could become problematic.
Emmett McKinney: Well Frank, this has been a super rich and engaging conversation. I’m really looking forward to our next segment on data analysis and analytics. Machine learning is this thing that has been unleashed on these troves of data that, as we talked about, are imperfect and are not neutral. To think about what it means when we’re handing over the reins to a machine that is built by imperfect humans and has no conscious of its own, what does it mean about how we understand these imperfect datasets that we have? Thanks for these reflections, and we’re really excited. So this has been an episode of CoLab Radio. Thank you all for listening and we hope that this conversation sparks some thoughts and perhaps some discomfort for you. I think it similarly leaves us with plenty to think about, and we are very excited to keep talking about it next time. So with that, we’ll sign off. You can find more at colab.mit.edu/colabradio.
For more of Frank’s work, or to get in touch:
About the Interviewers: Emmett McKinney is a transportation planner at the nexus of tech, equity, and decarbonization. He holds a Master in City Planning from MIT, where his research focused on the equity implications of emergent mobility technologies. He has worked in urban design and environmental policy — but these days, he manages mobility data for Superpedestrian, a mobility technology company. Find him on Twitter at @EmmettMcKinney and GitHub at @ezmckinn. Allison Lee is a Producer for CoLab Radio and a Masters student in the MIT Dept of Urban Studies and Planning. She is interested in balancing conservation and development, and places community and culture at the heart of her work.