Common data questions/issues

I have already posted my reasons for why this data should not be used in research here. For more extensive documentation on the data's issues, please see this paper by Maltz and Targonski from 2002.

Each year of UCR/NIBRS data has different agencies reporting. The overall trend is that more agencies report over time, though this is not always true (e.g. Florida stopped reporting most data in the 1990s). So if you have data that covers more than one year, make sure you are only analyzing the same agencies each year. Even within the same year some agencies don't report for 12 months, meaning they are not comparable to agencies that do. This likely means removing agencies that don't report every month of every year of your study period.

Negative numbers can happen in UCR data and are correct, not a data issue. UCR data is reported by police monthly and if they discover that a previous month was incorrect they don't alter the incorrect month to remove that reported crime, they report a negative crime in the current month so the annual crime count will be correct. For example, consider an agency that reports a burglary in January and then in June discover that the burglary didn't actually occur. In June they will report -1 burglaries. In practice it's rare to see negative numbers since the number of actual crimes in a month often outweights the number of corrections, but it is possible and is not an error. Converting negative numbers to missing values is not correct. See more on pages 82-83 of the FBI's Manual for UCR data.

 

Like all data, the data on this site and available to download on openICPSR has flaws. This is especially true of the FBI's UCR and NIBRS data available here. The main problems with the data is that it only counts reported crime and that not all agencies report (and those that do may not report all year). However, these are only the technical problems with the data - the bigger issue is in how people use it. Each of these issues can be solved - or at least avoided in conducting research - and the nuances of each dataset can be handled. However, many published articles that I've read either do not address these issues or even acknowledge that they exist. There is excellent documentation on these datasets in academic articles and in the FBI's manual for the data. If you intend to use this data you should read these documents (please don't ask me for links to them beyond what I already provide on this page) carefully and spend time exploring the data yourself. Simply running a regression on the data without fully understanding the data is bad research.

If you believe that you found an error in one of the datasets, please read the FBI's manual on that data to ensure it's not a known issue or is not an error at all. Common incorrect beliefs that something is erroneous are negative numbers in UCR data or Florida not appearing for many years in UCR data. 

I release the data as .dta (Stata) files which has a 32 character limit for column names. Therefore I sometimes have to abbreviate column names to meet this limit. Most misspellings you see in the column names are intentional to meet this requirement.

FIPS codes are US Census unique identifiers (within state) for geographic areas including state, county, and "place" (i.e. city). They are useful when merging with other datasets such as the US Census or other government datasets. These identifiers are not in the UCR data originally so I added them by merging the UCR data to the Law Enforcement Agency Identifiers Crosswalk (LEAIC) that NACJD produced (for more info on this dataset please see the Learning Guide on it that I made for NACJD Here). I merged the LEAIC and UCR data by matching on the ORI (unique agency identifier code) variable.

Crime (NIBRS)

The National Incident-Based Reporting System (NIBRS) data is a collection of data compiled by the FBI by reporting police agencies that provides detailed information about each crime that agency knows about. It's a fairly complex and detailed dataset that contains more information than available on this site. For more information, please select my book on the data here.

I only include agencies that reported 12 months of data in a given year. If an agency reports fewer than 12 months for that year I exclude that year, but keep in any other year for that agency if they reported 12 months of data for those years.

Alcohol

This data is from the Apparent Per Capita Alcohol Consumption: National, State, and Regional Trends 1977-2016 report produced by Sarah P. Haughwout and Dr. Megan E. Slater at the National Institute on Alcohol Abuse and Alcoholism. The Data page of this site has a link to download he data. For the original report, click here

For a complete methodology please see the actual report here. The authors determined the amount of alcohol consumed (originally measured in gallons of ethanol - pure alcohol) through sales data or tax information for each state-year. Population data for how many people age 14 and up live in each state was acquired from the CDC WONDER data. Ethanol was divided by population to determine per capita consumption.

The original report provides an equation to convert the amount of total ethanol consumed into a "total drinks" variable so I used that to make that variable. For the individual drink categories (beer, shots of liquor, and glasses of wine), it provides an equation to convert the amount of ethanol consumed into the amount of alcohol. I then converted this to number of drinks for each category based on the National Institute of Health's page saying how many ounces of alcohol make up a drink for those categories. Therefore, the sum of the categories is slightly different than the "total drinks" column.

Arrest

This data is from the FBI's Arrests by Age, Sex, and Race data which is part of the Uniform Crime Reporting (UCR) Program. This data provides the number of arrests that occurred in a city in any given year and breaks that down by age (adult or juvenile), race, and gender.

I try to keep data on this site consistent with the data I have published on openICPSR and I call it 'cannabis' there. I do so only because the programming language Stata limits columns to 32 characters and 'cannabis' is a shorter word than 'marijuana'.

This data does not show arrests broken down by if the arrestee is Hispanic or not.

Border Patrol

See here.

Crime (UCR)

This data is from the FBI's Offenses Known and Clearances by Arrest data which is part of the Uniform Crime Reporting (UCR) Program. This data provides the number of crimes that occurred in an agency in any given year and how many of those crimes were cleared.

Starting in 2013, rape has a new, broader definition in the UCR to include oral and anal penetration (by a body part or object) and allow men to be victims. The previous definition included only forcible intercourse against a woman. As this revised definition is broader than the original one, more rapes are reported ( social changes may also be partly responsible as they could encourage victims to report more). This definitional change makes post-2013 rape data non-comparable to pre-2013 data.

The FBI does have a variable that says how many months an agency reported each year. But this variable actually just measures the last month reported. So if the last month an agency reported was December then it would say that it reported 12 months; if the last month was August it'd report 8 months. Even if the only month reported was December (August) it'd still say that there were 12 (8) months reported. I think this is a bad method - though it doesn't make too much of a difference since when an agency reports at all they usually do so for 12 months - but I have a different method for this site. I define a month reported as long as there is at least a single crime in that month. The tradeoff to this method is while it does prevent incorrectly keeping data that isn't really 12 months reported, it also incorrectly drops data that is. For example, a very small agency may truly have no crimes in a given month and my method incorrectly drops it.

This data doesn't differentiate between a "real zero" and a "not reported zero". If an agency doesn't report any crimes (even if crimes did occur), the data will say zero crimes occurred. Even though the data indicates how many months of the year that agency reported, that doesn't necessarily mean that they reported fully. An agency that reports all 12 months of the year may still report only incomplete data. Agencies can report partial data each month and still be considered to have reported that month. Chicago, for example, reports every month but until the last few years didn't report any rapes.

No, this data only includes the most serious crime in an incident (except for motor vehicle theft which is always included). For incidents where most the one crime happens (for example, a robbery and a murder), only the more serious (murder in this case) will be counted. This is called the Hierarchy Rule. See more on pages 10-12 of the FBI's Manual for UCR data which details the Hierarchy Rule.

Though the Hierarchy Rule does mean this data is an undercount, data from other sources indicate it isn't much of an undercount. The FBI's other data set, the National Inicident-Based Reporting System (NIBRS) contains every crime that occurs in an incident (i.e. it doesn't use the Hierarchy Rule). Using this we can measure how many crimes the Hierarchy Rule excludes (Most major cities do not report to NIBRS so what we find in NIBRS may not apply to them). In over 90% of incidents, only one crime is committed. Additionally, when people talk about "crime" they usually mean murder which, while incomplete to discuss crime, means the UCR data here is accurate on that measure.

A major limitation (in my opinion the most important limitation) to the data here is that it doesn't include crimes not reported to police. Based on victimization surveys that ask people both if they were victimized and if they reported that crime, we know that the majority of crimes are not reported. This probably won't matter when looking at a single city for a short period of time - the population won't change too much so even underreporting of crime will be consistent underreporting. The issue becomes serious when looking at a city with major population changes or comparing multiple cities as their population may have very different reporting practices. There's no easy solution here but it is an important aspect of understanding crime data that you should keep in mind. For a full breakdown of reporting rates broken down by crime and a number of characteristics about the crime and victim (and reasons for not reporting), see Tables 91-105 (pages 98-114) in this report on the National Crime Victmization Survey from 2008.

Using the rate helps deal with population changes that could lead to changes in crime merely because of that change but it isn't without its drawbacks. The main drawback with using a rate is that it assumes equal risk of victimization, which we know isn't correct. For example, when looking at rape, a crime that affects 6 times as many women as men (according to the 2016 National Crime Victimation Survey Table 6, page 9), yet the rate is based on total population in that city (the UCR does not differentiate victims by gender but other data sets, such as NIBRS do, allowing for better rates.). Other crimes require even more granular rates. Murder victims are predominantly young men, but this differs by type of murder - domestic violence victims are mostly women. Also, consider that population comes from those who live in the city and doesn't include people like tourists or people who work in that city but live elsewhere yet can still be victimized in the city. So while rates are probably better than counts as it lets you control for population, consider exactly who that population is, and how risk changes within that population.

Index crimes (sometimes called Part I crimes) are a collection of eight crimes often divided between Violent Index Crimes (murder, rape, robbery, and aggravated assault (assault with a weapon or causing serious bodily injury)) and Property Index Crimes (burglary, theft, motor vehicle theft, and arson (however arson is not available in this data set)). When people discuss "crime" they are often referring to this collection of crimes. One major drawback of this is that it gives equal weight to each crimes. For example, consider if New York City has 100 fewer murders and 100 more thefts this year than last year (and all other crimes didn't change). Their total index crimes would be the same but this year would be far safer than last year. For complete definitions of each crime, please see the FBI's definitions page.

The biggest problem with index crimes is that it is simply the sum of 8 (or 7 since arson data usually isn't available) crimes. Index crimes have a huge range in their seriousness - it includes both murder and theft.This is clearly wrong as 100 murders is more serious than 100 thefts. This is especially a problem as less serious crimes (theft mostly) are far more common than more serious crimes (in 2017 there were 1.25 million violent index crimes in the United States. That same year had 5.5 million thefts.). So index crimes undercount the seriousness of crimes. Looking at total index crimes is, in effect, mostly just looking at theft.


This is especially a problem because it hide trends in violent crimes. San Francisco, as an example, has had a huge increase in index crimes in the last several years. When looking closer, that increase is driven almost entirely by the near doubling of theft since 2011. During the same years, violent crime has stayed fairly steady. So the city isn't getting more dangerous but it appears like it is due to just looking at total index crimes.


While many researchers divide index crimes into violent and nonviolent categories, which helps but even this isn't entirely sufficient. Take Chicago as an example. It is a city infamous for its large number of murders. But as a fraction of index crimes, Chicago has a rounding error worth of murders. Their 653 murders in 2017 is only 0.5% of total index crimes. For violent index crimes, murder makes up 2.2%. What this means is that changes in murder are very difficult to detect. If Chicago had no murders this year, but a less serious crime (such as theft) increased slightly, we couldn't tell from looking at the number of index crimes.

Death

This data comes from the Center for Disease Control and Prevention's (CDC) WONDER data and provides the number of deaths for several cause of death categories for each state.

The following is the CDC's definition of age-adjusted rates from this page.

The rates of almost all causes of disease, injury, and death vary by age. Age adjustment is a technique for "removing" the effects of age from crude rates so as to allow meaningful comparisons across populations with different underlying age structures. For example, comparing the crude rate of heart disease in Florida with that of California is misleading, because the relatively older population in Florida leads to a higher crude death rate, even if the age-specific rates of heart disease in Florida and California were the same. For such a comparison, age-adjusted rates are preferable.

The CDC does not report death counts when there are fewer than 16 deaths in that category. They do this both for confidentiality of the deceased and to avoid the misuse of rates caused by such a small numerator.

Hate Crime

This data is from the FBI's Hate Crime dataset that they release annually since the early 1990s. The data shown on this website is only a small subset of the variables available in the dataset.

Using it.

This dataset is the most flawed of all of the UCR datasets. While the main problems are simply an exacerbated version of of dataset problems in other UCR data - low reporting rates among agencies, only including reported crimes (and many hate crimes are not reported to police), changing definitions of what counts as a hate crime, different agencies reporting each year - the MAJOR problem is that people use this data incredibly irresponsibly. A huge amount of research using this data don't even acknowledge these problems and just naively use the data as if it had no issues. The problems are discussed in more detail below but the main takeaway is that this data is not appropriate for policy analysis.

The manual is available on this page.

In this data a hate crime is defined as a "normal" crime (or in other words a crime already collected by UCR or NIBRS in their normal reporting) where the victim was chosen because of the victims group or status. For example, vandalism is a crime already included in NIBRS data. If the vandalism occurred because of the victim's group or status, such as vandalizing a synagogue due to bias against Jews, that would be considered a hate crime. Animal cruelty was not a NIBRS crime until 2018 so for years prior to that would not count as a hate crime. For example, if a person poisoned a Black person's dog because of their bias against Black people (and not just personal opposition towards this individual), it would not be a hate crime if committed before 2018. So this data is not inclusive of even all hate crimes reported to police, only ones that have a crime type already reported to the FBI. Note that the hate crime is for the perceived victim group even if that perception is wrong. In the FBI manual they give the example of a hate crime against an Indian man who the offenders believe is Black. This is reported as an anti-Black hate crime.

This data should not be used for anything in the following (incomplete) list:

  • Aggregating to a larger geography than the reporting agency (ESPECIALLY TO THE NATIONAL LEVEL)
  • Comparing agencies
  • Aggregating to total hate crimes per agency
  • Looking at hate crimes of a specific bias motivation within a single agency over time without verifying that this bias motivation was reported for all years
  • Imputing missing data
  • Assuming that underreporting is consistent across time and place.

The only thing you should do to use this data properly is look at hate crimes of the same bias motivation (e.g. anti-Jewish, anti-Black) in the same agency over time (assuming the agency reports the same months each year) and assuming that the bias motivation has been reported by that agency every year.

No. The problem with this is that few agencies report so aggregating would exclude a lot of agencies that are actually in that geography but don't report. To my opinion there is no good way to impute missing hate crime data so any attempt to do so will be very flawed and give incorrect results. Every year when the FBI releases the report on this data the news and many academics will just aggregate all reporting agencies to the national level (ignoring all the missing agencies) and pretend that this is an accurate count of national hate crimes. This is a terrible idea and gives very inaccurate counts of hate crimes.

No. Not all agencies report for all months of the year or even for all bias motivations so they're often not comparable. Additionally, since the opportunity to commit a hate crime (i.e. the number of people of a particular victim group) differs between cities, differences across cities in hate crimes may be due to differences in opportunity, not hate.

Generally no. The FBI has added new bias motivations over time so there is artificially an increase in hate crimes just due to this. For example, if 10 transgender people (one of the bias motivations that hasn't always been included as a reportable bias motivation) are always the victim of a hate crime in a particular agency, the year that this becomes an accepted motivation, the agency starts reporting an extra 10 hate crimes per year. But in reality hate crime was consistent across time.

There are three reasons this could be. First, the agency may have gotten reports but not submitted hate crimes of those bias motivations. Second, they may have never received a complaint for this bias motivation prior to this year. Third, the FBI has increased the number of bias motivations they accepted so this may be one of this bias motivations. Prior to the accepted year, these would always be marked as zero reports. Note that in these cases not all agencies start reporting these motivations in the same year as the FBI accepts them.

No, reporting this data, like all UCR data, is voluntary. Some states do require that their agencys report UCR data but this is no national requirement. And even in these mandatory states not all agencies report.

No, like all crimes hate crimes are dependent on opportunity - though hate crimes have more variation in opportunity than other crimes. Consider, for example, a city with 10% Black population and one with 50% Black population. If anti-Black sentiment and willingness to attack Black people for their race is the same in each city, in the second city there is about five times as many opportunities as in the first cities to offend. So even if anti-Black hate is identical in both cities, we'd expect there to be many more anti-Black hate crimes in the second city. This can be particially alleviated by using rates per victim group using Census data but that's still flawed since you'd likely only get decennial Census data and the Census doesn't collect all victim type info.

This shows the number of hate crime incidents, regardless of how many victims or offenders were involved. Each incident can have multiple offenses and multiple bias motivations. For simplicity in this site I only include the first offense and the first bias motivation reported in the data.

Police

This data is from the FBI's Law Enforcement Officers Killed and Assaulted (LEOKA) data which is part of the Uniform Crime Reporting (UCR) Program. This data provides information about how many employees (civilian and officers) are at a given agency. It also says how many officers were assaulted for a number of different categories of assault.

Prior to 1971 the data did not breakdown employees by gender. The years 1960-1970 put the number of total employees in the male employees column (and a value of 0 in the female employees column).

Prison

The three categories that say the inmate's Most Serious Charge come from the National Corrections Reporting Program (NCRP) which provides data on how many people are incarcerated, admitted, or released from prison that year. This is divided by the most serious crime they are convicted of, race/ethnicity, and gender. All other categories are from the National Prisoner Statistics (NPS) data which has different information than the NCRP and more years available. Unlike the NCRP, the NPS has totals for the federal prison system, the state prison system, and the combined US as a whole. Some states and some years do not have information for some variables so you will likely see many missing values in this data.

All of the population data comes from the United States Census. For the years 2001-2016, I use the annual American Community Survey which is a census data set that samples 1% of the population. For the other years I use the decennial census and linearly impute for the years between the censuses. As such, please be aware that these population values are only estimates.

I included this because most people incarcerated in prison are between these ages. However, not all are in these age groups meaning that this is almost certainly an over estimate. As such you should use the rates as estimates, NOT precise rates.

No. This rate is the rate for 100k people of any race. For example, if you look at Black prisoners the rate per 100k people is the numbers of Black prisoners in that state divided by the number of people in that state (of all races) times 100,000. It is not divded only by the number of Black people in that state.

As per the National Prisoner Statistics codebook, available to download here

As states and the Federal Bureau of Prisons increased their use of local jails and interstate compacts to house inmates, NPS began asking states to report a count of inmates under the jurisdiction or legal authority of state and federal adult correctional officials in addition to their custody counts. Since 1977, the jurisdiction count has been the preferred measure. This count includes all state and federal inmates held in a public or private prison (custody) and those held in jail facilities either physically located inside or outside of the state of legal responsibility, and other inmates who may be temporarily out to court or in transit from the jurisdiction of legal authority to the custody of a confinement facility outside that jurisdiction. The difference between the total custody count and the jurisdiction count was small (approximately 7,000) when both were first collected in 1977. As more states began to report jurisdiction counts and more states began to rely on local and privately operated facilities to house inmates, the difference increased. At yearend 2016 the jurisdiction population totaled 1,506,800 while the custody population totaled 1,293,887.

School

All of this data comes from the Department of Education Office of Postsecondary Education which collects crime data from colleges and releases them publicly. Their website is here. While their site does allow you to look at a single school's data, it only shows the prior three years and only as tables. For a comprehensive look at the data codebook, please see their PDF here

As per the Department of Education definitions, available here

Not on Campus: (1) Any building or property owned or controlled by a student organization that is officially recognized by the institution; or (2) Any building or property owned or controlled by an institution that is used in direct support of, or in relation to, the institution's educational purposes, is frequently used by students, and is not within the same reasonably contiguous geographic area of the institution.
On Campus - Total: (1) Any building or property owned or controlled by an institution within the same reasonably contiguous geographic area and used by the institution in direct support of, or in a manner related to, the institution's educational purposes, including residence halls; and (2) Any building or property that is within or reasonably contiguous to paragraph (1) of this definition, that is owned by the institution but controlled by another person, is frequently used by students, and supports institutional purposes (such as a food or other retail vendor).
On Campus - Student Housing: Any student housing facility that is owned or controlled by the institution, or is located on property that is owned or controlled by the institution, and is within the reasonably contiguous geographic area that makes up the campus is considered an on-campus student housing facility.
Public Property: All public property, including thoroughfares, streets, sidewalks, and parking facilities, that is within the campus, or immediately adjacent to and accessible from the campus.
Total: This is the sun of Not on Campus, On Campus - Total, and Public Property.

There are different rules for which offenses are included when an offense is committed and a person is arrested so these categories do not necessarily overlap.

There are different rules for which offenses are included when an offense is committed and a disciplinary actions are taken so these categories do not necessarily overlap. As sexual offenses are not included in the required categories for disciplinary action, they is not available in the data.

No, if a person is arrested and then given disciplinary actions by the scohol, only the arrest is counted.

This is when a person is referred to the school for a "disciplinary action" though the action does not need to actually take place and the data does not specify which action is referred or the outcome of that referral .

  • Sexual Offense - Forcible is the sum of rape and fondling.
  • Sexual Offense - Non-forcible is the sum of incest and statutory rape.
  • Sexual Offense - Total is the sum of Sexual Offense - Forcible and Sexual Offense - Non-forcible.
For definitions of each individual crimes please see the Department of Education's codebook here

No, the hate crime data unwent a series of changes in how the data was collected. The crimes theft, intimidation, and vandalism/destruction of property only started being reported in 2009. Starting in 2014, "gender identity" was added as a possible bias motivation while in the same year the "ethnicity or national origin" bias motivation was split into either "ethnicity" or "national origin" bias motivations. This means that you should be cautious when looking at total hate crime changes as certain crimes/bias motivations were not included until recently.

This data set did not collect information on the number of rape, fondling, incest, or statutory rape crimes until 2014. Instead, it grouped rape and fondling as Sexual Offense - Forcible, and incest and statutory rape as Sexual Offense - Non-forcible.

As per the Department of Education definitions, available here

Dating Violence: Violence committed by a person who is or has been in a social relationship of a romantic or intimate nature with the victim. The existence of such a relationship shall be determined based on the reporting party’s statement and with consideration of the length of the relationship, the type of relationship, and the frequency of interaction between the persons involved in the relationship. For the purposes of this definition—
  • Dating violence includes, but is not limited to, sexual or physical abuse or the threat of such abuse.
  • Dating violence does not include acts covered under the definition of domestic violence.
Domestic Violence: A felony or misdemeanor crime of violence committed—
  • By a current or former spouse or intimate partner of the victim;
  • By a person with whom the victim shares a child in common;
  • By a person who is cohabitating with, or has cohabitated with, the victim as a spouse or intimate partner;
  • By a person similarly situated to a spouse of the victim under the domestic or family violence laws of the jurisdiction
in which the crime of violence occurred, or by any other person against an adult or youth victim who is protected from that person’s acts under the domestic or family violence laws of the jurisdiction in which the crime of violence occurred.
Stalking: Engaging in a course of conduct directed at a specific person that would cause a reasonable person to—
  • Fear for the person’s safety or the safety of others; or
  • Suffer substantial emotional distress.

Using rates is useful as it removes the important influence of the number of people at that school, but has its own serious limitations. Schools with similar number of students may still be very different in their student population and risk of victimization. Consider, for example, two schools which each have 20,000 students. If these two schools are very similar in students, then the rate per 1,000 students could be useful in comparing the schools are the groups are similar. If, however these schools differ on factors such as if the school is urban, whether students commute or live on campus, ages of students, etc, then knowing purely the number of students is not a very useful rate. Also consider that crimes can occur against victims other than students such as faculty or staff so a per 1,000 student rate would overestimate crime by decreasing the denominator.

Like all crime data, this data has a limitation as it is reported offenses only. If likelihood of reporting changes, that will be reflected in changes of reported offenses but we will not be able to tell (based only on this data) whether it was the number of crimes or the likelihood of reporting that changed. This is especially a problem with sexual offenses as they are already were unlikely to be reported and small changes in reporting likelihood can cause a seemingly large change in crimes reported. Also keep in mind that the population included (primarily college students) may have different reporting likelihoods than other populations.