I have no plans to do so at this time.

I have no plans to do so at this time.


This data is from the Apparent Per Capita Alcohol Consumption: National, State, and Regional Trends 1977-2016 report produced by Sarah P. Haughwout and Dr. Megan E. Slater at the National Institute on Alcohol Abuse and Alcoholism. The Data page of this site has a link to download he data. For the original report, click here

For a complete methodology please see the actual report here. The authors determined the amount of alcohol consumed (originally measured in gallons of ethanol - pure alcohol) through sales data or tax information for each state-year. Population data for how many people age 14 and up live in each state was acquired from the CDC' WONDER data. Ethanol was divided by population to determine per capita consumption.

The original report provides an equation to convert the amount of total ethanol consumed into a "total drinks" variable so I used that to make that variable. For the individual drink categories (beer, shots of liquor, and glasses of wine), it provides an equation to convert the amount of ethanol consumed into the amount of alcohol. I then converted this to number of drinks for each category based on the National Institute of Health's page saying how many ounces of alcohol make up a drink for those categories. Therefore, the sum of the categories is slightly different than the "total drinks" column.


This data is from the FBI's Arrests by Age, Sex, and Race data which is part of the Uniform Crime Reporting (UCR) Program. This data provides the number of arrests that occurred in a city in any given year and breaks that down by age (adult or juvenile), race, and gender.

Border Patrol

See here.


This data is from the FBI's Offenses Known and Clearances by Arrest data which is part of the Uniform Crime Reporting (UCR) Program. This data provides the number of crimes that occurred in a city in any given year and how many of those crimes were cleared.

Negative values reflect adjustments of previous returns. So if the agency reports, for example, a burglary in January then in February discovers that it wasn't actually a burglary, they will record that both as 1 unfounded burglary and -1 actual burglary in February. This is so it "deletes" the erroneously recorded burglary from January. See more on pages 82-83 of the FBI's Manual for UCR data.

Starting in 2013, rape has a new, broader definition in the UCR to include oral and anal penetration (by a body part or object) and allow men to be victims. The previous definition included only forcible intercouse against a woman. As this revised definition is broader than the original one, more rapes are reported ( social changes may also be partly responsible as they could encourage victims to report more). This definitional change makes post-2013 rape data non-comparable to pre-2013 data.

This data doesn't differentiate between a "real zero" and a "not reported zero". If an agency doesn't report any crimes (even if crimes did occur), the data will say zero crimes occurred. Even though the data indicates how many months of the year that agency reported, that doesn't necessarily mean that they reported fully. An agency that reports all 12 months of the year may still report only incomplete data. Agencies can report partial data each month and still be considered to have reported that month. Chicago, for example, reports every month but until the last few years didn't report any rapes.

No, this data only includes the most serious crime in an incident (except for motor vehicle theft which is always included). For incidents where most the one crime happens (for example, a robbery and a murder), only the more serious (murder in this case) will be counted. This is called the Hierarchy Rule. See more on pages 10-12 of the FBI's Manual for UCR data which details the Hierarchy Rule.

Though the Hierarchy Rule does mean this data is an undercount, data from other sources indicate it isn't much of an undercount. The FBI's other data set, the National Inicident-Based Reporting System (NIBRS) contains every crime that occurs in an incident (i.e. it doesn't use the Hierarchy Rule). Using this we can measure how many crimes the Hierarchy Rule excludes (Most major cities do not report to NIBRS so what we find in NIBRS may not apply to them). In over 90% of incidents, only one crime is committed. Additionally, when people talk about "crime" they usualyl mean murder which, while an incomplete to discuss crime, means the UCR data here is accurate on that measure.

A major limitation (in my opinion the most important limitation) to the data here is that it doesn't include crimes not reported to police. Based on victimization surveys that ask people both if they were victimized and if they reported that crime, we know that the majority of crimes are not reported. This probably won't matter when looking at a single city for a short period of time - the population won't change too much so even underreporting of crime will be consistent underreporting. The issue becomes serious when looking at a city with major population changes or comparing multiple cities as their population may have very different reporting practices. There's no easy solution here but it is an important aspect of understanding crime data that you should keep in mind. For a full breakdown of reporting rates broken down by crime and a number of characteristics about the crime and victim (and reasons for not reporting), see Tables 91-105 (pages 98-114) in this report on the National Crime Victmization Survey from 2008.

Using the rate helps deal with population changes that could lead to changes in crime merely because of that change but it isn't without its drawbacks. The main drawback with using a rate is that it assumes equal risk of victimization, which we know isn't correct. For example, when looking at rape, a crime that affects 6 times as many women as men (according to the 2016 National Crime Victimation Survey Table 6, page 9), yet the rate is based on total population in that city (the UCR does not differentiate victims by gender but other data sets, such as NIBRS do, allowing for better rates.). Other crimes require even more granular rates. Murder victims are predominantly young men, but this differs by type of murder - domestic violence victims are mostly women. Also, consider that population comes from those who live in the city and doesn't include people like tourists or people who work in that city but live elsewhere yet can still be victimized in the city. So while rates are probably better than counts as it lets you control for population, consider exactly who that population is, and how risk changes within that population.

Index crimes (sometimes called Part I crimes) are a collection of eight crimes often divided between Violent Index Crimes (murder, rape, robbery, and aggravated assault (assault with a weapon or causing serious bodily injury)) and Property Index Crimes (burglary, theft, motor vehicle theft, and arson (however arson is not available in this data set)). When people discuss "crime" they are often referring to this collection of crimes. One major drawback of this is that it gives equal weight to each crimes. For example, consider if New York City has 100 fewer murders and 100 more thefts this year than last year (and all other crimes didn't change). Their total index crimes would be the same but this year would be far safer than last year. For complete definitions of each crime, please see the FBI's definitions page.

The biggest problem with index crimes is that it is simply the sum of 8 (or 7 since arson data usually isn't available) crimes. Index crimes have a huge range in their seriousness - it includes both murder and theft.This is clearly wrong as 100 murders is more serious than 100 thefts. This is especially a problem as less serious crimes (theft mostly) are far more common than more serious crimes (in 2017 there were 1.25 million violent index crimes in the United States. That same year had 5.5 million thefts.). So index crimes undercount the seriousness of crimes. Looking at total index crimes is, in effect, mostly just looking at theft.

This is especially a problem because it hide trends in violent crimes. San Francisco, as an example, has had a huge increase in index crimes in the last several years. When looking closer, that increase is driven almost entirely by the near doubling of theft since 2011. During the same years, violent crime has stayed fairly steady. So the city isn't getting more dangerous but it appears like it is due to just looking at total index crimes.

While many researchers divide index crimes into violent and nonviolent categories, which helps but even this isn't entirely sufficient. Take Chicago as an example. It is a city infamous for its large number of murders. But as a fraction of index crimes, Chicago has a rounding error worth of murders. Their 653 murders in 2017 is only 0.5% of total index crimes. For violent index crimes, murder makes up 2.2%. What this means is that changes in murder. If Chicago had no murders this year, but a less serious crime (such as theft) increased slightly, we couldn't tell from looking at the number of index crimes.


This data comes from the Center for Disease Control and Prevention's (CDC) WONDER data and provides the number of deaths for several cause of death categories for each state.

The following is the CDC's definition of age-adjusted rates from this page.

The rates of almost all causes of disease, injury, and death vary by age. Age adjustment is a technique for "removing" the effects of age from crude rates so as to allow meaningful comparisons across populations with different underlying age structures. For example, comparing the crude rate of heart disease in Florida with that of California is misleading, because the relatively older population in Florida leads to a higher crude death rate, even if the age-specific rates of heart disease in Florida and California were the same. For such a comparison, age-adjusted rates are preferable.

The CDC does not report death counts when there are fewer than 16 deaths in that category. They do this both for confidentiality of the deceased and to avoid the misuse of rates caused by such a small numerator.


This data is from the FBI's Law Enforcement Officers Killed and Assaulted (LEOKA) data which is part of the Uniform Crime Reporting (UCR) Program. This data provides information about how many employees (civilian and officers) are at a given agency. It also says how many officers were assaulted for a number of different categories of assault.

Prior to 1971 the data did not breakdown employees by gender. The years 1960-1970 put the number of total employees in the male employees column (and a value of 0 in the female employees column).


The three categories that say the inmate's Most Serious Charge come from the National Corrections Reporting Program (NCRP) which provides data on how many people are incarcerated, admitted, or released from prison that year. This is divided by the most serious crime they are convicted of, race/ethnicity, and gender. All other categories are from the National Prisoner Statistics (NPS) data which has different information than the NCRP and more years available. Unlike the NCRP, the NPS has totals for the federal prison system, the state prison system, and the combined US as a whole. Some states and some years do not have information for some variables so you will likely see many missing values in this data.

All of the population data comes from the United States Census. For the years 2001-2016, I use the annual American Community Survey which is a census data set that samples 1% of the population. For the other years I use the decennial census and linearly impute for the years between the censuses. As such, please be aware that these population values are only estimates.

I included this because most people incarcerated in prison are between these ages. However, not all are in these age groups meaning that this is almost certainly an over estimate. As such you should use the rates as estimates, NOT precise rates.

As per the National Prisoner Statistics codebook, available to download here

As states and the Federal Bureau of Prisons increased their use of local jails and interstate compacts to house inmates, NPS began asking states to report a count of inmates under the jurisdiction or legal authority of state and federal adult correctional officials in addition to their custody counts. Since 1977, the jurisdiction count has been the preferred measure. This count includes all state and federal inmates held in a public or private prison (custody) and those held in jail facilities either physically located inside or outside of the state of legal responsibility, and other inmates who may be temporarily out to court or in transit from the jurisdiction of legal authority to the custody of a confinement facility outside that jurisdiction. The difference between the total custody count and the jurisdiction count was small (approximately 7,000) when both were first collected in 1977. As more states began to report jurisdiction counts and more states began to rely on local and privately operated facilities to house inmates, the difference increased. At yearend 2016 the jurisdiction population totaled 1,506,800 while the custody population totaled 1,293,887.