An estimation of the total amount of data existing in the digital universe today is 7.9 Zettabytes [1].
1 Zettabyte is 1 trillion gigabytes which is beyond comprehension of us mortals. An estimation of the total number of stars in our galaxy, the Milky Way, is 300 billion. The total number of gigabytes today is more than 2000 times the number of total stars in our galaxy (to get a feeling of a gigabyte you could say that a good quality movie file could be close to 1 GB). The digital universe size is poised to grow 5 times to a total of 35 Zettabytes (10,000 times the number of stars in our galaxy) by 2020.
So when people talk about big data and explosion of data they really mean it. But it is one thing to have data and another thing to be able to extract value from data. Crude datasets are like crude oil. You need to process them to get value. Nevertheless, one could say that we are sitting on gold mines and we only need the miners to harvest the value. Let’s take, for example, open data. Does it have value or not?
Open data describe large datasets that entities (usually governments) release online and free of charge for anyone to analyze for any purpose. While there is a huge debate regarding the pros and cons of this approach, nobody can doubt the value of these data. A plethora of applications has emerged from those data along with the added benefit of transparency into the government itself.
To demonstrate the value one can mine from data, I analyzed the Hospital Inpatient Discharges (SPARCS De-Identified) of 2012 released from New York State as an open dataset [2].
This dataset contains data regarding hospital discharges (when a patient is released from the hospital) along with demographics, etiology of admission, procedures (if any), severity and cost data. The file is close to 1 gigabyte in size, which is relatively small in terms of the files I usually work with, but as we will see there is quite a bit of value to mine even in 1 gigabyte. One thing with data is that you spend a lot of time cleaning the files (there is actually a pareto law paradigm stating that 80% of your time as a data scientist is spent on cleaning and transforming your data while only 20% is spent on getting results and visualizing). So it might take a while for you to clean and munge your data, as we say, but it is like hunting for gold, where the value far outweighs the effort, however arduous.
First contact with data is always a visual exploration to check trends, outliers and get a general impression and feel for the dataset. Let’s take a look at the age of inpatients vs. the total number of admissions.
We can easily see that most of the hospitalizations are for newborns and 50+ year olds.
Let’s take a look into the male/female distribution. Interestingly enough, female discharges are higher.
For the sake of space, we will look into only three diseases, create some insights for demonstration purposes, and show a process that can easily be generalized for the rest of the diseases.
COPD
Chronic obstructive pulmonary disease (COPD) is a type of obstructive lung disease characterized by chronically poor airflow with main symptoms that include shortness of breath, cough, and sputum production.
Let’s take a look into the COPD age distribution.
What is the distribution of the length of stay? Most of the admissions are between 2-4 days.
What about the geolocation of COPD discharges based on the first three numbers of the zip code (rest of zip codes are truncated for anonymization)? Most of the hospitalizations are of course in New York City, but it is interesting to see a high incidence of COPD discharges outside of NYC. One can create interactive maps online relatively easily for the data stakeholders to be able to “geo – visualize” their data.
What about Chronic Heart Insufficiency, a condition of the heart where the heart muscle does not work efficiently in pumping blood? Can we compare the number of discharges of CHF with those of COPD? It is obvious that for patients older than 70 the incidence of CHF is almost double in comparison with COPD.
Length of stay is more or less the same as COPD.
Now, let’s add Diabetes into the analysis and continue on from there.
Cost Data for the Three Diseases
Clearly CHF is the costliest disease among the three, with average cost per admission nearly twice as high as COPD.
More specifically, we notice that CHF, apart from being very expensive in total, in average and in maximum cost for one hospitalization, also has the highest standard deviation among the three diseases.
Costs of Hospitalization
COPD | CHF | DIABETES | |
Mean | $25,779 | $42,549 | $29,404 |
Total | $0.958 B | $2,425 B | $1,35 B |
STD | $34,534 | $84,772 | $45,483 |
25% | $9,971 | $12,389 | $9,336 |
50% | $17,329 | $22,877 | $16,918 |
75% | $30,243 | $44,266 | $32,061 |
max | $1,665,430 | $4,214,537 | $2,141,412 |
Let’s take a look into the power that predictive and preventive medicine can have on cost reduction.
Let’s say that we define the vulnerable patients and we find a magic way to reduce emergency admissions by 10%. What would be the cost savings for New York State only?
$203M for CHF, $116M for Diabetes, and $83.9M for COPD.
Is there any correlation between average cost per admissions and day of the week for 2012?
Generally speaking, there shouldn’t be any over the course of the year. But there is no free lunch in data science, so let’s dig deeper into the data to uncover the real answer.
COPD: Average Cost per Admission by Day of Week
Admissions for emergencies surprisingly cost on average the same amount for each day of the week.
Elective admissions have an almost 50% higher average cost on Wednesday but very low costs on weekends, as compared with emergency admissions.
CHF: Average Cost per Admission by Day of Week
Elective admissions cost more on average (probably due to lots of tests). We also watch this peak on electives on Wednesday.
Diabetes: Average Cost per Admission by Day of Week
There is a peak of emergency admission costs on Sunday followed by low average cost on Monday and Friday elective admissions. One guess is that some emergency admissions from Sunday worsen and account for some of the elective admissions on Monday. Moreover, there is likely a reduction in elective admissions on Friday because people don’t want to spend the weekend in the hospital.
Last but not least, we can take a look at what the insurance companies paid on average per hospitalization by type of admission (elective/emergency) for each disease.
COPD: Average Payment per Type of Admission per Insurance Coverage
Source of Payment (rows) / Type of Admission (columns) | Elective | Emergency |
Blue Cross | $32,290 | $24,371 |
CHAMPUS | $9,912 | $22,182 |
Insurance Company | $23,025 | $25,607 |
Medicaid | $13,614 | $21,689 |
Medicare | $25,057 | $26,875 |
Other Federal Program | $12,329 | $13,947 |
Other Non-Federal Program | $25,990 | $25,915 |
Self-Pay | $22,594 | $27,032 |
Unknown | $3,739 | $17,353 |
Workers Compensation | $43,622 | $25,436 |
CHF: Average Payment per Type of Admission per Insurance Coverage
Source of Payment (rows) / Type of Admission (columns) | Elective | Emergency |
Blue Cross | $78,158 | $43,927 |
CHAMPUS | $49,044 | $37,291 |
Insurance Company | $93,328 | $47,836 |
Medicaid | $75,460 | $39,134 |
Medicare | $68,665 | $37,911 |
Other Federal Program | $7,058 | $23,907 |
Other Non-Federal Program | $36,672 | $31,150 |
Self-Pay | $38,606 | $34,050 |
Workers Compensation | $75,141 | $42,851 |
Diabetes: Average Payment per Type of Admission per Insurance Coverage
Source of Payment (rows) / Type of Admission (columns) | Elective | Emergency |
Blue Cross | $27,257 | $21,206 |
Insurance Company | $32,544 | $30,226 |
Medicaid | $23,405 | $38,076 |
Medicare | $23,861 | $30,567 |
Other Non-Federal Program | $56,586 | $33,893 |
And what were the 20 top hospitalizations in terms of cost?
Disease | Cost in $ | |
1. | Eye Infection | 7,066,436 |
2. | Surgcl/Med Care Compl | 6,286,622 |
3. | Aspiration Pneumonitis | 6,230,015 |
4. | Coag/Hemrrge Disorder | 6,196,974 |
5. | Septicemia | 5,166,411 |
6. | Liveborn | 4,971,831 |
7. | Liveborn | 4,953,934 |
8. | Burns | 4,877,072 |
9. | HIV Infection | 4,839,726 |
10. | Septicemia | 4,511,673 |
11. | Encephalitis | 4,362,477 |
12. | HIV Infection | 4,324,821 |
13. | CHF | 4,214,537 |
14. | Leukemias | 4,196,908 |
15. | CHF | 4,138,746 |
16. | HIV Infection | 4,052,052 |
17. | Liveborn | 3,975,798 |
18. | Liveborn | 3,961,028 |
19. | Adult Respiratory Failure | 3,947,940 |
20. | Anemia | 3,891,765 |
We notice that 5 of them are related to births, 2 to CHF, and 3 to HIV infections.
It seems that with this dataset we might have enough data to make a prediction model for the probable cost of an admission based on gender, age, disease, area, length of stay and insurance type for the region of New York.
I’m guessing that there is a lot more value to be extracted from a dataset like this.
But that is a topic for another blog post….
[1] http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode
[2] (https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/u4ud-w55t)