Covid-19 Project – Miguel Atares

COVID-19 GLOBAL IMPACT

[PYTHON + POWER BI]

🔎 Ready to explore the report?
🚀 Go ahead, it’s all yours!

QUICK REVIEW

Before we dive into the full analysis,
here’s a quick overview.

If you’re here to evaluate my technical skills, or just want a snapshot of the project, you can explore all the core files and documentation below:

If you’re more into data storytelling, no worries — keep scrolling and enjoy the guided tour:

Highlights

Documentation

Numbers Speak

Technical information

Columns Analyzed

+ 0

CSVs Analyzed

DAX Measures

+ 0

CSV Structures

Calculated Columns

+ 0

Dashboards

📄 Data Sources + First Exploration
🐍 Python ETL Scripts: 5 Unique Formats
🔁 Python Automation for VAERS Folders
📊 DAX Measures + Calculated Columns
🧩 Power BI Modeling + M Code Calendar
🧠 Key Insights VS Official Data

Background, Objective & Tech Stack

❓ Why did I build this project?

As a biotechnologist, I was especially drawn to the topic of COVID-19 and vaccine-related data. It’s a globally relevant issue that also connects directly with my background. The challenge was fascinating: fragmented, inconsistent datasets from dozens of countries, each with different reporting methods.

⚙️ What technologies did I use?

I used Python (Pandas, Jupyter Notebook) for all ETL processes, and Power BI / DAX for data modeling and dashboard creation. I automated the cleaning of repetitive yearly CSVs and built a dual-star schema. This allowed me to integrate different sources (OWID + VAERS) into one unified analytical ecosystem.

🎯 What I'm trying demonstrate?

I wanted to prove I can handle full-cycle analytics: from collecting and transforming raw data to delivering insights through an interactive dashboard. I also aimed to show critical thinking, validate data externally, and contextualize the output. It wasn’t just about “analyzing COVID,” but about demonstrating end-to-end data skills.

🚨 What is the main objective?

The main goal of this project is to extract meaningful insights about the global impact of COVID-19, identifying which countries were most affected, what strategies they adopted, and how they evolved over time. It also aims to analyze the most common and severe symptoms reported post-vaccination, and explore potential patterns linked to specific vaccines or technologies.

Dimensions, Row Meaning, Column Info

Let’s start by understanding the structure and content of each dataset individually. This step is crucial to identify what each row represent, how the data is organized, and which columns are useful for analysis.

DIMENSIONS, UNDERSTANDING ROW MEANING AND COLUMN PROFILING

OWID - Compact.csv

DIMENSIONS:
Total rows = 506.842
Total columns = 60
Columns kept = 37

WHAT DOES EACH ROW MEANS?
Each row corresponds to one day for one country, summarizing all the national-level pandemic indicators on that day.

A) 🌍 Geographical and Temporal Information

Column	Description	Keep	Reason
country	Country name	✅	Essential for grouping and comparisons
code	Country code (ISO Alpha-3)	❌	Redundant unless standardization is needed
continent	Continent to which the country belongs	✅	Enables regional aggregation
date	Date of the data	✅	Core time variable for trends

B) 🦠 COVID-19 Cases

Column	Description	Keep	Reason
total_cases	Cumulative confirmed cases	✅	Shows the total evolution of the pandemic
new_cases	Daily new cases	❌	Too noisy; replaced with smoothed version
new_cases_smoothed	7-day rolling average of new cases	✅	Shows real trends, filters daily reporting fluctuations
total_cases_per_million	Total cases per million people	✅	Allows comparisons between countries regardless of population size
new_cases_per_million	Daily new cases per million	❌	Redundant; replaced by smoothed version per million
new_cases_smoothed_per_million	Smoothed new cases per million	✅	Ideal for normalized trend comparisons

C) ☠️ COVID-19 Deaths

Column	Description	Keep	Reason
total_deaths	Cumulative deaths	✅	Key indicator of pandemic impact
new_deaths	Daily new deaths	✅	Useful for daily monitoring
new_deaths_smoothed	Smoothed new deaths	❌	Redundant if using both daily and per million metrics
total_deaths_per_million	Deaths per million	✅	Enables normalized cross-country comparison
new_deaths_per_million	Daily deaths per million	✅	Adjusted by population
new_deaths_smoothed_per_million	Smoothed deaths per million	❌	Redundant with existing metrics

D) ⚰️ Excess Mortality

Column	Description	Keep	Reason
excess_mortality	Weekly % of excess mortality	✅	Indicates excess deaths beyond expectations
excess_mortality_cumulative	Cumulative % of excess mortality	✅	Shows prolonged deviation over time
excess_mortality_cumulative_absolute	Absolute number of excess deaths	✅	Gives real volume of excess mortality
excess_mortality_cumulative_per_million	Excess deaths per million	✅	Allows cross-national analysis

E) 🏥 Hospitalizations and ICU

Column	Description	Keep	Reason
hosp_patients	Current hospitalized patients	❌	Redundant; prefer normalized version
hosp_patients_per_million	Hospitalized per million people	✅	Enables fair comparison across countries
weekly_hosp_admissions	Weekly hospital admissions (raw)	❌	Not normalized; better to use per million
weekly_hosp_admissions_per_million	Weekly hospital admissions per million	✅	Reflects burden over time
icu_patients	Current ICU patients	❌	Redundant with ICU admissions
icu_patients_per_million	ICU patients per million	❌	Less useful if ICU admissions are used
weekly_icu_admissions	Weekly ICU admissions (raw)	❌	Same reason as above
weekly_icu_admissions_per_million	ICU admissions per million	✅	Captures severe case trends

F) 🦠 Testing and Positivity

Column	Description	Keep	Reason
total_tests	Total tests performed	❌	Hard to interpret without positives
new_tests	Daily new tests	❌	Too volatile
total_tests_per_thousand	Total tests per 1000 people	✅	Shows overall testing effort
new_tests_per_thousand	New tests per 1000 people	❌	Redundant; replaced by smoothed version
new_tests_smoothed	Smoothed new tests	✅	Tracks real testing trends
new_tests_smoothed_per_thousand	Smoothed new tests per 1000 people	✅	Enables normalized comparison
positive_rate	Percentage of positive tests	✅	Key to assessing under-testing
tests_per_case	Number of tests per case detected	✅	Inversely related to positivity rate

G) 💉 Vaccination

Column	Description	Keep	Reason
total_vaccinations	Total doses administered	❌	Redundant if focusing on individuals
people_vaccinated	People with at least one dose	❌	Dropped for simplification
people_fully_vaccinated	Fully vaccinated people	❌	Dropped for simplification
total_boosters	Total booster doses	✅	Important for booster analysis
new_vaccinations	Daily new doses	❌	Replaced by smoothed version
new_vaccinations_smoothed	Smoothed new doses	✅	Shows true vaccination pace
total_vaccinations_per_hundred	Doses per 100 people	❌	Redundant with individual-based metrics
people_vaccinated_per_hundred	One dose per 100 people	❌	Redundant
people_fully_vaccinated_per_hundred	Fully vaccinated per 100 people	✅	Key indicator of national coverage
total_boosters_per_hundred	Boosters per 100 people	❌	Redundant
new_vaccinations_smoothed_per_million	Smoothed new doses per million	❌	Dropped for simplification
new_people_vaccinated_smoothed	Smoothed new individuals vaccinated	❌	Dropped by design
new_people_vaccinated_smoothed_per_hundred	Per 100 people	❌	Dropped by design

H) 🦠 Demographic and Health Indicators

Column	Description	Keep	Reason
population	Total population	✅	Needed for all per capita calculations
population_density	Population density	✅	Helps understand transmission dynamics
median_age	Median age	✅	Important for assessing mortality risk
life_expectancy	Life expectancy	✅	General health indicator
gdp_per_capita	GDP per capita (USD)	✅	Socioeconomic indicator
extreme_poverty	% living in extreme poverty	✅	Crucial for assessing access to healthcare
diabetes_prevalence	Diabetes prevalence (%)	✅	Known comorbidity for COVID-19
handwashing_facilities	% with access to handwashing	✅	Indicator of basic hygiene and prevention capacity
hospital_beds_per_thousand	Hospital beds per 1000 people	✅	Measures healthcare capacity
human_development_index	Human Development Index (HDI)	✅	Overall development and wellbeing metric

OWID - Vaccinations_manufacturer.csv

DIMENSIONS:
Total rows = 888.361
Total columns = 4
Columns kept = 4

WHAT DOES EACH ROW MEANS?
Each row represents the cumulative number of vaccine doses administered for a specific vaccine brand in a specific country on a specific day.

Column	Description	Keep	Reason
country	Country name	✅	Essential for grouping
vaccine	Vaccine name	✅	Key for comparison between vaccines
date	Date of administration	✅	Required for timeline and trends
total_vaccinations	Cumulative doses administered	✅	Core quantitative indicator of usage

VAERS - Data.csv

DIMENSIONS (2021 example):
Total rows = 768.353
Total columns = 35
Columns kept = 17

WHAT DOES EACH ROW MEANS?
Each row is an individual report submitted to the VAERS system describing a possible adverse event after receiving a COVID-19 vaccine in the United States.

A) 🦠 Identification and Dates

Column	Description	Keep	Reason
VAERS_ID	Unique report identifier	✅	Primary key to link across VAERS files
RECVDATE	Report received date	✅	Useful to analyze reporting trends
RPT_DATE	Report creation date	❌	Often empty or redundant with RECVDATE
TODAYS_DATE	File creation date (not event date)	❌	Irrelevant for analytical purposes
VAX_DATE	Date of vaccination	✅	Needed to calculate time to symptom onset
ONSET_DATE	Symptom onset date	✅	Crucial to assess reaction time
NUMDAYS	Days between vaccination and symptoms	✅	Already calculated; saves preprocessing steps
DATEDIED	Date of death (if applicable)	✅	Required for severity and fatality timeline analysis

B) 🦠 Demographics

Column	Description	Keep	Reason
STATE	US state where the event occurred	✅	Useful for regional analysis
AGE_YRS	Patient age in years	✅	Key for age-based segmentation
CAGE_YR	Age in text format (years)	❌	Redundant with numeric age (AGE_YRS)
CAGE_MO	Age in months (for infants)	❌	Not needed if analysis excludes infants or already using AGE_YRS
SEX	Patient gender (M/F/U)	✅	Fundamental for demographic breakdowns

C) 🏥 Clinical Outcome (Severity)

Column	Description	Keep	Reason
DIED	Indicates death	✅	Key indicator of event severity
L_THREAT	Life-threatening condition	✅	Useful for classification of serious cases
ER_VISIT	Emergency room visit	✅	Acts as proxy for acute impact
ER_ED_VISIT	Emergency visit (duplicate)	❌	Redundant with ER_VISIT
HOSPITAL	Was hospitalized	✅	Helps measure clinical burden
HOSPDAYS	Days spent in hospital	✅	Useful for estimating impact severity
DISABLE	Resulted in disability	✅	Captures long-term adverse effects
RECOVD	Patient recovered	✅	Allows tracking of recovery vs. severity
BIRTH_DEFECT	Birth defect caused	✅	Rare but critical for completeness

D) 🤒 Symptom Description

Column	Description	Keep	Reason
SYMPTOM_TEXT	Free text description of symptoms	❌	Unstructured and redundant with codified symptom tables

E) 🦠 Patient Medical Context

Column	Description	Keep	Reason
LAB_DATA	Lab data (free text)	❌	Unstructured; difficult to scale
V_ADMINBY	Who administered the vaccine (e.g., pharmacy)	❌	Operational, not analytically useful
V_FUNDBY	Funding source (e.g., public/private)	❌	Contextual info, not needed for clinical analysis
OTHER_MEDS	Current medication (free text)	❌	Not standardized
CUR_ILL	Current illnesses (free text)	❌	Difficult to interpret systematically
HISTORY	Medical history (free text)	❌	Lacks structure for consistent analysis
PRIOR_VAX	Previous vaccines received	❌	Inconsistent and loosely related to COVID vaccine
SPLTTYPE	Report type (e.g., VAERS internal)	❌	Administrative metadata, not useful for analysis
FORM_VERS	Form version	❌	Technical detail with no analytical value

VAERS - Symptoms

DIMENSIONS (2021 example):
Total rows = 1.030.219
Total columns = 11
Columns kept = 6

WHAT DOES EACH ROW MEANS?
Each row links up to five symptoms to a specific VAERS report, using the same unique VAERS_ID. The structure is wide, and each row represents the set of symptoms for that event.

Column	Description	Keep	Reason
VAERS_ID	Unique report identifier (shared across VAERS)	✅	Key for joining with VAERS Data and VAX tables
SYMPTOM1	First reported symptom (usually the most severe)	✅	Allows analysis of primary adverse reactions
SYMPTOMVERSION1	Dictionary version for symptom 1	❌	Internal technical detail, not analytically relevant
SYMPTOM2	Second reported symptom	✅	Expands scope of adverse event analysis
SYMPTOMVERSION2	Dictionary version for symptom 2	❌	Redundant and technical
SYMPTOM3	Third reported symptom	✅	Helps analyze co-occurring symptoms
SYMPTOMVERSION3	Dictionary version for symptom 3	❌	No analytical value
SYMPTOM4	Fourth reported symptom	✅	Adds clinical context
SYMPTOMVERSION4	Dictionary version for symptom 4	❌	Repetitive, not useful
SYMPTOM5	Fifth reported symptom	✅	Covers up to 5 symptoms per report
SYMPTOMVERSION5	Dictionary version for symptom 5	❌	Purely technical, no impact on analysis

VAERS - VAX

DIMENSIONS (2021 example):
Total rows = 813.441
Total columns = 9
Columns kept = 4

WHAT DOES EACH ROW MEANS?
Each row contains vaccine-related metadata linked to a VAERS report, such as vaccine type, manufacturer, and dose series.

Column	Description	Keep	Reason
VAERS_ID	Unique report ID	✅	Key to merge with SYMPTOMS and DATA tables
VAX_TYPE	Vaccine type (e.g., COVID19, FLU)	✅	Required to filter only COVID-related reports
VAX_MANU	Manufacturer (Pfizer, Moderna, etc.)	✅	Important for brand-level comparison of adverse effects
VAX_LOT	Vaccine lot number	❌	Highly specific, often missing, not suitable for broad analysis
VAX_DOSE_SERIES	Dose number (1st, 2nd, booster)	✅	Enables analysis by dose sequence
VAX_ROUTE	Route of administration (e.g., IM)	❌	Too technical, not useful for impact evaluation
VAX_SITE	Injection site (e.g., left/right arm)	❌	Does not influence severity of reaction
VAX_NAME	Full commercial vaccine name	❌	Redundant with VAX_TYPE and VAX_MANU

Data Cleaning, Automatization, Merge

Before building the reports, I cleaned each dataset using Python + Pandas in Jupyter Notebook. This included filtering columns, fixing dates, handling missing values, and correcting inconsistencies. For VAERS data, I also automated the cleaning process for all yearly folders and then merged them. Full code is provided above.

1️⃣ FIRST CLEANING OF EACH DATASET STRUCTURE [PYTHON]

OWID Databases

Compact.csv

Column Selection
Dropped 23 columns based on prior evaluation, retaining variables for COVID trends, demographics, testing, vacc. and mortality.
Initial Filtering by Country
Removed countries with few or poor data. Selected 64 countries across all continents with meaningful data for comparison.
Date Conversion and Filtering
Converted date to datetime. Excluded:
- Dates before March 2020 (pre-pandemic noise)
- Dates beyond June 2025 (future reporting errors)
Duplicate Check
Verified there were no duplicate rows after filtering.
Missing Data Review
Identified high-NaN columns. Instead of dropping them, I flagged them for later consideration in visual analysis and comparisons.
Manual Imputation from Official Sources
Filled human_development_index, life_expectancy, and handwashing_facilities using dictionaries built from UN and World Bank data, mapped by country.
Country-Continent Consistency Check
Ensured each country was consistently mapped to only one valid continent.
Outlier Detection and Handling
- tests_per_case capped at 500 (higher values replaced with NaN)
- Negative reproduction_rate values replaced with NaN
- Reviewed extreme values across all metrics using .describe()
Categorical Field Verification
Reviewed fields like continent for category consistency. Confirmed appropriate cardinality and uniqueness.

Vaccinations.csv

Country Filtering
Filtered rows to keep only the countries selected in the main OWID dataset.
Date Formatting and Filtering
Converted the date column to datetime and filtered rows between March 1, 2020 and June 1, 2025.
Duplicates Check
Verified that no duplicate rows were present.
Vaccine Categorization
Created a new column technology to classify each vaccine into one of five types: mRNA, viral vector, inactivated virus, protein subunit, or other.
Missing Data
Confirmed that there were no null values in any column after cleaning.

VAERS Databases

Data.csv (2021)

Column Filtering
Retained only the 17 relevant columns listed above; dropped the rest.
Duplicates & Formatting
Removed duplicates. Converted all date fields to datetime format.
Categorical Standardization
Normalized values in STATE, replaced "U" with NaN, and fixed inconsistent codes.
Valid Location Filtering
Kept only valid US states and territories (e.g., PR, GU), removing invalid entries.
Binary Fields Cleanup
Replaced NaN in binary fields (DIED, HOSPITAL, etc.) with "N" when absence implied «No».
Date Range Filtering
Removed records with dates outside a reasonable range (Dec 2020 – Jan 2022).
Numerical Outlier Handling
Dropped rows with implausible hospital stays (over 365 days).

Symptoms.csv (2021)

Column Pruning
Technical and non-analytical fields such as SYMPTOMVERSION1–5 and ORDER were removed.
Duplicate Removal
Removed duplicates.
Missing Data Analysis
Nulls are expected in SYMPTOM2–5, as not all patients report multiple symptoms. SYMPTOM1 is always present.
Validation
All columns were confirmed to be in the appropriate format (int64 for VAERS_ID, object for symptoms).
Basic Statistical Review
A .describe() analysis was run to check for range anomalies or structural issues.

VAX.csv (2021)

Column Reduction
Removed irrelevant or overly specific fields that added no analytical value to the vaccine data.
COVID-19 Filtering
Only rows where VAX_TYPE contained «COVID19» were retained. Once filtered, VAX_TYPE was dropped as redundant.
Duplicate and Null Handling
Removed duplicates. Only ~0.37% of rows had missing VAX_DOSE_SERIES, which was acceptable and retained.
Dose Classification
VAX_DOSE_SERIES contains values like «1», «2», «3», «7+», or «UNK». Given that «UNK» represents over 15% of the data, these entries were preserved for now.
Basic Statistical Review
A .describe() analysis was run to check for range anomalies or structural issues.

2️⃣ AUTOMATIZATION OF VAERS DATABASES (2020-2025)

To efficiently process VAERS datasets from 2020- 2025—which share the same structure but vary in size and content—I created three Python scripts to automate the cleaning of each one. This prevents manual repetition and ensures consistency across years. Each script loads all yearly files from its respective folder and applies the exact same preprocessing steps as done manually for 2021. Code above, as always.

Data.csv

For each yearly VAERSDATA.csv file:

Selected only relevant columns (as listed above).
Removed duplicates.
Converted date fields to datetime.
Standardized categorical fields and filtered valid states/territories.
Replaced missing values in binary flags with «N» where appropriate.
Removed rows with implausible dates or hospitalization durations.

Symptoms.csv

For each VAERSSYMPTOMS.csv file:

Kept only VAERS_ID and symptom columns.
Removed duplicates.
Dropped entries with no primary symptom (SYMPTOM1 missing).

VAX.csv (2021)

For each VAERSVAX.csv file:

Selected essential columns (e.g., VAERS_ID, VAX_TYPE, VAX_MANU, VAX_DOSE_SERIES).
Removed duplicates.
Filtered only COVID-19-related rows (by VAX_TYPE).
Kept only the first vaccine entry per report (in case of multiple rows per VAERS_ID).

3️⃣ MERGE OF ALL THE VAERS DATABASES (2020-2025)

After cleaning all nine VAERS datasets (3 data types × 6 years), the next step was to merge them into unified files to simplify the Power BI workflow. Instead of working with 18 separate CSVs, I consolidated everything into just 3 (code above).

Pre-Power BI: Calculated Columns, DAX

Before diving into Power BI with the dashboard design, I always take the time to plan the analytical structure of the report. Once the datasets are cleaned and I have a good understanding of their structure, I sketch out the key questions I want to answer, the potential dashboards I’ll need to build, and the types of visualizations that might best support those insights. See an example here (don’t laugh please):

As part of this planning phase, I create an initial batch of calculated columns and DAX measures for each dataset. While I often add or fine-tune additional measures later during the visual building process, this early stage preparation helps me work much faster and more consistently.

In total, I prepared over 80 DAX measures and 10 calculated columns across five core datasets. These calculations cover a wide range of analytical needs: from time intelligence and cumulative indicators, to epidemiological ratios, severity metrics, symptom-based outcomes, ranking logic, and composite scores. As you may know, all the code is available above.

Dataset	Calculated Columns	DAX Measures
OWID Compact	4	50
Vaccinations Manufacturer	1	6
VAERS Data	4	13
VAERS Symptoms	0 (used unpivoting)	5
VAERS VAX	1	5
Total	10	79

Power BI: Data Modeling, Power Query

Once all datasets were cleaned in Python and exported to .csv, they were imported into Power BI for further processing. At this stage, few final adjustments were made in Power Query. For example, region identification was refined to ensure countries were properly categorized, Decimal formats were adjusted, rounding excessively long decimals or setting percentages.

In the case of the VAERS_SYMPTOMS table, I applied an unpivot transformation to the five symptom columns (SYMPTOM1–5) to consolidate them into a single “Symptom” column. This was essential to rank the most frequent symptoms. I also created a custom date table (Dim_Calendar) to allow consistent time-based analysis across all datasets. M code written above.

This calendar table was linked to: OWID_Compact[date] and VAERS_Data[ONSET_DATE] The fact tables are OWID_Compact and VAERS_Data, each with many entries per day, so the relationships were one-to-many (from Dim_Calendar). To keep the data model clean, I hid all technical linking fields (e.g., country, VAERS_ID, date) from the report view except the ones from fact tables. Additional model relationships:

OWID_Compact to OWID_Vacc via country (inactive relationship to avoid ambiguity)
VAERS_Data to VAERS_Symptoms and VAERS_VAX via VAERS_ID

Key Considerations, Project Boundaries

While analyzing the VAERS database, it is crucial to emphasize that this dataset must never be used to infer direct causality -actually it’s prohibited and punishable- . The presence of adverse events following vaccination does not imply that the vaccine caused them. The VAERS system is designed to detect safety signals and patterns, not to confirm causal relationships between vaccines and health outcomes.

Additionally, there are important limitations in the availability and accuracy of international data:

Underreporting in low-resource countries: The WHO estimates that some African nations underreport COVID-19 deaths by a factor of 8 to 10, due to limited testing, healthcare access, and incomplete data systems.
Political or social suppression: In countries like Russia, data may be intentionally withheld or manipulated for political reasons, which prevents accurate figures from appearing in open-source platforms like OWID.
Lack of monitoring infrastructure: Many regions lack reliable epidemiological systems, making it impossible to track key metrics such as deaths or case rates.

These limitations can significantly distort indicators like “average deaths per million” or country rankings. As a result, some findings may reflect data availability rather than the true impact of the pandemic, and should therefore be interpreted with caution.

This project was designed without a specific business objective in mind, simply because no stakeholder requested a concrete analysis. Instead of answering a predefined question (e.g., “Should we invest in X?” or “How did sales evolve compared to last year?”), the goal was to build a general and flexible reporting system capable of extracting as much relevant information as possible from the datasets. This approach aims to showcase analytical depth and technical execution, while maintaining clarity and design integrity—avoiding meaningless charts or overwhelming dashboards.

Project Results, Official Cross-checking

This section presents the main findings from the OWID and VAERS datasets, along with comparisons to official sources such as WHO or NCBI. Verifying results with external data is key to detecting biases, validating patterns, and ensuring that conclusions are grounded in reality.

1️⃣ Mortality and Case Comparison by Country | OWID

While some findings may be expected—like the total number of deaths being highest in the U.S., Brazil, and India due to population size —we can start comparing them to total reported cases, where China, the U.S., and India lead. This helps identify countries where high case counts didn’t translate to equally high death counts, such as France.

Cross-checking: Correct ✅ [Our World in Data global COVID-19 death counts]

The Case Fatality Rate metric reveals expectedly higher values in continents like Africa and South America, due to well-known issues such as underreporting, limited healthcare infrastructure, and low testing capacity. However, what stands out is the unexpectedly low fatality rate in many Asian countries, especially India, which ranks 58th out of 64 in hospital capacity, yet reports relatively few COVID deaths.

Cross-checking: Correct ✅ [Our World in Data reported case fatality rates]

This discrepancy strongly suggests systemic underreporting or data inconsistency, rather than actual low mortality. In contrast, countries like Peru (3.44), Bulgaria (2.96), and Hungary (2.64) report high daily deaths per million, which may reflect more transparent and honest reporting.

Cross-checking: Correct ✅ [Our World in Data Mortality]

2️⃣ Infection Spread and Restriction Levels | OWID

Countries like France, Germany, and Austria had the highest new cases per million, showing a strong spread of the virus in Europe, though not necessarily a corresponding death toll.

Cross-checking: Correct ✅ [Our World in Data Cases]

Regarding stringency of restrictions, countries like China, Iran, and India maintained consistently high levels of control throughout the pandemic, sometimes even beyond the critical phases.

Cross-checking: Correct ✅ [Oxford Stringency Index ]

3️⃣ Vaccination and Testing Efficiency | OWID

Surprisingly, Chile had the highest number of fully vaccinated people per 100 inhabitants in 2025, followed closely by China, Malta, Vietnam, and Peru—indicating not just strong campaigns but also high compliance and completion.

Cross-checking: Correct ✅ [Our World in Data Vaccinated]

As for testing efficiency, the number of tests per confirmed case indicates how extensively a country tests its population relative to actual infections. A higher value reflects broader, more proactive testing, even for mild or asymptomatic individuals. In this metric, China clearly leads, followed by New Zealand and Denmark. On the other end, countries like Brazil, Ecuador, and Mexico show some of the lowest test-per-case ratios. Not all countries are represented in the dataset, so these may not be the worst globally.

Cross-checking: Correct ✅ [Our World in Data Testing]

4️⃣ Temporal Evolution of the Pandemic | OWID

The deadliest period was the first quarter of 2021, accounting for nearly 47% of total global deaths (2.74 million), likely due to gatherings around Christmas and the emergence of new variants.

Cross-checking: Correct ✅ [WHO Covid-19 Timeline]

Interestingly, 2022 recorded over twice as many reported cases as 2021 (394 million vs. 158 million), yet mortality rates dropped significantly. This trend reflects the impact of widespread vaccination in 2021, which helped protect against severe disease, even though it didn’t fully prevent infections. In other words, more people got infected, but far fewer died—a clear indicator of vaccine effectiveness on a global scale.

Cross-checking: Clarification ❌ It’s true 2022 saw over twice the cases of 2021, but this surge was due to the Omicron variant and relaxation of restrictions, not because prior vaccination increased infections [OpenVAERS Omnicron]

5️⃣ Vaccine Types and Usage by Country | OWID

Globally, Pfizer/BioNTech was by far the most used vaccine (1.5 billion doses, ~65%), followed by Moderna and Oxford/AstraZeneca. The most used vaccines (Pfizer and Moderna) are mRNA-based, largely due to production simplicity and early rollout.

Cross-checking: Incomplete ❌ Pfizer/BioNTech was widely used, but global vaccine distribution included massive use of Sinovac, Sinopharm, AstraZeneca, and others, so Pfizer’s share was <65% [Global manufacturer dose counts]

Usage patterns varied by country:

The U.S. used almost exclusively Pfizer, Moderna, and Johnson & Johnson
Spain also used AstraZeneca alongside Pfizer
South American countries (Argentina, Peru, Chile, Ecuador) relied heavily on Sinopharm/Beijing and Sputnik V, using inactivated virus or viral vector platforms.

Cross-checking: Correct ✅ [WHO, National Vaccine rollout records]

1️⃣ General Overview and Demographics | VAERS

Disclaimer: These results do not attempt to establish any causal link between vaccines and adverse effects. This is purely a BI exercise based on voluntary reports submitted in the U.S. from 2020 to mid-2025. Interpret with caution.

From 2020 to June 2025, over 807,000 VAERS reports were filed, including 7,660 deaths (0.95%), 5.94% hospitalizations, and 1.67% cases resulting in disability. Notably, 67% of reports came from women, yet they accounted for only 43.8% of total deaths.

Cross-checking: Partially incorrect ❌VAERS reported majority of events from women, aligning with the data, but total VAERS counts for deaths (7,660) and total reports (807k) are much lower than CDC/OpenVAERS official totals (1.5 million+ reports and ~16–38k deaths) [OpenVAERS/CDC official statistics]

Age analysis shows expected trends: older individuals had higher mortality, hospitalization, and disability rates. Surprisingly, hospital stays across age groups ranged from 4 to 6.3 days, more uniform than expected. Also, the time from vaccination to death and symptom onset to death increased with age, possibly reflecting different physiological responses between age groups.

Cross-checking: Correct ✅ [VAERS demographic analysis]

2️⃣ State-Level Differences in the U.S. | VAERS

When analyzing U.S. states, some showed much higher mortality rates: Kentucky (4.05%), South Dakota (3.54%), and Tennessee (2.79%)—likely due to poorer hospitalization management or systemic issues.

Cross-checking: Unverified ❌ These state‑by‑state VAERS mortality percentages are not confirmed by official data, and attributing this to systemic hospital issues is speculative [VAERS lack of state-level corroborating reports]

3️⃣ Symptom Frequency and Severity | VAERS

The most commonly reported symptoms were headache, pyrexia, fatigue, pain, and chills, each linked to a mortality rate below 0.6%. Some rare symptoms had 100% mortality—but were based on 1–2 reports, often marked as “death” itself, and were excluded from visualizations. Graphs focus on symptoms with more than 1,000 reports for meaningful insights.

Cross-checking: Correct ✅ [VAERS symptom frequency]

Graphs focus on symptoms with more than 1,000 reports for meaningful insights:

Most lethal symptoms: “unresponsive to stimuli,” “chest X-ray abnormal,” “intensive care,” “pneumonia,” and “hypoxia.”
Most associated with disability: “cerebrovascular accident,” “MRI head abnormal,” “Guillain-Barré syndrome.”
Most linked to hospitalization: “pulmonary embolism,” “cerebrovascular accident,” “chest X-ray abnormal,” “troponin increased».

Cross-checking: Correct ✅ [VAERS reports]

4️⃣ Mortality and Severity by Vaccine | VAERS

Most VAERS reports relate to Pfizer and Moderna, as expected from their usage rates in the U.S. However, these were not the vaccines with the highest death or disability percentages.
Some vaccines with fewer total reports had worse outcome ratios. These insights are left for personal interpretation, yet visuals are provided without numeric conclusions.

Cross-checking: Correct ✅ [CDC vaccine safety findings]

Conclusions

This project has been both a technical and personal challenge. It pushed me to apply advanced data cleaning, modeling, and visualization skills using real-world health data. Despite its complexity and lack of a predefined question, I aimed to extract as much relevant insight as possible while maintaining clarity and analytical rigor. I’m proud of the result—not only as a data analyst, but as someone committed to turning raw data into knowledge.

Any feedback is welcome, and if you’d like to connect or collaborate, feel free to contact me!