eBird Data Preparation¶

Preparing my personal checklist data for analysis¶

Part 1 of several posts about my eBird data¶

As an avid birdwatcher, I keep detailed checklists of all of my birding outings, as well as some outings in which I wasn't birding but happened to see a bird of interest. While notebooks (and especially Rite-in-the-rain notebooks) are easy to use, I then have the responsibility of transcribing that information into either a spreadsheet or Cornell Lab of Ornithology's eBird website, which doubles as a personal checklist tracker and an amazing database for birders looking for birds or scientists looking to study varying avian trends. Luckily, eBird created an app that allows you to make checklists in the field! Now collecting data from checklists is easier than ever.

I downloaded my eBird checklists (which don't 100% reflect my birding endeavors but are close enough) in order to look at personal trends and perhaps even my progression as a birder! I'm writing the code along with this but it is meant to be digestible for the public "at large", so hopefully everyone will be able to enjoy as I brag about the birds I've seen and the places I've been, but also lament about my downfalls as a birder (spoiler alert, I get lazy sometimes). Please feel free to read as much or as little as you care to read, I will probably digress into irrelevant details at least a few times.

This first post is VERY skippable for most readers: I am going to break down my processes for getting the data ready for exploration and analysis

The first thing I'm going to do is import the pandas, numpy, seaborn, and os libraries/modules from Python so I can use their awesome tools to better wrangle my data! We will call in some other tools, but we will do those as we need them. After my import, I'm going to create a relative path using os.getcwd() and os.path.join(), which assumes that the Jupyter Notebook will always be in the same directory as my data is. Even though all of this data is being manipulated in memory, I have still created a backup copy in the same directory just in case!

import pandas as pd
import numpy as np
import seaborn as sns
import os

current_path = os.getcwd()
file_path = os.path.join(current_path, "MyEBirdData.csv")

We will use pandas (renamed pd) to import the csv into memory (RAM) from the directory. There are a plethora of options, but for now we are really interested in one option that will save us some headache down the road.

Dates and times are going to be very important in looking at trends in this dataset because the suites of birds on checklists and even my birding habits change throughout the year, and I go birding at many different times throughout the day. If we convert these dates and times as we are importing the csv, we can save ourselves some trouble down the road. We know that our checklist data set has a column for Date and a column for Time, and we can parse those dates/times in the DataFrame by adding the parse_dates argument. The options for parse_dates (for our dataset) are as follows:

True : Try to parse the entire index
[ 'Date', 'Time' ] : Parse columns 'Date' and 'Time' as single columns
[ [ 'Date', 'Time' ] ] : Combine columns 'Date' and 'Time' and parse as a single column
{ 'Checklist_Date_Time' : [ 'Date', 'Time' ] } : Combine columns 'Date' and 'Time' and name the result Checklist_Date_Time

Let's try out options 2 through 4. We'll skip option 1 because we have values other than dates in the rest of the DataFrame, which means an error will be thrown upon execution.

Import the data and parse dates¶

Option 2¶

This option will individually take the 'Date' and 'Time' columns and convert them into datetime objects. Perhaps unexpected here is that is has evaluated the 'Time' column as today's date, but the original checklist's time. It doesn't seem like this is the option we want considering the date for the 'Time' column is incorrect and we can parse dates and times from a combined datetime object. This option is more useful for DataFrames in which there are many columns with dates.

df = pd.read_csv(file_path, parse_dates=['Date','Time'])
print(df.dtypes, df['Date'].head(3), df['Time'].head(3), sep='\n\n')

Submission ID                     object
Common Name                       object
Scientific Name                   object
Taxonomic Order                    int64
Count                             object
State/Province                    object
County                            object
Location                          object
Latitude                         float64
Longitude                        float64
Date                      datetime64[ns]
Time                      datetime64[ns]
Protocol                          object
Duration (Min)                   float64
All Obs Reported                   int64
Distance Traveled (km)           float64
Area Covered (ha)                float64
Number of Observers                int64
Breeding Code                     object
Species Comments                  object
Checklist Comments                object
dtype: object

0   2016-05-14
1   2016-10-30
2   2017-01-16
Name: Date, dtype: datetime64[ns]

0   2019-02-01 12:20:00
1   2019-02-01 11:00:00
2   2019-02-01 11:30:00
Name: Time, dtype: datetime64[ns]

Option 3¶

This option will grab columns 'Date' and 'Time' and combine them into a single column, then convert them to a datetime object. When we take a peek at our newly created column (which has noticeably been moved to the first column), we see this format is better for us.

df = pd.read_csv(file_path, parse_dates=[['Date','Time']])
print(df.dtypes, '\n\n', df['Date_Time'].head(3))

Date_Time                 datetime64[ns]
Submission ID                     object
Common Name                       object
Scientific Name                   object
Taxonomic Order                    int64
Count                             object
State/Province                    object
County                            object
Location                          object
Latitude                         float64
Longitude                        float64
Protocol                          object
Duration (Min)                   float64
All Obs Reported                   int64
Distance Traveled (km)           float64
Area Covered (ha)                float64
Number of Observers                int64
Breeding Code                     object
Species Comments                  object
Checklist Comments                object
dtype: object 

 0   2016-05-14 12:20:00
1   2016-10-30 11:00:00
2   2017-01-16 11:30:00
Name: Date_Time, dtype: datetime64[ns]

Option 4¶

This option grabbed our 'Date' and 'Time' columns and combined them into one column, then renamed the column 'Checklist_Date_Time'. This makes our lives easier because the column name is more descriptive and we are able to parse dates and times from the column information. Let's continue on with this DataFrame option.

df = pd.read_csv(file_path, parse_dates={'checklist_date_time' : ['Date','Time']})
print(df.dtypes, '\n\n', df['checklist_date_time'].head(3))

checklist_date_time       datetime64[ns]
Submission ID                     object
Common Name                       object
Scientific Name                   object
Taxonomic Order                    int64
Count                             object
State/Province                    object
County                            object
Location                          object
Latitude                         float64
Longitude                        float64
Protocol                          object
Duration (Min)                   float64
All Obs Reported                   int64
Distance Traveled (km)           float64
Area Covered (ha)                float64
Number of Observers                int64
Breeding Code                     object
Species Comments                  object
Checklist Comments                object
dtype: object 

 0   2016-05-14 12:20:00
1   2016-10-30 11:00:00
2   2017-01-16 11:30:00
Name: checklist_date_time, dtype: datetime64[ns]

Dimension reduction and variable standardization¶

Now that we've chosen our method for consolidating the 'Date' and 'Time' columns, let's take a deeper dive into what each variable means. As we look for missing data, this will guide us as to which variables aren't necessary.

Checklist_Date_Time : The date and time the checklist was started
Submission ID : The unique checklist ID assigned by eBird
Common Name : The North American common name for the species
Scientific Name : The genus and species name for the species
Taxonomic Order : The number given to the species for ordering purposes
Count : The number of individuals of the species reported on the checklist
State/Province : The U.S. State or Foreign "Province" the checklist occurs in
County : The county in which the checklist occurs
Location : The location name, sometimes called Hotspot, in which the checklist occurs
Latitude : The latitude value that estimates the Hotspot location
Longitude : The latitude value that estimates the Hotspot location
Protocol : The type of birding excursion, i.e. Traveling, Stationary, Historical, Incidental
Duration (Min) : The length of the birding excursion
All Obs Reported : A yes or no value for whether all bird observations were reported
Distance Traveled (km) : The distance traveled during the birding excursion
Area Covered (ha) : The area covered by the birding excursion (unclear how this calculation is done)
Number of Observers : The number of observers in the birding party
Breeding Code : Supplemental information given about the suspected breeding status of the bird
Species Comments : Supplemental information given about the species, especially for rare bird ID
Checklist Comments : Supplemental information given about the checklist as a whole, i.e. specific location, non-avian species, weather, etc.

Next let's look for missing values in the dataset. We can do this by using df.isna().sum(). This will check each column for values that are 'NaN', or not a number. I want to check for large proportions of missing values in columns because I would like to standardize column names and don't want to waste time renaming columns that I am going to drop or not use.

df.isna().sum()

checklist_date_time          0
Submission ID                0
Common Name                  0
Scientific Name              0
Taxonomic Order              0
Count                        0
State/Province               0
County                     115
Location                     0
Latitude                     0
Longitude                    0
Protocol                     0
Duration (Min)              28
All Obs Reported             0
Distance Traveled (km)    1829
Area Covered (ha)         9006
Number of Observers          0
Breeding Code             8980
Species Comments          8557
Checklist Comments        2817
dtype: int64

It looks like Area Covered (ha) has no valid values in it, so that'll be on the chopping block. Breeding Code is close to completely empty as well, and we aren't interested in Breeding Codes for this exploration anyways. I'm also going to remove Species Comments and Checklist Comments because I'm not interested in them at this moment, and most of the Checklist Comments are going to contain the information of the app and version of the app I used to submit the checklist from the mobile app.

What's important to remember going forward is that eBird has given me my checklists broken down by species and not by checklist. This means that I will have many duplicates in the Submission ID column because of the number of species reported per checklist.

Using df.drop(columns=...), I can drop multiple columns from my dataset using their names. Notice that I won't use inplace=True because then I won't be able to do iterations in this Notebook. I'm not sure about the official guidance on using inplace=True, but with smaller datasets like this it won't matter much. It's probably advised if you have a massive dataset that will exceed your RAM to have multiple copies. For me the reassigment is fine!

df1 = df.drop(columns=['Area Covered (ha)', 'Breeding Code', 'Species Comments', 'Checklist Comments'])
df1.head(3)

df1.shape

(9006, 16)

As we see with call to shape above, we've gotten rid of our four columns we didn't want, so now it's time to standardize the column names that have spaces or other characters to use snake_case and to use all lowercase. For this job, we can pull and alter the example from the pandas documentation: df.rename(index=str, columns={"A": "a", "B": "c"}). I'll write a little for-loop that will take care of all of the cases I have.

for column in df1.columns:
    if ' ' in column:
        x = column.lower().split()
        x_joined = '_'.join(x)
        df1 = df1.rename(index=str, columns={column : x_joined})
    elif '/' in column:
        x = column.lower().split()
        x_joined = '_'.join(x)
        df1 = df1.rename(index=str, columns={column : x_joined})
    else:
        x = column.lower()
        df1 = df1.rename(index=str, columns={column : x})

df1.head(3)

To summarize what I did to prepare the data:

Combine and parse the 'Date' and 'Time' columns upon import
Look for and remove columns with large proportions of missing values
Standardize column names to lowercase and snake_case

In my next post, I'll be exploring the data a little bit. To save having to do this cleaning every time I want to use the dataset, I put all of the steps into a script that I can import and get my df out to me with one line! So convenient!

	checklist_date_time	Submission ID	Common Name	Scientific Name	Taxonomic Order	Count	State/Province	County	Location	Latitude	Longitude	Protocol	Duration (Min)	All Obs Reported	Distance Traveled (km)	Number of Observers
0	2016-05-14 12:20:00	S29653771	Snow Goose	Anser caerulescens	239	1	US-OR	Crook	Houston Lakes	44.294121	-120.989342	eBird - Traveling Count	30.0	1	1.609	2
1	2016-10-30 11:00:00	S32320173	Snow Goose	Anser caerulescens	239	75	US-OR	Crook	Houston Lakes	44.294121	-120.989342	eBird - Stationary Count	50.0	1	NaN	2
2	2017-01-16 11:30:00	S33930374	Snow Goose	Anser caerulescens	239	8000	US-CA	Glenn	Sacramento NWR--Auto Tour Route	39.428438	-122.156500	eBird - Traveling Count	120.0	1	4.828	2

	checklist_date_time	submission_id	common_name	scientific_name	taxonomic_order	count	state/province	county	location	latitude	longitude	protocol	duration_(min)	all_obs_reported	distance_traveled_(km)	number_of_observers
0	2016-05-14 12:20:00	S29653771	Snow Goose	Anser caerulescens	239	1	US-OR	Crook	Houston Lakes	44.294121	-120.989342	eBird - Traveling Count	30.0	1	1.609	2
1	2016-10-30 11:00:00	S32320173	Snow Goose	Anser caerulescens	239	75	US-OR	Crook	Houston Lakes	44.294121	-120.989342	eBird - Stationary Count	50.0	1	NaN	2
2	2017-01-16 11:30:00	S33930374	Snow Goose	Anser caerulescens	239	8000	US-CA	Glenn	Sacramento NWR--Auto Tour Route	39.428438	-122.156500	eBird - Traveling Count	120.0	1	4.828	2