As an avid birdwatcher, I keep detailed checklists of all of my birding outings, as well as some outings in which I wasn't birding but happened to see a bird of interest. While notebooks (and especially Rite-in-the-rain notebooks) are easy to use, I then have the responsibility of transcribing that information into either a spreadsheet or Cornell Lab of Ornithology's eBird website, which doubles as a personal checklist tracker and an amazing database for birders looking for birds or scientists looking to study varying avian trends. Luckily, eBird created an app that allows you to make checklists in the field! Now collecting data from checklists is easier than ever.
I downloaded my eBird checklists (which don't 100% reflect my birding endeavors but are close enough) in order to look at personal trends and perhaps even my progression as a birder! I'm writing the code along with this but it is meant to be digestible for the public "at large", so hopefully everyone will be able to enjoy as I brag about the birds I've seen and the places I've been, but also lament about my downfalls as a birder (spoiler alert, I get lazy sometimes). Please feel free to read as much or as little as you care to read, I will probably digress into irrelevant details at least a few times.
This first post is VERY skippable for most readers: I am going to break down my processes for getting the data ready for exploration and analysis
The first thing I'm going to do is import the pandas, numpy, seaborn, and os libraries/modules from Python so I can use their awesome tools to better wrangle my data! We will call in some other tools, but we will do those as we need them. After my import, I'm going to create a relative path using os.getcwd() and os.path.join(), which assumes that the Jupyter Notebook will always be in the same directory as my data is. Even though all of this data is being manipulated in memory, I have still created a backup copy in the same directory just in case!
import pandas as pd
import numpy as np
import seaborn as sns
import os
current_path = os.getcwd()
file_path = os.path.join(current_path, "MyEBirdData.csv")
We will use pandas (renamed pd) to import the csv into memory (RAM) from the directory. There are a plethora of options, but for now we are really interested in one option that will save us some headache down the road.
Dates and times are going to be very important in looking at trends in this dataset because the suites of birds on checklists and even my birding habits change throughout the year, and I go birding at many different times throughout the day. If we convert these dates and times as we are importing the csv, we can save ourselves some trouble down the road. We know that our checklist data set has a column for Date and a column for Time, and we can parse those dates/times in the DataFrame by adding the parse_dates
argument. The options for parse_dates
(for our dataset) are as follows:
True
: Try to parse the entire index[ 'Date', 'Time' ]
: Parse columns 'Date' and 'Time' as single columns[ [ 'Date', 'Time' ] ]
: Combine columns 'Date' and 'Time' and parse as a single column{ 'Checklist_Date_Time' : [ 'Date', 'Time' ] }
: Combine columns 'Date' and 'Time' and name the result Checklist_Date_TimeLet's try out options 2 through 4. We'll skip option 1 because we have values other than dates in the rest of the DataFrame, which means an error will be thrown upon execution.
This option will individually take the 'Date' and 'Time' columns and convert them into datetime objects. Perhaps unexpected here is that is has evaluated the 'Time' column as today's date, but the original checklist's time. It doesn't seem like this is the option we want considering the date for the 'Time' column is incorrect and we can parse dates and times from a combined datetime object. This option is more useful for DataFrames in which there are many columns with dates.
df = pd.read_csv(file_path, parse_dates=['Date','Time'])
print(df.dtypes, df['Date'].head(3), df['Time'].head(3), sep='\n\n')
This option will grab columns 'Date' and 'Time' and combine them into a single column, then convert them to a datetime object. When we take a peek at our newly created column (which has noticeably been moved to the first column), we see this format is better for us.
df = pd.read_csv(file_path, parse_dates=[['Date','Time']])
print(df.dtypes, '\n\n', df['Date_Time'].head(3))
This option grabbed our 'Date' and 'Time' columns and combined them into one column, then renamed the column 'Checklist_Date_Time'. This makes our lives easier because the column name is more descriptive and we are able to parse dates and times from the column information. Let's continue on with this DataFrame option.
df = pd.read_csv(file_path, parse_dates={'checklist_date_time' : ['Date','Time']})
print(df.dtypes, '\n\n', df['checklist_date_time'].head(3))
Now that we've chosen our method for consolidating the 'Date' and 'Time' columns, let's take a deeper dive into what each variable means. As we look for missing data, this will guide us as to which variables aren't necessary.
Next let's look for missing values in the dataset. We can do this by using df.isna().sum()
. This will check each column for values that are 'NaN', or not a number. I want to check for large proportions of missing values in columns because I would like to standardize column names and don't want to waste time renaming columns that I am going to drop or not use.
df.isna().sum()
It looks like Area Covered (ha) has no valid values in it, so that'll be on the chopping block. Breeding Code is close to completely empty as well, and we aren't interested in Breeding Codes for this exploration anyways. I'm also going to remove Species Comments and Checklist Comments because I'm not interested in them at this moment, and most of the Checklist Comments are going to contain the information of the app and version of the app I used to submit the checklist from the mobile app.
What's important to remember going forward is that eBird has given me my checklists broken down by species and not by checklist. This means that I will have many duplicates in the Submission ID column because of the number of species reported per checklist.
Using df.drop(columns=...)
, I can drop multiple columns from my dataset using their names. Notice that I won't use inplace=True
because then I won't be able to do iterations in this Notebook. I'm not sure about the official guidance on using inplace=True
, but with smaller datasets like this it won't matter much. It's probably advised if you have a massive dataset that will exceed your RAM to have multiple copies. For me the reassigment is fine!
df1 = df.drop(columns=['Area Covered (ha)', 'Breeding Code', 'Species Comments', 'Checklist Comments'])
df1.head(3)
df1.shape
As we see with call to shape above, we've gotten rid of our four columns we didn't want, so now it's time to standardize the column names that have spaces or other characters to use snake_case and to use all lowercase. For this job, we can pull and alter the example from the pandas documentation: df.rename(index=str, columns={"A": "a", "B": "c"})
. I'll write a little for-loop that will take care of all of the cases I have.
for column in df1.columns:
if ' ' in column:
x = column.lower().split()
x_joined = '_'.join(x)
df1 = df1.rename(index=str, columns={column : x_joined})
elif '/' in column:
x = column.lower().split()
x_joined = '_'.join(x)
df1 = df1.rename(index=str, columns={column : x_joined})
else:
x = column.lower()
df1 = df1.rename(index=str, columns={column : x})
df1.head(3)
To summarize what I did to prepare the data:
In my next post, I'll be exploring the data a little bit. To save having to do this cleaning every time I want to use the dataset, I put all of the steps into a script that I can import and get my df out to me with one line! So convenient!