We talk about data a lot at the Sisense for Cloud Data Teams office. Recently, the discussion turned to how much time we spend managing our email inboxes. Austin, one of our engineers, decided to analyze his own email usage for answers.

Ever since I first heard the phrase Inbox Zero, I have been aggressively pursuing it with careful inbox management and quick email response times. I wanted to see how I was doing, so I downloaded a data dump of my gmail data. ​

Getting Your Data Download from Gmail

Once you have your Gmail data, you’ll need to convert your .mbox file format into a CSV. This post by Claire Willett offers some code to help use python’s mailbox library to to help you do this.

To analyze how much time I was spending on email, I decided to pull out the subject, from, to, and date fields. ​

Analyzing the Data

The first thing I do with any dataset is to look for a date column and plot the growth over time. I have been using this email account for about one year, so I decided to look at a weekly overview.

I quickly encountered dates that came with different timezones and varying formats, for example Wed, 02 Mar 2016 14:43:34 +0200 versus Tue, 1 Mar 2016 19:25:02 -0600 (CST). Redshift had some trouble with the variable timezone formats, and because I was interested in weekly granularity, I chopped off the timestamp by removing anything that came after + and -. ​

select
  date_trunc('week', 
    (regexp_replace(date, ' [+\-].*'))::timestamp
  )
  , count(1)
from
  my_email
group by
  1

Here are my results:

Email volume chart

Immediately an elbow in the growth popped out at me. In January, we changed how we were logging information from our production servers, which led to a much greater volume of emails.

The engineering team recognized that while having the alerts was great, the email flood was not. We switched to using Rollbar to aggregate emails coming from the servers, and the volume tapers off at the end of January. ​ This led me to ask who was sending me all of these emails, so I drew a pie chart of the top ten accounts sending me email.

Email pie chart

As expected, most of my volume comes from email lists that handle things like our server emails, signups from people interested in trying out Sisense for Cloud Data Teams, and messages to our all-purpose email.

I saw my coworkers Andreas and Sunny impressively show up in seventh and eighth place, respectively, and wanted to take a second look while ignoring anything coming from an automated address.

I decided to use regexp_substr to parse the sender out of the email address. Some of the emails come from our previous .io domain, some use the named noreply format, while others do not.

To do a consistent comparison I only wanted to consider any characters up to a space or @ symbol. ​

regexp_substr("from", '[^@ ]*') 

This regex says to record any characters that are not @ or space, and then to stop when it sees one of those characters. I can then use the extracted name to quickly filter out lists: ​

regexp_substr("from", '[^@ ]*') not in (
  'noreply'
  , 'leads'
  , 'signups'
  , 'hello'
  , 'alerts'
  , 'Rollbar'
  , 'Periscope'
  , 'support'
  , '"Slack"'
  , 'OpsGenie'   
)

Re-running my weekly cumulative graph with this filter dropped my total from 255,000 to 45,600. Progress! Knowing now that my analysis would not be dominated by email lists, I wanted to ask a more involved question: whose emails do I respond to most quickly, and whose sit in my inbox the longest?

To start answering this question, I decided to do a self-join on my email. Since different email servers make different modifications to the subject line, I first I stripped the Re: out of the subjects so that I could match emails by subject:

And simplified the problem by only looking at emails with a single response: ​

with
  replied_once as (
    select
      subject
    from
      (
        select
          subject, count(1)
        from
          [austins_email_no_lists]
        group by
          subject
      ) where count = 2
  )
  
select * from response_times, replied_once 
where response_times.subject = replied_once.subject 

Then I plotted the average over time: ​

select
  date_trunc('week', (recv_time)::timestamp)::date
  , avg(datediff(minute, recv_time, send_time))
from
  response_times
group by
  1
Pie chart of email response times

My response times have been pretty dynamic! I spent a lot of time in August, September, and October building charts for new Sisense for Cloud Data Teams trials, and would have to spend time running queries before I could respond.

In February, I stepped up my recruiting efforts and my email response times have definitely seen the effects of coordinating schedules since.

Finally, I wanted to know how much time I was spending communicating with Sisense for Cloud Data Teams teams over email. We use Slack for most of our internal communications, so I wanted to see where we were using email to communicate.

I mapped each of the internal email addresses to their Sisense for Cloud Data Teams team, and looked at the sum and average number of hours it takes me to reply: ​

Pie chart of email sources
graph of email response time

I am clearly on top of my marketing email response time, but it looks like I have some work to do with the sales team. If you want to see where you are spending your time with email, download your own Gmail archive and start exploring!

Tags: