Usually, the easiest way to upload a CSV is to use Sisense’s CSV functionality. It’s fast, easy, lets you join the data with all your databases, and automatically casts types. However, sometimes it’s useful to interact directly with a Redshift cluster — usually for complex data transformations and modeling in Python. When interacting directly with a database, it can be a pain to write a create table statement and load your data. When the table is wide, you have two choices while writing your create table — spend the time to figure out the correct data types, or lazily import everything as text and deal with the type casting in SQL. The first is slow, and the second will get you in trouble down the road.

You can see a great example of this looking at Stack Overflow’s survey results for industry data. They’re available in a CSV format that’s a daunting 158 columns wide. If you want to load the data into Redshift and rather than be generous with the data types, use the proper columns, you can speed up the load process by writing a Python script.

Importing Libraries and Reading Data in Python

The first step is to load the data, import libraries, and load the data into a CSV reader object. The CSV library will be used to iterate over the data, and the AST library will be used to determine data type.

The process requires a few lists. “Longest” will be a list of the longest values in character length to specify varchar column capacity, “headers” will be a list of the column names, and “type_list” will be the updating list of column types as we iterate over our data.

import csv, ast, psycopg2

f = open('/path/to/survey/data/survey_data.csv', 'r')
reader = csv.reader(f)

longest, headers, type_list = [], [], []

Finding the Data Type

Once the data is in, we need to find the data type for each row. This means evaluating every value and cast to the most restrictive option, from decimalized numbers to integers, and from integers to strings.

The function dataType does this. First, it evaluates to see if the value is text or a number, and then for the appropriate type of number if needed. This function consumes both the new data, and the current best type to evaluate against.

def dataType(val, current_type):
    try:
        # Evaluates numbers to an appropriate type, and strings an error
        t = ast.literal_eval(val)
    except ValueError:
        return 'varchar'
    except SyntaxError:
        return 'varchar'
    if type(t) in [int, long, float]:
       if (type(t) in [int, long]) and current_type not in ['float', 'varchar']:
           # Use smallest possible int type
           if (-32768 < t < 32767) and current_type not in ['int', 'bigint']:
               return 'smallint'
            elif (-2147483648 < t < 2147483647) and current_type not in ['bigint']:
               return 'int'
            else:
               return 'bigint'
      if type(t) is float and current_type not in ['varchar']:
           return 'decimal'
    else:
        return 'varchar'

It iterates over the rows in our CSV, calls the function above, and populates the lists.

for row in reader:
    if len(headers) == 0:
        headers = row
        for col in row:
            longest.append(0)
            type_list.append('')
    else:
        for i in range(len(row)):
            # NA is the csv null value
            if type_list[i] == 'varchar' or row[i] == 'NA':
                pass
            else:
                var_type = dataType(row[i], type_list[i])
                type_list[i] = var_type
        if len(row[i]) > longest[i]:
            longest[i] = len(row[i])
f.close()

Then use those lists to write the SQL statement:

statement = 'create table stack_overflow_survey ('

for i in range(len(headers)):
    if type_list[i] == 'varchar':
        statement = (statement + '\n{} varchar({}),').format(headers[i].lower(), str(longest[i]))
    else:
        statement = (statement + '\n' + '{} {}' + ',').format(headers[i].lower(), type_list[i])

statement = statement[:-1] + ');'

Finally, the output!

create table stack_overflow_survey_data (
   respondent int,
   , professional varchar(56)
   , programhobby varchar(45)
   , country varchar(34)
   ....
   , expectedsalary decimal);

Finishing the Job

Of course, the job isn’t done—the data needs to get into Redshift! This can be done using the psycopg2 library (imported above). To use the copy command, first load the data to S3. The access key ID and secret access key can be found under users in your AWS console.

(Find additional details about the copy command used below in the post How to ETL Data into and out of Amazon Redshift.)

conn = psycopg2.connect(
    host='mydb.mydatabase.us-west-2.redshift.amazonaws.com',
    user='user',
    port=5439,
    password='password',
    dbname='example_db')

cur = conn.cursor()

cur.execute(statement)
conn.commit()

sql = """copy stack_overflow_survey_data from 's3://an-example-bucket/survey_data.csv'
    access_key_id '<access_key_id>'
    secret_access_key '<secret_access_key>'
    region 'us-west-1'
    ignoreheader 1
    null as 'NA'
    removequotes
    delimiter ',';"""
cur.execute(sql)
conn.commit()

Now you can begin examining the Stack Overflow data.

Start with a few quick charts: Stack Overflow asked people what attributes are the most important when making technical hires—including knowledge of algorithms, communications skills, titles, and education. 

Compare a couple of those questions with the average salary of the respondents:

select
  importantHiringAlgorithms
  , avg(salary) as salary
  , count(*) as respondents
from
  stack_overflow_survey_data
group by
  1
order by
  1

You can see that the average respondent’s salary actually decreases with a focus on data structures and algorithms when hiring! In contrast, a heavier emphasis on communication skills is associated with higher salaries.

Salary table

Looking deeper, you can see why—students aren’t worried about communications skills yet. 

Salary table 2

While Stack Overflow didn’t ask for respondent’s titles, it’s reasonable to suspect that an emphasis on communication skills grows as people advance their careers. It turns out communications skills are highly valued. See what else you can find out using the neatly ordered data you’ve imported.