Loading and Unloading
A key question about any data warehouse, often overlooked in the purchasing process, is how the data gets in and out. While performance is obviously critical, you will be loading data in and dumping data out all day long. A simple, scalable process is critical.
Amazon Redshift makes a high-speed cache for lots of different types of data, so it’s become very popular. Check out these recommendations for a silky-smooth, terabyte-scale pipeline into and out of Redshift.
Getting Data In: The COPY Command
To get data into Redshift, start by staging it in Amazon S3. Once your data’s there, you’ll take advantage of the all-purpose COPY command. Here’s an example:
COPY my_schema.my_table FROM 's3://bucket_name/path/to/my_manifest' WITH CREDENTIALS 'aws_access_key_id=<my_access_key>; aws_secret_access_key=<my_secret_key>' REGION 'us-east-1' MANIFEST GZIP ACCEPTANYDATE TRUNCATECOLUMNS ACCEPTINVCHARS
Let’s take this step by step:
The first couple of lines are straightforward. Immediately after COPY comes the schema name and table name. Immediately after FROM comes the path to your data’s manifest file in S3.
The WITH CREDENTIALS line contains your AWS account credentials. These credentials must have read access to the S3 bucket in question.
The next three keywords clarify some things about the data:
- REGION specifies the AWS region of your S3 bucket. The default is the AWS region of your Redshift cluster.
- MANIFEST specifies that the path after FROM is to a manifest file.
- GZIP indicates that the data is gzipped.
The final three keywords serve to maximize the success rate of the import:
- ACCEPTANYDATE allows any date format to be parsed in datetime columns.
- TRUNCATECOLUMNS will truncate text columns rather than error if the data is too long.
- ACCEPTINVCHARS will replace invalid Unicode characters with ? rather than erroring.
You can adjust these toggles to taste, but in our experience, failed loads are quite frustrating. We recommend some flexibility on the data rather than endless ETL headaches.
Getting Data Out: The UNLOAD Command
Nearly as common as getting data in is getting data out. Sometimes, the results of hard computations done in Redshift are necessary for serving systems. Other times, a large export is needed for analysis in Excel or other tools.
To get lots of data out, you’ll want the UNLOAD command. Here’s an example:
UNLOAD ('select * from my_table') TO 's3://bucket_name/path/to/my_filename_prefix' WITH CREDENTIALS 'aws_access_key_id=<my_access_key>; aws_secret_access_key=<my_secret_key>' MANIFEST GZIP ALLOWOVERWRITE ESCAPE NULL AS '\\N'
Immediately after the UNLOAD keyword, enter the query whose results you want to export, as a string surrounded by parentheses.
Similarly to COPY, you must use WITH CREDENTIALS to specify credentials that may write to your S3 bucket.
The next three keywords modify the format of the export itself:
- MANIFEST includes a file listing the dumped files. Redshift will export two files per node (one per slice), and a master list can be helpful for reloading via COPY as well as for other programs reading the data.
- GZIP compresses the files, making them much easier to work with.
- ALLOWOVERWRITE proceeds with the export even if the file already exists.
The last two keywords deal with the data:
- ESCAPE puts an escape backslash before newlines, quotes, etc.
- NULL AS uses a special format for null values instead of writing whitespace.
That’s all there is to it! We hope this has been helpful. You’ve got some more skills to help you copy large amounts of data.