amazon s3 - Simple way to load new files only into Redshift from S3? -


the documentation redshift copy command specifies 2 ways choose files load s3, either provide base path , loads files under path, or specify manifest file specific files load.

however in our case, imagine pretty common, s3 bucket periodically receives new files more recent data. we'd able load files haven't been loaded.

given there table stl_file_scan logs files have been loaded s3, nice somehow exclude have been loaded. seems obvious feature, can't find in docs or online how this.

even redshift s3 loading template in aws data pipeline appears manage scenario loading data -- new , old -- staging table, , comparing/upserting target table. seems insane amount of overhead when can tell front filenames file has been loaded.

i know move files have been loaded out of bucket, can't that, bucket final storage place process not our own.

the alternative can think of have other process running tracks files have been loaded redshift, , periodically compares s3 bucket determine differences, , writes manifest file somewhere before triggering copy process. pain! we'd need separate ec2 instance run process have it's own management , operational overhead.

there must better way!

here mention steps includes process how load data in redshift.

  1. export local rdbms data flat files (make sure remove invalid characters, apply escape sequence during export).
  2. split files 10-15 mb each optimal performance during upload , final data load.
  3. compress files *.gz format don’t end $1000 surprise bill :) .. in case text files compressed 10-20 times.
  4. list file names manifest file when issue copy command redshift treated 1 unit of load.
  5. upload manifest file amazon s3 bucket.
  6. upload local *.gz files amazon s3 bucket.
  7. issue redshift copy command different options.
  8. schedule file archiving on-premises , s3 staging area on aws.
  9. capturing errors, setting restart ability if fails doing easy way can follow link.

Comments

Popular posts from this blog

Email notification in google apps script -

c++ - Difference between pre and post decrement in recursive function argument -

javascript - IE11 incompatibility with jQuery's 'readonly'? -