How can I list all the files on a SFTP server

Hi All,

 

I'm receiving CSV files from a third-party vendor onto our SFTP server on an hourly basis. Each of the files is pretty large, so I'd like to only process files that haven't yet been processed. Each of the filenames has a timestamp (e.g. "data_20210308_150322.csv") so I know when they were uploaded.

 

In Matillion, is there a way for me to list all the files on the SFTP server and write them to a table? This would allow me to compare the files on the SFTP with the files that I've already processed to let me know which files have and haven't been processed already. I can then use this table with the File Iterator component to grab the the correct files from SFTP.

 

Thanks!

Hi @NT567373​ ,

I had a similar challenge (transfer only files from SFTP to S3 which are not in the bucket, yet) and I have solved this with the following job:

First, there is a file iterator to query all the file names and file paths on the SFTP server. The “Append to grid” component appends all the file names to a grid variable which has to be defined before. Next, there is another file iterator that fetches all existing files from S3. These files are removed from the same grid variable that was defined and loaded before. Now, the grid variable only contains file names which are only on the SFTP server and not in S3. Finally, the real data Transfer can start. We only transfer “new” files.

The Python Script at the very beginning generates a string containing the last x day, e.g., “2021-03-09|2021-03-08|…”. I can use this pattern for the Regex filters in the file iterator components. You will need this if you have a large number of files (the file iterator is limited to handle only a certain amount of files.

Thanks @Michael​ ! I appreciate the response. This looks great, I'll give this a shot!

hi Michael,

 

Do you mind sharing your python script on how to connect to the SFTP source?

 

Thanks,

Good luck and let me know how this worked for you!

@Michael​ I ended up sticking a SQL component under the File Iterator that iterated through my SFTP files. In the SQL component, I used an `INSERT INTO SFTP_FILENAMES_TABLE` statement, to insert the filename into a table. Then, I compared the files I already processed to the filenames in SFTP_FILENAMES_TABLE. I then processed the files that hadn't yet been processed.

I'm also running into an issue with the File Iterators Max Iterations limit. Can you elaborate on your Python script and how you got around the limit?

Forget my last question; it took me a while to understand what you were doing, but I get it now!