Delete files from fileshare after Data Transfer to Blob

The Data Transfer component doesn't have an option to delete files after moving successfully. Does anyone have an example script to purge the source files from the fileserver?

Hi @Treaders​,

There are a couple ways to accomplish this if the Data Transfer component doesn't delete the files after moving them.

If you are leveraging an SMB fileshare as the source to load the Blob destination. One way would be to get a list of the paths to the files that will or have been moved and push those into a Grid Variable. You could then iterate over the Grid Variable and execute a Python script that would delete the files. Here is a good example of connecting to an SMB fileshare and deleting files in the share using Python: https://stackoverflow.com/questions/44871317/pysmb-delete-recursively-folder-sub-folder-and-files

I hope this helps in some way. Please feel free to let us know if you have more information that will help us give you a more direct answer. Thanks for contributing

Hi @Bryan​ , Just to update you that after enabling the firewalls between source and target, the pysmb process works for clearing up on prem files. I've also created a separate Bash script utilising the Azure CLI to run some maintenance scripts on Blob as well.

Thanks for your initial response.

@Treaders​ I am also trying to do the same thing but stuck. Could you please tell me how did you resolve this?

Hi,

 

I've learnt quite a bit since this initial question and my initial approach. To answer your question first though, I used a python script to delete the files after transfering.

 

###

# Job deletes files from on-premise file share after ingestion into Blob store. 

# No archive is required as copy in blob archive.

###

 

from smb.SMBConnection import SMBConnection

 

dry_run = False   # Set to True to test if all files/folders can be "walked". Set to False to perform the deletion.

userID = ev_onprem_irvm_serviceaccount

password = jv_decrypt

client_machine_name = 'testclient'    # Usually safe to use 'testclient'

server_name = ev_onprem_irvm_name    # Must match the NetBIOS name of the remote server

server_ip = ev_onprem_irvm_ip  # Must point to the correct IP address

domain_name = ev_onprem_irvm_auth_domain # Safe to leave blank, or fill in the domain used for your remote server

shared_folder = ev_onprem_irvm_sharename # Set to the shared folder name

 

conn = SMBConnection(userID, password, client_machine_name, server_name, domain=domain_name, use_ntlm_v2=True, is_direct_tcp=True)

conn.connect(server_ip, 445)

 

def walk_path(path):

  print('Scanning path:', path)

  for p in conn.listPath(shared_folder, path):

    if p.filename!='.' and p.filename!='..':

      parentPath = path

      if not parentPath.endswith('/'):

        parentPath += '/'

 

      if p.isDirectory:

        print("Ignoring Subdirectories")

      #  walk_path(parentPath+p.filename)

      #  print('Scanning folder (%s) in %s' % ( p.filename, path ))

        #if not dry_run:

          #conn.deleteDirectory('smbtest', parentPath+p.filename)

      #else:

      elif p.filename == jv_filename:

        print('Deleting file %s from directory %s' % ( p.filename, path ))

        if not dry_run:

          conn.deleteFiles(shared_folder, parentPath+p.filename)

 

# Start and delete everything at shared folder root

walk_path('/'+ ev_onprem_irvm_envname + '/' + jv_source_system)

 

You need to ensure ports TCP ports 139/145 are enabled between file server(s) and Matillion VM(s).

 

If you have a lot of files you're looking to transfer to Blob concurrently then you may want to set up a Storage Integration (Snowflake) or equivalent depending what the equivalent is in your target Cloud DWH.

 

The second learning was that using SMB in the Data Transfer component was not very performant. For example a 350MB file was taking an hour just to copy over to Blob. So instead of using the SMB Data Transfer component, I mounted the Windows file share to the Matillion Linux VM and now use a Bash script to copy the data directly from the Mount to Blob. The same file now tranfers to Blob in 15 seconds.

 

Mount command sudo mount -t cifs //<FILE SERVER IP ADDRESS>/<FILE SERVER SHARE NAME>/ /mnt/win_share -osec=ntlmv2,domain=<AD AUTH DOMAIN>,username=<AD USER FOR MOUNT>

 

Bash script to copy file to Blob az storage blob upload --account-name <storage account> --container-name <CONTAINER/DIRECTORY> --file "/mnt/win_share/<path/to/file/filename.csv>" --name "blobname.csv"

 

If you take this approach then you can use another bash script (rm command) to remove the files off the mount if you have permission.

 

Hope this helps.

 

 

sudo pip3.6 install pysmb and got our networking team to ensure IP addresses:ports139/145 open between Matillion VM(s) and fileserver(s)

Np. So we have two file servers where we ask all files to be landed which is where we load from. On each file server, the directories are split by environment. So on one file server we have Dev and Test directories, on the other file server we have Pre and Prod. These environment directories are set as environment variables (ev_onprem_irvm_envname ) on each Matillion instance.

 

Under each environment directory, we then have a sub directory for the source system, so this could be HR or Finance etc so the values will be determined by however you structure your directories. We have a parameterised metadata-driven approach to extracting and loading data from a multitude of sources, file servers just being one route.

 

You could probably simplify this if you wanted and remove these variables or hardcode them.

 

Hope this makes sense.

Thank you, Treaders. I am choosing to go with first approach. The files are small in my case so there is no impact on performance.

In this approach, how did you install smb? I asked the admin to install it in root and it went successfully. However, I still get error: ModuleNotFoundError: No module named 'smb'.

Thank you, Treaders. This helped me to get rid of the error.

One more question, what  needs to be put in place of these: ev_onprem_irvm_envname and jv_source_system?

Thank you.