When looking at a transformation, I would like to know from what sources the data come from and where they will go.
For that, I would like to extract the source and destination datasets of that transformation.
I am using a Redshift environment, and when looking for instance at a Table Input component, I can find in its properties a schema, a table name and its column names (https://documentation.matillion.com/docs/1991918), but I don’t know the database they all come from.
I can see that the available tables and views to choose from are provided from the selected environment, but there is no concept of « selected environment » when using the API.
How can I know what is the selected environment to deduce the source database? Is that a good approach to obtain the source database or is there a better one?
Thank you in advance for the time reading and any help,
Cyrille Dakhlia
Hi Cyrille,
Is our Data Lineage feature useful in this respect?
If there's additional information you need which is not covered by the existing functionality, please could you add it in our Ideas Portal? I had a check through the existing ideas and there's nothing currently about source database name in the lineage.
In your related question, you can get the default schema in the way that @DataGuy suggested.
Best regards,
Ian
Hi Ian,
Thank you for your answer and sorry for the time I took to reply, I needed some time to reflect on the topic to ask the good questions.
I have read the documentation about the Data Lineage feature you shared, and it seems to me that it provides the lineage of data inside the transformation, with the aim of understanding the journey of a specific data from the beginning of the transformation, to the point of inspection.
Thus, we know what are the data at the entrance of the transformation (= input data) (please correct me if I am wrong so far).
What I would know is how are those input data made.
To explain my point, I have understood that data are originally retrieved at the orchestration level, and can then be composed to create a new table, to finally be passed to a transformation. In that case, the input data of a transformation may be the composition of multiple data sources (for instance PgSQL and Amazon S3).
What I am looking for, is to be able to look at a transformation, and be able to say that the input data of that transformation are coming from the PgSQL database "DBpg" and the Amazon S3 Bucket "Bucky" (I was very inspired for the name examples).
Do I necessarily have to look at the orchestration and deduce this information, or is there a way to know it at the transformation level?
I hope I could explain clearer my goal. If not, please do not hesitate to ask me for more explanations. It would challenge my understanding and either help you to understand better my goal, or help me to discover where my understanding is failing.
Kind regards,
Cyrille Dakhlia
Hi @ian.funnell,
I don't know if you have missed my previous message but, in doubt, I hope you do not mind that I tag you for confirmation and, ideally, a possible answer.
The main points were:
- Does the data lineage feature only provide the lineage of data inside the transformation (from the beginning to the point of inspection), meaning that we don't know anything about the data prior the entrance of the transformation?
- Since data are originally retrieve at the orchestration level, is there a way to obtain the data's sources with the API when requesting the transformation, or do I necessarily need to look at the orchestration level?
-
If there's no answer, I will write in the Ideas Portal to submit my points.
Thank you in advance,
Cyrille Dakhlia