WARNING: WALL OF TEXT
Let's talk about how we're moving and storing data at a high level.
I've been charged with creating a Data Warehouse™ at my company.
Due to the nebulous definition of the concept, here is what the Data Warehouse™ (henceforth known as DW) means to me, and how I plan to tackle it:
[Source Data] → [Landing Database] → [Persistent Staging Database] → [Presentation Layer]
[Source Data] is any source of information (RDBMS, API, NoSQL, etc.) that I don't have control over. I'll take this data as-is and duplicate the data as tables. These tables will be truncated before data is loaded. If the source table is from a SQL Server and is called ftp.SR_Files
The [Landing Database] is all of the information we care about from the [Source Data] taken as-is and duplicated as tables. These tables will be truncated before more data is loaded. If the source table is from a SQL Server and is called
ftp.SR_Files from the AdventureWorks database, the landing table will be called something like
ADWKS_ftp_SR_Files. I'm not really sure of the best way to name source tables from various schemas on various databases within one single schema.
The [Persistent Staging Database] is essentially a historical dumping ground for all of the data we've ever thought was useful. It is populated from the [Landing Database] using UPSERT logic to add missing rows, update existing rows, and drop rows that no longer exist in the source. However, these tables will be system versioned on a daily basis, so that we have historics that are NOT provided by the source applications.
Finally, we have the...rest of it. This is where I'm a bit fuzzy. The [Presentation Layer] is where we store information about business-level objects. This will essentially be star-schema'd, physical manifestations of what once was a myriad of reporting views that existed on the source databases. This will allow for extremely fast queries from front-end BI tools like Power BI and other internal business applications.
And now for the ETL stuff:
SSIS. Oh boy.
I spent a few days watching videos, following guides, and really trying to dig into SSIS. The entire time, I couldn't help but think: "Man, this would be so much easier in C#."
So that's what I did. I wrote my own custom ETL tool that runs as a service on a Windows Server. You can add "jobs" that can be scheduled to be intermittent or via cron. There are many considerations to handle graceful shutdowns, data "pipes" that avoid excessive RAM usage and hosing the source databases, email notifications, logging, etc.
So here's my question:
Is SSIS really the only out-of-the-box option for ETL?
I'd like some feedback on my stream of consciousness here, and what everyone is using to move data from any given source into their data warehouse.
I'll edit this post with additional information as it's discovered.