Managing the Big Data project
Everything ‘expands’ in a Big Data project. There are many more decision points, even before you draw the first entity in your ERD design tool. A typical IT lifecycle, of any type, consists of
- Analysis (“You start coding and I’ll go upstairs and see what the business wants”);
- Design (“Is this a pattern that has previously been designed?”)
- Coding (“Is this a pattern that has previously been coded?”)
- Testing (“Have we used tests like these, before?”)
- Feedback (all positive, right?)
- All of the above happens in Agile development as well as in Waterfall development, but for this discussion, I’ll ignore the details of Agile work sessions.
Let’s dive in the detailed lifecycle about Big Data projects:
First, Analysis and Design…
Analysis in a Big Data project includes not only the exploration and documentation of what the (Mathematical) Analysts/ Advanced Business People want and need but also the nature or situation of the data upon which they are going to work.
You will need to answer: What is or will be estimates of the (growing) size of the data? Will the application continue to exponentially collect data; if so, how will it be treated, where will it be stored, and how do the Analysts/Business People plan to use it? You will need to have conversations, with a “chalkboard” for drawing samples and examples. For example: Suppose we are collecting data, which is growing exponentially over an 18-month period (pick any time frame). Is it important to apply the Analysts’ calculations over the entire 18 months of data, or can we select a sample…; say a time slice of data from a particular 3 month period ?:
Additionally, can an amount of data be purged after a certain time period? Can a particular type of data be purged? Do the Managers understand cloud costs of exabyte volumes of data? Does some/all of the volume of data need to be saved for either historical and/or regulatory purposes? If we need to save all of the data, can we compress some of it by saving calculation results or trend results?
You can see from the above questions, that a comprehensive understanding of the source(s), use(s), and short-/medium-/long-term disposition of the data must be understood and documented in the Analysis Phase.
This is necessary to not only document the requirements but also to be able to estimate project cost, length/duration of project phases, and the staffing that the project will require. Thus, going forward, if there are departures from this understanding, you have the original documentation which can be acknowledged and amended with Management approval.
The deliverable of Analysis is a BRD – Business Requirements Document. At the end of Analysis, it is recommended that a Conceptual Model of the project (block diagrams and/or subject area circles) for Data and for Processing be included in the BRD.
Design time and questions also expand in a Big Data project. Design can be addressed in 3 concurrent sections – Data, Operational Code and Interfacing – BUT the Data design must itself be addressed in 3 subsections:
- What are the business’s/Analyst’s initial requirements (i.e. What problems do they intend to solve?) This will give us the shape of the initial data model(s).
- What will the requirements/needs be 1 to 2 years into the operation of the system?
- What will the requirements/formation of the data be towards the end of this system’s life? That is, if this database or data structure is growing at an exponential rate, at what point (given current technology) does it become unmanageable and require mitigation or a new design?
Now, the design itself.(depending on the requirements) may be OLTP, a Data Vault, Data Warehouse(s), Data Marts, or practically speaking, a combination of two or more of these underlying designs. You must take one or many purpose(s) of the desired system and select the best design to facilitate processing the Big Data project. I could write buckets on selecting an appropriate design, and also much about not reinventing the wheel, but maybe on another blog topic. Look at the 3 scenarios above. You may recommend different designs for each one, but the usual experience is that an initial design is drawn, and then in subsequent years, the initial design is augmented to support items 2 and 3 above.
For both Design and Coding, please don’t reinvent the wheel. From your experience in previous projects, and from the Internet and a library’s worth of published models there are models and patterns that can be incorporated into your new model. Use them and save some time. For example, in the business world there is essentially one model: “Customer purchases Product.” Of course there are all sorts of extensions to this, but how many models of ‘Customer’ and ‘Product’ are there in the public domain. Maybe 50? Pick the ones which match your requirements and use them.
The deliverable of the Design phase is Design Documentation, which includes the Data Logical Design (at a minimum), an outline of Operational Code and designs for all necessary interfaces – from source(s) to end user tools. The Design Documentation of the data piece should include the conceptual and detailed Data architecture, ERDs or equivalent detailed models of the data, Data Dictionaries, Data Lineage, and any related information to help the Coding Team produce Physical Data Models and DDL which will satisfy the requirements established in Analysis.
A word about Testing – The easiest way to develop tests and test plans is to develop and document them in your DESIGN phase…or at least towards the end of your design phase, when you know what the system is going to look like. This is also a good time to estimate what/which data will be used to test and how long testing will take.
Approach and Team Composition in the Big Data project
Identify all of the project team members in advance and get commitment for their time with both the person and their management stream. You will need: an Enterprise Architect, Analysts, Data Modelers, DBAs, Testing Specialists, Interface Engineers, Implementation Engineers, Training Personnel, and Technical and/or Business and Management review personnel at each level. Forgetting or ignoring one of these specialties could put your project in trouble.
Don’t wait until the end of a phase of work to start planning the next phase. The experienced Project Manager will plan overlapping phases of work – sort of a combo between Agile and Waterfall. A certain amount of Logical/Physical database design.
- Little or no CIO backing – You didn’t go high enough in the chain or else the CIO is not really committed to the concept and implementation
- Middle management distracted with multiple projects
- Insufficient training for staff
- Project Manager (PM) with little or no IT familiarity
There is a tendency, especially among large corporations, to establish a separate department called “Project Management.” This department is staffed with people who are highly trained in leadership, negotiation, scheduling, planning, monitoring of deadlines, PM software, etc. Some PMs have a coding background; some have a testing background. Personally, I have never met one with database development experience. Many do not even know the business very well.
Because complexity can be a factor in Big Data projects, it is essential that the PM be a person who is exceptionally experienced in the business that is being serviced, and also very experienced in the development lifecycles of databases, operational code, APIs, BI, testing, and possibly AI. It will not have to have a PM produce a schedule which, for example, covers every task, but is missing the conceptual, logical and physical modeling of the data. I know it’s hard to find people with these qualifications…too bad. Keep looking. There might even be one in-house who has been overlooked for years while they have quietly gathered credentials.
- Poor understanding of the scope of a Big Data project even after you assemble the team– It is essential to have presentations of the concepts at all levels of design. In other words, more presentations, walk-thru’s, and trial examples for people to work through as a team.