my data set has a member id, and from one to N diagnoses on the same line as 1 to N procedures. there is no mapping between a diagnosis and its procedures, so i cannot tell which dx belongs to which proc. My thought was to reengineer the table so that it was transformed from multiple bundled facts per row into one fact pair per row, so to get to there, do i think i need to pivot all possible dxs for a member id on a discrete rows and then pivot (or cross join?) on the procedures, to get them matched one proc per dx per member id per date, and potentially create an accumulator to count the number of dx-proc combinations per member id per month.
is it the appropriate non-lossy data transformation to prepare the data for clustering? i propose changing YearMonth, memberid, dx1, dx2, dx3, dxN, proc1, proc2, proc3, procN into [YearMonth], [memberid], 1{dx}N, 1{proc}N, dxproccount? (what i mean is N rows for a date-member-dx-proc-dxproccount). i have a file with a characterization of each member id, and i want to build a model to see if i can predict the characterization for the remnant 1/3 of the file not used for training. i do have the population's dx and proc history which i think should tell the story. is this a good transformation of the data?
thanks for bearing with me.