-
Notifications
You must be signed in to change notification settings - Fork 527
[S#3543] Piece-wise linear compression of column groups first working prototype #2415 #2420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # src/main/java/org/apache/sysds/runtime/compress/CompressionSettings.java # src/main/java/org/apache/sysds/runtime/compress/colgroup/scheme/ColGroupPiecewiseLinearCompressed.java # src/test/java/org/apache/sysds/runtime/compress/colgroup/ColGroupPiecewiseLinearCompressedTest.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your first contribution @mori49, this is a good start. I left some comments in the code. You used segmented least squares, which is a fine approach (even though control over the actual loss is quite limited). One limiting factor is compression complexity of O(n³), which is not viable for production compression. This particular approach can be optimized to O(n²). This could be achieved by precomputing prefix sums or SSE matrix (please first address the smaller formatting issues and other code suggestions before approaching that optimization).
In general, we may think of a more lightweight and accurate method to preserve targetLoss as an upper bound (this will be part of after the first submission deadline).
Also, please avoid german comments or variable/method names in your contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this class from colgroup/scheme package to colgroup/.
In general, all methods that are currently unimplemented should throw new NotImplementedException()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be part of the PR. You can keep it locally but you should untrack it and not add it to your commits. You could use git rm --cached bin/systemds-standalone.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you reformatted the file to revert the tabs -> spaces conversion, which is good. However, there are still many unnecessary changes. I would recommend you revert that file to the original state of this repository and then only add the enum CompressionType PiecewiseLinear
| public static AColGroup compressPiecewiseLinearFunctional(IColIndex colIndexes, MatrixBlock in, | ||
| CompressionSettings cs) { | ||
|
|
||
| //Erstmal den Inhalt einer Spalte speichern | ||
|
|
||
| int numRows = in.getNumRows(); | ||
| int colIdx = colIndexes.get(0); //Die erste Spalte | ||
| double[] column = getColumn(in, colIdx); | ||
|
|
||
| //Sette den Targetloss | ||
|
|
||
| // Breakpoints bestimmen: Einteilung der Segmente | ||
|
|
||
| List<Integer> breakpointsList = computeBreakpoints(cs, column); | ||
| int[] breakpoints = breakpointsList.stream().mapToInt(Integer::intValue).toArray(); | ||
| //Für jedes Segment lineare Regression als kompressionsverfahren | ||
|
|
||
| // 3) Pro Segment Regression -> a,b | ||
| int numSeg = breakpoints.length - 1; | ||
| double[] slopes = new double[numSeg]; | ||
| double[] intercepts = new double[numSeg]; | ||
|
|
||
| for(int s = 0; s < numSeg; s++) { | ||
| int start = breakpoints[s]; | ||
| int end = breakpoints[s + 1]; | ||
|
|
||
| double[] ab = regressSegment(column, start, end); // nutzt gleiche Stats wie computeSegmentCost | ||
| slopes[s] = ab[0]; | ||
| intercepts[s] = ab[1]; | ||
| } | ||
| //Erstelle die Datenstruktur: PiecewiseLinearColGroupCompressed | ||
|
|
||
| return ColGroupPiecewiseLinearCompressed.create(colIndexes, breakpoints, slopes, intercepts, numRows); | ||
| } | ||
|
|
||
| public static double[] getColumn(MatrixBlock in, int colIndex) { | ||
| int numRows = in.getNumRows(); // Anzahl der Zeilen [web:16] | ||
| double[] column = new double[numRows]; // Variable für die Spalte | ||
|
|
||
| for(int r = 0; r < numRows; r++) { | ||
| column[r] = in.get(r, colIndex); // Wert (r, colIndex) lesen [web:16][web:25] | ||
| } | ||
| return column; | ||
| } | ||
|
|
||
| public static List<Integer> computeBreakpoints(CompressionSettings cs, double[] column) { | ||
| int n = column.length; | ||
| double targetMSE = cs.getPiecewiseTargetLoss(); | ||
| // Fall A: kein TargetLoss angegeben -> einfache Variante mit fixem λ | ||
| if(Double.isNaN(targetMSE) || targetMSE <= 0) { | ||
| double lambda = 5.0; | ||
| return computeBreakpointsLambda(column, lambda); | ||
| } | ||
|
|
||
| // Fall B: TargetLoss gesetzt -> globales Fehlerbudget berücksichtigen | ||
| double sseMax = n * targetMSE; // MSE -> SSE-Budget | ||
|
|
||
| double lambdaMin = 0.0; // viele Segmente, minimaler Fehler | ||
| double lambdaMax = 1e6; // wenige Segmente, mehr Fehler | ||
|
|
||
| List<Integer> bestBreaks = null; | ||
|
|
||
| for(int it = 0; it < 20; it++) { // Binärsuche auf λ | ||
| double lambda = 0.5 * (lambdaMin + lambdaMax); | ||
|
|
||
| List<Integer> breaks = computeBreakpointsLambda(column, lambda); | ||
| double totalSSE = computeTotalSSE(column, breaks); | ||
|
|
||
| if(totalSSE <= sseMax) { | ||
| // Budget eingehalten: wir können versuchen, mit größerem λ noch weniger Segmente zu nehmen | ||
| bestBreaks = breaks; | ||
| lambdaMin = lambda; | ||
| } | ||
| else { | ||
| // Fehler zu groß: λ verkleinern, mehr Segmente zulassen | ||
| lambdaMax = lambda; | ||
| } | ||
| } | ||
|
|
||
| if(bestBreaks == null) | ||
| bestBreaks = computeBreakpointsLambda(column, lambdaMin); | ||
|
|
||
| return bestBreaks; | ||
| } | ||
|
|
||
| public static List<Integer> computeBreakpointsLambda(double[] column, double lambda) { | ||
| int sizeColumn = column.length; | ||
| double[] dp = new double[sizeColumn + 1]; | ||
| int[] prev = new int[sizeColumn + 1]; | ||
|
|
||
| dp[0] = 0.0; | ||
|
|
||
| for(int index = 1; index <= sizeColumn; index++) { | ||
| dp[index] = Double.POSITIVE_INFINITY; | ||
| for(int i = 0; i < index; i++) { // Segment [i, index) | ||
| double costCurrentSegment = computeSegmentCost(column, i, index); // SSE | ||
| double candidateCost = dp[i] + costCurrentSegment + lambda; | ||
| if(candidateCost < dp[index]) { | ||
| dp[index] = candidateCost; | ||
| prev[index] = i; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| List<Integer> segmentLimits = new ArrayList<>(); | ||
| int breakpointIndex = sizeColumn; | ||
| while(breakpointIndex > 0) { | ||
| segmentLimits.add(breakpointIndex); | ||
| breakpointIndex = prev[breakpointIndex]; | ||
| } | ||
| segmentLimits.add(0); | ||
| Collections.sort(segmentLimits); | ||
| return segmentLimits; | ||
| } | ||
|
|
||
| public static double computeSegmentCost(double[] column, int start, int end) { | ||
| int n = end - start; | ||
| if(n <= 1) | ||
| return 0.0; | ||
|
|
||
| double[] ab = regressSegment(column, start, end); | ||
| double slope = ab[0]; | ||
| double intercept = ab[1]; | ||
|
|
||
| double sse = 0.0; | ||
| for(int i = start; i < end; i++) { | ||
| double x = i; | ||
| double y = column[i]; | ||
| double yhat = slope * x + intercept; | ||
| double diff = y - yhat; | ||
| sse += diff * diff; | ||
| } | ||
| return sse; // oder sse / n als MSE | ||
| } | ||
|
|
||
| public static double computeTotalSSE(double[] column, List<Integer> breaks) { | ||
| double total = 0.0; | ||
| for(int s = 0; s < breaks.size() - 1; s++) { | ||
| int start = breaks.get(s); | ||
| int end = breaks.get(s + 1); | ||
| total += computeSegmentCost(column, start, end); // SSE des Segments | ||
| } | ||
| return total; | ||
| } | ||
|
|
||
| public static double[] regressSegment(double[] column, int start, int end) { | ||
| int n = end - start; | ||
| if(n <= 0) | ||
| return new double[] {0.0, 0.0}; | ||
|
|
||
| double sumX = 0, sumY = 0, sumXX = 0, sumXY = 0; | ||
| for(int i = start; i < end; i++) { | ||
| double x = i; | ||
| double y = column[i]; | ||
| sumX += x; | ||
| sumY += y; | ||
| sumXX += x * x; | ||
| sumXY += x * y; | ||
| } | ||
|
|
||
| double nD = n; | ||
| double denom = nD * sumXX - sumX * sumX; | ||
| double slope, intercept; | ||
| if(denom == 0) { | ||
| slope = 0.0; | ||
| intercept = sumY / nD; | ||
| } | ||
| else { | ||
| slope = (nD * sumXY - sumX * sumY) / denom; | ||
| intercept = (sumY - slope * sumX) / nD; | ||
| } | ||
| return new double[] {slope, intercept}; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep this file clean, I recommend that you create a new class called PiecewiseLinearUtils in the package functional. Your compressPiecewiseLinearFunctional(...) then just calls PiecewiseLinearUtils.compressSegmentedLeastSquares(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here please revert the file. Did you change anything in this file (except tabs->spaces which you should be reverted)?
You might consider creating a variable double targetLoss and a method public CompressionSettingsBuilder setTargetLoss(double loss) {...}. If you then add the targetLoss as a variable in the CompressionSettings constructor, you directly set the target loss via CompressionSettingsBuilder
| import static org.apache.sysds.runtime.compress.colgroup.ColGroupFactory.computeSegmentCost; | ||
| import static org.junit.Assert.assertEquals; | ||
| import static org.junit.Assert.fail; | ||
| import static org.junit.jupiter.api.Assertions.assertTrue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove jupiter assertions, that will cause the build to fail as we don't use jupiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no underscores in method names.
Move this test file to test/component/compress/colgroup.
You have a lot of isolated tests (which also look like autogenerated tests and not handwritten). It would be nice to have more tests. Please remove some redundant ones, and add tests on randomly generated data (with a fixed seed) where you create a ColGroupPiecewiseLinearCompressed and then decompressToDenseBlock. You then compare it to the original data and compute a loss (which should be no more than some upper bound).
| /** | ||
| * Ziel-Gesantverlust für piecewise Lineace Komocession• Interpretation: maximal entaubter Alobaler MSE pro Went in | ||
| * der Sealte. O.O ~ quasi verlustfrei, viele Segmente >0 ~ mehr Approximation entaubt, weniger Segmente | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird comment
| //Erstmal den Inhalt einer Spalte speichern | ||
|
|
||
| int numRows = in.getNumRows(); | ||
| int colIdx = colIndexes.get(0); //Die erste Spalte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You take the first column, which is fine for now, but in a finished implementation you would either repeat compression on every column or do a multidimensional regression, where you treat a 'row' of all indices as a vector.
|
|
||
| @Override | ||
| public double getIdx(int r, int colIdx) { | ||
| // ✅ CRUCIAL: Bounds-Check für colIdx! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid emojis;
Also, they are usually a hint of LLM generated code (which is strictly forbidden for your submissions)
No description provided.