Skip to content

Conversation

@mori49
Copy link

@mori49 mori49 commented Jan 29, 2026

No description provided.

Copy link
Contributor

@janniklinde janniklinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your first contribution @mori49, this is a good start. I left some comments in the code. You used segmented least squares, which is a fine approach (even though control over the actual loss is quite limited). One limiting factor is compression complexity of O(n³), which is not viable for production compression. This particular approach can be optimized to O(n²). This could be achieved by precomputing prefix sums or SSE matrix (please first address the smaller formatting issues and other code suggestions before approaching that optimization).
In general, we may think of a more lightweight and accurate method to preserve targetLoss as an upper bound (this will be part of after the first submission deadline).
Also, please avoid german comments or variable/method names in your contribution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this class from colgroup/scheme package to colgroup/.
In general, all methods that are currently unimplemented should throw new NotImplementedException()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be part of the PR. You can keep it locally but you should untrack it and not add it to your commits. You could use git rm --cached bin/systemds-standalone.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you reformatted the file to revert the tabs -> spaces conversion, which is good. However, there are still many unnecessary changes. I would recommend you revert that file to the original state of this repository and then only add the enum CompressionType PiecewiseLinear

Comment on lines +1077 to +1250
public static AColGroup compressPiecewiseLinearFunctional(IColIndex colIndexes, MatrixBlock in,
CompressionSettings cs) {

//Erstmal den Inhalt einer Spalte speichern

int numRows = in.getNumRows();
int colIdx = colIndexes.get(0); //Die erste Spalte
double[] column = getColumn(in, colIdx);

//Sette den Targetloss

// Breakpoints bestimmen: Einteilung der Segmente

List<Integer> breakpointsList = computeBreakpoints(cs, column);
int[] breakpoints = breakpointsList.stream().mapToInt(Integer::intValue).toArray();
//Für jedes Segment lineare Regression als kompressionsverfahren

// 3) Pro Segment Regression -> a,b
int numSeg = breakpoints.length - 1;
double[] slopes = new double[numSeg];
double[] intercepts = new double[numSeg];

for(int s = 0; s < numSeg; s++) {
int start = breakpoints[s];
int end = breakpoints[s + 1];

double[] ab = regressSegment(column, start, end); // nutzt gleiche Stats wie computeSegmentCost
slopes[s] = ab[0];
intercepts[s] = ab[1];
}
//Erstelle die Datenstruktur: PiecewiseLinearColGroupCompressed

return ColGroupPiecewiseLinearCompressed.create(colIndexes, breakpoints, slopes, intercepts, numRows);
}

public static double[] getColumn(MatrixBlock in, int colIndex) {
int numRows = in.getNumRows(); // Anzahl der Zeilen [web:16]
double[] column = new double[numRows]; // Variable für die Spalte

for(int r = 0; r < numRows; r++) {
column[r] = in.get(r, colIndex); // Wert (r, colIndex) lesen [web:16][web:25]
}
return column;
}

public static List<Integer> computeBreakpoints(CompressionSettings cs, double[] column) {
int n = column.length;
double targetMSE = cs.getPiecewiseTargetLoss();
// Fall A: kein TargetLoss angegeben -> einfache Variante mit fixem λ
if(Double.isNaN(targetMSE) || targetMSE <= 0) {
double lambda = 5.0;
return computeBreakpointsLambda(column, lambda);
}

// Fall B: TargetLoss gesetzt -> globales Fehlerbudget berücksichtigen
double sseMax = n * targetMSE; // MSE -> SSE-Budget

double lambdaMin = 0.0; // viele Segmente, minimaler Fehler
double lambdaMax = 1e6; // wenige Segmente, mehr Fehler

List<Integer> bestBreaks = null;

for(int it = 0; it < 20; it++) { // Binärsuche auf λ
double lambda = 0.5 * (lambdaMin + lambdaMax);

List<Integer> breaks = computeBreakpointsLambda(column, lambda);
double totalSSE = computeTotalSSE(column, breaks);

if(totalSSE <= sseMax) {
// Budget eingehalten: wir können versuchen, mit größerem λ noch weniger Segmente zu nehmen
bestBreaks = breaks;
lambdaMin = lambda;
}
else {
// Fehler zu groß: λ verkleinern, mehr Segmente zulassen
lambdaMax = lambda;
}
}

if(bestBreaks == null)
bestBreaks = computeBreakpointsLambda(column, lambdaMin);

return bestBreaks;
}

public static List<Integer> computeBreakpointsLambda(double[] column, double lambda) {
int sizeColumn = column.length;
double[] dp = new double[sizeColumn + 1];
int[] prev = new int[sizeColumn + 1];

dp[0] = 0.0;

for(int index = 1; index <= sizeColumn; index++) {
dp[index] = Double.POSITIVE_INFINITY;
for(int i = 0; i < index; i++) { // Segment [i, index)
double costCurrentSegment = computeSegmentCost(column, i, index); // SSE
double candidateCost = dp[i] + costCurrentSegment + lambda;
if(candidateCost < dp[index]) {
dp[index] = candidateCost;
prev[index] = i;
}
}
}

List<Integer> segmentLimits = new ArrayList<>();
int breakpointIndex = sizeColumn;
while(breakpointIndex > 0) {
segmentLimits.add(breakpointIndex);
breakpointIndex = prev[breakpointIndex];
}
segmentLimits.add(0);
Collections.sort(segmentLimits);
return segmentLimits;
}

public static double computeSegmentCost(double[] column, int start, int end) {
int n = end - start;
if(n <= 1)
return 0.0;

double[] ab = regressSegment(column, start, end);
double slope = ab[0];
double intercept = ab[1];

double sse = 0.0;
for(int i = start; i < end; i++) {
double x = i;
double y = column[i];
double yhat = slope * x + intercept;
double diff = y - yhat;
sse += diff * diff;
}
return sse; // oder sse / n als MSE
}

public static double computeTotalSSE(double[] column, List<Integer> breaks) {
double total = 0.0;
for(int s = 0; s < breaks.size() - 1; s++) {
int start = breaks.get(s);
int end = breaks.get(s + 1);
total += computeSegmentCost(column, start, end); // SSE des Segments
}
return total;
}

public static double[] regressSegment(double[] column, int start, int end) {
int n = end - start;
if(n <= 0)
return new double[] {0.0, 0.0};

double sumX = 0, sumY = 0, sumXX = 0, sumXY = 0;
for(int i = start; i < end; i++) {
double x = i;
double y = column[i];
sumX += x;
sumY += y;
sumXX += x * x;
sumXY += x * y;
}

double nD = n;
double denom = nD * sumXX - sumX * sumX;
double slope, intercept;
if(denom == 0) {
slope = 0.0;
intercept = sumY / nD;
}
else {
slope = (nD * sumXY - sumX * sumY) / denom;
intercept = (sumY - slope * sumX) / nD;
}
return new double[] {slope, intercept};
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep this file clean, I recommend that you create a new class called PiecewiseLinearUtils in the package functional. Your compressPiecewiseLinearFunctional(...) then just calls PiecewiseLinearUtils.compressSegmentedLeastSquares(...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here please revert the file. Did you change anything in this file (except tabs->spaces which you should be reverted)?

You might consider creating a variable double targetLoss and a method public CompressionSettingsBuilder setTargetLoss(double loss) {...}. If you then add the targetLoss as a variable in the CompressionSettings constructor, you directly set the target loss via CompressionSettingsBuilder

import static org.apache.sysds.runtime.compress.colgroup.ColGroupFactory.computeSegmentCost;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.fail;
import static org.junit.jupiter.api.Assertions.assertTrue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove jupiter assertions, that will cause the build to fail as we don't use jupiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no underscores in method names.

Move this test file to test/component/compress/colgroup.

You have a lot of isolated tests (which also look like autogenerated tests and not handwritten). It would be nice to have more tests. Please remove some redundant ones, and add tests on randomly generated data (with a fixed seed) where you create a ColGroupPiecewiseLinearCompressed and then decompressToDenseBlock. You then compare it to the original data and compute a loss (which should be no more than some upper bound).

Comment on lines +139 to +142
/**
* Ziel-Gesantverlust für piecewise Lineace Komocession• Interpretation: maximal entaubter Alobaler MSE pro Went in
* der Sealte. O.O ~ quasi verlustfrei, viele Segmente >0 ~ mehr Approximation entaubt, weniger Segmente
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird comment

//Erstmal den Inhalt einer Spalte speichern

int numRows = in.getNumRows();
int colIdx = colIndexes.get(0); //Die erste Spalte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You take the first column, which is fine for now, but in a finished implementation you would either repeat compression on every column or do a multidimensional regression, where you treat a 'row' of all indices as a vector.


@Override
public double getIdx(int r, int colIdx) {
// ✅ CRUCIAL: Bounds-Check für colIdx!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid emojis;
Also, they are usually a hint of LLM generated code (which is strictly forbidden for your submissions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants