One of the large issues with code generation is getting unreadable results. Very long lines are easily generated because it’s easy to describe bit-by-bit operations algorithmically. This is very true of LDPC and other ECC codes. I decided to make a linebreaking script. Overall, it works OK. The problem itself can be interesting, and is more an artform.
My goal was to get a script to try to break a line similar to what a human would do. For example:
x = a + b + c + ... + z;
could be expressed as:
x = a + b + c + d + e + f + g + ...
But nothing lines up nicely. A person might write the same code as follows:
x = a + b + c
+ d + e + f ...
or
x = a + b + c +
d + e + f ...
My script attempts to do a version of a “string correlation” on the strings. This actually ends up being fairly difficult. My string breaking function looks at word boundaries before a maximum line length, then uses a function to generate a score of how many characters are the same if the string were broken at that point. The break is then shifted several times to see if the score improves. (Similar to cross-correlation).
I ended up adding in a factor that reduces score for large extra indentation — otherwise the code will strongly try to line up characters, and successive line lengths will be reduced quickly. I also favor multiple sucessive characters matching. This has the side effect that “std_logic_vector” will really try to line up with “std_logic_vector” if it appears twice on a long line.
Another improvement, though a costly one, is to recalculate the score of the previous line break decision and add this into the score for choosing the current line break. This is similar to a fixed-lag smoother, but for text. This prevents the following:
x = a + b + c +
d + e +
f + g + h;
Which occurs because the score is equal between this line break, and one where d+e+f are on one line. The f+g+h case is tested first, so it gets priority. If the leftmost break gets priority, it will place +g+h; on the last line, making it the only line starting with “+”.
My algorithm isn’t perfect. Currently, the largest issue is runtime, and scaling. It takes around thirty seconds for it to process a file like gf256_pkg.vhd when most of the line breaks are removed. The results are pretty good though.
The code also looks for __name__ == ‘__main__’, which means the file can also be used for a library. I included a few options, mainly adjusting the max line length and doing an in-place replace for files.
At this time, comments will not break. This is because a linebreak in the middle of a comment would place text on a line where it would be interpreted. Comments often have some reason to break a line length limit, so my script just stops breaking a line once a comment is found.
Here is the script.