This document discusses the spreadsheet software that is available, notably running on Linux. Spreadsheets are a historically interesting and useful class of applications. Early in my career, when working as an accountant, I used them extensively and intensively to collect and summarize financial information.
There exists an assortment of "free" spreadsheet packages, most not considered to be terribly "credible" as they do not read or write the data formats used by Microsoft Excel or Lotus 123, which are considered "industry standard."
There are a number of both free and commercial packages that do a good job of "understanding" various proprietary spreadsheet formats.
This document also includes some historical background about spreadsheets.
Based on the history of personal computers, spreadsheets may fairly readily be argued to be the most important application area on personal computers.
Something that isn't questionable is that the entry of personal computers into business began as a specific result of the advent of spreadsheet software.
Businesses bought Apple II computers because they wanted to use VisiCalc. (Which, interestingly to operating system aficionados, was originally developed on the Multics platform, according to VisiCalc coauthor Bob Frankston. Yet another argument for the notion that virtually all important modern computing innovations took place 20 years ago on the Multics platform...) More details may be found at VisiCalc: Information from its creators, Dan Bricklin and Bob Frankston
Thanks are in order to Bob Frankston for some corrections he has provided to this history.
When Lotus 123 became available for IBM PCs, the cycle continued, and IBM PC sales took off. Then Microsoft got into the picture, and things have unfortunately degraded considerably...
Here is another Brief History of Spreadsheets. The straight facts are pretty accurate; I'll just take a bit of issue with a couple points that represent "editorial opinion" that we can probably agree to disagree over...
The authors suggested that Lotus 123's "A1" referencing system was "more intuitive" than the "R1C1" system used by various other spreadsheets notably including Microsoft's MultiPlan.
The A1 notation originated in VisiCalc , not 1-2-3. The reasoning for it was simple: it was simple to type. Reversing standard notation, we used the letters for the columns because there were fewer columns than rows.
|-- Bob Frankston|
They stated that: The spreadsheet instantly became easier to use than the archaic interface of PC-DOS products...
The text-based user interfaces were hardly "archaic" at the time; they were as up to date at the time as anything was. It is indeed fairly convenient to select "blocks" using a mouse, and it is fair to claim this is easier to learn than the (quite functional) keyboard-based ways for doing this that were and likely still are faster. (Except with Excel , where the keyboard interface appears to have been made deliberately arcane, but I digress...)
Spreadsheets have provided for (without most of the users having any conscious awareness of this), the large scale deployment of Cellular Automata. One [FAQ on Cellular Automata] defined Cellular Automata thus:
A cellular automaton is a discrete dynamical system. Space, time, and the states of the system are discrete. Each point in a regular spatial lattice, called a cell, can have any one of a finite number of states. The states of the cells in the lattice are updated according to a local rule. That is, the state of a cell at a given time depends only on its own state one time step previously, and the states of its nearby neighbors at the previous time step. All cells on the lattice are updated synchronously. Thus the state of the entire lattice advances in discrete time steps.
Spreadsheets satisfy these requirements, with a few "bits of fuzziness," notably that:
"Nearby neighbors" don't have to be terribly nearby with a spreadsheet.
There are some questions of synchronicity.
Spreadsheet packages have often provided some configurability to indicate update policies that are not synchronous, but which rather involve some ordering of updates.
Pure CA systems tend to apply a single update rule to many cells in the lattice; in contrast, the spreadsheet software in common use attach individual formulae to each and every cell.
Improv and related packages would apply rules to different regions of cells; it seems to me that a more intelligent use of spreadsheets should involve some sort of "pattern matching" to help discover such rules for the user.
I've been using spreadsheets of various sorts on a wide variety of platforms since the mid-'80s. The first one which I used any substantial amount was SuperCalc, running under ZR-DOS (an enhanced CP/M "clone" that grew into ZCPR). Others have included (in rough chronological order):
Lotus Symphony, which integrated a wordprocessor of dubious functionality, a simple "form" processor, and a telecom module into the mix;
A 6502 Atari 8-bit based spreadsheet called SpeedCalc that was published in Compute! Magazine;
LDW Power, a Lotus 123 "clone" for the Atari ST;
Borland's Quattro Pro, used in grad school primarily for its linear programming capabilities;
The spreadsheet built into the TRS-80 Model 100 portable computer (a very small, early version of Microsoft Multiplan that eventually grew into the monster now known as Excel)
As Easy As (guess who they were cloning?);
MS-DOS-based and available as shareware, this is still a featureful spreadsheet. It doesn't allow heavy-duty GUI formatting of spreadsheets, but only people with far too much time on their hands do so...
The first version of Lucid 3D ran on TRS-80 Model 100 laptop computers, and the necessity for frugality on that platform resulted in a design that was sparse, frugal, and indeed, extremely lucid.
The MS-DOS version was my favorite MS-DOS program of any sort; its user interface is a wonderful model of integration of powerful use of both keyboard and mouse; the program was still frugal in its use of disk/RAM and yet provided excellent overall functionality.
SC ("Spreadsheet Calculator");
VC (Enhanced, "more visual" version of SC);
The much despised memory hog, Microsoft Excel ;
The spreadsheet built into the Psion 3 handheld computer;
with bits of playing around with others such as Lotus Improv , the Microsoft Works spreadsheet for MS-DOS, the FSF's Oleo, an entirely-custom one that I wrote in LISP (mostly just as a programming exercise), and sundry fiddling around with a pretty wide variety of MS-DOS, MS-Windows, and Unix-based "integrated packages." The most notable spreadsheet of which I have never made significant use is VisiCalc, the program that popularized the whole idea of software that that allow interactive entry of numbers, text, and formulae arranged in rows and columns.
Tasks I've done with spreadsheets have included:
Preparation of accounting working papers
Preparation of financial statements
Economic simulations via difference equations/"cellular automata"
Mathematical modelling, solving linear and nonlinear programs
In short, I've done enough work of enough various types with a large enough variety of spreadsheet packages that I figure I'm entitled to rant a little bit about their proper use.
Here are some further useful resources on the ancient and more modern history of spreadsheets:
"Enhancements" of spreadsheets over the last few years have not involved substantive improvements in functionality, but have primarily just involved enhancing their "typesetting" capabilities, that is, the ability to change fonts, insert special formatting, and to otherwise make tables look "pretty."
I put "enhancements" in quotes because I am skeptical that this actually represents a true improvement of either the quality of the information or user efficiency in finding and using information.
These so-called improvements gloss over the continuing problems that plague spreadsheet users:
Spreadsheet models encourage the use of "spaghetti" logic, where cells point to cells that point to cells, and can grow into random networks of calculation logic;
They permit lots of easy off-by-one errors;
They generally are difficult to verify/audit;
They do not provide good tools for managing data either in terms of consolidation or searching for specific detail;
Perhaps most importantly, despite their convenience, spreadsheets are not a robust repository for information.
I have seen one multinational enterprise that (believe it or not) built a budgeting system atop sets of dozens of departmental spreadsheets that they would roll up into a master budget; while it's a neat extension of the technology, only a fool would try to use this to run a large enterprise. One bad link in one subsheet, and the whole house of cards could fall down. (And the "top" vendor these days, Microsoft, isn't noted for building products that are of industrial grade robustness.)
The last few points point towards where I would like to see spreadsheets go. They have been, and are very good at producing ad-hoc, one-off reports. This is a proper use of spreadsheets.
They are often being used instead as repositories for information that really ought to be managed by a database management system of some sort.
What spreadsheets should do is to allow, nay encourage, the use of data extracts from external sources, notably relational databases. The use of named ranges (which are a venerable feature from at least as early as Lotus 123 v2.01) is of assistance; Lotus Improv was a rather complex-to-use test platform for improved "modelling" whose functionality included database extraction.
Using external repositories permits the benefits of:
A single repository that can be kept correct, rather than a multitude of mutually incompatible data stores;
Data synchronization (a restatement of the last);
All the good RDBMS "stuff" like:
Maintaining field relationships,
Data modelling and
In effect, the real point I would propose is that the task of building a spreadsheet should involve some data modelling, with thought not just about the report at hand, but also about where the data comes from and perhaps should go to.
Microsoft Excel does have a capability to connect to databases as does LibreOffice ; the support is yet a bit on the primitive side.
I would suggest that what happened in the history of the developments is that, for "political" reasons, the developments attributable to Lotus Improv (originally developed using NeXtstep ) were lost, and that its better model of spreadsheet construction/management was thereby lost.
Improv provided an interface that actively encouraged, nay required the user to add additional structure to spreadsheet models.
It provided the ability to define a variety of "categories" to provide multidimensional analysis, as well as "groups" to allow the grouping of data that is not so readily decomposed.
Every time some form of categorization is defined, this defines a sequence of "items," whether that be a list of months ("January, February, March, ..."), a list of countries ("Canada, United States, United Kingdom, France, Germany, ..."), continents ("North America, South America, Asia, Europe, Africa, Australia, Antarctica,"), or whatever.
Each "item" represents a row or column, and thus may contain many cells.
In a company selling things internationally, it would be unambiguously valuable to set up Country as a category, as you would certainly need to analyze data based on that. Currency exchange rates are based on countries' currencies, the set of laws that apply depend on the country, and so forth.
Periods of time, such as months, quarters, and years, also tend to unambiguously reflect a "dimension," in this case that of time.
On the other hand, reports might need to group countries together into "regions" or "continents," depending on who is looking at the data. Those groupings are likely to be fuzzier, whether we're talking about grouping several countries together to represent a Continent or Sales Region, or if there is need to have smaller regions (such as states, provinces, or counties, or shires) to decompose the activities within a country.
In both cases, it would be fairly appropriate to define a less-structured "group" that does not add an extra dimension to the hierarchy, and thereby to the complexity of the data model. Thus, a set of related items are collected together to represent a "Group."
The behaviour of formulae in Improv is exceedingly different from that of traditional spreadsheets.
In a traditional spreadsheet, a formula is associated with a cell, and in order to have a particular formula apply to many cells, you must copy the formula into that range of cells.
In Improv, on the other hand, formulae are not associated with cells, but are "first class" objects associated with the spreadsheet, and rather than representing a mere single cell, are applied to an entire range/group of cells, and may thereby operate on items, groups, and categories.
Instead of operating on cryptic "cell ranges," they operate on named ranges, and thereby tend to be more readable than the traditional spreadsheet alternative.
Improv formulae almost
always represent vector operations, providing values for multiple
cells at once. Thus, a formula that computes monthly totals across
category "Fruit" might look like
SUM(Fruit). The crosscheck formula, computing annual
totals for each variety of fruit, might be
This example displays the "overlap" issue; the pair of formulae overlap in the cell that contains the total for all fruit for the whole year. Improv discloses this overlap, and allows formulae to be placed into a priority order where "later" formulae override "earlier" formulae.
Here are links to historical information about Improv as well as about other packages that might be considered to be "successors."
This was created as something of a successor to Improv; it might conceivably be ported to Linux, if they get suitable interest.
Advance Planning Solutions Advance had a spreadsheet-like system rather like Improv.
The company was acquired by PeopleSoft in 2000, so parts of this may have been integrated into their applications, but it is not likely still available as a separate product.
... It's worth noting that Improv flattered an earlier program, Javelin (by a company of the same name located in Cambridge MA).
|-- Bob Frankston|
I was aware of the release of Javelin; it was an MS-DOS-based software package that had an unfortunately-brief flash of fame.
See also the web page of one of Javelin's authors, John R. Levine.
It appears that Javelin may have been an early victim of Microsoft Predatory Marketing; an InfoWorld article (now gone) indicates that when Javelin won the InfoWorld "Product of the Year" award, beating out Excel, Bill Gates "got up and stomped out of the room in front of everybody in a spectacularly rude manner."
Another report suggests that they got overambitious, planning to try to dominate Lotus 123 when they really needed to grow their niche. And then were "done in" by bad timing on an IPO, scheduled just a week after the market crash of October 1987.
Other comments suggest that the failure had to do with the software being difficult to use. It was powerful, but business GUI software was in its infancy at the time, and the implication is that Javelin did not have a sufficiently "user-friendly" interface to permit widespread adoption.
These may all have been contributing factors.
You may still be able to get copies of Javelin; I'm told that it was bought out by a "venture capital" group; they integrated it into some data retrieval tools, and you may be able to get a copy of World Bank Indicators - World*Data 1995 which included a copy of Javelin.
Spreadsheet models can be difficult to read and maintain. Spreadsheets provide few facilities for documentation, and although the structure of a spreadsheet program is implicit in the cell equations, it cannot be made explicit as it would if programmed in a conventional programming language. To make spreadsheets easier to use, we are developing Model Master (MM), a compiler that generates spreadsheet equations from textual specifications of models.
An MM program consists of one or more object specifications. To specify single objects, the user describes their attributes or properties, together with equations stating how these depend on one another's present and past values. To specify a complete model, the user describes how these objects are to be connected together, by writing extra equations that say how their attributes are interrelated. MM compiles these specifications into cell equations. It automatically allocates attributes to columns and time points to rows: the user can override these allocations, but will not usually need to do so.
MM is based on a new programming paradigm, System Limit Programming, also used in the development of the Web authoring tool Web-O-Matic.
Furthermore, the paper references some of the preliminary research that has been done on the issue of the correctness of spreadsheets. They are highly dependent on user input, there are several serious vulnerabilities from which spreadsheet tools suffer. The use of a tool like Model Master to construct a spreadsheet allows conscious validation of more of the spreadsheet model which can't but be helpful.
There have been other presentations on Model Master, and it has been augmented to include a decompiler so that a spreadsheet may be turned into a concise set of equation specifications. Several interesting things pop out of that:
The Model Master program may be a more attractive interchange format than raw spreadsheets themselves.
This has the various merits that:
What is transferred is effectively a description of the spreadsheet model; that may be more usefully readable than the spreadsheet itself;
It is likely to be more compact than the "binary dumps" that commercial spreadsheets generate.
The model program can't contain the "macro viruses" that MS Office has been plagued with of late.
Any "nefarious" calculations will be visibly described in the model's text.
For instance, suppose I were to have a special formula for the line calculating my payroll amounts, that would show up.
The decompiler can readily pick up on which calculations are being run "hard-coded," and which are using formulae.
The "tough part," at this point, which warrants considerable additional research, is the notion of doing some searching to find repeated patterns of formulae.
For instance, it is very common for there to be a column of cells that computes some sort of "cost," by multiplying a quantity cell by a price cell, perhaps adding in taxes or other costs, coming up with a "total cost."
It would be very valuable to recognize the repetition of that formula, and essentially present the formula only once in the model. Note that this is exactly the way Improv treated formulae; they were not defined merely for a cell, but rather for a whole range of cells.
In a traditional spreadsheet, the repetition is done by hand, which is one of the major areas that modelling errors creep in. By "pattern searching," such errors may both be found (when decompiling) and avoided altogether (when compiling).
Spreadsheet algebra describes notation to use to allow manipulating Model Master spreadsheet models.
One point of view is to treat spreadsheets as a form of "database," and, in regarding them from that perspective, to use database-oriented tools to work with data.
www.DigDB.com - DIG
DIG is billed as a "Personal Data Warehouse" or "End User Query Tool." It may be used to extract data from spreadsheets, pushing it into a database, to be analyzed using a set of database tools, in this case, written in Java .
By having a "real" database there (albeit a somewhat weak one), they can certainly overcome the common deficiencies where a spreadsheet "chokes" if presented with too much data, or if you try to have too many elements involved in a Pivot Table.
EuSpRIG was founded in March 1999 when researchers from ISACA (Northern UK Chapter), University of Wales Institute Cardiff and the University of Greenwich came together to discuss the ever increasing problem of business risk associated with spreadsheet errors.
Dan Bricklin's Wiki tool that will store web-updatable spreadsheet-structured data
A Windows-only "table-oriented spreadsheet" system
A number of "dead ends" have been encountered in the ongoing development of spreadsheets.
The "traditional" spreadsheet systems went through a process of "racing for successive refinements" through the late 1980s and early 1990s, largely seeking to add "feature points" to win the contests for "most features counted in the reviews."
Unfortunately, this means that the current code bases are not terribly amenable to more radical evolution, and many of the "improvements" are merely cosmetic, in providing more and more sophisticated tools for managing the appearance of the data.
The major problem with the "traditional" spreadsheet system is that it does not provide much in the way of "structuring tools" to recognize and enforce the structure of the data model, as described in Problems with Modern Spreadsheet Developments.
Improv proposed better ways of building models, and essentially mandated constructing spreadsheets as a process of constructing a system model.
This had the unfortunate, and, as it turns out, unacceptable effect of preventing the "free form" spreadsheet construction that traditional spreadsheets encouraged.
In a traditional spreadsheet, you have a set of rows and columns in which you are free to put anything. It is easy to prototype something up, throwing values here and there wherever it seems convenient to throw them.
In Improv, elements could not be added to the matrix without defining the nature of the row and column. This eliminates "doodling around." On the one hand, there is good to it, in that spreadsheets commonly turn into undefined spaghetti. On the other hand, if you have to be able to "theoretically" justify every change you make to the shape of the spreadsheet, then the only people that will be using the tool will be "theorists," and to a great extent, that's what happened with Improv.
Model Master provides considerable power in defining models, provides all sorts of "strong typing" options, and provides the logical extension of having the language specify access to robust data sources like relational databases, but has two substantial demerits:
It altogether rules out "free form" construction of data into sheets
Although this is changing, as construction of a "decompiler" is underway.
It mandates using a declarative programming language to describe the model.
The implicit "programmability" of the cellular automata means that the average user doesn't need to know about programming; unfortunately, Model Manager pushes programming in their face.
It seems to me that a "step forward" is to try to take the merits of each of these approaches, whilst seeking to avoid the demerits.
It's not a mechanism of infinite analytical power; it likely will only be helpful to find some limited bits of structure. Of course, "limited" may still be sufficient to actually provide some useful added functionality to relatively unsophisticated users, and forcing people to start from a data modelling perspective, as with Improv , hasn't proven terribly popular.
The strength of the traditional spreadsheet is in providing a "free form" medium where users may construct models without directly having to program.
So, we start with a front end that is a very "traditional" sort of spreadsheet. Rows, columns, cells, formulae.
It would be nice to get the benefits of Model Master, in providing the ability to attach fairly strong "rules" to portions of the spreadsheet, whether to enforce the use of common formulae or to enforce "strong typing" of the data types used in those regions. For instance, a region that represents "dates" should contain nothing other than legitimate dates.
The route to this is to use some "artificial intelligence-like" techniques to search for patterns in the data, and to write up rules to propose to the user. I will call these "Pattern Wizards."
This could include patterns such as the following:
Detecting sequences of cell contents that look, for instance, like dates.
The proposal would then offer to:
Give the region a name indicating that it is a "date" region;
Attach "type" information to the region, requiring that all cells contain dates;
If a clear sequence is indicated, offer a formula that would compute the contents of the cells.
Detecting that a region contains a Price/Quantity formula.
For instance, there may be a column that multiplies the contents of a cell in another column of the current row (indicating quantity?) by either a specific value, or by the contents of one cell somewhere in the sheet (perhaps containing a constant price?), or by the contents of a cell in another column (indicating a price for the current row?).
Such a pattern is suggestive of a price/quantity relationship, and the system could offer to:
To establish all three regions (price, quantity, total cost) as having names;
To attach "type" information to all three regions;
To name and attach the single formula to all of the formulae cells.
Detecting a "running balance" formula.
If there is commonality that a sequence of cells add together the "cell above" with "values to the left," this looks like a running balance.
A similar set of proposals could be generated, to attach "type" information, to attach names, and to attach the single formula across all the "balance" cells.
In order to provide goodly flexibility in offering "abstractions" that make for convenient formulae, the system would need to allow construction of "user-defined" functions.
My preference would be for this to be a dynamic language such as Lisp; a critical factor is for the language to be quite functional, where cells receive one value that is solely based on the input parameters. Other interesting alternatives would include:
ML, which makes major use of the notion of strong static typing; this would address the issue of the way cells can contain different "types" of data, and prevent many classes of runtime errors.
Lazy evaluation is a particularly useful notion for efficient implementation of spreadsheet calculation, as it allows deferring calculations until they are actually needed. Mind you, this only particularly applies to order of calculations, which isn't really a good enough for picking an implementation language.
It would be valuable for the "extension language" to be the language in which the "pattern wizards" described above are constructed so that it is possible to augment the patterns without a need to deploy a whole new system.
The point here is not to expect users to write their own patterns, as most won't be able to cope with this. The "average user" is not going to be writing "Pattern Wizards," but will rather use "pre-canned" ones.
However, an organization might hire a programmer that looks around to find the organization's "favorite patterns," and create Wizards to detect them.
Note that the "pattern wizards" are given the ability to decide that certain regions of the spreadsheet are to have certain names, and are to use common types/formats/formulae.
A logical extension to this would be to allow attaching database tables to regions, so that you might have columns that look up database information based on either static SQL SELECTs, or look up data based on other cells.
Indeed, if a good database schema can be arrived at for the data that does not represent computed information, it would be very nice to offer the ability to push the data conformant to the schema into a relational database. That provides the merits of:
Eliminating arbitrary size limits, as databases are commonly able to quite readily cope with managing tables containing millions of tuples.
Improving robustness, as good database systems support transactional updates, and generally don't crash very much.
Having the database manage data validation.
For instance, a relational database will reject attempts to add data with invalid dates, where, for instance, an attempt to do INSERT INTO TRANS (ACCT, AMOUNT, DATE) VALUES (101, 255.71, '2002-02-29') will elicit an error message such as ERROR: Bad date - external representation '2002-02-29'.
ZigZag, an invention of Ted Nelson, is a new type of data structure. For mathematicians, the key words would be discrete, multidimensional, locally euclidean, with global directions (coordinate axes). Nonmathematicians can find an explanation in the FAQ at the GZigZag project website, but as a short and very inadequate summary, it is a spreadsheet on steroids. Or a database on acid. Or a filesystem on ... whatever.
There was a free implementation, GZigZag, written in Java using Swing.
Major properties of the system are thus:
the information is stored in cells, kind of like a spreadsheet.
instead of two dimensions, X and Y, that a spreadsheet has, a ZigZag space can have any number of dimensions which are distinguished by strings.
a spreadsheet is globally euclidean, i.e. it is a lattice. ZigZag is only locally euclidean, so the neighborhood of (i.e. the cells next to) a given cell looks euclidean: if you go up and come down, you are back where you were before.
But if you start from location 1, and go up, left, down and right, you might not get back to where you were, let's say you are at location 2. But if you then go left, up, down and right from location 2, you get back to location 1.
The connections in ZigZag are user-alterable so you can connect any two cells along any given dimension, but because of the local euclidean constraint, each cell can be connected to only one cell in the positive and one cell in the negative direction on each dimension.
Computer scientists might note that ZigZag is an interesting special case of graphs.
Information on Excel's Pivot Table facility.
This contains a tool to generate PostgreSQL database views that correspond to Pivot Tables.
Here's an auditing tool for Excel spreadsheets...