If one really thinks about it, text files are a kind of virtual memory. You can add strings, retrieve strings. And if the strings are of a fixed length, then you can even replace strings in a text file.
But I found the text file commands of DOS3.3 and Basic.system very slow. Some recommendations were to save your strings to a BIN file instead. So I combined the two and created a sort of semi-Virtual memory for strings
Applesoft is limited with what it can do with variables. The maximum value for any dimension is 32767, and the actual string space available for strings is far less.
Using my virtual string array add on, you can define up to 65536 strings including string zero. And the maximum length of any string is 255 characters (not 256 because string zero cannot be used since it is used as an end of string marker) Strings used this way can reach Prodos’s maximum file size of 16 Mb limit (65536 x 255 = 16711680)
Only one virtual string array variable is created in memory with 255 characters, one master block and one single byte block is also loaded with the pointers (the 1 block that is loaded with the actual text uses basic.systems buffer so does not take any additional space. If the string spans 2 blocks, then the first part of the string is moved to the string space, then the 2nd block is loaded)
Text files of Prodos’s maximum files size can be created
Multiple Virtual String Array files can be created and used within the same program
Virtual Array Text files can be created using any Word Processor that can create large files (I wrote a short applesoft program that creates a header from the text files so any string can be jumped to automatically as if it were the first string in the list)
Disadvantages – every string must be retrieved from disk although with the advent of flash cards has made the speed of drives more tolerable.
One extra file is created for the pointers.
The commands to use
& DEF VA$(0) – creates a virtual string variable and sets its length to $FF (255) and makes space in string space
& GET VA$(0-65535),”Virtual String Array filename” – gets string from file
& STORE X$,”Virtual String Array filename” – appends the string X$ to the end of the file. If file does not exist then create it. If file is not a Virtual Array file then return with error
Also a separate short applesoft program was created to make a header for the pointers to every string. I used a pretty neat trick to compress the header so that only one byte is needed to point to each string instead of three and a one block master header to point to the start of each string block.
Long story short, hopefully. One block can point to 512 strings. so only 128 blocks would be needed to point to 65535 strings plus one master block. The master block holds the two high bytes of the start of each of the 128 blocks so only one master block is needed. Once one of the 128 blocks has been chosen by a little math (# of the string divided by 512 gives one of the 128 blocks)(coincidently, this same result also points to the hi bytes in the master block), a short search is then made in the found block so that each time the next byte is less than the preceding byte, a counter advances the high bytes until the start of the correct string is found.
Remember, these blocks can point to 512 strings of 255 characters, so the master pointers may also need to be incremented. There, short and sweet.
A three byte pointer header would take 384 blocks to point to each string to reach the 16 MB limit.
One last thought for a use of Virtual String Arrays. One String Array can have 65535 strings and multiple Virtual files can be created.
Each file can be designated as a column. Strings can become formulas. Are you getting the picture.
A Virtual Spreadsheet with 65535 columns by 65535 rows. There is a good program in Nibble magazine Vol6 No8. Dynamic functions. It puts formulas into a string and then uses redefined DEF and VAL commands to evaluate the formula.
To recap, I have created a couple short programs to turn a Random Access Text File (RATF) into a sequential text file to save space that a RATF would need since each line of text is padded to be all the same
length. But a sequential file has all of its text lines in a sequential order of varying length and with a terminator character to mark the end of the line. The only way to get to each line of text is by reading each previous line to get to the line that you may want.
Question: How do you access each line quickly without the time consuming operation of reading each previous line?
Answer: With pointers to each line.
This has the advantages of allowing one to create sequential files with a text editor but using the text file as a RATF. And the file size is substantially reduced due to the removal of padded characters.
To continue on with this topic of creating even smaller RATF files.
A set of short programs can compress a large text file even further.
How, you ask?
Good question. Here’s how.
By creating a library of words from the text file and having a pointer, called a descriptor, to each word from the library. The average word length in our English language is between 4-5 characters. There are only a couple of single letter words (I and a) and also not too many two letter words. (an at is if as ad ah am be by do go my no so to we). At the worst even with a 2 byte descriptor, we are basically at a break even point for compression.
Now keep in mind, that the first occurrence of a word is stored in a library, and you don’t get any compression until that word shows up again. And also a one, two or three byte descriptor takes up memory as well. So on the first occurrence of a word, you actually may lose a couple of bytes. If a document only contained one occurrence of each word, then this compression technique would be an utter disaster.
Originally, only one byte is needed to start, to point to up to 256 words from a library. This would soon be used up as most text documents easily have more than 256 distinct words. So once the value 255 ($FF) is encountered, this would signal the decompression program that it now requires 2 bytes to point to a word from the library.
This would give us access of up to 65536 words from our library. This is still pretty good compression if the majority of the words stored in the library are 3 characters or more.
For the most part, a two-byte descriptor and 65536 words should cover most of the words in a reasonable sized text file. And if needed a third byte can be used with the descriptor to access up to 16,777,216 words in our library. Even the use of this third byte can generally reduce a file by 25% or more since a three-byte descriptor would be used to point to an average 4 to 5 byte word.
Now, punctuation is a slightly different story. Punctuation would have to be considered one-byte words due to their usage and they may follow any character of the alphabet or any word in the library. You can not add punctuation to a word since that would make a new word, and the occurrence of that same word with the added punctuation mark would rarely be used again throughout a document. So, for the best compression, punctuation should be added to the library first so as to be used by the single-byte descriptor without any loss of hard drive space since a one-byte descriptor uses one byte and the punctuation mark takes up one byte.
But, there may be one boon to saving space with punctuation.
Normally, spaces are not compressed and are not needed since pointers only point to a whole word, and it is assumed that a word is followed by a space. If a word is followed by a punctuation mark, then that space would have to be recanted and the punctuation mark inserted.
Also, usually, a punctuation mark is followed by one space, and that space would not need to be included as part of the punctuation word either, but some punctuation marks may have 2 or more spaces following it. Since spaces are not captured as a word unless there is more than one consecutive space, you could then include those extra spaces as part of the punctuation word. The chance of a re-occurrence of the punctuation mark with extra spaces would be quite high, therefore space can be saved and the punctuation word can be compressed, or at the very least a break even point.
Another instance of repetitive punctuation is at the end of each paragraph. A period can be combined with one or two carriage returns to form a word. Space savings can be fairly good when punctuation marks are repeated frequently in a document. A one-byte descriptor with a three-byte punctuation is a 66% savings.
So for the most part, compression of a fair size text document should exceed the 50% mark and may reach as high as 75%. The document would have to be quite large to recognize any significant savings. For example, this short lecture would not see that much of a savings in compression but it may be worth to compress even it as there are some quite large words that are used a number of times. (Compression, characters, descriptor, document, library, punctuation). If each of these words had a two-byte descriptor would result in 60 – 80% compression for those words.
One last note is with capital letters. Words that contain capitals would not have to be distinguished apart from its lower case counterpart if the word is at the beginning of a sentence, but if the word falls in the middle of a sentence, and contains any capital letters, then it will have to be added to the library as a separate word.
-A RATF file with padded characters and no compression can be quite large.
-Close to a 50% reduction in file size can be recognized by being saved as a sparse file.
-Another 50% or more can be seen by removing padded characters and saving as a sequential file and using pointers.
-And lastly, the text document may be compressed to save another 50-75% by using a library of words and having a one, two or three byte descriptor to describe the word being used from the library.
All-in-all. A 300 block RATF file could potentially be reduced down to about 30 blocks. That is a 10 times savings. A must need on a computer with limited memory or hard drive space. A lot of information can be stored in a small amount of space.
That’s it for now