what are other type of data exept files

for.ggame.playing · Jul 4, 2020

What are link, pipe and socket files and how to use them?
and what are non-text type of files that I dont know of?

mickey · Jul 4, 2020

for.ggame.playing said:
What are link, pipe and socket files and how to use them?
and what are non-text type of files that I dont know of?

A link references an already existing file either by means of pointing to the same blocks on disk as the original file (hard link) or a pointer that points to the original file (symbolic link). Symbolic links may cross filesystem boundaries whereas hard links may not. See ln(1). (named) pipes and (UNIX domain) sockets are methods of interprocess communication, used to exchange information between processes running on the same machine (in contrast to processes running on separate machines using the network to communicate). See pipe(2), socket(2) and associated manual pages for more in-depth information on how to use these. As for the other non-text type files... plenty... almost any file format carrying data of some sort that is not text, i.e. executable files, audio-, video- and all kinds of other data files.

Alain De Vos · Jul 4, 2020

& devices

Mjölnir · Jul 4, 2020

When I wrote I an awk(1) script to summarize all file types of a filesystem (or part of it), I learned there is a rarely used file type called whiteout. These are deleted files in an overlay filesystem, e.g. unionfs (mount_unionfs(8)).

ralphbsz · Jul 4, 2020

One way of looking at it: what data is stored on a computer? To begin with, nearly all of it is stored in a file system, on disks. Most of the data in a file system is in files, which are ordered (and perhaps sparse) arrays of bytes, which can be read and written.

Hard links are not a different type of file. The correct way to interpret them is that a file may have 0 or more names. What is a name? A text string that is stored in a directory. Typically, files have 1 name. Some files have zero names; those will vanish as soon as the last program that has them open closes them (they are typically called temporary files, for that reason). Sometimes files have multiple names, by having multiple hard links.

Soft links are indeed a different type of file system object. They were described above.

Directories are file system objects. They store data, most visible the names of files, directories, and other file system objects. They also store attributes, such as modification time and permissions. Today, they can also store arbitrary extended attributes, which are just strings (sequences of bytes). In some systems, extended attributes are limited in size (often to less than 4KiB), but in other systems, they can be arbitrary length. This becomes related to the concept of a file having multiple "forks", meaning multiple sets of bytes. This used to be the hallmark of the Macintosh file system a while ago.

All other file system objects (fifos, sockets, pipes, devices, reparse points, whiteouts, ...) have no permanent data content. For example, you can only read from a named pipe what the other process just wrote.

Now, what are other ways in which data is stored? Disks also contain data that is not organized in file system. In today's common usage, the only other organized form of information on disks is the partition table (GPT or MBR). Until about 15-20 years ago, databases commonly used disks (or disk partitions) directly, without going through a file system. That usage is historic today. Lots of data is still stored on tape, typically NOT in file systems (although there are tape file systems). Most computers today have a small amount of read/write storage that contains hardware configuration data (the BIOS configuration), and some security information (the TPM or trusted platform module). In theory one could refer to the ROM that the boot code is in as a form of storage; in practice, the only thing that goes into there today is executable boot code. There are forms of storage that are halfway between RAM and storage. For example NVDIMMs, which during operation look and feel like RAM, but don't lose their content when power goes away. One interesting form of storage that's used in specialized settings is to just use RAM, with multiple machines; I've worked with compute clusters that had persistent information that was only in RAM, stored under the assumption that the power system (consisting of utility power, battery-powered UPSes, and generators, in multiple geographic locations) would never go completely down.

ralphbsz · Jul 4, 2020

As to the content of the data: That's open to the user. The Unix tradition is that a file (or similar things, like sockets or pipes) are simply ordered arrays of bytes. The meaning of the bytes is up to the user who reads and writes them. Now, what these bytes mean is a very complex technical and philosophical question.

Clearly, the most common type of file is text. The Unix tradition is that text files contain printable characters (in US-ASCII those are from 0x20 to 0x7E inclusive), plus newlines. Some control characters (like backspace, tab and form feed) are commonly interpreted the same way (for example "x BS _" means an underlined x, and when rendered should look like this: x). But even text file have a terribly nasty hidden secret: What character set is to be used? In the old days, this was easy, and computers used only US-ASCII (or EBCDIC, typically on IBM main frames, and a few other character sets that are only of historic significance). But beginning in the 80s, the upper half of the 8-bit address range of bytes started being used. First for accented European characters and line drawing, for boxes. Then for other character sets (like Cyrillic). Then came character sets with more than 256 different characters, which have to be mapped to bytes in some way (that is called an encoding). Today, we use Unicode for that, and most often the UTF-8 encoding. But this means that data is no longer just a stream of bytes, which can all be understood individually, but to understand it, you have to know something about the syntax of the encoding and the semantics of the encoded data, even just for a text string.

One thing that has gone away with the passage of time are richer file formats. For example, many operating systems didn't use to store individual characters and rely on the "newline" character to end a line, they used to store lines of text. Lines could be fixed length (padded with spaces or other invisible characters at the end), or variable length. As computers became more comfortable, a variety of other information was packed into that, for example how to handle the paper when printing the text file (carriage control, typically with a Fortran carriage control character in the first position of each line, but handled separately by the operating system). Good operating systems (such as the various mainframe OSes and VMS on the VAX) had a huge variety of record formats, and rich conversion utilities, which made operation easy, efficient, and convenient. Unix instead went for the lowest common denominator and quick hack, relying on separate utility programs (such as tr, cut and pr as examples).

And in reality, having to interpret the syntax and semantics of bytes has always been necessary. That's even true for text files. If you write a script and don't put the #! line correctly in the first line, that script won't run. But even more files have always also stored forms of data that is not clear text, often called binary data. And there are an incountably large number of such formats. The ones that are commonly seen today in consumer (amateur) computer usage are sound files (mp3), images (jpg) and video (mov). There are lots of others, and making a complete list is somewhere between impossible and laughable. The "magic" file in /usr/share/misc/magic helps define many commonly used file formats, but there are zillions of others.

Mjölnir · Jul 4, 2020

ralphbsz said:
[...] Until about 15-20 years ago, databases commonly used disks (or disk partitions) directly, without going through a file system. That usage is historic today. [...]

What? That's new to me. I still read the recommendation to give the DB access to the raw disk/SSD (if speed is not sufficient w/ the standard file method) quite often.

Jose · Jul 5, 2020

Oracle has gone back and forth at least three times in the past 20 years. One of my first big mistakes as a young sysadmin was to newfs the apparently empty partition the Oracle database was using. The Anderson consultants were not amused.

This is the latest emanation from Redwood Shores:

Introduction to Automatic Storage Management (ASM)

ralphbsz · Jul 5, 2020

They're still recommending to use raw disk? Serious? I stand corrected. Those fools. For a small performance gain (we measured it in the last 90s, and it was a handful percent), you lose a lot of convenience. But it's a free country ... do as you please.

Mjölnir · Jul 5, 2020

ralphbsz said:
They're still recommending to use raw disk? Serious? I stand corrected. Those fools. For a small performance gain (we measured it in the last 90s, and it was a handful percent), you lose a lot of convenience. But it's a free country ... do as you please.

Iff the amount of data is so huge that you need several physical media to store it anyway, I do not see any loss of convenience. Just mirror the whole disks in a RAID-10 (e.g. graid(8) or gvinum(8)), insert a gsched(8) IO-scheduler (a geom_cksum(4) is not available yet), done. On your measurements: the performance of DB differs significantly with the use-case. In general, when latency is of primary concern, using raw disks can make a big difference.

Jose · Jul 5, 2020

ralphbsz said:
They're still recommending to use raw disk? Serious? I stand corrected. Those fools. For a small performance gain (we measured it in the last 90s, and it was a handful percent), you lose a lot of convenience. But it's a free country ... do as you please.

It's a real barrel of laughs when the ignorant Oracle DBAs insist on using ASM on top of virtual disk.