Friday, March 22, 2019

Data Structure Unit 5 : File Structure and Hasing for B.C.A.,M.C.A. and all IT Students


--------------------------------------------------------------------------------------------------------------------------
Prepared By : Uday Shah (HOD - IT)
E-Mail : rupareleducation@gmail.com
Contact No : 7600044051

Data Structure Unit 5 : File Structure and Hasing

Basic concepts of File and file systems:
1.     File System Service.
·        A file system or file system controls how data is stored and retrieved.
·        Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins.
·        By separating the data into pieces and giving each piece a name, the information is easily isolated and identified.
·        Taking its name from the way paper-based information systems are named, each group of data is called a "file".
·        The structure and logic rules used to manage the groups of information and their names are called a "file system".
·        There are many different kinds of file systems.
·        Each one has different structure and logic, properties of speed, flexibility, security, size and more.
·        Some file systems have been designed to be used for specific applications.
·        File systems can be used on different types of storage devices that use different kinds of media.

Following are some Attributes of a File

·        Name:  It is the only information which is in human-readable form.
·        Identifier: The file is identified by a unique tag number within file system.
·        Type: It is needed for systems that support different types of files.
·        Location: Pointer to file location on device.
·        Size: The current size of the file.
·        Protection: This controls and assigns the power of reading, writing, executing.
·        Time, date, and user identification. This is the data for protection, security, and usage monitoring.
2.  Disk Space Allocation

The Disk Space Allocation methods define how the files are stored in the disk blocks. There are three main disk space or file allocation methods.
·         Contiguous Allocation
·         Linked Allocation
·         Indexed Allocation
The main idea behind these methods is to provide:
·         Efficient disk space utilization.
·         Fast access to the file blocks.

1. Contiguous Allocation
In this each file occupies a contiguous set of blocks on the disk. For example, if a file requires n blocks and is given a block b as the starting location, then the blocks assigned to the file will be: b, b+1, b+2,……b+n-1. 
This means that given the starting block address and the length of the file in terms of blocks required.
The directory entry for a file with contiguous allocation contains
·         Address of starting block
·         Length of the allocated portion.
Following is figure to show Contiguous Allocation.



Advantages:
·         Both the Sequential and Direct Accesses are supported by this.
·         Direct access, the address of the kth block of the file which starts at block b can easily be obtained as.
·         This is extremely fast since the number of seeks are minimal because of contiguous allocation of file blocks.
Disadvantages:
·        This method suffers from both internal and external fragmentation. This makes it inefficient in terms of memory utilization.
·        Increasing file size is difficult because it depends on the availability of contiguous memory at a particular instance.

2. Linked List Allocation
·        In this allocation system each file is a linked list of disk blocks which need not be contiguous.
·        The disk blocks can be scattered anywhere on the disk.
The directory entry contains a pointer to the starting and the ending file block.
·        Each block contains a pointer to the next block occupied by the file.


Advantages:
·         This is very flexible in terms of file size. File size can be increased easily since the system does not have to look for a contiguous chunk of memory.
·         This method does not suffer from external fragmentation. This makes it relatively better in terms of memory utilization.
Disadvantages:
·         Because the file blocks are distributed randomly on the disk, a large number of seeks are needed to access every block individually. This makes linked allocation slower.
·         It does not support random or direct access. We cannot directly access the blocks of a file.
·         A block k of a file can be accessed by traversing k blocks sequentially sequential access from the starting block of the file via block pointers.
·         Pointers required in the linked allocation incur some extra overhead.
3. Indexed Allocation
·        In this scheme, a special block known as the Index block contains the pointers to all the blocks occupied by a file.
·        Each file has its own index block. The ith entry in the index block contains the disk address of the ith file block.
·        The directory entry contains the address of the index block as shown in the image:



Advantages:
·         This supports direct access to the blocks occupied by the file and therefore provides fast access to the file blocks.
·         It overcomes the problem of external fragmentation.
Disadvantages:
·         The pointer overhead for indexed allocation is greater than linked allocation.
·         For very small files, say files that expand only 2-3 blocks, the indexed allocation would keep one entire block (index block) for the pointers which is inefficient in terms of memory utilization. However, in linked allocation we lose the space of only 1 pointer per block.
·         For files that are very large, single index block may not be able to hold all the pointers.


MS Dos Fat File System

·         The File Allocation Table (FAT) file system is a simple file system originally designed for small disks and simple folder structures.
·         The FAT file system is named for its method of organization, the file allocation table, which resides at the beginning of the volume. To protect the volume, two copies of the table are kept, in case one becomes damaged.
·         In addition, the file allocation tables and the root folder must be stored in a fixed location so that the files needed to start the system can be correctly located.
·         A volume formatted with the FAT file system is allocated in clusters. The default cluster size is determined by the size of the volume. For the FAT file system, the cluster number must fit in 16 bits and must be a power of two.
·         In Short file allocation table, FAT is a method of keeping track of the contents of a Hard Drive used by early Microsoft operating systems that was first introduced in 1977.
File Allocation Table
·         A file allocation table (FAT) is a file system developed for hard drives that originally used 12 or 16 bits for each cluster entry into the file allocation table.
·         It is used by the operating system (OS) to manage files on hard drives and other computer systems. It is often also found on in flash memory, digital cameras and portable devices.
·         It is used to store file information and extend the life of a hard drive.
·         Most hard drives require a process known as seeking, this is the actual physical searching and positioning of the read/write head of the drive.
·         The FAT file system was designed to reduce the amount of seeking and thus minimize the wear and tear on the hard disc.
·         FAT was designed to support hard drives and subdirectories.

A FAT file system has four different sections, each as a structure in the FAT partition. The four sections are:
  • Boot Sector:
  • This is also known as the reserved sector it is located on the first part of the disc.
  • It contains the OS's necessary boot loader code to start a PC system, the partition table known as the master boot record (MRB) that describes how the drive is organized, and the BIOS parameter block (BPB) which describes the physical outline of the data storage volume.
  • FAT Region:
  • This region generally encompasses two copies of the File Allocation Table which is for redundancy checking and specifies how the clusters are assigned.
  • Data Region: This is where the directory data and existing files are stored. It uses up the majority of the partition.
  • Root Directory Region: This region is a directory table that contains the information about the directories and files.
  • FAT32 usually stores the root directory in the data region so it can be expanded if needed.
Tree Structure Directory System:
·        In Tree structured directory system, any directory entry can either be a file or sub directory.
·        Tree structured directory system overcomes the drawbacks of two level directory system.
·        The similar kind of files can now be grouped in one directory.
·        Each user has its own directory and it cannot enter in the other user's directory.
·        However, the user has the permission to read the root's data but he cannot write or modify this.
·        Only administrator of the system has the complete access of root directory.
·        Searching is more efficient in this directory structure.
·        A file can be accessed by two types of path, either relative or absolute.
·        Absolute path is the path of the file with respect to the root directory of the system while relative path is the path with respect to the current working directory of the system.
·        In tree structured directory systems, the user is given the privilege to create the files as well as directories.

File Organization

File organization ensures that records are available for processing. It is used to determine an efficient file organization for each base relation. 

For example, if we want to retrieve employee records in alphabetical order of name. Sorting the file by employee name is a good file organization. However, if we want to retrieve all employees whose marks are in a certain range, a file is ordered by employee name would not be a good file organization.

Types of File Organization

There are three types of organizing the file:

1. Sequential access file organization
2. Direct access file organization
3. Indexed sequential access files organization


Concept of Field , Record and File
Field: A field is an item of stored data. A field could be a name, a date, an address, a description, a quantity, etc. When a field is defined it is given a name (identifier) and a type, that defines the type of data that will be stored in that field. This is exactly the same as defining a variable within a program.
For ex.
          int rno;
          char name [20];
          float amt;

According to above rno, name and amt is field of different type which helps to create a record.
Record: A record is the collection of fields that relate to a single entity. For example, we could have a student record that includes fields for the student’s name, address, homeroom, date of birth, etc. A product record could include fields for the serial number, description, cost price, quantity in stock, etc.
For ex.
          structure stud
          {
                   int rno;
                   char name [20];
                   float amt;
};

According to above structure stud is on record and rno, name and amt is field of structure stud with different type which helps to create a record.

File: A file is a collection of related records. For example, a student file might include all of the records of students enrolled at a school. A police department might keep a file of criminal records, which includes details about all known criminals. Files are stored on secondary storage devices such as hard disks, CD-ROMs etc.
Within a file, all records usually have the same structure. That is, every record in the file contains the same fields. Only the data stored in the fields of different record will be different.

Indexes: An indexed file is a computer file with an index that allows easy random access to any record given its file key.
The key must be such that it uniquely identifies a record. If more than one index is present the other ones are called alternate indexes. The indexes are created with the file and maintained by the system.

Sequential access file organization : Storing and sorting in contiguous block within files on tape or disk is called as sequential access file organization. In sequential access file organization, all records are stored in a sequential order. The records are arranged in the ascending or descending order of a key field. Sequential file search starts from the beginning of the file and the records can be added at the end of the file. In sequential file, it is not possible to add a record in the middle of the file without rewriting the file.


Advantages of sequential file
·        It is simple to program and easy to design.
·        Sequential file is best use if storage space.
Disadvantages of sequential file
·        Sequential file is time consuming process.
·        It has high data redundancy.
·        Random searching is not possible.

Indexed sequential access file organization : Indexed sequential access file combines both sequential file and direct access file organization. In indexed sequential access file, records are stored randomly on a direct access device such as magnetic disk by a primary key. This file have multiple keys. These keys can be alphanumeric in which the records are ordered is called primary key. The data can be access either sequentially or randomly using the index. The index is stored in a file and read into memory when the file is opened.


Advantages of Indexed sequential access file organization

·        In indexed sequential access file, sequential file and random file access is possible.
·        It accesses the records very fast if the index table is properly organized.
·        The records can be inserted in the middle of the file.
·        It provides quick access for sequential and direct processing.
·        It reduces the degree of the sequential search.

Disadvantages of Indexed sequential access file organization

·        Indexed sequential access file requires unique keys and periodic reorganization.
·        Indexed sequential access file takes longer time to search the index for the data access or retrieval.
·        It requires more storage space.
·        It is expensive because it requires special software.
·        It is less efficient in the use of storage space as compared to other file organizations.

 What is Hashing?

·        Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function.
·        Hashing is also known as Hashing Algorithm or Message Digest Function.
·        It is a technique to convert a range of key values into a range of indexes of an array.
·        It is used to facilitate the next level searching method when compared with the linear or binary search.
·        Hashing allows updating and retrieving any data entry in a constant time.
·        Constant time means the operation does not depend on the size of the data.
·        Hashing is used with a database to enable items to be retrieved more quickly.
·        It is used in the encryption and decryption of digital signatures.


What is Hash Table?

·         Hash table or hash map is a data structure used to store key-value pairs.
·         It is a collection of items stored to make it easy to find them later.
·         It uses a hash function to compute an index into an array of buckets or slots from which the desired value can be found.
·         It is an array of list where each list is known as bucket.
·         It contains value based on the key.
·         Hash table is used to implement the map interface and extends Dictionary class.
·         Hash table is synchronized and contains only unique elements.



The above figure shows the hash table with the size of n = 10. Each position of the hash table is called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.


Liner Hashing:

·        Linear Hashing is a dynamically updateable disk-based index structure which implements a hashing scheme and which grows or shrinks one bucket at a time.
·        The index is used to support exact match queries, i.e., find the record with a given key.
·        Compared with the B Tree index which also supports exact match queries, Linear Hashing has better expected query cost.
·        Compared with Extendable Hashing, Linear Hashing does not use a bucket directory, and when an overflow occurs, it is not always the overflow n bucket that is split.
·        The name Linear Hashing is used because the number of buckets grows or shrinks in a linear fashion.
·        Overflows are handled by creating a chain of pages under the overflow n bucket.

The hashing function changes dynamically and at any given instant there can be at most two hashing functions used by the scheme.

:: Best Of Luck :: 

https://myshopprime.com/uday.shah3/shop

Please share and comment it... Thank you.