How to choose field types appropriately when creating a table

Foreword

When we create a table, there are several types of people who choose field types:

  1. Rigorous
    People who strictly investigate the possible size of each field and then make selections based on the limitations of different field types will have no problem creating relational data tables.
  2. Try to save yourself the trouble
    Set all fields to String so that he can import all data at once. Such people will be scolded when using relational databases (such as mysql). In big data (such as hive), it may be possible to do this in the ODS layer. In principle, it is feasible. After all, there is still modeling to handle these. Data types, but modelers complain too.
  3. Screening types based on different scenarios
    (1) Big data field
    Since the performance impact of differences in big data field types is far outweighed by the amount of data, it does not need to be so rigorous.
    (2) Relational database
    It is necessary to provide users with the ultimate experience as much as possible, and field types should be selected as reasonably as possible.

Commonly used field types

Data type Description
TINYINT 1-byte signed integer, from -128 to 127
SMALLINT 2-byte signed integer, from -32,768 to 32,767
INT/INTEGER 4-byte signed integer, from -2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
FLOAT 4-byte single precision floating point number
DOUBLE 8-byte double precision floating point number
DECIMAL The DECIMAL type in Hive is based on Java’s BigDecimal and is used to represent immutable arbitrary-precision decimal numbers in Java. All regular numeric operations (such as +, -, *, /) and related UDFs (such as Floor, Ceil, Round, etc.) can handle the DECIMAL type. You can type cast the DECIMAL type just like other numeric types. The DECIMAL type persistence format supports both scientific and non-scientific notation. So whether your data set contains data like 4.004E+3 (scientific notation) or 4004 (non-scientific notation) or a combination of both, you can use DECIMAL to handle it.
STRING String, variable length, string literals can use single quotes (‘) or double quotes Quotation marks (“) indicate
VARCHAR Varchar type needs to specify the length when creating (from 1 to 65535 ), which defines the maximum number of characters allowed in a character string. If the string value converted/assigned to a varchar value exceeds the length specified value, the string will be silently truncated
CHAR Char types are similar to Varchar, but their length is fixed, which means shorter than the specified length value The value will be padded with spaces, but trailing spaces are not important when comparing. The maximum length is fixed at 255
TIMESTAMP Supports legacy UNIX timestamps with optional nanosecond precision
DATE DATE values Describes a specific year/month/day in the format YYYY-MM-DD. For example, the date is “January 01, 2013”. The date type does not have a time of day component. The value range supported by the Date type is 0000-01- 01 to 9999-12-31, depending on original Java Date type support
BOOLEAN true /false
INTERVAL Time frequency interval
ARRAY Ordered collection of the same type
MAP key-value, key must be a primitive type, value can be of any type
STRUCT Field collection, types can be different
UNION within a limited value range a value

Big data (Hive) field type selection

The field length in the data warehouse Hive should try to meet the maximum length requirements of the corresponding source system fields. Of course, the business meaning of the field will also be considered. For some source systems, the definition is too long, and it is impossible to have such a long field based on the actual business meaning. It is up to the warehouse to choose an appropriate length definition; in order to maintain the consistency and standardization of data types in the warehouse as much as possible, the data type definitions in the data warehouse should not be too complicated. It is recommended to only define string, bigint, double type to keep the field types in the warehouse neat.

Relational database (Mysql) field type selection

  1. in principle
    Choose the right type whenever possible. For example, if a field only has 0 and 1, tinyint will definitely take precedence over int.
  2. tinyint and int performance
  • storage
    The INT field occupies 4 bytes, while the TINYINT field only occupies 1 byte. Therefore, if your data volume is large, using TINYINT can save storage space and reduce disk IO overhead.
  • memory consumption
    INT fields will take up more memory space when stored. If there are many INT fields in the table and the amount of data is large, the memory consumption of the MySQL server will increase.
  • Index efficiency
    Using TINYINT fields to create indexes is more efficient than using INT fields to create indexes. Because TINYINT occupies a small space during the index creation and query process, it can reduce the number of IO operations and improve query efficiency.
  • Operation and comparison speed
    INT fields may be slightly slower in operations and comparisons than TINYINT fields because more bytes need to be processed.

If your data range can be within the value range of TINYINT (-128 to 127 or 0 to 255), and you have high requirements for storage space and index efficiency, using the TINYINT field will be more appropriate. If the data range exceeds the value range of TINYINT, or if the requirements for memory consumption and operation speed are higher, you can consider using the INT field.

  1. What is the difference between int(1) and int(10)
    There is actually no difference between INT(1) and INT(10). In MySQL, when defining an integer type (such as INT), the number in parentheses represents the display width, not the storage size or range of values.
    Display width only affects the width of the values displayed in the query result set. For example, if you use INT(10) and have a value of 123 in the query results, it will still appear as 123 instead of being padded to 10 with leading zeros. Display width does not limit the range of values or storage size.
    In fact, for integer types such as INT, the storage size and range of values are determined by the type itself, not the display width. The INT type always uses 4 bytes (32 bits) of storage space, and the range of values is always from -2147483648 to 2147483647 (signed) or from 0 to 4294967295 (unsigned).
    Therefore, whether you use INT(1) or INT(10), their storage size and range of values are the same. Choosing an appropriate display width is simply to better format the displayed values in the query results.
    To summarize, there is no real difference between INT(1) and INT(10) in MySQL 8, they are just used to specify the width of the value displayed in the query results.
    For a more intuitive understanding:
  • Create a table:

    create table intVsIntAnyThingDemo
    (
        Number1 int(1) unsigned zerofill,
        Number int(8) unsigned zerofill
    );
    

    Detailed explanation of table creation statement:
    In MySQL, unsigned and zerofill are two attributes that can be used to modify integer types (such as int).
    unsigned: Indicates that this field can only store non-negative integers. In other words, this field cannot store negative numbers.
    zerofill: means that if the number of digits in the value is less than the specified number of integer digits, zeros will be padded to the left of the value.
    For the field Number1 int(1) unsigned zerofill, since int(1) is specified, this means that this field can only display one digit. But due to the use of unsigned zerofill, when you insert a number, it will automatically be filled to two digits. For example, inserting 5 will become 05.
    For the field Number int(8) unsigned zerofill, it can display 8 digits. So, inserting 42 will become 00000042.
    Summary: In this example, unsigned ensures that the field only stores non-negative numbers, while zerofill ensures that when the field is displayed, if the actual number of digits is less than the specified number of digits, it will be padded with zeros on the left. But it should be noted that although it appears that Number1 can only store numbers from 0 to 9, in fact it can store integer values from 0 to 255 due to its unsigned attribute. This is because int(1) is still treated as the underlying storage of an integer when it is unsigned, and its range is 0 to 255.

  • Insert data

insert into intVsIntAnyThingDemo values(11,22);
  • View data

Reference documentation

Hive data types
stackoverflow
int(5)vsint(10)

syntaxbug.com © 2021 All Rights Reserved.