Hive configuration file Hive-site.xml parameter description purpose

Parameter description and usage in the Hive configuration file hive-site.xml

Article directory

  • Parameter description and usage in Hive configuration file hive-site.xml
  • Parameter Description
  • Parameter example
  • specific purpose:

Parameter description

Parameter name Default value Usage
hive.metastore.uris The URI of Hive metadata storage.
hive.metastore.client.socket.timeout 600 Hive metadata client socket timeout.
hive.metastore.warehouse.dir /user/hive/warehouse Hive data warehouse directory.
hive.warehouse.subdir.inherit.perms true Whether subdirectories inherit permissions.
hive.auto.convert.join true Join operation that automatically converts the connection type.
hive.auto.convert.join.noconditionaltask.size 10000000 The conditions are not correct when automatically converting the join operation of the connection type The maximum amount of data that is satisfied.
hive.optimize.bucketmapjoin.sortedmerge false Whether to optimize the Sorted Merge of Bucket Map Join.
hive.smbjoin.cache.rows 10000 The number of rows cached by the SMB Join operation.
hive.server2.logging.operation.enabled false Whether to enable Hive Server2 logging operation.
hive.server2.logging.operation.log.location ${system:java.io.tmpdir}/ ${system:user.name } /operation_logs The storage location of Hive Server2 operation logs.
mapred.reduce.tasks The number of Reduce tasks for the MapReduce job.
hive.exec.reducers.bytes.per.reducer 67108864 The amount of data for each Reduce task.
hive.exec.copyfile.maxsize 33554432 The maximum size of the file allowed to be copied.
hive.exec.reducers.max -1 The maximum number of Reduce tasks running simultaneously.
hive.vectorized.groupby.checkinterval 100000 Check interval for Vectorized Group By operation.
hive.vectorized.groupby.flush.percent 0.1 The Flush proportion of the Vectorized Group By operation.
hive.compute.query.using.stats true Whether to use statistical information to optimize query plans.
hive.vectorized.execution.enabled false Whether to enable the vectorized execution engine.
hive.vectorized.execution.reduce.enabled false Whether to enable vectorized execution in the Reduce phase.
hive.vectorized.use.vectorized.input.format false Whether to use vectorized input format.
hive.vectorized.use.checked.expressions false Whether to use vectorized execution of check expressions.
hive.vectorized.use.vector.serde.deserialize false Whether to use vectorized serialization and deserialization .
hive.vectorized.adaptor.usage.mode off The usage mode of the vectorized adapter.
hive.vectorized.input.format.excludes List of excluded vectorized input formats.
hive.merge.mapfiles true Whether to merge the small files output by Map.
hive.merge.mapredfiles false Whether to merge the small files output by MapReduce.
hive.cbo.enable false Whether to enable CBO optimization.
hive.fetch.task.conversion none Fetch task conversion level.
hive.fetch.task.conversion.threshold -1 The data volume threshold that triggers Fetch task conversion.
hive.limit.pushdown.memory.usage 0.1 The memory usage percentage of Limit operation.
hive.merge.sparkfiles false Whether to merge the small files output by the Spark task.
hive.merge.smallfiles.avgsize -1 The average size when merging small files.
hive.merge.size.per.task -1 The amount of data merged by each task.
hive.optimize.reducededuplication true Whether to enable duplicate elimination optimization.
hive.optimize.reducededuplication.min.reducer 4 The minimum number of Reduce tasks to enable duplication elimination optimization.
hive.map.aggr false Whether to enable Map-side aggregation.
hive.map.aggr.hash.percentmemory 0.5 Hash table memory proportion aggregated on the Map side.
hive.optimize.sort.dynamic.partition false Whether to optimize dynamic partition sorting.
hive.execution.engine mr Hive execution engine type.
spark.executor.memory 1g The memory size of Spark Executor.
spark.driver.memory 1g The memory size of Spark Driver.
spark.executor.cores 1 The number of cores for each Spark Executor.
spark.yarn.driver.memoryOverhead 384 Spark Driver’s memory Overhead.
spark.yarn.executor.memoryOverhead 384 The memory Overhead of Spark Executor.
spark.dynamicAllocation.enabled false Whether to enable dynamic resource allocation.
spark.dynamicAllocation.initialExecutors -1 The initial number of Executors for dynamic resource allocation.
spark.dynamicAllocation.minExecutors -1 The minimum number of Executors for dynamic resource allocation.
spark.dynamicAllocation.maxExecutors -1 The maximum number of Executors for dynamic resource allocation.
hive.metastore.execute.setugi false Whether to execute setugi operation in Hive metadata store.
hive.support.concurrency true Whether to support concurrent operations.
hive.zookeeper.quorum ZooKeeper server list.
hive.zookeeper.client.port ZooKeeper client port number.
hive.zookeeper.namespace default The ZooKeeper namespace used by Hive.
hive.cluster.delegation.token.store.class org.apache.hadoop.hive .thrift.MemoryTokenStore Cluster delegation token storage class.
hive.server2.enable.doAs false Whether to enable Hive Server2 user agent mode.
hive.metastore.sasl.enabled false Whether to enable SASL authentication for Hive metadata storage.
hive.server2.authentication NONE Hive Server2 authentication method.
hive.metastore.kerberos.principal The Kerberos principal name of the Hive metadata store.
hive.server2.authentication.kerberos.principal The Kerberos principal name of Hive Server2.
spark.shuffle.service.enabled true Whether to enable the Spark Shuffle service.
hive.strict.checks.orderby.no.limit true Whether to execute in the OrderBy statement without Limit operation Check strictly.
hive.strict.checks.no.partition.filter true Whether to execute in queries without partition filter conditions Check strictly.
hive.strict.checks.type.safety true Whether to perform strict type safety checks.
hive.strict.checks.cartesian.product false Whether to perform strict Cartesian product checking.
hive.strict.checks.bucketing true Whether to perform strict bucket sorting check.

Parameter examples

<configuration>
  <!-- URI of Hive metadata storage -->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://myhost:9083</value>
  </property>

  <!-- Hive metadata client socket timeout (in milliseconds) -->
  <property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>300</value>
  </property>

  <!-- Hive data warehouse directory -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>

  <!-- Whether subdirectories inherit permissions -->
  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>true</value>
  </property>

  <!--Join operation that automatically converts the connection type -->
  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
  </property>

  <!-- The maximum amount of data (in bytes) that does not meet the conditions when automatically converting the join operation of the connection type -->
  <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>20971520</value>
  </property>

  <!-- Whether to optimize Sorted Merge of Bucket Map Join -->
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>false</value>
  </property>

  <!-- Number of rows cached for SMB Join operation -->
  <property>
    <name>hive.smbjoin.cache.rows</name>
    <value>10000</value>
  </property>

  <!-- Whether to enable Hive Server2 logging operation -->
  <property>
    <name>hive.server2.logging.operation.enabled</name>
    <value>true</value>
  </property>

  <!-- Storage location of Hive Server2 operation log -->
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/var/log/hive/operation_logs</value>
  </property>

  <!-- Number of Reduce tasks of MapReduce job -->
  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>

  <!-- The amount of data for each Reduce task (in bytes) -->
  <property>
    <name>hive.exec.reducers.bytes.per.reducer</name>
    <value>67108864</value>
  </property>

  <!-- Maximum size of files allowed to be copied (in bytes) -->
  <property>
    <name>hive.exec.copyfile.maxsize</name>
    <value>33554432</value>
  </property>

  <!-- The maximum number of Reduce tasks running simultaneously -->
  <property>
    <name>hive.exec.reducers.max</name>
    <value>1099</value>
  </property>

  <!-- Check interval for Vectorized Group By operation -->
  <property>
    <name>hive.vectorized.groupby.checkinterval</name>
    <value>4096</value>
  </property>

  <!-- Flush ratio of Vectorized Group By operation -->
  <property>
    <name>hive.vectorized.groupby.flush.percent</name>
    <value>0.1</value>
  </property>

  <!-- Whether to use statistics to optimize query plans -->
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>false</value>
  </property>

  <!-- Whether to enable vectorized execution engine -->
  <property>
    <name>hive.vectorized.execution.enabled</name>
    <value>true</value>
  </property>

  <!-- Whether to enable vectorized execution in the Reduce phase -->
  <property>
    <name>hive.vectorized.execution.reduce.enabled</name>
    <value>true</value>
  </property>

  <!-- Whether to use vectorized input format -->
  <property>
    <name>hive.vectorized.use.vectorized.input.format</name>
    <value>true</value>
  </property>

  <!-- Whether to use vectorized execution of check expressions -->
  <property>
    <name>hive.vectorized.use.checked.expressions</name>
    <value>true</value>
  </property>

  <!-- Whether to use vectorized serialization and deserialization -->
  <property>
    <name>hive.vectorized.use.vector.serde.deserialize</name>
    <value>false</value>
  </property>

  <!-- Usage mode of vectorized adapter -->
  <property>
    <name>hive.vectorized.adaptor.usage.mode</name>
    <value>chosen</value>
  </property>

  <!-- List of excluded vectorized input formats -->
  <property>
    <name>hive.vectorized.input.format.excludes</name>
    <value>org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat</value>
  </property>

  <!-- Whether to merge small files output by Map -->
  <property>
    <name>hive.merge.mapfiles</name>
    <value>true</value>
  </property>

  <!-- Whether to merge small files output by MapReduce -->
  <property>
    <name>hive.merge.mapredfiles</name>
    <value>false</value>
  </property>

  <!-- Whether to enable CBO optimization -->
  <property>
    <name>hive.cbo.enable</name>
    <value>false</value>
  </property>

  <!-- Fetch task conversion level -->
  <property>
    <name>hive.fetch.task.conversion</name>
    <value>minimal</value>
  </property>

  <!-- Data volume threshold that triggers Fetch task conversion (in bytes) -->
  <property>
    <name>hive.fetch.task.conversion.threshold</name>
    <value>268435456</value>
  </property>

  <!-- Memory usage percentage of Limit operation -->
  <property>
    <name>hive.limit.pushdown.memory.usage</name>
    <value>0.1</value>
  </property>

  <!-- Whether to merge small files output by Spark tasks -->
  <property>
    <name>hive.merge.sparkfiles</name>
    <value>true</value>
  </property>

  <!-- Average size (in bytes) when merging small files -->
  <property>
    <name>hive.merge.smallfiles.avgsize</name>
    <value>16777216</value>
  </property>

  <!-- The amount of data merged per task (in bytes) -->
  <property>
    <name>hive.merge.size.per.task</name>
    <value>268435456</value>
  </property>

  <!-- Whether to enable duplicate elimination optimization -->
  <property>
    <name>hive.optimize.reducededuplication</name>
    <value>true</value>
  </property>

  <!-- Minimum number of Reduce tasks to enable duplicate elimination optimization -->
  <property>
    <name>hive.optimize.reducededuplication.min.reducer</name>
    <value>4</value>
  </property>

  <!-- Whether to enable Map-side aggregation -->
  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
  </property>

  <!-- Hash table memory ratio aggregated on Map side -->
  <property>
    <name>hive.map.aggr.hash.percentmemory</name>
    <value>0.5</value>
  </property>

  <!-- Whether to optimize dynamic partition sorting -->
  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>false</value>
  </property>

  <!-- Hive execution engine type (mr, tez, spark) -->
  <property>
    <name>hive.execution.engine</name>
    <value>mr</value>
  </property>

  <!-- Memory size of Spark Executor -->
  <property>
    <name>spark.executor.memory</name>
    <value>2572261785b</value>
  </property>

  <!-- Memory size of Spark Driver -->
  <property>
    <name>spark.driver.memory</name>
    <value>3865470566b</value>
  </property>

  <!-- Number of cores for each Spark Executor -->
  <property>
    <name>spark.executor.cores</name>
    <value>4</value>
  </property>

  <!-- Spark Driver's memory Overhead -->
  <property>
    <name>spark.yarn.driver.memoryOverhead</name>
    <value>409m</value>
  </property>

  <!-- Memory Overhead of Spark Executor -->
  <property>
    <name>spark.yarn.executor.memoryOverhead</name>
    <value>432m</value>
  </property>

  <!-- Whether to enable dynamic resource allocation -->
  <property>
    <name>spark.dynamicAllocation.enabled</name>
    <value>true</value>
  </property>

  <!-- The initial number of Executors for dynamic resource allocation -->
  <property>
    <name>spark.dynamicAllocation.initialExecutors</name>
    <value>1</value>
  </property>

  <!-- Minimum number of Executors for dynamic resource allocation -->
  <property>
    <name>spark.dynamicAllocation.minExecutors</name>
    <value>1</value>
  </property>

  <!-- The maximum number of Executors for dynamic resource allocation -->
  <property>
    <name>spark.dynamicAllocation.maxExecutors</name>
    <value>2147483647</value>
  </property>

  <!-- Whether to perform setugi operations in Hive metadata storage -->
  <property>
    <name>hive.metastore.execute.setugi</name>
    <value>true</value>
  </property>

  <!-- Whether to support concurrent operations -->
  <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>

  <!-- ZooKeeper server list -->
  <property>
    <name>hive.zookeeper.quorum</name>
    <value>myhost04,myhost03,myhost02</value>
  </property>

  <!-- ZooKeeper client port number -->
  <property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
  </property>

  <!-- ZooKeeper namespace used by Hive -->
  <property>
    <name>hive.zookeeper.namespace</name>
    <value>hive_zookeeper_namespace_hive</value>
  </property>

  <!-- Cluster delegation token storage class -->
  <property>
    <name>hive.cluster.delegation.token.store.class</name>
    <value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
  </property>

  <!-- Whether to enable Hive Server2 user agent mode -->
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
  </property>

  <!-- Whether to enable SASL authentication for Hive metadata storage -->
  <property>
    <name>hive.metastore.sasl.enabled</name>
    <value>true</value>
  </property>

  <!-- Hive Server2 authentication method -->
  <property>
    <name>hive.server2.authentication</name>
    <value>kerberos</value>
  </property>

  <!-- Kerberos principal name for Hive metadata storage -->
  <property>
    <name>hive.metastore.kerberos.principal</name>
    <value>hive/[email protected]</value>
  </property>

  <!-- Kerberos principal name of Hive Server2 -->
  <property>
    <name>hive.server2.authentication.kerberos.principal</name>
    <value>hive/[email protected]</value>
  </property>

  <!-- Whether to enable Spark Shuffle service -->
  <property>
    <name>spark.shuffle.service.enabled</name>
    <value>true</value>
  </property>

  <!-- Whether to perform strict checking in the OrderBy statement without Limit operation -->
  <property>
    <name>hive.strict.checks.orderby.no.limit</name>
    <value>false</value>
  </property>

  <!-- Whether to perform strict checking in queries without partition filter conditions -->
  <property>
    <name>hive.strict.checks.no.partition.filter</name>
    <value>false</value>
  </property>

  <!-- Whether to perform strict type safety checks -->
  <property>
    <name>hive.strict.checks.type.safety</name>
    <value>true</value>
  </property>

  <!-- Whether to perform strict Cartesian product checking -->
  <property>
    <name>hive.strict.checks.cartesian.product</name>
    <value>false</value>
  </property>

  <!-- Whether to perform strict bucket sorting check -->
  <property>
    <name>hive.strict.checks.bucketing</name>
    <value>true</value>
  </property>
</configuration>

Specific use:

  • hive.metastore.uris: URI of Hive metadata store.
  • hive.metastore.client.socket.timeout: Hive metadata client socket timeout.
  • hive.metastore.warehouse.dir: Hive data warehouse directory.
  • hive.warehouse.subdir.inherit.perms: Whether subdirectories inherit permissions.
  • hive.auto.convert.join: Join operation that automatically converts the connection type.
  • hive.auto.convert.join.noconditionaltask.size: The maximum amount of data that does not meet the conditions when automatically converting the join operation of the connection type.
  • hive.optimize.bucketmapjoin.sortedmerge: Whether to optimize the Sorted Merge of Bucket Map Join.
  • hive.smbjoin.cache.rows: The number of rows cached by the SMB Join operation.
  • hive.server2.logging.operation.enabled: Whether to enable Hive Server2 logging operation.
  • hive.server2.logging.operation.log.location: The storage location of Hive Server2 operation logs.
  • mapred.reduce.tasks: The number of Reduce tasks for the MapReduce job.
  • hive.exec.reducers.bytes.per.reducer: The amount of data for each Reduce task.
  • hive.exec.copyfile.maxsize: The maximum size of the file allowed to be copied.
  • hive.exec.reducers.max: The maximum number of Reduce tasks running simultaneously.
  • hive.vectorized.groupby.checkinterval: Check interval for Vectorized Group By operation.
  • hive.vectorized.groupby.flush.percent: Flush proportion of Vectorized Group By operation.
  • hive.compute.query.using.stats: Whether to use statistics to optimize query plans.
  • hive.vectorized.execution.enabled: Whether to enable the vectorized execution engine.
  • hive.vectorized.execution.reduce.enabled: Whether to enable vectorized execution in the Reduce phase.
  • hive.vectorized.use.vectorized.input.format: Whether to use vectorized input format.
  • hive.vectorized.use.checked.expressions: Whether to use vectorized execution of check expressions.
  • hive.vectorized.use.vector.serde.deserialize: Whether to use vectorized serialization and deserialization.
  • hive.vectorized.adaptor.usage.mode: Usage mode of vectorized adapter.
  • hive.vectorized.input.format.excludes: List of excluded vectorized input formats.
  • hive.merge.mapfiles: Whether to merge small files output by Map.
  • hive.merge.mapredfiles: Whether to merge small files output by MapReduce.
  • hive.cbo.enable: Whether to enable CBO optimization.
  • hive.fetch.task.conversion: Fetch task conversion level.
  • hive.fetch.task.conversion.threshold: The data volume threshold that triggers Fetch task conversion.
  • hive.limit.pushdown.memory.usage: Memory usage percentage of Limit operation.
  • hive.merge.sparkfiles: Whether to merge small files output by Spark tasks.
  • hive.merge.smallfiles.avgsize: The average size when merging small files.
  • hive.merge.size.per.task: The amount of data merged by each task.
  • hive.optimize.reducededuplication: Whether to enable duplicate elimination optimization.
  • hive.optimize.reducededuplication.min.reducer: Minimum number of Reduce tasks to enable duplicate elimination optimization.
  • hive.map.aggr: Whether to enable Map-side aggregation.
  • hive.map.aggr.hash.percentmemory: The proportion of hash table memory aggregated on the Map side.
  • hive.optimize.sort.dynamic.partition: Whether to optimize dynamic partition sorting.
  • hive.execution.engine: Hive execution engine type.
  • spark.executor.memory: The memory size of Spark Executor.
  • spark.driver.memory: Spark Driver memory size.
  • spark.executor.cores: The number of cores for each Spark Executor.
  • spark.yarn.driver.memoryOverhead: Spark Driver’s memory Overhead.
  • spark.yarn.executor.memoryOverhead: Spark Executor’s memory Overhead.
  • spark.dynamicAllocation.enabled: Whether to enable dynamic resource allocation.
  • spark.dynamicAllocation.initialExecutors: The initial number of Executors for dynamic resource allocation.
  • spark.dynamicAllocation.minExecutors: The minimum number of Executors for dynamic resource allocation.
  • spark.dynamicAllocation.maxExecutors: The maximum number of Executors for dynamic resource allocation.
  • hive.metastore.execute.setugi: Whether to perform setugi operations in the Hive metadata store.
  • hive.support.concurrency: Whether to support concurrent operations.
  • hive.zookeeper.quorum: ZooKeeper server list.
  • hive.zookeeper.client.port: ZooKeeper client port number.
  • hive.zookeeper.namespace: ZooKeeper namespace used by Hive.
  • hive.cluster.delegation.token.store.class: cluster delegation token storage class.
  • hive.server2.enable.doAs: Whether to enable Hive Server2 user agent mode.
  • hive.metastore.sasl.enabled: Whether to enable SASL authentication for Hive metadata storage.
  • hive.server2.authentication: Hive Server2 authentication method.
  • hive.metastore.kerberos.principal: Kerberos principal name of Hive metadata store.
  • hive.server2.authentication.kerberos.principal: Kerberos principal name of Hive Server2.
  • spark.shuffle.service.enabled: Whether to enable the Spark Shuffle service.
  • hive.strict.checks.orderby.no.limit: Whether to perform strict checks in OrderBy statements without Limit operations.
  • hive.strict.checks.no.partition.filter: Whether to perform strict checks in queries without partition filter conditions.
  • hive.strict.checks.type.safety: Whether to perform strict type safety checks.
  • hive.strict.checks.cartesian.product: Whether to perform strict Cartesian product checking.
  • hive.strict.checks.bucketing: Whether to perform strict bucket sorting check.

The specific values of these parameters can be modified and configured according to actual needs to meet the requirements of your Hive and Spark environments.