3. Advanced operations and techniques of awk arrays

Article directory

  • Preface
  • 1. Multidimensional array operations with awk
      • 1. Create and initialize awk’s multidimensional array.
      • 2. Access multidimensional array elements.
  • 2. Advanced operations on arrays
      • 1. Use arrays to implement counters:
      • 2. AWK for data grouping and statistics
      • 3. Sorting of arrays in awk
        • 1. awk built-in function sorting
        • 2. awk custom sorting function
          • 1. Awk’s bubble sort method
          • 2. awk’s Hill sorting method
          • 2. Quick sort with awk
  • Summarize

Foreword

This chapter mainly talks about the operation of awk arrays and some tips, the creation of awk pseudo-multidimensional arrays, deletion of array elements, sorting of arrays, application of associative arrays, performance optimization, etc. The theme is awk array

1. Multidimensional array operations with awk

We will use the awk file to write the code later. The end of the awk file is .awk
For example, I created a test.awk file


Run the awk script, -f specifies the awk file we wrote.

awk -f test.awk
hello world!

In fact, awk does not support real two-dimensional arrays, but you can use associative arrays to simulate the behavior of multi-dimensional arrays.
This creates a multidimensional array-like data structure, simulating row and column indexing by using appropriate keys.

1. Create and initialize awk’s multi-dimensional array.

  • Suppose you have a table containing student information, including student name, student ID, and course grades. You want to use AWK to process this table.

  • First, create a 2D table named students:
    Let’s use awk to store such a two-dimensional table.

    Student name Student number Math score English score
    Alice 101 95 88
    Bob 102 78 92
    Charlie 103 88 75

    Use awk to initialize it

    #/usr/bin/awk
    BEGIN {<!-- -->
    #Initialization of students table
    students["Alice, ID"] = 101
    students["Alice, Math"] = 95
    students["Alice, English"] = 88
    
    students["Bob, ID"] = 102
    students["Bob, Math"] = 78
    students["Bob, English"] = 92
    
    students["Charlie, ID"] = 103
    students["Charlie, Math"] = 88
    students["Charlie, English"] = 75
    }
    

    Now a two-dimensional table has been initialized, in which the rows represent the students’ names and the columns represent the students’ student numbers, math scores, and English scores.

2. Access multi-dimensional array elements.

  • To access elements of a simulated multidimensional array, you use the corresponding keys, specifying rows and columns.

  • For example, I want to know Bob’s English score and Charlie’s math score. We can get the data like this

    #/usr/bin/awk
    BEGIN {<!-- -->
    #The initialization code of the array is omitted
    #...
    printf("Bob's English score is: %d\
    ", students["Bob, English"])
    printf("Charlie's math score is: %d\
    ", students["Charlie, Math"])
    }
    
    Output:
    Bob’s English score is: 92
    Charlie’s math score is: 88
    
  • Let’s try using a for loop to print the students array and print it into this output.

    BEGIN {<!-- -->
    #Initialize the array
    students["Alice, ID"] = 101
    students["Alice, Math"] = 95
    students["Alice, English"] = 88
    
    students["Bob, ID"] = 102
    students["Bob, Math"] = 78
    students["Bob, English"] = 92
    
    students["Charlie, ID"] = 103
    students["Charlie, Math"] = 88
    students["Charlie, English"] = 75
    
        # Print header
    printf("%-15s%-10s%-10s%-10s\
    ", "Name", "ID", "Math", "English")
    print "------------------------------------------------ --"
    
    # Create two separate arrays
    # student_names is used to store students’ names
    #student_data is used to store data about students' various subjects
    #The split function can create and clear an array
    split("", student_data)
    split("", student_names)
    \t
    for (key in students) {<!-- -->
        split(key, data, ", ")
        name = data[1]
        subject = data[2]
        student_data[name, subject] = students[key]
    
        if (!(name in student_names)) {<!-- -->
            student_names[name] = 1
        }
    }
        for (name in student_names) {<!-- -->
        printf("%-15s%-10d%-10d%-10d\
    ", name, student_data[name, "ID"], student_data[name, "Math"], student_data[name, "English"])
    }
    }
    
    Run our awk script file
    
    awk -f test.awk
    Output:
    
    Name ID Math English
    --------------------------------------------------
    Bob 102 78 92
    Alice 101 95 88
    Charlie 103 88 75
    
  • Let’s talk about the split function first

  • The split() function in awk is used to split a string into substrings and store these substrings in an array.
    grammar:

    split(string, array, separator)
    
  • string: The string to be split.

  • array: The name of the array used to store substrings.

  • separator: The separator used to split a string, usually a string. If the delimiter parameter is omitted, the space character is used as the delimiter by default.

  • The split() function works as follows:

  1. It separates the content in string into multiple substrings according to the specified separator.
  2. These substrings are stored in an array named array, each substring has an index.
  3. The function returns the number of split substrings.
    Here is an example that demonstrates how to use the split() function:
    #/usr/bin/awk
    
    BEGIN {<!-- -->
    # Example string
    my_string = "Alice,Bob,Charlie,David"
    
    # Split string into array using comma as separator
    # The return value of the split() function is the number of substrings, which is the number of array elements.
    \t
    num_substrings = split(my_string, my_array, ",")
    
    #Print the number of divided substrings
    printf("The number of substrings after splitting is: %s\
    ", num_substrings)
    
    # Loop through the array and print each substring
    for (i = 1; i <= num_substrings; i + + ) {<!-- -->
      printf("Substring %d : %s\
    ", i, my_array[i])
    }
    }
    
    Execute awk script file
    ?awk -f test.awk
    The number of substrings after splitting is: 4
    Substring 1: Alice
    Substring 2: Bob
    Substring 3: Charlie
    Substring 4: David
    

In this example, the split() function splits the comma-separated portions in my_string into substrings and stores them in my_array in the array. Then, we print the number of split substrings and the content of each substring.

The split() function is a useful tool for working with text data, splitting it into more manageable parts.

2. Advanced operations on arrays

1. Use arrays to implement counters:

  • awk's array is particularly suitable for use as a counter, used to track and count the number of times data appears. Here’s an example of how to use an array to count the number of occurrences of each element in a set of data:
  • Consider a text file containing some words, and we want to count the number of times each word appears.
    Suppose the text file text.txt contains the following content:
    apple banana apple orange banana apple
    

    Here is an example AWK script to count the occurrences of each word:

    #/usr/bin/awk
    
    # Use array as counter
    {<!-- -->
     # Split each line using spaces as separators
    split($0, words, " ")
    
    # Iterate through each word and increment the counter
    for (i = 1; i <= length(words); i + + ) {<!-- -->
        word = words[i]
        count[word] + +
    }
    \t
    } END {<!-- -->
    #Print each word and the number of occurrences
    for (word in count) {<!-- -->
        print word, "Number of occurrences:", count[word]
    }
    }
    

    Execute the awk script file we wrote

    ? echo "apple banana apple orange banana apple" | awk -f test.awk
    Number of occurrences of apple: 3
    banana Occurrences: 2
    orange Occurrences: 1
    

2. AWK for data grouping and statistics

  • In AWK, using arrays to group data and count various types of information is also a common operation, such as:

    1. Group data by field value: You can use an array to group data according to the value of a certain field, and then perform statistics on each group.
      For example, you can group data by fields such as date, product name, region, etc., and then calculate the total sales, average, etc. for each group.

    2. Data aggregation: You can use arrays to aggregate data
      For example, you can aggregate monthly sales data and calculate total sales, average sales volume, etc.

    3. Count the occurrences of elements: You can use arrays as counters to count the occurrences of elements (such as words, events, etc.).

    4. Compute frequency distribution: You can use arrays to calculate frequency distributions
      For example, count the number of students in each score range in the grade distribution.

    5. Calculate mean, median, mode, etc.: You can use arrays to calculate various statistical indicators
      Such as mean, median, mode, etc.

    6. Filtering and filtering data: You can use arrays to filter data and only retain data that meets certain conditions.

    Here is an example showing how to use AWK to group data and calculate the total value for each group:

    Suppose there is a text file data.txt containing date and sales data:

    2022-01-01 100
    2022-01-01 150
    2022-01-02 120
    2022-01-03 200
    2022-01-03 180
    

    Here is an example AWK script to group data by date and calculate the total sales for each date:

    #/usr/bin/awk
    
    # Use arrays for data grouping and statistics
    {<!-- -->
    date = $1
    sales = $2
    \t
    # Use date as key to increment total sales
    total_sales[date] + = sales
    } END {<!-- -->
    # Print each date and total sales
    for (date in total_sales) {<!-- -->
        print "date:", date, "total sales:", total_sales[date]
    }
    }
    

    Run the awk file we wrote

    ?awk -f test.awk data.txt
    Date: 2022-01-01 Total sales: 250
    Date: 2022-01-02 Total sales: 120
    Date: 2022-01-03 Total sales: 380
    

    In this example, we use the array total_sales to group the data by date and calculate the total sales for each date. Running the script will produce the total sales for each date.

    This example demonstrates how to use arrays in AWK for data grouping and statistics, but AWK’s capabilities go far beyond that. You can perform various statistical and data processing operations based on specific needs.

3. Sorting of arrays in awk

1. awk built-in function sorting
  • Two main built-in sorting functions are provided, asort() and asorti(), which can be used to sort data in an array.

  • When using AWK to write scripts to process data, you often need to sort the data in the array to better analyze and process the data. AWK provides two main sorting functions, which are asort() and asorti(), which can be used to sort data in an array.

  • asort(array [, dest [, how] ]) function

    • array: The array to be sorted, which can be an associative array or a numeric array.
    • dest (optional): An array used to store the sorted results. If the dest parameter is provided, the sorted results will be stored in dest and the original array will not be modified. If the dest parameter is not provided, sorting will be done directly on the original array.
    • how (optional): Sorting method, which can be one of the following options:
      • "asc" or 0: Sort in ascending order (default).
      • "desc" or 1: Sort in descending order.
  • The asort() function returns the length of the sorted array. It sorts the values in an array and is suitable for situations where sorting by value is required.

    • Here’s an example:
    #/usr/bin/awk
    # Example uses the asort() function to sort the array in ascending order
    BEGIN {<!-- -->
    data[1] = 5
    data[2] = 2
    data[3] = 8
    data[4] = 3
    data[5] = 1
    
    # Sort the array data in ascending order, and store the sorting results in sorted_data
    count = asort(data, sorted_data)
    
    #Print the sorted array
    for (i = 1; i <= count; i + + ) {<!-- -->
        print "value:", sorted_data[i]
    }
    }
    
  • asorti(array [, dest [, how] ]) Function: **
    • array: The array to be sorted, which can be an associative array or a numeric array.
    • dest (optional): An array used to store the sorted results. If the dest parameter is provided, the sorted results will be stored in dest and the original array will not be modified. If the dest parameter is not provided, sorting will be done directly on the original array.
    • how (optional): Sorting method, which can be one of the following options:
      • "asc" or 0: Sort in ascending order (default).
      • "desc" or 1: Sort in descending order.
  • The asorti() function returns the length of the sorted array. It sorts the keys in an array and is suitable for situations where sorting by key is required.
  • Here’s an example:
#/usr/bin/awk

# Example uses the asorti() function to sort the array in ascending order by key
BEGIN {<!-- -->
data[2] = "banana"
    data[8] = "cherry"
    data[3] = "date"
    data[1] = "fig"

# Sort the array data in ascending order by key, and store the sorting results in sorted_keys
count = asorti(data, sorted_keys)

#Print the sorted keys and corresponding values
for (i = 1; i <= count; i + + ) {<!-- -->
    key = sorted_keys[i]
    value = data[key]
    print "key:", key, "value:", value
}
}
2. awk custom sorting function
1. Awk’s bubble sorting method
#/usr/bin/awk

# Bubble sort function, sort one-dimensional array in ascending order of age
function bubble_sort(arr, i, j, temp, n) {<!-- -->
    n = asorti(arr, sorted_array)
    for (i = 1; i < n; i + + ) {<!-- -->
        for (j = 1; j <= n - i; j + + ) {<!-- -->
            if (arr[sorted_array[j]] > arr[sorted_array[j + 1]]) {<!-- -->
                temp = sorted_array[j]
                sorted_array[j] = sorted_array[j + 1]
                sorted_array[j + 1] = temp
            }
        }
    }
    for (i = 1; i <= n; i + + ) {<!-- -->
        print "Name: " sorted_array[i] ", Age: " arr[sorted_array[i]]
    }
}

# Initialize a one-dimensional array, the key is the name, the value is the age
BEGIN {<!-- -->
    my_array["Alice"] = 25
    my_array["Bob"] = 30
    my_array["Charlie"] = 22
    my_array["wf"] = 44
    my_array["lcx"] = 18

    #Call the bubble sort function
    bubble_sort(my_array)
}

Execute awk file

?awk -f 1.awk
Name: lcx, Age: 18
Name: Charlie, Age: 22
Name: Alice, Age: 25
Name: Bob, Age: 30
Name: wf, Age: 44
2. awk’s Hill sorting method
#/usr/bin/awk

#Hill sorting function, sorts a one-dimensional array in ascending order of age
function shell_sort(arr, i, j, temp, n, gap, current_name, current_age) {<!-- -->
    n = asorti(arr, sorted_array)
    for (gap = int(n/2); gap > 0; gap = int(gap/2)) {<!-- -->
        for (i = gap + 1; i <= n; i + + ) {<!-- -->
            current_name = sorted_array[i]
            current_age = arr[current_name]
            j=i
            while (j > gap & amp; & amp; arr[sorted_array[j - gap]] > current_age) {<!-- -->
                sorted_array[j] = sorted_array[j - gap]
                j=j-gap
            }
            sorted_array[j] = current_name
        }
    }
    for (i = 1; i <= n; i + + ) {<!-- -->
        print "Name: " sorted_array[i] ", Age: " arr[sorted_array[i]]
    }
}

# Initialize a one-dimensional array, the key is the name, the value is the age
BEGIN {<!-- -->
    my_array["Alice"] = 25
    my_array["Bob"] = 30
    my_array["Charlie"] = 22
    my_array["wf"] = 44
    my_array["lcx"] = 18

    #Call the bubble sort function
    shell_sort(my_array)
}

Execute awk file

?awk -f 1.awk
Name: lcx, Age: 18
Name: Charlie, Age: 22
Name: Alice, Age: 25
Name: Bob, Age: 30
Name: wf, Age: 44
2. Quick sorting of awk
#/usr/bin/awk

# Quick sort function, sort one-dimensional array in ascending order of age
function quick_sort(arr, left, right) {<!-- -->
    if (left < right) {<!-- -->
        pivot_name = sorted_array[right]
        pivot_age = arr[pivot_name]
        i=left-1
        for (j = left; j < right; j + + ) {<!-- -->
            current_name = sorted_array[j]
            current_age = arr[current_name]
            if (current_age <= pivot_age) {<!-- -->
                i++
                temp = sorted_array[i]
                sorted_array[i] = sorted_array[j]
                sorted_array[j] = temp
            }
        }
        temp = sorted_array[i + 1]
        sorted_array[i + 1] = sorted_array[right]
        sorted_array[right] = temp
        pivot_index = i + 1

        quick_sort(arr, left, pivot_index - 1)
        quick_sort(arr, pivot_index + 1, right)
    }
}

# Initialize a one-dimensional array, the key is the name, the value is the age
BEGIN {<!-- -->
    my_array["Alice"] = 25
    my_array["Bob"] = 30
    my_array["Charlie"] = 22
    my_array["wf"] = 44
    my_array["lcx"] = 18

    # Use the asorti function to sort the array by name
    n = asorti(my_array, sorted_array)

    # Call the quick sort function
    quick_sort(my_array, 1, n)

    #Print the array sorted by age in ascending order
    for (i = 1; i <= n; i + + ) {<!-- -->
        name = sorted_array[i]
        age = my_array[name]
        print "Name: " name ", Age: " age
    }
}

Execute awk file

?awk -f 1.awk
Name: lcx, Age: 18
Name: Charlie, Age: 22
Name: Alice, Age: 25
Name: Bob, Age: 30
Name: wf, Age: 44

Summary

The next section mainly talks about the performance optimization of awk arrays, etc.