Article directory
- Preface
- 1. Multidimensional array operations with awk
-
-
- 1. Create and initialize awk’s multidimensional array.
- 2. Access multidimensional array elements.
-
- 2. Advanced operations on arrays
-
-
- 1. Use arrays to implement counters:
- 2. AWK for data grouping and statistics
- 3. Sorting of arrays in awk
-
- 1. awk built-in function sorting
- 2. awk custom sorting function
-
- 1. Awk’s bubble sort method
- 2. awk’s Hill sorting method
- 2. Quick sort with awk
-
- Summarize
Foreword
This chapter mainly talks about the operation of awk arrays and some tips, the creation of awk pseudo-multidimensional arrays, deletion of array elements, sorting of arrays, application of associative arrays, performance optimization, etc. The theme is awk array
1. Multidimensional array operations with awk
We will use the awk file to write the code later. The end of the awk file is .awk
For example, I created a test.awk file
Run the awk script, -f specifies the awk file we wrote.
awk -f test.awk hello world!
In fact, awk does not support real two-dimensional arrays, but you can use associative arrays to simulate the behavior of multi-dimensional arrays.
This creates a multidimensional array-like data structure, simulating row and column indexing by using appropriate keys.
1. Create and initialize awk’s multi-dimensional array.
-
Suppose you have a table containing student information, including student name, student ID, and course grades. You want to use AWK to process this table.
-
First, create a 2D table named students:
Let’s use awk to store such a two-dimensional table.Student name Student number Math score English score Alice 101 95 88 Bob 102 78 92 Charlie 103 88 75 Use awk to initialize it
#/usr/bin/awk BEGIN {<!-- --> #Initialization of students table students["Alice, ID"] = 101 students["Alice, Math"] = 95 students["Alice, English"] = 88 students["Bob, ID"] = 102 students["Bob, Math"] = 78 students["Bob, English"] = 92 students["Charlie, ID"] = 103 students["Charlie, Math"] = 88 students["Charlie, English"] = 75 }
Now a two-dimensional table has been initialized, in which the rows represent the students’ names and the columns represent the students’ student numbers, math scores, and English scores.
2. Access multi-dimensional array elements.
-
To access elements of a simulated multidimensional array, you use the corresponding keys, specifying rows and columns.
-
For example, I want to know Bob’s English score and Charlie’s math score. We can get the data like this
#/usr/bin/awk BEGIN {<!-- --> #The initialization code of the array is omitted #... printf("Bob's English score is: %d\ ", students["Bob, English"]) printf("Charlie's math score is: %d\ ", students["Charlie, Math"]) } Output: Bob’s English score is: 92 Charlie’s math score is: 88
-
Let’s try using a for loop to print the students array and print it into this output.
BEGIN {<!-- --> #Initialize the array students["Alice, ID"] = 101 students["Alice, Math"] = 95 students["Alice, English"] = 88 students["Bob, ID"] = 102 students["Bob, Math"] = 78 students["Bob, English"] = 92 students["Charlie, ID"] = 103 students["Charlie, Math"] = 88 students["Charlie, English"] = 75 # Print header printf("%-15s%-10s%-10s%-10s\ ", "Name", "ID", "Math", "English") print "------------------------------------------------ --" # Create two separate arrays # student_names is used to store students’ names #student_data is used to store data about students' various subjects #The split function can create and clear an array split("", student_data) split("", student_names) \t for (key in students) {<!-- --> split(key, data, ", ") name = data[1] subject = data[2] student_data[name, subject] = students[key] if (!(name in student_names)) {<!-- --> student_names[name] = 1 } } for (name in student_names) {<!-- --> printf("%-15s%-10d%-10d%-10d\ ", name, student_data[name, "ID"], student_data[name, "Math"], student_data[name, "English"]) } }
Run our awk script file awk -f test.awk Output: Name ID Math English -------------------------------------------------- Bob 102 78 92 Alice 101 95 88 Charlie 103 88 75
-
Let’s talk about the split function first
-
The
split()
function in awk is used to split a string into substrings and store these substrings in an array.
grammar:split(string, array, separator)
-
string
: The string to be split. -
array
: The name of the array used to store substrings. -
separator
: The separator used to split a string, usually a string. If the delimiter parameter is omitted, the space character is used as the delimiter by default. -
The
split()
function works as follows:
- It separates the content in
string
into multiple substrings according to the specifiedseparator
. - These substrings are stored in an array named
array
, each substring has an index. - The function returns the number of split substrings.
Here is an example that demonstrates how to use thesplit()
function:#/usr/bin/awk BEGIN {<!-- --> # Example string my_string = "Alice,Bob,Charlie,David" # Split string into array using comma as separator # The return value of the split() function is the number of substrings, which is the number of array elements. \t num_substrings = split(my_string, my_array, ",") #Print the number of divided substrings printf("The number of substrings after splitting is: %s\ ", num_substrings) # Loop through the array and print each substring for (i = 1; i <= num_substrings; i + + ) {<!-- --> printf("Substring %d : %s\ ", i, my_array[i]) } }
Execute awk script file ?awk -f test.awk The number of substrings after splitting is: 4 Substring 1: Alice Substring 2: Bob Substring 3: Charlie Substring 4: David
In this example, the split()
function splits the comma-separated portions in my_string
into substrings and stores them in my_array
in the array. Then, we print the number of split substrings and the content of each substring.
The split()
function is a useful tool for working with text data, splitting it into more manageable parts.
2. Advanced operations on arrays
1. Use arrays to implement counters:
awk's array
is particularly suitable for use as a counter, used to track and count the number of times data appears. Here’s an example of how to use an array to count the number of occurrences of each element in a set of data:- Consider a text file containing some words, and we want to count the number of times each word appears.
Suppose the text file text.txt contains the following content:apple banana apple orange banana apple
Here is an example AWK script to count the occurrences of each word:
#/usr/bin/awk # Use array as counter {<!-- --> # Split each line using spaces as separators split($0, words, " ") # Iterate through each word and increment the counter for (i = 1; i <= length(words); i + + ) {<!-- --> word = words[i] count[word] + + } \t } END {<!-- --> #Print each word and the number of occurrences for (word in count) {<!-- --> print word, "Number of occurrences:", count[word] } }
Execute the awk script file we wrote
? echo "apple banana apple orange banana apple" | awk -f test.awk Number of occurrences of apple: 3 banana Occurrences: 2 orange Occurrences: 1
2. AWK for data grouping and statistics
-
In AWK, using arrays to group data and count various types of information is also a common operation, such as:
-
Group data by field value: You can use an array to group data according to the value of a certain field, and then perform statistics on each group.
For example, you can group data by fields such as date, product name, region, etc., and then calculate the total sales, average, etc. for each group.
-
Data aggregation: You can use arrays to aggregate data
For example, you can aggregate monthly sales data and calculate total sales, average sales volume, etc.
-
Count the occurrences of elements: You can use arrays as counters to count the occurrences of elements (such as words, events, etc.).
-
Compute frequency distribution: You can use arrays to calculate frequency distributions
For example, count the number of students in each score range in the grade distribution.
-
Calculate mean, median, mode, etc.: You can use arrays to calculate various statistical indicators
Such as mean, median, mode, etc.
-
Filtering and filtering data: You can use arrays to filter data and only retain data that meets certain conditions.
Here is an example showing how to use AWK to group data and calculate the total value for each group:
Suppose there is a text file
data.txt
containing date and sales data:2022-01-01 100 2022-01-01 150 2022-01-02 120 2022-01-03 200 2022-01-03 180
Here is an example AWK script to group data by date and calculate the total sales for each date:
#/usr/bin/awk # Use arrays for data grouping and statistics {<!-- --> date = $1 sales = $2 \t # Use date as key to increment total sales total_sales[date] + = sales } END {<!-- --> # Print each date and total sales for (date in total_sales) {<!-- --> print "date:", date, "total sales:", total_sales[date] } }
Run the awk file we wrote
?awk -f test.awk data.txt Date: 2022-01-01 Total sales: 250 Date: 2022-01-02 Total sales: 120 Date: 2022-01-03 Total sales: 380
In this example, we use the array
total_sales
to group the data by date and calculate the total sales for each date. Running the script will produce the total sales for each date.This example demonstrates how to use arrays in AWK for data grouping and statistics, but AWK’s capabilities go far beyond that. You can perform various statistical and data processing operations based on specific needs.
-
3. Sorting of arrays in awk
1. awk built-in function sorting
-
Two main built-in sorting functions are provided, asort() and asorti(), which can be used to sort data in an array.
-
When using AWK to write scripts to process data, you often need to sort the data in the array to better analyze and process the data. AWK provides two main sorting functions, which are
asort()
andasorti()
, which can be used to sort data in an array. -
asort(array [, dest [, how] ])
functionarray
: The array to be sorted, which can be an associative array or a numeric array.dest
(optional): An array used to store the sorted results. If thedest
parameter is provided, the sorted results will be stored indest
and the original array will not be modified. If thedest
parameter is not provided, sorting will be done directly on the original array.how
(optional): Sorting method, which can be one of the following options:"asc"
or0
: Sort in ascending order (default)."desc"
or1
: Sort in descending order.
-
The
asort()
function returns the length of the sorted array. It sorts the values in an array and is suitable for situations where sorting by value is required.- Here’s an example:
#/usr/bin/awk # Example uses the asort() function to sort the array in ascending order BEGIN {<!-- --> data[1] = 5 data[2] = 2 data[3] = 8 data[4] = 3 data[5] = 1 # Sort the array data in ascending order, and store the sorting results in sorted_data count = asort(data, sorted_data) #Print the sorted array for (i = 1; i <= count; i + + ) {<!-- --> print "value:", sorted_data[i] } }
asorti(array [, dest [, how] ])
Function: **array
: The array to be sorted, which can be an associative array or a numeric array.dest
(optional): An array used to store the sorted results. If thedest
parameter is provided, the sorted results will be stored indest
and the original array will not be modified. If thedest
parameter is not provided, sorting will be done directly on the original array.how
(optional): Sorting method, which can be one of the following options:"asc"
or0
: Sort in ascending order (default)."desc"
or1
: Sort in descending order.
- The
asorti()
function returns the length of the sorted array. It sorts the keys in an array and is suitable for situations where sorting by key is required.
- Here’s an example:
#/usr/bin/awk # Example uses the asorti() function to sort the array in ascending order by key BEGIN {<!-- --> data[2] = "banana" data[8] = "cherry" data[3] = "date" data[1] = "fig" # Sort the array data in ascending order by key, and store the sorting results in sorted_keys count = asorti(data, sorted_keys) #Print the sorted keys and corresponding values for (i = 1; i <= count; i + + ) {<!-- --> key = sorted_keys[i] value = data[key] print "key:", key, "value:", value } }
2. awk custom sorting function
1. Awk’s bubble sorting method
#/usr/bin/awk # Bubble sort function, sort one-dimensional array in ascending order of age function bubble_sort(arr, i, j, temp, n) {<!-- --> n = asorti(arr, sorted_array) for (i = 1; i < n; i + + ) {<!-- --> for (j = 1; j <= n - i; j + + ) {<!-- --> if (arr[sorted_array[j]] > arr[sorted_array[j + 1]]) {<!-- --> temp = sorted_array[j] sorted_array[j] = sorted_array[j + 1] sorted_array[j + 1] = temp } } } for (i = 1; i <= n; i + + ) {<!-- --> print "Name: " sorted_array[i] ", Age: " arr[sorted_array[i]] } } # Initialize a one-dimensional array, the key is the name, the value is the age BEGIN {<!-- --> my_array["Alice"] = 25 my_array["Bob"] = 30 my_array["Charlie"] = 22 my_array["wf"] = 44 my_array["lcx"] = 18 #Call the bubble sort function bubble_sort(my_array) }
Execute awk file
?awk -f 1.awk Name: lcx, Age: 18 Name: Charlie, Age: 22 Name: Alice, Age: 25 Name: Bob, Age: 30 Name: wf, Age: 44
2. awk’s Hill sorting method
#/usr/bin/awk #Hill sorting function, sorts a one-dimensional array in ascending order of age function shell_sort(arr, i, j, temp, n, gap, current_name, current_age) {<!-- --> n = asorti(arr, sorted_array) for (gap = int(n/2); gap > 0; gap = int(gap/2)) {<!-- --> for (i = gap + 1; i <= n; i + + ) {<!-- --> current_name = sorted_array[i] current_age = arr[current_name] j=i while (j > gap & amp; & amp; arr[sorted_array[j - gap]] > current_age) {<!-- --> sorted_array[j] = sorted_array[j - gap] j=j-gap } sorted_array[j] = current_name } } for (i = 1; i <= n; i + + ) {<!-- --> print "Name: " sorted_array[i] ", Age: " arr[sorted_array[i]] } } # Initialize a one-dimensional array, the key is the name, the value is the age BEGIN {<!-- --> my_array["Alice"] = 25 my_array["Bob"] = 30 my_array["Charlie"] = 22 my_array["wf"] = 44 my_array["lcx"] = 18 #Call the bubble sort function shell_sort(my_array) }
Execute awk file
?awk -f 1.awk Name: lcx, Age: 18 Name: Charlie, Age: 22 Name: Alice, Age: 25 Name: Bob, Age: 30 Name: wf, Age: 44
2. Quick sorting of awk
#/usr/bin/awk # Quick sort function, sort one-dimensional array in ascending order of age function quick_sort(arr, left, right) {<!-- --> if (left < right) {<!-- --> pivot_name = sorted_array[right] pivot_age = arr[pivot_name] i=left-1 for (j = left; j < right; j + + ) {<!-- --> current_name = sorted_array[j] current_age = arr[current_name] if (current_age <= pivot_age) {<!-- --> i++ temp = sorted_array[i] sorted_array[i] = sorted_array[j] sorted_array[j] = temp } } temp = sorted_array[i + 1] sorted_array[i + 1] = sorted_array[right] sorted_array[right] = temp pivot_index = i + 1 quick_sort(arr, left, pivot_index - 1) quick_sort(arr, pivot_index + 1, right) } } # Initialize a one-dimensional array, the key is the name, the value is the age BEGIN {<!-- --> my_array["Alice"] = 25 my_array["Bob"] = 30 my_array["Charlie"] = 22 my_array["wf"] = 44 my_array["lcx"] = 18 # Use the asorti function to sort the array by name n = asorti(my_array, sorted_array) # Call the quick sort function quick_sort(my_array, 1, n) #Print the array sorted by age in ascending order for (i = 1; i <= n; i + + ) {<!-- --> name = sorted_array[i] age = my_array[name] print "Name: " name ", Age: " age } }
Execute awk file
?awk -f 1.awk Name: lcx, Age: 18 Name: Charlie, Age: 22 Name: Alice, Age: 25 Name: Bob, Age: 30 Name: wf, Age: 44
Summary
The next section mainly talks about the performance optimization of awk arrays, etc.