Linux shell calculates the union, intersection, and difference of two files

Suppose we now have two files a.txt and b.txt

The content in a.txt is as follows:

a
c
1
3
d
4

The content in b.txt is as follows:

a
b
e
2
1
5

#Example 01

Compute the union:

[root@VM_81_181_centos ~]# sort -u a.txt b.txt
1
2
3
4
5
a
b
c
d
e
[root@VM_81_181_centos ~]#

#Exmaple 02

Compute intersection:

[root@VM_81_181_centos ~]# grep -F -f a.txt b.txt | sort | uniq
1
a
[root@VM_81_181_centos ~]#

#Example 03

Compute the difference set (a – b):

[root@VM_81_181_centos ~]# grep -F -v -f b.txt a.txt | sort | uniq
3
4
c
d
[root@VM_81_181_centos ~]#

#Example 04

Calculate the difference set (b – a):

[root@VM_81_181_centos ~]# grep -F -v -f a.txt b.txt | sort | uniq
2
5
b
e
[root@VM_81_181_centos ~]#

————————————————– ———-Manual dividing line—————————————– —————————————-

2018/09/30 Update

The above describes how to use the grep command to implement intersection and difference sets of files, but there are some problems with the results obtained in actual operations.

[root@VM_81_181_centos ~]# grep -F -f a.txt b.txt | sort | uniq | wc -l
4095
[root@VM_81_181_centos ~]# grep -F -f b.txt a.txt | sort | uniq | wc -l
4729
[root@VM_81_181_centos ~]#

I used the above command to find the intersection of two files a and b, but when I changed the position order of the two files, the result turned out to be different.

Again, this is unscientific.

After thinking about it carefully, the grep command is a search command. For example:

The contents of the c.txt file are as follows:

The contents of the d.txt file are as follows:

11223344

Execute grep command:

[root@VM_81_181_centos ~]# grep -F -f c.txt d.txt | sort | uniq
11223344
[root@VM_81_181_centos ~]# grep -F -f d.txt c.txt | sort | uniq
[root@VM_81_181_centos ~]#

Based on the results, the interpretation of the first command is:

After the command is executed, search the d.txt file for characters that match the c.txt file, because the characters 1122 in the c.txt file and the characters in the d.txt file

If the character 11223344 matches the preceding 1122, the character 11223344 will be recorded as the same part of the two files.

Second command:

After the command is executed, search the c.txt file for characters that match the d.txt file. 11223344 in the d.txt file cannot be found in the c.txt file.

Similar or identical characters, so the result is empty.

Now, add the character 112233445566 in the c.txt file. The results and operations are as follows:

c.txt file content:

Execute grep command:

[root@VM_81_181_centos ~]# grep -F -f d.txt c.txt | sort | uniq
1122334455
[root@VM_81_181_centos ~]#

in conclusion:

grep -F -f fileA fileB | sort | uniq

When fileA file comes first, it means searching for the same or similar characters in fileB file as those in fileA file, and recording the character in fileB file.

In the same way, fileB comes first and fileA comes last.

However, this is not the result we want here. The result we want is that when we used to learn mathematics, we found that the intersection of two sets is the same, and the result is

The output is the common part of the two collections. I tried several methods and finally chose to use the cat command.

The command format is as follows:

cat fileA fileB | sort | uniq -d # Find intersection
cat fileA fileB | sort | uniq -u # Find the difference set

This command is easier to understand. The cat command first merges two files into one file, and then sorts and removes duplicates from the merged files. The -d command outputs the file.

The same characters in the file, the -u command outputs different characters in the file, and when calculating the intersection, the result of which file order is fileA or fileB is the same.

The case is as follows:

[root@VM_81_181_centos ~]# cat c.txt
1122
1133
1144
1155
1122334455
[root@VM_81_181_centos ~]# cat d.txt
11223344
1122
[root@VM_81_181_centos ~]#

The contents of c and d files are as above

Execute the cat command to find the intersection:

[root@VM_81_181_centos ~]# cat c.txt d.txt | sort | uniq -d
1122
[root@VM_81_181_centos ~]# cat d.txt c.txt | sort | uniq -d
1122
[root@VM_81_181_centos ~]#

Execute the cat command to find the difference set:

[root@VM_81_181_centos ~]# cat c.txt d.txt | sort | uniq -u
11223344
1122334455
1133
1144
1155
[root@VM_81_181_centos ~]# cat d.txt c.txt | sort | uniq -u
11223344
1122334455
1133
1144
1155
[root@VM_81_181_centos ~]#

But the cat command also has a shortcoming. When the file is relatively large, an error will occur, but here we can use it.

The split command splits files, divides and conquers them, and then merges them. For how to use the split command, you can refer to this article of mine.

Portal: https://www.cnblogs.com/leeyongbard/p/9594439.html

————————————————–2019/04/ 27————————————————- ———-

paste command

Merge files by columns

The paste format is:

paste -d -s -file1 file2

The options have the following meanings:

-d specifies a delimiter different from spaces or tab keys, such as using the @ delimiter, use -d @

-s merge each file into lines instead of pasting by line

– Use standard input. For example: ls -l | paste means to display the output on only one column

example:

#cat pas1
ID897
ID666
ID982
#cat pas2
P.Jones
S.Round
L.Clip

Paste the two files pas1.txt and pas2.txt into two columns based on the paste command:

# paste pas1 pas2
ID897 P.Jones
ID666 S.Round
ID982 L.Clip

You can specify which column to paste first by exchanging the file names:

# paste pas2 pas1
P.Jones ID897
S.Round ID666
L.Clip ID982

To create a separator other than spaces or tabs, use the -d option, using colon as the separator as follows:

# paste -d: pas2 pas1
P.Jones:ID897
S.Round:ID666
L.Clip:ID982

To merge two columns into two rows, you need to use the -s option, as in the following example:

# paste -s pas1 pas2
ID897 ID666 ID982
P.Jones S.Round L.Clip

If you have different opinions, please share your opinions ^_^

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Cloud native entry-level skills treeHomepageOverview 15692 people are learning the system