Linux-文本处理三剑客

2022-12-07

Linux-文本处理三剑客

awk、grep、sed是linux操作文本的三大利器,合称文本三剑客。

但是三者有所侧重点：

awk：功能最强，也是最复杂的。更适合格式化文本，对文本作复杂格式处理
grep：更适合查找、匹配
sed：更适合编辑匹配到的文本

grep

grep命令已经在 linux-管道篇学习过了，这里就略过。

sed

sed是一种流编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为**“模式空间”，接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。然后读入下一行，执行下一个循环。如果没有使用诸如“D”的特殊命令，那会在两个循环之间清空模式空间，但不会清空保留空间**。这样不断重复，直到文件末尾。文件内容并没有改变，除非使用重定向存储输出或-i。

参数说明

sed --help
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

  -n, --quiet, --silent # 只打印匹配的行
                 suppress automatic printing of pattern space
  -e script, --expression=script
                 add the script to the commands to be executed
  -f script-file, --file=script-file  # 写入文档中，后接文件名
                 add the contents of script-file to the commands to be executed
  --follow-symlinks
                 follow symlinks when processing in place
  -i[SUFFIX], --in-place[=SUFFIX]  # 直接将处理结果写入文件
                 edit files in place (makes backup if SUFFIX supplied)
  -c, --copy
                 use copy instead of rename when shuffling files in -i mode
  -b, --binary
                 does nothing; for compatibility with WIN32/CYGWIN/MSDOS/EMX (
                 open files in binary mode (CR+LFs are not treated specially))
  -l N, --line-length=N
                 specify the desired line-wrap length for the `l' command
  --posix
                 disable all GNU extensions.
  -r, --regexp-extended  # 支持扩展的正则表达式
                 use extended regular expressions in the script.
  -s, --separate
                 consider files as separate rather than as a single continuous
                 long stream.
  -u, --unbuffered
                 load minimal amounts of data from the input files and flush
                 the output buffers more often
  -z, --null-data
                 separate lines by NUL characters
  --help
                 display this help and exit
  --version
                 output version information and exit

If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret.  All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.

sed命令

a\\ 在当前行下面插入文本。
i\\ 在当前行上面插入文本。
c\\ 把选定的行改为新的文本。
d 删除，删除选择的行。
D 删除模板块的第一行。
s 替换指定字符
h 拷贝模板块的内容到内存中的缓冲区。
H 追加模板块的内容到内存中的缓冲区。
g 获得内存缓冲区的内容，并替代当前模板块中的文本。
G 获得内存缓冲区的内容，并追加到当前模板块文本的后面。
l 列表不能打印字符的清单。
n 读取下一个输入行，用下一个命令处理新的行而不是用第一个命令。
N 追加下一个输入行到模板块后面并在二者间嵌入一个新行，改变当前行号码。
p 打印模板块的行。
P(大写) 打印模板块的第一行。
q 退出Sed。
b lable 分支到脚本中带有标记的地方，如果分支不存在则分支到脚本的末尾。
r file 从file中读行。
t label if分支，从最后一行开始，条件一旦满足或者T，t命令，将导致分支到带有标号的命令处，或者到脚本的末尾。
T label 错误分支，从最后一行开始，一旦发生错误或者T，t命令，将导致分支到带有标号的命令处，或者到脚本的末尾。
w file 写并追加模板块到file末尾。 
W file 写并追加模板块的第一行到file末尾。 
! 表示后面的命令对所有没有被选定的行发生作用。 
= 打印当前行号码。 
# 把注释扩展到下一个换行符以前。

sed替换标记

sed替换标记
g 表示行内全面替换。 
p 表示打印行。 
w 表示把行写入一个文件。 
x 表示互换模板块中的文本和缓冲区中的文本。 
y 表示把一个字符翻译为另外的字符（但是不用于正则表达式）
\\1 子串匹配标记
& 已匹配字符串标记

sed元字符集

sed元字符集
^ 匹配行开始，如：/^sed/匹配所有以sed开头的行。
$ 匹配行结束，如：/sed$/匹配所有以sed结尾的行。
. 匹配一个非换行符的任意字符，如：/s.d/匹配s后接一个任意字符，最后是d。
* 匹配0个或多个字符，如：/*sed/匹配所有模板是一个或多个空格后紧跟sed的行。
[] 匹配一个指定范围内的字符，如/[sS]ed/匹配sed和Sed。 
[^] 匹配一个不在指定范围内的字符，如：/[^A-RT-Z]ed/匹配不包含A-R和T-Z的一个字母开头，紧跟ed的行。
\\(..\\) 匹配子串，保存匹配的字符，如s/\\(love\\)able/\\1rs，loveable被替换成lovers。
& 保存搜索字符用来替换其他字符，如s/love/**&**/，love这成**love**。
\\< 匹配单词的开始，如:/\\\\> 匹配单词的结束，如/love\\>/匹配包含以love结尾的单词的行。
x\\{m\\} 重复字符x，m次，如：/0\\{5\\}/匹配包含5个0的行。
x\\{m,\\} 重复字符x，至少m次，如：/0\\{5,\\}/匹配至少有5个0的行。
x\\{m,n\\} 重复字符x，至少m次，不多于n次，如：/0\\{5,10\\}/匹配5~10个0的行。

常见用法

1、替换（s）

# 1. sed 's/xxx/yyy/' 打印替换后的结果
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
abcdefg
123456
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed 's/cde/111/' 1.txt # 打印替换后的结果
ab111fg
123456
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt  # 不改变原文件
abcdefg
123456
# 2. sed -n 's/xxx/yyy/p' 只打印发生替换的行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n 's/cde/111/p' 1.txt # 只打印发生替换的行
ab111fg
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
abcdefg
123456
# 3. sed -i 's/xxx/yyy/g' 直接在原文件中替换，替换每一个匹配行的所有匹配字段
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
abcdefg
123456
abcdefg
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -i 's/cde/111/g' 1.txt # 直接在原文件中替换，替换每一个匹配行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
ab111fg
123456
ab111fg
# 4. sed 's/xxx/yyy/' 没有/g只有替换每一行的第一个匹配
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
abcabc
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -i 's/abc/123/' 2.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt # 只替换了第一个匹配
123abc
123
# 5. sed 's/xxx/yyy/g' 有/g会替换每一行的所有匹配
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
abcabc
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed 's/abc/123/g' 2.txt # 和例4形成对比
123123
123
# 6. sed 's/xxx/yyy/Ng' 从第N个xxx开始匹配成yyy
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "123123123" | sed 's/123/456/2g'
123456456

2、使用规范

# 1、定界符
# 以上例子是使用/作为定界符，也可以使用其他任意字符作为定界符，如:
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "abab" | sed 's🆎12:g'
1212
# 值得注意的是，若定界符出现在语句内部，则需要转移
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "1/2/3/" | sed 's/\/2/\/5/'
1/5/3/
# 2、组合使用(以下三种方法等价)
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "abc" | sed 's/a/1/' | sed 's/b/2/'
12c
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "abc" | sed 's/a/1/; s/b/2/'
12c
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "abc" | sed -e 's/a/1/' -e 's/1/2/'
2bc
# 3、sed中使用引用变量（注意：此时应该用双引号包围）
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.sh
var=good
echo "good morning" | sed "s/$var/bad/"
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sh 1.sh
bad morning

3、删除操作（d）

# ^表示开始，$表示最后
# 1. sed '/^$/d' 删除空白行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 3.txt
1
2
3
4

5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/^$/d' 3.txt
1
2
3
4
5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 3.txt # 不改变原文件
1
2
3
4

5
# 2. sed '2d' 删除第2行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '2d' 3.txt
1
3
4

5
# 3. sed '2,$d' 删除第2-末尾行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '2,$d' 3.txt
1
# 4. sed '$d' 删除末尾行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '$d' 3.txt
1
2
3
4

# 5. sed '/^ab/d' 删除以ab开头的行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 4.txt
ab ccc
ab ddd
a eee
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/^ab/d' 4.txt
a eee

4、匹配标记（&,\N）

# 1. 已匹配字符串标记&
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "abab" | sed 's/ab/[&]/g'
[ab][ab]
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "this is good" | sed 's/\w\+/*&*/g'
*this* *is* *good*
# 2. 子串匹配标记\N
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "aaa bbb" | sed 's/\([a-z]\+\) \([a-z]\+\)/\2 \1/'
bbb aaa
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ echo "aaa 123 bbb" | sed 's/\([a-z]\+\) \([0-9]\+\) \([a-z]\+\)/\2 \1/'
123 aaa
# 上个例子中，\1=aaa,\2=123,\3=bbb
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed 's/\(bad\)/\1?/' 4.txt 
i am bad?
i am bad?
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 4.txt
i am bad
i am bad

5、选定行范围（,）

# 1. sed -n '/bad/,/good/p' 打印匹配bad和good之间的行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 4.txt
i am bad
i am bad
i am good
i am nobody
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n '/bad/,/good/p' 4.txt
i am bad
i am bad
i am good
# 2. sed -n '2,/good$/p' 打印 第2行-匹配该行最后是good 的行
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ vi 4.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 4.txt
i am bad
i am bad
i am good
i am nobody
i am good
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n '2,/good$/p' 4.txt
i am bad
i am good
# 3. sed '/bad/,/nobody/s/$/.../' 在模版bad和good之间的行，在每行的最后添加...
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/bad/,/nobody/s/$/.../' 4.txt
i am bad...
i am bad...
i am good...
i am nobody...
i am good

6、读写追加等命令

# 1. 读(r) sed '/good/r 2.txt' 1.txt 打印1.txt匹配的模版good行，在下面打印2.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
abcabc
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
i am good
test1
test2
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/good/r 2.txt' 1.txt
i am good
abcabc
abc
test1
test2
# 2. 写(w) sed '/good/w 2.txt' 1.txt 将1.txt中匹配模版good的行（真正）写入2.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
i am good
test1
test2
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
abcabc
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/good/w 2.txt' 1.txt
i am good
test1
test2
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt # 实际被改变了
i am good
# 3. 行下追加(a) 
# sed '/am/a\haha' 2.txt 在匹配am的行下面插入haha
# sed -i '2a\oh haha' 2.txt 在第2行下面插入oh haha
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
i am good
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/am/a\haha' 2.txt
i am good
haha
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
oh no
i am good
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -i '2a\oh haha' 2.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
oh no
i am good
oh haha
# 4. 行上追加(i)  类似行下追加，只是变成了行上
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
i am good
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -i '1i\oh no' 2.txt
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
oh no
i am good
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/am/i\oh yes' 2.txt
oh no
oh yes
i am good

# 5、下一行处理(n) sed '/test/{ n; s/aa/bb/;}' 在匹配test模式的下一行，进行/aa/bb/替换（然后继续）
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
i am good
test1
1aa
test2
1aa
1aa
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/test/{ n; s/aa/bb/;}' 1.txt
i am good
test1
1bb
test2
1bb
1aa # 因为上面不是test模式，所以不会被替换为1bb
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
i am good
test1aa
1aa
test2
1aa
1aa
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '/test/{ n; s/aa/bb/;}' 1.txt
i am good
test1aa # 本身的test匹配行的aa，不会被替换为bb
1bb
test2
1bb
1aa
# 6、变形/按字符替换(y) sed '1,2y/abc/ABC/' 将第1-2行的abc替换为ABC
# y和s的区别：
# 1. y一般是行级别的替换，s一般是列级别替换（当然也可以转换成行级）；
# 2.s替换的是整体，y替换的是每一字母对应的单个字母（见下面的例子）
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 2.txt
abc
1abc1
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '1,2y/abc/ABC/' 2.txt
ABC
1ABC1
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '1,2s/and/123/' 2.txt
abc
1abc1
abc
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '1,2y/abc/12/' 2.txt # 如果不是逐字母替换的话，y会报错，但是s不会
sed: -e expression #1, char 12: strings for `y' command are different lengths
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed '1,2s/abc/12/' 2.txt
12
1121
abc

7、打印奇数/偶数行

# 命令见例子
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
1
2
3
4
5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n 'p;n' 1.txt
1
3
5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n 'n;p' 1.txt
2
4
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n '1~2p' 1.txt
1
3
5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n '2~2p' 1.txt
2
4

8、打印匹配字符串的下一行

[cindy@iZbp15qc4wmx335c268l5mZ sed]$ cat 1.txt
1
2
3
4
5
[cindy@iZbp15qc4wmx335c268l5mZ sed]$ sed -n '/1/{n;p}' 1.txt # 打印匹配模式1的下一行
2

9、sed高级处理

sed之所以能以行为单位的编辑或修改文本，其原因在于它使用了两个空间：一个是活动的“模式空间（pattern space）”，另一个是起辅助作用的“保持空间（hold space）这2个空间的使用。

模式空间：可以想成工程里面的流水线，数据之间在它上面进行处理。保持空间：可以想象成仓库，我们在进行数据处理的时候，作为数据的暂存区域。

sed在正常情况下，将处理的行读入模式空间，脚本中的“sed command（sed命令）”就一条接着一条进行处理，直到脚本执行完毕。然后该行被输出，模式被清空；接着，在重复执行刚才的动作，文件中的新的一行被读入，直到文件处理完毕。

一般情况下，数据的处理只使用模式空间（pattern space），按照如上的逻辑即可完成主要任务。正常情况下，如果不显示使用某些高级命令，保持空间不会使用到！但是某些时候，通过使用保持空间（hold space），还可以带来意想不到的效果。

+ g：[address[,address]]g 将hold space中的内容拷贝到pattern space中，原来pattern space里的内容清除。
+ G：[address[,address]]G 将hold space中的内容append到pattern space\n后。
+ h：[address[,address]]h 将pattern space中的内容拷贝到hold space中，原来的hold space里的内容被清除。
+ H：[address[,address]]H 将pattern space中的内容append到hold space\n后。
+ d：[address[,address]]d 删除pattern中的所有行，并读入下一新行到pattern中。
+ D：[address[,address]]D 删除multiline pattern中的第一行，不读入下一行。
+ x：交换保持空间和模式空间的内容。

高级用法后续再学习

awk

wk其名称得自于它的创始人 Alfred Aho 、Peter Weinberger 和 Brian Kernighan 姓氏的首个字母。实际上 AWK 的确拥有自己的语言： AWK 程序设计语言，三位创建者已将它正式定义为“样式扫描和处理语言”。它允许您创建简短的程序，这些程序读取输入文件、为数据排序、处理数据、对输入执行计算以及生成报表，还有无数其他的功能。

awk 是一种很棒的语言，它适合文本处理和报表生成，其语法较为常见，借鉴了某些语言的一些精华，如 C 语言等。在 linux 系统日常处理工作中，发挥很重要的作用，掌握了 awk将会使你的工作变的高大上。 awk 是三剑客的老大，利剑出鞘，必会不同凡响。

https://www.cnblogs.com/ginvip/p/6352157.html

sed偏向于字符串处理，awk可支持更广泛的类型处理，例如可以在awk中执行整数运算等，但是在sed中就不行。在sed中，万物皆字符串。

awk不是命令，是一门语言。又叫做GNU awk，gawk。

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ which awk
/usr/bin/awk
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ ls -l /usr/bin/awk
lrwxrwxrwx. 1 root root 4 Sep 14  2020 /usr/bin/awk -> gawk # 一般Linux默认为Gawk，Gawk是 AWK的GNU开源免费版本。

[cindy@iZbp15qc4wmx335c268l5mZ ~]$ awk -h
awk: option requires an argument -- h
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options: (standard)
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs  # 指定分隔符
        -v var=val              --assign=var=val  # 指定变量
Short options:          GNU long options: (extensions)
        -b                      --characters-as-bytes
        -c                      --traditional
        -C                      --copyright
        -d[file]                --dump-variables[=file]
        -e 'program-text'       --source='program-text'
        -E file                 --exec=file
        -g                      --gen-pot
        -h                      --help
        -L [fatal]              --lint[=fatal]
        -n                      --non-decimal-data
        -N                      --use-lc-numeric
        -O                      --optimize
        -p[file]                --profile[=file]
        -P                      --posix
        -r                      --re-interval
        -S                      --sandbox
        -t                      --lint-old
        -V                      --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
        gawk '{ sum += $1 }; END { print sum }' file
        gawk -F: '{ print $1 }' /etc/passwd

基本使用方法：

awk '{pattern + action}' {filenames}

awk内置变量

# NR: number of record 行号
# RS: record separator 行分隔符（\n）
# FS: field separator 列分隔符（空格）
# NF: number of field 每行的列数
# $0: 当前记录
# $1~$n: 当前记录的1～n个记录
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat text1
1 2 3
4 5 6
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print NR}' text1
1
2
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print RS}' text1




[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print FS}' text1
 
 
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print NF}' text1
3
3
# 但是通常不是这么直白的用，需要配合一些语句一起使用

awk选项（指定）

# -F 指定分隔符 
# -v 指定变量
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat test2
a:bb:c
1:2:100
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print $1}' test2 # 默认以空格为分隔符，所以$1就是全部的
a:bb:c
1:2:100
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -F":" '{print $1}' test2 # 以:为分隔符，打印出第一个
a
1
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -v var="hello" -F":" '{print var, $2}' test2 # 指定变量var的值，然后可以直接在语句中用（不需要加$号）。print的内容用逗号分隔，打印时用空格隔开
hello bb
hello 2

awk内置函数

算数函数
字符串函数
时间函数：systime和strftime
其他函数

下面重点看下字符串函数：

替换函数gsub和sub

先说下gsub和sub的区别：gsub会替换所在范围内所有满足条件的字符串，而sub只会替换第一个满足条件的。

# gsub(r,s,t) 在每行的t范围中，将r替换为s；sub类似。如没有t参数，则表示查找范围为整行。
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat text3
hello how are you
hi i am happy
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{gsub("h","H")}' text3 # 替换操作不包含打印，所以没法看到结果
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{gsub("h","H"); print $0}' text3 # 替换并打印所有h->H
Hello How are you
Hi i am Happy
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{sub("h","H"); print $0}' text3 # sub只替换第一个匹配的h
Hello how are you
Hi i am happy
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{gsub("h","H",$1); print $0}' text3 # 指定在$1中进行匹配
HHHHH how are you
HHHH i am happy
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{sub("h","H",$1); print $0}' text3
Hhhhh how are you
Hhhh i am happy
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat test2
a:bb:c
1:2:100
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{gsub("[0-9]+","x"); print $0}' test2 # 使用正则表达式进行匹配，将所有的数字替换为x
a:bb:c
x❌x

字符串长度length

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat test2
a:bb:c
1:2:100
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -F":" '{ for(i=1;i<=NF;i++){print $i, length($i)} }' test2
a 1
bb 2
c 1
1 1
2 1
100 3
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print length()}' test2  # 如果length里面不输入参数，则使用的是默认参数$0
6
7

返回字符所在位置index

# index(s,t) 返回s中字符串t所在的位置
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat test2
apple is good
i like apple
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{print index($0,"apple")}' test2
1
8

动态创建数组split

# split(r,s,t) 以t为分隔符，切割r为数组t
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ echo "a:bb:ccc" | awk -F":" '{split($0,arr,":");for(i in arr){print i,arr[i]}}' # 注意看哦，awk的split下标，是从1开始的
1 a
2 bb
3 ccc
# 也可以切割传进来的变量
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -v var="h1 h2 h3" 'BEGIN{split(var,arr," "); for(i in arr){print i,arr[i]}}'  
1 h1
2 h2
3 h3
# split方法的返回值为数组长度
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -v var="h1 h2 h3" 'BEGIN{print split(var,arr," ")}'
3
# 需要注意，采用for(i in arr)可能不会按照下标顺序遍历输出，如果需要按顺序的话，需要采用以下方法
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk -v var="h1 h2 h3" 'BEGIN{ arrlen=split(var,arr," "); for(i=1;i<=arrlen;i++){print i, arr[i]}}'
1 h1
2 h2
3 h3

匹配函数match

后续再学习

子串函数substr

https://www.cnblogs.com/irockcode/p/6880597.html

substr(对象,index) 从对象的第index字符开始到设定的分隔符结束

substr(对象,index1,length) 从对象的第index1字符截取length个字符

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat test2
apple is good
i like apple
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{ print substr($0,1) }' test2  # $0没有分隔符，默认截取到行末
apple is good
i like apple
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{ print substr($0,3) }' test2
ple is good
like apple
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{ print substr($1,3) }' test2  # 如第2行没法截取指定的长度，则输出空
ple

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk '{ print substr($3,1,2) }' test2
go
ap

awk执行流程

awk -参数 'BEGIN{动作} (模式){动作} END{动作} }' 文件名

BEGIN：读取数据之前的动作，只能执行一次
PATTERN：按行执行的正常动作
END：和BEGIN相反，是在读取所有数据之后的动作，只能执行一次

例子：

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat text1
1 2 3
4 5 6
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ awk 'BEGIN{print "开始awk"} {print $2} END{print "结束awk"}' text1
开始awk
2
5
结束awk

awk流程控制语句

https://blog.51cto.com/linux2023/5016231

awk的流程控制语句包括：

if-else 语句
for 语句
while 语句
do-while 语句
break 语句
continue 语句
next 语句
nextfile 语句
exit 语句

if-else

if (条件 1){
    动作 1
    }
else if (条件 2){
    动作 2
}
else{
    动作 3
}

[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat text1
1 2 3
4 5 6
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ cat 1.sh
awk '{
        if ( $2 == 2 ){
                print "$1 is ",$1;
        }
        else {
                print "$3 is ",$3;
        }
}' ./text1
[cindy@iZbp15qc4wmx335c268l5mZ awk]$ sh 1.sh
$1 is  1
$3 is  6