8.2 目标检测简介
这篇文章主要介绍了目标检测。
目标检测是判断目标在图像中的位置,有两个要素:
  • 分类:分类向量$P{0}, P{1}, P_{2}...$,shape 为$[N, c+1]$
  • 回归:回归边界框$[x{1}, x{2}, y{1}, y{2}]$,shape 为$[n, 4]$
下面代码是加载预训练好的FasterRCNN_ResNet50_fpn,这个模型在是 COCO 模型上进行训练的,有 91 种类别。这里图片不再是BCHW的形状,而是一个list,每个元素是图片。输出也是一个 list,每个元素是一个 dict,每个 dict 包含三个元素:boxes、scores、labels,每个元素都是 list,因为一张图片中可能包含多个目标。接着是绘制框的代码,scores的的某个元素小于某个阈值,则不绘制这个框。
1
import os
2
import time
3
import torch.nn as nn
4
import torch
5
import numpy as np
6
import torchvision.transforms as transforms
7
import torchvision
8
from PIL import Image
9
from matplotlib import pyplot as plt
10
11
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
12
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
13
14
15
# classes_coco
16
COCO_INSTANCE_CATEGORY_NAMES = [
17
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
18
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
19
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
20
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
21
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
22
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
23
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
24
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
25
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
26
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
27
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
28
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
29
]
30
31
32
if __name__ == "__main__":
33
34
path_img = os.path.join(BASE_DIR, "demo_img1.png")
35
# path_img = os.path.join(BASE_DIR, "demo_img2.png")
36
37
# config
38
preprocess = transforms.Compose([
39
transforms.ToTensor(),
40
])
41
42
# 1. load data & model
43
input_image = Image.open(path_img).convert("RGB")
44
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
45
model.eval()
46
47
# 2. preprocess
48
img_chw = preprocess(input_image)
49
50
# 3. to device
51
if torch.cuda.is_available():
52
img_chw = img_chw.to('cuda')
53
model.to('cuda')
54
55
# 4. forward
56
# 这里图片不再是 BCHW 的形状,而是一个list,每个元素是图片
57
input_list = [img_chw]
58
with torch.no_grad():
59
tic = time.time()
60
print("input img tensor shape:{}".format(input_list[0].shape))
61
output_list = model(input_list)
62
# 输出也是一个 list,每个元素是一个 dict
63
output_dict = output_list[0]
64
print("pass: {:.3f}s".format(time.time() - tic))
65
for k, v in output_dict.items():
66
print("key:{}, value:{}".format(k, v))
67
68
# 5. visualization
69
out_boxes = output_dict["boxes"].cpu()
70
out_scores = output_dict["scores"].cpu()
71
out_labels = output_dict["labels"].cpu()
72
73
fig, ax = plt.subplots(figsize=(12, 12))
74
ax.imshow(input_image, aspect='equal')
75
76
num_boxes = out_boxes.shape[0]
77
# 这里最多绘制 40 个框
78
max_vis = 40
79
thres = 0.5
80
81
for idx in range(0, min(num_boxes, max_vis)):
82
83
score = out_scores[idx].numpy()
84
bbox = out_boxes[idx].numpy()
85
class_name = COCO_INSTANCE_CATEGORY_NAMES[out_labels[idx]]
86
# 如果分数小于这个阈值,则不绘制
87
if score < thres:
88
continue
89
90
ax.add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], fill=False,
91
edgecolor='red', linewidth=3.5))
92
ax.text(bbox[0], bbox[1] - 2, '{:s} {:.3f}'.format(class_name, score), bbox=dict(facecolor='blue', alpha=0.5),
93
fontsize=14, color='white')
94
plt.show()
95
plt.close()
96
97
98
99
# appendix
100
classes_pascal_voc = ['__background__',
101
'aeroplane', 'bicycle', 'bird', 'boat',
102
'bottle', 'bus', 'car', 'cat', 'chair',
103
'cow', 'diningtable', 'dog', 'horse',
104
'motorbike', 'person', 'pottedplant',
105
'sheep', 'sofa', 'train', 'tvmonitor']
106
107
# classes_coco
108
COCO_INSTANCE_CATEGORY_NAMES = [
109
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
110
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
111
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
112
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
113
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
114
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
115
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
116
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
117
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
118
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
119
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
120
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
121
]
Copied!
输出如下:
目标检测中难题之一是边界框的数量 $N$ 的确定。
传统的方法是滑动窗口策略,缺点是重复计算量大,窗口大小难以确定。
将全连接层改为卷积层,最后一层特征图的一个像素就对应着原图的一个区域,就可以使用利用卷积操作实现滑动窗口。
目标检测模型可以划分为 one-stage 和 two-stage。
one-stage 包括:
  • YOLO
  • SSD
  • Retina-Net
two-stage 包括:
  • RCNN
  • SPPNet
  • Fast RCNN
  • Faster RCNN
  • Pyramid Network
one-stage 的模型是直接把得到的特征图划分为多个网格,每个网格分别做分类和回归。
two-stage 的模型多了 proposal generation,输出 多个候选框,通常默认 2000 个候选框
在 Faster RCNN 中,proposal generation 是 RPN(Region Proposal Network),会根据 feature map 生成数十万个候选框,通过 NMS 选出前景概率最高的 2000 个框。由于候选框的大小各异,通过 ROI pooling,得到固定大小的输出,channel 数量就是框的数量。ROI pooling 的特点是输入特征图尺寸不固定,但是输出特征图尺寸固定。最后经过全连接层得到回归和分类的输出。
fasterrcnn_resnet50_fpn会返回一个FasterRCNNFasterRCNN继承于GeneralizedRCNNGeneralizedRCNNforward()函数中包括下面 3 个模块:
  • backbone:features = self.backbone(images.tensors)
    在构建 backbone 时,会根据backbone_name选择对应的 backbone,这里使用 resnet50。
  • rpn:proposals, proposal_losses = self.rpn(images, features, targets)
  • roi_heads:detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
GeneralizedRCNNforward()函数如下:
1
def forward(self, images, targets=None):
2
...
3
...
4
...
5
features = self.backbone(images.tensors)
6
if isinstance(features, torch.Tensor):
7
features = OrderedDict([('0', features)])
8
proposals, proposal_losses = self.rpn(images, features, targets)
9
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
10
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
11
12
losses = {}
13
losses.update(detector_losses)
14
losses.update(proposal_losses)
15
16
if torch.jit.is_scripting():
17
if not self._has_warned:
18
warnings.warn("RCNN always returns a (Losses, Detections) tuple in scripting")
19
self._has_warned = True
20
return (losses, detections)
21
else:
22
return self.eager_outputs(losses, detections)
Copied!
self.backbone(images.tensors)返回的features是一个 dict,每个元素是一个 feature map,每个特征图的宽高是上一个特征图宽高的一半。
这 5 个 feature map 分别对应 ResNet 中的 5 个特征图
接下来进入 rpn 网络,rpn 网络代码如下。
1
def forward(self, images, features, targets=None):
2
...
3
...
4
...
5
features = list(features.values())
6
objectness, pred_bbox_deltas = self.head(features)
7
anchors = self.anchor_generator(images, features)
8
9
num_images = len(anchors)
10
num_anchors_per_level = [o[0].numel() for o in objectness]
11
objectness, pred_bbox_deltas = \
12
concat_box_prediction_layers(objectness, pred_bbox_deltas)
13
# apply pred_bbox_deltas to anchors to obtain the decoded proposals
14
# note that we detach the deltas because Faster R-CNN do not backprop through
15
# the proposals
16
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
17
proposals = proposals.view(num_images, -1, 4)
18
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
Copied!
self.head(features)会调用RPNHead,返回的objectnesspred_bbox_deltas都是和features大小一样的 dict,只是 channel 数量不一样。objectness的 channel 数量是 3,表示特征图的一个像素点输出 3 个可能的框;pred_bbox_deltas的 channel 数量是 12,表示每个特征图的 3 个框的坐标的偏移量。
self.anchor_generator(images, features)的输出是anchors,形状是$242991 \times 4$,这是真正的框。
self.filter_proposals()对应的是 NMS,用于挑选出一部分框。
1
def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):
2
...
3
...
4
...
5
# select top_n boxes independently per level before applying nms
6
top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)
7
8
image_range = torch.arange(num_images, device=device)
9
batch_idx = image_range[:, None]
10
11
objectness = objectness[batch_idx, top_n_idx]
12
levels = levels[batch_idx, top_n_idx]
13
proposals = proposals[batch_idx, top_n_idx]
14
15
final_boxes = []
16
final_scores = []
17
for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):
18
boxes = box_ops.clip_boxes_to_image(boxes, img_shape)
19
keep = box_ops.remove_small_boxes(boxes, self.min_size)
20
boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
21
# non-maximum suppression, independently done per level
22
keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
23
# keep only topk scoring predictions
24
keep = keep[:self.post_nms_top_n()]
25
boxes, scores = boxes[keep], scores[keep]
26
final_boxes.append(boxes)
27
final_scores.append(scores)
28
return final_boxes, final_scores
Copied!
其中self._get_top_n_idx()函数去取出概率最高的前 4000 个框的索引。最后的for循环是根据特征图的框还原为原图的框,并选出最前面的 1000 个框(训练时是 2000 个,测试时是 1000 个)。
然后回到GeneralizedRCNNforward()函数里的roi_heads(),实际上是调用RoIHeadsforward()函数如下:
1
def forward(self, features, proposals, image_shapes, targets=None):
2
...
3
...
4
...
5
box_features = self.box_roi_pool(features, proposals, image_shapes)
6
box_features = self.box_head(box_features)
7
class_logits, box_regression = self.box_predictor(box_features)
Copied!
其中box_roi_pool()是调用MultiScaleRoIAlign把不同尺度的特征图池化到相同尺度,返回给box_features,形状是$[1000, 256, 7, 7]$,1000 表示有 1000 个框(在训练时会从2000个选出 512 个,测试时则全部选,所以是 1000)。box_head()是两个全连接层,返回的数形状是$[1000,1024]$,一个候选框使用一个 1024 的向量来表示。box_predictor()输出最终的分类和边界框,class_logits的形状是$[1000,91]$,box_regression的形状是$[1000,364]$,$364=91 \times 4$。
然后回到GeneralizedRCNNforward()函数中,transform.postprocess()对输出进行后处理,将输出转换到原始图像的维度上。
下面总结一下 Faster RCNN 的主要组件:
  1. 1.
    backbone
  2. 2.
    rpn
  3. 3.
    filter_proposals(NMS)
  4. 4.
    rio_heads
下面的例子是使用 Faster RCNN 进行行人检测的 Finetune。数据集下载地址是https://www.cis.upenn.edu/~jshi/ped_html/,包括 70 张行人照片,345 个行人标签。
数据存放结构如下:
  • PennFudanPed
    • Annotation:标注文件,为txt
    • PedMasks:不清楚,没用到
    • PNGImages:图片数据
Dataset中,首先在构造函数中保存所有图片的文件名,后面用于查找对应的 txt 标签文件;在__getitem__()函数中根据 index 获得图片和 txt 文件,查找 txt 文件的每一行是否有数字,有数字的则是带有标签的行,处理得到 boxes 和 labels,最后构造 target,target 是一个 dict,包括 boxes 和 labels。
在构造 DataLoader 时,还要传入一个collate_fn()函数。这是因为在目标检测中,图片的宽高可能不一样,无法以 4D 张量的形式拼接一个 batch 的图片,因此这里使用 tuple 来拼接数据。
1
# 收集batch data的函数
2
def collate_fn(batch):
3
return tuple(zip(*batch))
Copied!
collatefn 的输入是 list,每个元素是 tuple;每个 tuple 是 Dataset 中的 `_getitem()返回的数据,包括(image, target)`
举个例子:
1
image=[1,2,3]
2
target=[4,5,6]
3
batch=list(zip(image,target))
4
print("batch:")
5
print(batch)
6
collate_result = tuple(zip(*batch))
7
print("collate_result:")
8
print(collate_result)
Copied!
输出为:
1
batch:
2
[(1, 4), (2, 5), (3, 6)]
3
collate_result:
4
((1, 2, 3), (4, 5, 6))
Copied!
在代码中首先对数据和标签同时进行数据增强,因为对图片进行改变,框的位置也会变化,这里主要做了翻转图像和边界框的数据增强。
构建模型时,需要修改输出的类别为 2,一类是背景,一类是行人。
1
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
2
in_features = model.roi_heads.box_predictor.cls_score.in_features
3
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
Copied!
这里不用构造 Loss,因为在 Faster RCNN 中已经构建了 Loss。在训练时,需要把 image 和 target 的 tuple 转换为 list,再输入模型。模型返回的不是真正的标签,而是直接返回 Loss,所以我们可以直接利用这个 Loss 进行反向传播。
代码如下:
1
import os
2
import time
3
import torch.nn as nn
4
import torch
5
import random
6
import numpy as np
7
import torchvision.transforms as transforms
8
import torchvision
9
from PIL import Image
10
import torch.nn.functional as F
11
from my_dataset import PennFudanDataset
12
from common_tools import set_seed
13
from torch.utils.data import DataLoader
14
from matplotlib import pyplot as plt
15
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
16
from torchvision.transforms import functional as F
17
import enviroments
18
19
set_seed(1) # 设置随机种子
20
21
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
22
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
23
24
# classes_coco
25
COCO_INSTANCE_CATEGORY_NAMES = [
26
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
27
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
28
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
29
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
30
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
31
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
32
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
33
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
34
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
35
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
36
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
37
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
38
]
39
40
41
def vis_bbox(img, output, classes, max_vis=40, prob_thres=0.4):
42
fig, ax = plt.subplots(figsize=(12, 12))
43
ax.imshow(img, aspect='equal')
44
45
out_boxes = output_dict["boxes"].cpu()
46
out_scores = output_dict["scores"].cpu()
47
out_labels = output_dict["labels"].cpu()
48
49
num_boxes = out_boxes.shape[0]
50
for idx in range(0, min(num_boxes, max_vis)):
51
52
score = out_scores[idx].numpy()
53
bbox = out_boxes[idx].numpy()
54
class_name = classes[out_labels[idx]]
55
56
if score < prob_thres:
57
continue
58
59
ax.add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], fill=False,
60
edgecolor='red', linewidth=3.5))
61
ax.text(bbox[0], bbox[1] - 2, '{:s} {:.3f}'.format(class_name, score), bbox=dict(facecolor='blue', alpha=0.5),
62
fontsize=14, color='white')
63
plt.show()
64
plt.close()
65
66
67
class Compose(object):
68
def __init__(self, transforms):
69
self.transforms = transforms
70
71
def __call__(self, image, target):
72
for t in self.transforms:
73
image, target = t(image, target)
74
return image, target
75
76
77
class RandomHorizontalFlip(object):
78
def __init__(self, prob):
79
self.prob = prob
80
81
def __call__(self, image, target):
82
if random.random() < self.prob:
83
height, width = image.shape[-2:]
84
image = image.flip(-1)
85
bbox = target["boxes"]
86
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
87
target["boxes"] = bbox
88
return image, target
89
90
91
class ToTensor(object):
92
def __call__(self, image, target):
93
image = F.to_tensor(image)
94
return image, target
95
96
97
if __name__ == "__main__":
98
99
# config
100
LR = 0.001
101
num_classes = 2
102
batch_size = 1
103
start_epoch, max_epoch = 0, 5
104
train_dir = enviroments.pennFudanPed_data_dir
105
train_transform = Compose([ToTensor(), RandomHorizontalFlip(0.5)])
106
107
# step 1: data
108
train_set = PennFudanDataset(data_dir=train_dir, transforms=train_transform)
109
110
# 收集batch data的函数
111
def collate_fn(batch):
112
return tuple(zip(*batch))
113
114
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn)
115
116
# step 2: model
117
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
118
in_features = model.roi_heads.box_predictor.cls_score.in_features
119
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) # replace the pre-trained head with a new one
120
121
model.to(device)
122
123
# step 3: loss
124
# in lib/python3.6/site-packages/torchvision/models/detection/roi_heads.py
125
# def fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
126
127
# step 4: optimizer scheduler
128
params = [p for p in model.parameters() if p.requires_grad]
129
optimizer = torch.optim.SGD(params, lr=LR, momentum=0.9, weight_decay=0.0005)
130
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
131
132
# step 5: Iteration
133
134
for epoch in range(start_epoch, max_epoch):
135
136
model.train()
137
for iter, (images, targets) in enumerate(train_loader):
138
139
images = list(image.to(device) for image in images)
140
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
141
142
# if torch.cuda.is_available():
143
# images, targets = images.to(device), targets.to(device)
144
145
loss_dict = model(images, targets) # images is list; targets is [ dict["boxes":**, "labels":**], dict[] ]
146
147
losses = sum(loss for loss in loss_dict.values())
148
149
print("Training:Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] Loss: {:.4f} ".format(
150
epoch, max_epoch, iter + 1, len(train_loader), losses.item()))
151
152
optimizer.zero_grad()
153
losses.backward()
154
optimizer.step()
155
156
lr_scheduler.step()
157
158
# test
159
model.eval()
160
161
# config
162
vis_num = 5
163
vis_dir = os.path.join(BASE_DIR, "..", "..", "data", "PennFudanPed", "PNGImages")
164
img_names = list(filter(lambda x: x.endswith(".png"), os.listdir(vis_dir)))
165
random.shuffle(img_names)
166
preprocess = transforms.Compose([transforms.ToTensor(), ])
167
168
for i in range(0, vis_num):
169
170
path_img = os.path.join(vis_dir, img_names[i])
171
# preprocess
172
input_image = Image.open(path_img).convert("RGB")
173
img_chw = preprocess(input_image)
174
175
# to device
176
if torch.cuda.is_available():
177
img_chw = img_chw.to('cuda')
178
model.to('cuda')
179
180
# forward
181
input_list = [img_chw]
182
with torch.no_grad():
183
tic = time.time()
184
print("input img tensor shape:{}".format(input_list[0].shape))
185
output_list = model(input_list)
186
output_dict = output_list[0]
187
print("pass: {:.3f}s".format(time.time() - tic))
188
189
# visualization
190
vis_bbox(input_image, output_dict, COCO_INSTANCE_CATEGORY_NAMES, max_vis=20, prob_thres=0.5) # for 2 epoch for nms
Copied!
参考资料
如果你觉得这篇文章对你有帮助,不妨点个赞,让我有更多动力写出好文章。
我的文章会首发在公众号上,欢迎扫码关注我的公众号张贤同学
最近更新 1yr ago
复制链接