Development of Counting Based Visual Question Answering System by Using Transformer and Pyramid Networks with Hybrid Deep Learning Model

Development of Counting Based Visual Question Answering System by Using Transformer and Pyramid Networks with Hybrid Deep Learning Model

Publication Date : 2023-08-05
Author(s) :

S. Nigisha, K. Anugirba
Conference Name :

International Conference on scientific innovations in Science, Technology, and Management (NGCESl-2023)
Abstract :

Visual Question Answering (VQA) merges images and natural language processing, that enables machines to respond to queries about visual content with prowess by comprehending visual features and contextual cues in text. VQA aims to bridge the gap between human-like understanding and visual comprehension. Counting-based VQA is a specific subfield within VQA that focuses on answering questions related to counting objects or quantities in images. The objective of counting-based VQA is to develop algorithms and models capable of accurately answering questions that involve counting specific objects or quantities in visual data. Our Model consists of Bidirectional Encoder Representations from Transformers (BERT) to extract the texture features from the Question part and for the visual part, Feature Pyramid Network (FPN) is used to extract the deep features from images. Both the textual and visual features are integrated to form a combined set of features. These fused features are fed in to a hybrid model for answer prediction. This hybrid model is an integration of Gated Recurrent Unit (GRU) and One-Dimensional Convolutional Neural Network (1DCNN).

No. of Downloads :

5