Center for Research in Intelligent Systems
Room 216, Winston Chung Hall, University of California at Riverside,
Riverside, CA 92521,
Tel. 951-827-3954, Fax. 951-827-2425
Xiu Zhang, Graduate Student (research assistant)
Linan Feng, Graduate Student
Raj Theagarajan (Graduate Student)
Federico Pala (Researcher)
- Major Goals of the Project:
Motivated by the fact that multiple concepts that frequently co-¬occur across images form patterns which could provide contextual cues for individual concept inference, the objectives of the proposed EAGER project are:
- (a) Develop a social network inspired formal framework for finding hierarchical co-occurrence correlation among concepts, and use these patterns of co-occurrence as contextual cues to improve the detection of individual concepts in multimedia databases.
- (b) Develop algorithms to select visually consistent semantic concepts.
- (c) Develop an image content descriptor called concept signature that can record both the semantic concept and the corresponding confidence value inferred from lowlevel image features.
- (d) Evaluate the effectiveness of the proposed approach in application domains such as automatic image annotation and concept-based image/video retrieval. The validation of the proposed techniques will be carried out by performing experiments on multiple databases using a variety of quantitative measures.
- Accomplishments under these goals:
Major Activities (2016-17):
- PI Bir Bhanu worked with his researchers/students Linan Feng, Xiu Zhang, Federico Pala and Raj Theagarajan to perform the proposed research, carry out the experiments and publish the research work. Xiu Zhang completed her course work and passed the PhD qualifying examinations and advanced to the candidacy for the PhD degree in Computer Science.
- During the project period Xiu Zhang, Federico Pala and Bir Bhanu wrote a paper on Attributes Co-occurrence Pattern Mining which has been accepted in a premier IEEE conference. This paper shows results on several large benchmark datasets which indicate that attributes can provide improvements both in accuracy and generalization capabilities.
- Raj Theagarajan, Federico Pala and Bir Bhanu participated in a competition and workshop (held in conjunction with IEEE Conference on Computer Vision and Pattern Recognition) on identifying 10 vehicle classes using deep ensemble learning techniques that exploit logical reasoning as semantic information. A detailed paper on this work is forthcoming. The dataset for this work includes over ¾ million diverse images resembling a real-world environment.
- Pi Bir Bhanu worked with his students Linan Feng and Xiu Zhang to perform the proposed research, carry out the experiments and publish the research work. Linan Feng and Bir Bhanu completed and revised a journal paper. Xiu Zhang worked on developing network¬ based hierarchical co-¬occurrence algorithms and exploiting correlation structure for available large image datasets for re-¬identification.
- Specific Objectives:
Being different from the widely used low-level descriptors, visual attributes (e.g., hair and shirt color) offer a human understandable way to recognize objects such as people. In this work, a new way to take advantage of them is proposed for person re-identification where the challenges include illumination, pose and viewpoint changes among non-overlapping camera views.
- First, detect the attributes in images/videos by using deep learning-based convolutional neural networks.
- Second, compute the dependencies among attributes by mining association rules that are used to refine the attributes classification results.
- Third, transfer the attribute learning task to person re-identification in video by using metric learning technique.
- Finally, integrate the attributes-based approach into an appearance-based method for video-based person re-identification and evaluate the results on benchmark datasets.
- 1) Discover and represent the cooccurrence patterns as hierarchical communities by graph modularity maximization in a network with nodes and edges representing concepts and cooccurrence relationships separately.
2) Generate intermediate image descriptors by exploiting concept cooccurrence patterns in the prelabeled training set that renders it possible to depict complex scene images semantically.
3) Evaluate the effectiveness for automated image annotation and image retrieval
- Significant Results:
We validate our approach on two of the most important benchmark datasets for video-based person re-identification. The first dataset is iLIDS-VID, which consists of 2 acquisitions of 300 pedestrians at an airport arrival hall. The length of videos varies from 23-192 frames with an average of 73 frames. This dataset is very challenging due to the changing illumination conditions and viewpoints, complex backgrounds and occlusions. The other dataset is PRID 2011 dataset that includes 200 pairs of image sequences taken from two adjacent camera views. The length of the image sequences varies from 5 to 675, with an average of 100 frames. Compared with the iLIDS-VID dataset, this is less challenging because of the relatively simple backgrounds and the rare presence of occlusions. PEdesTrian Attribute (PETA) is the dataset we used to learn attributes. It is a large-scale surveillance dataset of 19000 attribute labeled images taken from 8707 persons. Each image is annotated with 61 binary and 4 multi-class attributes, such as hair style, clothing color and accessories. The images in this dataset are from 10 different datasets including 477 images from iLIDS-VID and 1134 images from PRID. The images from iLIDS-VID and PRID 2011 datasets are handled appropriately in performing experiments with PETA. For the attributes detection network, we use a NVIDIA Digits DevBox, which comes with Four TITAN X GPUs with 7 TFlops of single precision, 336.5 GB/s of memory bandwidth, and 12 GB of memory/board. For the co-occurrence pattern mining, we use Weka 3 package.
For each dataset, we randomly extract two equal subsets, one for training and one for testing. During the testing stage, for each query sequence, we compute the distance against each identity in the gallery set and return the top n identities. To measure the performance, the Cumulative Match Characteristic (CMC) plot is used, which represents the percentage of the test sequences that are correctly matched within the specified rank. The experiments are repeated 10 times and the average CMC plot is reported. For evaluating the attributes detection, the corresponding 477 and 1134 images of related person identities in iLIDSVID and PRID 2011 are removed from the PETA dataset separately. Then we randomly select 16000 images from the remaining PETA dataset for training and leave the remaining 2523 and 1866 images from the two datasets for validation. We test the performance on two datasets of 150 and 100 videos from iLIDS-VID and PRID 2011, respectively. For all the experiments, same parameters are used.
We compare our results with the following state-of-the-art methods: Recurrent Convolutional Neural Networks (RCNN), Top-Push (TDL), Temporally Aligned Pooling Representation (TAPR) and Simultaneously learning Intra-Video and Inter-video Distance Learning (SI2DL). We achieve rank 1 identification rates of 60.3% and 73.2%, which results in improvements of 2% and 2.6% with respect to [McLaughlin et al. CVPR 2016] for iLIDS-VID and PRID 2011 datasets, respectively. For iLIDS-VID dataset, our algorithm achieves the best rank 1 performance. For the PRID 2011 dataset, the results are approaching to the best result obtained by [Zhu et al. IJCAI 2016]. However, if we examine the results on the more challenging iLIDS-VID dataset, it obtains the lowest recognition rates compared to all the listed results. Instead, our method performs consistently well on both datasets. It is fair to say that attributes information and co-occurrence patterns are complementary to RCNN [McLaughlin et al. CVPR 2016]. Further details are given in the paper attached with this report.
We carried out experiments for automatic image annotation and semantic image retrieval on several challenging datasets. We use three datasets: 10,000 image and 2500 concepts from LabelMe dataset, 12,000 images and 5800 concepts from SUN09 dataset and 2682 images and 520 concepts from OSR dataset. We use a variety of features (color, histogram of oriented gradients, etc. We evaluate the results for automated image annotation using various measures, including F1 measure and precision measures and for retrieval using mean average precision. The key results are:
- Co-occurrence pattern detection results - Our combined co-occurrence measure of normalized google distance, normalized tag distance, and automated local analysis is more effective than each of the individual measures in co-occurrence network construction as well as co-occurrence pattern detection. The combined measure gives the best performance in modularity measure.
- Automated image Annotation: To analyze the scalability of our approach, we compare the results on the three datasets with increased complexity (OSR < SUN09 < LabelMe) evaluated by the total number of concepts in the datasets and the number of concepts per image. Our results show that generally when the images are complex the performance of the approaches drop. In particular, we observe that our approach achieves better maximum performance gain when the images have higher complexities. For example, LabelMe usually has more than 10 concepts in an image, the maximum performance gain reaches 20.59 percent when the training set contains 80 percent of the images. SUN09 contains on average 5-10 concepts per image, the maximum performance gain is between 11:29 and 14.00 percent. OSR has the least number of concepts in an image, and the maximum gain is the lowest as well which is approximately 10.00 percent only. This indicates that our approach is well suited for understanding images with complex scenes.
- Image Retrieval: The proposed hierarchical concept co-occurrence patterns can boost the individual concept inference. In particular, we can observe that when using only a small fraction of the dataset for training, our method can still achieve comparatively good performance. Further, we observe that the returned images are more semantically related to the scene concept reﬂected in the query images rather than just visually related.
These detailed quantified experimental results (in the attached paper) demonstrate the following:
- (a) The importance of the hierarchy of co-occurrence patterns and its representation as a network structure, and (b) The effectiveness of the approach for building individual concept inference models and the utilization of co-occurrence patterns for reﬁnement of concept signature as a way to encode both visual and semantic information.
- Key Outcomes or other Achievements:
As compared to the state-of-the-art, our contribution can be summarized as follows:
- We developed a novel framework that takes into account attributes and their co-occurrence.
- We perform experiments that highlight the generalization capabilities of the framework. We train on a large independent attribute dataset and then test on two different re-id benchmarks. Unlike the work of Zhu et al (IJCAI 2016), our approach performs consistently on both testing datasets. Experimental results on two benchmark datasets indicate that attributes can provide improvements both in accuracy and generalization capabilities.
Developed algorithms to represent the cooccurrence patterns as hierarchical communitiees by graph modularity maximization in a network with nodes and edges representing concepts and cooccurrence relationships separately.
Developed algorithms for a random walk process that works on the inferred concept probabilities with the discovered cooccurrence patterns to acquire the refined concept signature representation.
What opportunities for training and professional development has the project provided?
How have the results been disseminated to communities of interest?
- Project provided opportunity for research on large image databases, machine learning and data mining and the development of algorithms/tools. It provided many sitautions for the improvement and refinemement of oral/written communication skills.
What do you plan to do during the next reporting period to accomplish the goals?
- Publication in the top journal and workshop/conference.
- Develop and refine the approach and its experimental validation on reidentification datasets.
Bir Bhanu and Ajay Kumar (Eds.) (2017). Deep Learning for Biometrics, Springer International, ISBN: 978-3-319-61656-8.
Journals or Juried Conference Papers
L. Feng and B. Bhanu, “Semantic concept co-occurrence patterns for image annotation and retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, No. 4, April 2016.
X. Zhang, F. Pala and B. Bhanu, “Attributes co-occurrence pattern mining for video-based person re-identification,” 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2017.
R. Theagarajan, F. Pala and B. Bhanu (2017). EDeN: Ensemble of deep networks for vehicle classification. Traffic Surveillance Workshop and Challenge (TSWC-2017) held in conjunction with IEEE Conference on Computer Vision and Pattern Recognition, July 21, 2017.
Other Conference Papers and Presentations
- L. Feng and B. Bhanu (2015). Cooccurrenece Patterns for Imgae Annotation and Retrieval. 3rd Workshop on Webscale Vision and Social Media (VSM), International Conference on Computer Vision (ICCV). Santiago, Chile. Status = PUBLISHED; Acknowledgement of Federal Support = Yes
This material is based upon work supported by the National Science Foundation Project ID No. IIS-1552454. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.