I participated in Kevin Murphy’s talk at Data Science Institute (DSI) - http://datascience.columbia.edu/events/calendar on September 5th. Kevin Murphy is a renowned machine learning researcher and currently serves as a Research Scientist at Google.
In his talk, he introduced some recent work related to image and text analysis. As the title of his seminar, “Towards Machines that Perceive and Communicate”, implies, he discussed some machine learning techniques his Google research team has been employing to perceive images and to communicate with images and texts. He discussed (1) for perception, their approaches for image understanding methods; and (2) for communication, introduced methods for description (image to text) and for comprehension (text to image).
The talk in general was very technical and this note is just a brief summary of what I understood from the talk.
The first part was about methods for “Image Understanding”. Basically, he explained (1) how to recognize stuff, (2) how to detect objects, and (3) how to detect people and estimate people’s pose. In order to recognize stuff, “Semantic Segmentation” is used. The idea of semantic segmentation is that we label each pixel with a class of objects and eventually perform automatic differentiation. He described specific methods including a CNN for pixel classification (U-Nets) for a standard approach and a trous (Dilated) convolution as an alternative. Next, to detect objects, the method called “Instance Segmentation” can be used.
The figure above (source: Nathan Silberman) shows the difference between Semantic Segmentation (second one) and Instance Segmentation (fourth one). Methods including SSD, Faster R-CNN, R-FCN were mentioned. In addition to these image segmentation methods, people’s 2D pose can be estimated. Once people (objects) are detected, for each person we find key points and create heat maps, and then their pose can be estimated by using a deep part-based model (related to R-CNN and similar to mixture regression?).
The second part was about methods for “Image Captioning”, which indicates image to text. The basic idea is that we take features from the image using Vision Deep CNN and then use Language Generating RNN to create captions describing the image. Some more advanced topics related to image captioning were discussed, e.g., evaluation of image captioning (they proposed using referring expressions), discriminative image captioning, and optimization of the semantic metrics.
The third part was about “Generative Model”. The purpose of research on generative models is to find underlying context (which is latent) based on images and texts so we can generate images based on text query (which corresponds to inference). This is their ongoing project, and the basic idea for the method is to use Joint Variational Autoencoder (JVAE). The method called “triple ELBO” was discussed for inference.
Kevin mentioned they will extend the methods for video beyond static images. I really enjoyed Kevin’s talk and found it very interesting and useful in practice (I recently found Google’s image search feature and it is really cool!). I think such methods have a lot potentials to be applied to educational research, particularly in context of online learning that use materials of images and videos. What do you think about potential application of those methods in education?
As mentioned previously, this note is based on my understanding of Kevin’s seminar, so I’d appreciate any feedback, comments and correction if you notice something. Thanks!