Reflections on ICCV over 31 years
Computer vision from 1990 to 2021
14 October 2021 Michael J. Black 8 minute read
The International Conference on Computer Vision feels like home, even though, in 2021, that home is virtual.
My first ICCV was in 1990 in Osaka. I was a relatively new PhD student and I had an oral presentation on "A model for the detection of motion over time". My advisor, Anandan, said to me before the conference, ‘This is a big conference for you. After this talk, you're going to be famous." Frankly, I was terrified.
Terrible photo of my ICCV'90 talk.
My wife was with me, and we took a week before the conference to tour around Japan. I had an upset stomach the whole time. I thought the Japanese food wasn't agreeing with me, but it was the stress! The moment my talk was over, my stomach was perfectly fine. In the end, I'm sure that nobody remembers that talk and I wasn't famous after it. Fame doesn't come from a paper. If it comes at all, it can only come from a reputation that is built from years of contributions.
If you look at the proceedings today some things are striking
The papers seem to vary from 4 to 12 pages: I don't remember why. They also had many fewer authors than today's papers and often just one author. Back then, even a senior professor like Sandy Pentland could write a single-author paper. Many of the authors are still active today (Trevor Darrell, Luc van Gool, Pietro Perona, Jitendra Malik, Bill Freeman, etc.). There were only a handful of women and, sadly, we haven't made much progress there. The papers also contained many fewer references than today because the field was young and not so big. It was way easier than today to stay on top of the literature; there were only 126 papers in the conference after all! Out of these 126 papers, one of them was on neural networks: Object recognition by a Hopfield neural network.
There was no WWW, no GitHub, no GPUs, and very little existing work to build on. I started my PhD with an empty Emacs buffer and started typing (Lisp and C). There were a small number of videos that people had digitized at great expense and these few sequences were used by everyone. Imagine doing computer vision research with no way to get image and video data into the computer!
What was most amazing to me in 1990 was that my advisor would pay for me to fly to Japan! I hadn't travelled much and I was immediately hooked on it. All I needed to do was to keep writing papers and I'd get to travel the world for free! Travel was a major motivator and kept me productive during my PhD. I still feel that I can't attend a conference unless I have a paper.
ICCV 1990 Reception -- Free food. Life is good!
The community was pretty small back then. ICCV 1990 was the third ICCV and had 419 attendees, which was a pretty large turnout. It was my first computer vision conference and I was struck by how friendly and welcoming everyone was. I met a young Andrew Zisserman there and walked with him to and from our hotel to the conference center. His book with Andrew Blake on Visual Reconstruction was like a bible to me. I was amazed at how smart he was, how quickly he thought, and how he was teaching himself to read Japanese while walking between the conference and his hotel.
What really impressed me about ICCV then and still impresses me today is how accessible and open people are. Andrew was already well known in 1990, yet he had the time to talk to a new graduate student like me. The people I met in these early conferences, like Andrew, have been friends and colleagues throughout my career. I hope young researchers today feel as welcomed as I did.
At conferences today, many young people come up to me and introduce themselves and I love meeting them - they are the future of the field. I also go to a lot of posters because that's where I meet the new people and get to know how they think. Sometimes at a poster we are having a good discussion and then they read my badge and find out who I am. I guess I'm now an intimidating oldtimer because they're like "Oh, you're Michael Black!". I try to reassure them that I'm just another researcher who is interested in understanding their paper.
Of course, there are many big differences between ICCV 1990 and ICCV 2021. In addition to the massive growth of the field, the largest change is in the tools we use. The problems are similar but today our tools are based on neural networks.
After AlexNet took first place in the ImageNet Challenge, everybody of a certain age went through the five stages of grief. First, there was shock and denial. It felt like the world was upended and nothing was going to be the same again. There was no denying the result, however.
Then one begins with anger and bargaining. "Sure, these things are great at classification. It's obvious that they should be good at classification. But my problems are all regression problems, which involve predicting continuous numbers, and these things are never going to be good at predicting continuous numbers." I definitely said that. But then, of course, deep nets turned out to be very good at solving regression problems.
So then came depression. Many of the older researchers thought: "Oh gosh, my career is over. Everything I've done is useless. Nothing I've written in the last five years is ever going to be cited again. And I'm not interested in this new thing. I like calculus, I like manifolds, I like geometry, I like linear algebra. I like thinking about problems in a particular way. I'm good at it. And this new thing involves thinking about problems in a different way, and it's just not interesting to me. So what am I going to do?"
Some people stopped there. If you were a certain age, maybe it was a good time to retire. But many people stuck it out. As you work through your grief, at some point, there's an upturn, and you begin to accept that this is it. And then hope comes. And for those of us that went through that, you come out on the other side very hopeful about the field. You have a whole new set of tools, you have a new outlook on the problems that you were interested in, and life goes on.
Now, that said, eight or nine years into this revolution, it's clearer that some problems aren't that interesting anymore, at least to me. But other problems are now open to me that weren't open before.
I started my career doing work on optical-flow estimation; that was what my 1990 ICCV paper was on. My group at Max Planck had a paper last year, looking at adversarial attacks on optical-flow networks. But that may be my last paper on optical flow. I’ll never say the problem is solved, but given sufficient data — synthetic data as well as unlabeled video data — this is a problem that is solvable up to the point that somebody would want it to be solvable.
In fact, stereo, albedo, and surface normal estimation — any of these problems that are what I would call low-level problems, in the sense that you do the same thing at every pixel in the image, and you can measure your accuracy in terms of a number — those problems are not so interesting to me anymore because they are so well solved by sufficient data.
But there’s a class of mid-level problems that is currently interesting. I would consider 3-D human pose and shape estimation to be in this class. These mid-level problems involve nonuniform processing in the image but still use metric accuracy for evaluation. If I'm estimating 3-D human pose and shape, I'm estimating something that is not completely tied to the pixels. I'm estimating something in the 3-D world. But the accuracy of the result is still something you can evaluate quantitatively. This metric quality means that it's relatively easy to train neural networks to solve this problem if you have labeled data. What makes this interesting is that it's still hard to get such labeled data. Progress on this 3D human pose and shape estimation is pretty rapid and self-supervised methods are improving so this problem may also not be interesting in five years.
Beyond these mid-level problems, however, there's a whole wonderful world of things that we don't understand. I call those high-level problems. Computer vision has always had a broader mission of trying to "see what's not in the image." That's the real goal of computer vision. What caused the image? What's going to happen in the future? In my case, I'm interested in people and their movement, their behavior. What are they doing? Why are they doing it? What is their emotional state? What might they do next? These are things that are not observable directly in the pixels. There's no pixel in the image that you can measure that will tell you what's inside somebody's head. Progress on these problems will be slower because the output of our methods will not be so easily measurable. I think this going to keep me busy for many more ICCVs.
Another big difference between 1990 and today is the participation of industry. Many of the papers at ICCV are from companies. Back in 1990, not many of the things we did worked well enough to be useful. Fortunately, that's not true today and many ideas from ICCV 2021 will make their way into products.
Getting computer vision into the hands of real users is exciting, but I think many people underestimate what it takes to go from a research paper at ICCV through to something that's actually being used by customers on a daily basis. By the time a product is in use, the DNA of the original research is still there, but so many people have contributed to it, and so many ideas have come to it — and so much of what makes it actually a good customer experience has absolutely nothing to do with the original technology — that you can feel like, "I contributed, but it's just a piece of a big puzzle."
I've been fortunate to attend every ICCV since my first one and it has let me see the world and make great friends. As for the future of ICCV, I'm excited to get back to an in-person meeting in 2023 in Paris. The virtual meetings don't even begin to capture the feeling of the real thing. I look forward to seeing you there!
A wide-eye graduate student, living the dream, in Japan, 1990.
Credit: Inspired by a conversation with Larry Hardesty.
The Perceiving Systems Department is a leading Computer Vision group in Germany.
We are part of the Max Planck Institute for Intelligent Systems in Tübingen — the heart of Cyber Valley.
We use Machine Learning to train computers to recover human behavior in fine detail, including face and hand movement. We also recover the 3D structure of the world, its motion, and the objects in it to understand how humans interact with 3D scenes.
By capturing human motion, and modeling behavior, we contibute realistic avatars to Computer Graphics.
To have an impact beyond academia we develop applications in medicine and psychology, spin off companies, and license technology. We make most of our code and data available to the research community.