you can watch the whole event (1.5hr long) at SW WELCOMES GIRLS 8TH – YouTube, and i’m sharing the script i used to record my talk below. sorry it’s in Korean, and it’s way too long for me to translate it myself:

안녕하세요?

이런 좋은 행사에 초대해주셔서 감사합니다.

일단 간단히 제 소개부터 하겠습니다.

전 현재 뉴욕대학교의 Courant Institute of Mathematical Sciences와 Center for Data Science에서 교수로 재직 중 인 조경현입니다. 올해 8월부터 Genentech에서 Senior Director of Frontier Research로 겸직 중이기도 합니다.

제 연구 주제는 기계학습이고 그 중 artificial neural network를 사용하는 다양한 분야를 살펴보고 공부합니다. 지난 7년 정도는 기계학습을 natural language processing 및 machine translation에 적용하는 연구를 해왔고 최근 들어서는 조금 더 다양한 문제들을 살펴보고 있습니다.

지난 10여년간 제 연구 분야의 학회 등 다양한 세팅에서 강연도 해보고 발표도 해봤습니다. 안타깝게도 저와 전공이 겹치지 않는 분들 앞에서 발표는 거의 못 해봤습니다. 그러다 보니 본 행사 초대를 받았을때 대체 어떤 얘기를 할 수 있을까 많은 고민이 되었습니다.

그래서 주최측을 통해 미리 질문을 받아보기로 했습니다. 과연 누가 미리 질문을 할까 하는 걱정이 있었지만, 다행히 많은 분들께서 질문을 남겨주셨고 그 질문들을 기반으로 짧은 메시지를 준비해봤습니다.

이에 앞서 먼저 제가 어떤 과정을 거쳐서 이 자리에 왔는지 말씀드리겠습니다.

전 집에서 걸어 다닐 수 있는 동작중학교 그리고 경문고등학교를 졸업한 후 카이스트에 입학했습니다.

카이스트에 입학해서 다양한 공부도 하고 다양한 과외 활동도 하며 재밌게 지냈습니다. 제가 한국에서 가르쳐 본 적은 없지만, 이곳 뉴욕대학교에서 학부생들 대학원생들 모두 굉장히 열심히 공부하며 지내는 것을 보다보면 제가 이렇게 놀면서 대학생활을 맘 편히 할 수 있었던 마지막 세대가 아닐까 생각이 들곤 합니다.

그래서 그런지 졸업이 점점 다가올 수록 졸업 후 무엇을 해야 하는지 큰 고민도 없었고 생각도 없었습니다. 정말 인공지능이며 기계학습이며 모르는 상태에서 같이 강의 듣던 선배가 우연히 학과 사무실 앞에서 줏어온 팜플렛을 보고 핀란드로 석사 과정 유학을 갔습니다.

당시 많은 고민을 하고 다양한 옵션들을 조심스레 상세히 알아보고 최선의 선택을 했다면 아마 핀란드에 가지 않았을 것이라 생각됩니다. 핀란드에 안 갔다면 과연 어떤 선택을 하고 지금 어디에 있을까요? 도저히 상상이 되지 않습니다.

석사 시작한 후 한 두 달이 지난 뒤부터 제가 volunteer한 것이 아니라 학과에서 지정해준 연구실에서 일주일에 하루 씩 연구에 참여하기 시작했습니다. 지금은 없어진 이 연구실이 인공신경망 연구를 하는 연구실이었고 이곳에서 저도 기계학습 및 deep learning 연구에 참여하기 시작했습니다. 그때, 만약 학과에서 이 연구실이 아니라 다른 연구실에 저를 배정했다면 전 과연 지금 뭘하고 있을까요? 역시나 상상이 되지 않습니다.

그 후 박사 과정을 마치고 몬트리올 대학교에 박사후 연구원으로 갔습니다. 몬트리올 연구실에 드디어 도착해서 자리를 잡고 앉아 있는데 저를 고용한 Yoshua Bengio 교수가 제게 와서 어떤 연구를 하고 싶냐고 물었습니다. 당시만 해도 큰 생각 없이 박사 과정 동안 하던 연구를 계속 하겠거니 생각했는데, 사실 그럴 이유가 없었던 것이죠.

Yoshua가 제가 제안한 네 개의 연구 주제 중 하나가 기계 번역이었습니다. 가장 생소했고 실제로 이게 연구 분야라는 사실도 모르던 그런 분야였는데… 그냥 너무 재밌을 것 같았습니다. 이때 Yoshua가 기계번역을 제안하지 않았으면 어땠을지, 아니면 제가 기계 번역이 아닌 조금 더 익숙한 주제를 골랐으면 어땠을지, 역시나 어떤 자리에 제가 있었을지 상상이 잘 안 됩니다.

보통 새로운 분야 연구를 시작하면 조심스럽게 기존에 존재하는 연구 결과와 방식들을 공부하고 어떤 것들이 잘 되어 있는지, 어떤 것들이 부족한지 파악하곤 합니다. 다만, 당시 막 박사 졸업을 한, 특히나 박사 과정을 거의 동료 없이 두 명의 포스닥의 도움을 받아가며 끝낸, 저는 도저히 지금 와서는 이해할 수 없는 자신감과 용기가 있었습니다. 옆에서 틈틈히 교과서를 읽어가면서도 대부분의 시간은 처음부터 새로 기계번역 시스템을 뉴럴넷으로 만들면 어떨지를 고민하고 실제로 구현하면서 보냈습니다.

지금 와서 돌아보면 마치 당시에 큰 비전이 있고 그 비전을 따라 앞으로 나아가다 보니 기계번역에 새로운 방식을 적용하고 이를 통해 최근의 많은 발전에 작은 contribution을 해온 것 같아 보입니다. 하지만, 당연히 그럴리 없겠죠. 전혀 모르던 연구 분야에 도전하기로 하고, 쓸데 없는 용기와 자신감에 취해 문제 자체를 조심히 살펴보지 않았다 보니 하루 하루가 시행착오의 연속이 었습니다.

텍스트 데이타를 다뤄본 적이 전혀 없다보니 어떤 형식으로 저장해놓고 불러와야 하는지도 많은 고생을 했습니다. 물론 궁극적으로는 plain text를 gzip으로 압축해놓고 한줄 한줄 읽어가는 식으로 구현하고 말았죠.

포스닥 전까지는 matlab와 독일에 있던 친구가 취미 삼아 만들어놓은 python 라이브러리를 사용해서 직접 모든 것을 구현했었습니다. 몬트리올에 오니 모두 지금은 discontinue된 Theano를 사용했고, 저도 새로운 것을 배워 보자는 마음으로 Theano로 옮겼습니다. 완전히 새로운 paradigm이다 보니 여러모로 많은 고생을 했습니다. 실험이 너무 잘 안돼서 보면 Theano를 잘 이해하지 못해서 만든 버그 때문이었고, 실험이 너무 잘 되어서 보면 역시나 Theano를 잘 이해하지 못해서 만든 버그 때문이었습니다. 간신히 Theano에 익숙해져서 자신감이 붙기 시작했는데 Theano는 2016년에 discontinue되었죠..

몬트리올에서 2년을 지낸 후 2015년 가을에 뉴욕대학교로 옮겨서 지금껏 뉴욕에서 지내고 있습니다. 딱히 교수가 되겠다는 마음은 없었습니다. 아니, 사실 당시에 deep learning을 공부하는 사람은 여전히 소수였고, 그런 소수를 구글, 딥마인드 등에서 공격적으로 뽑아갈 시절이기에 오히려 너무나도 당연하게 저도 제 친구들처럼 그런 회사에 가서 일할거라 생각했었습니다.

당시 우연찮게 학회 가는 길에 잠시 만난, 제 박사 학위 defense의 chair였던 Nando de Freitas 교수가 제게 혹시 교수 자리 생각 없냐고 물어봤던 것이 계기가 되어 회사 연구소에 가지 않고 대학 교수가 될 수 있다는 사실을 깨달았습니다. 당시까지는 몬트리올에서 1년 정도 연구실에 꾸준히 나간 것 외에는 대학원 연구실이 어떤 식으로 운영되고 교수가 어떤 일을 해야 랩을 꾸리고 운영할 수 있는지 몰랐습니다. 다만… 할 수 있다고 하니, 그리고 뭔가 구글, 딥마인드, 페이스북 등에 취업한 친구들과는 조금 다른 길을 가볼 수도 있겠다는 생각에 덜컥 교수 자리에 지원해보기로 결정했습니다.

물론 취업이 쉽지는 않았습니다. 미국, 캐나다, 영국, 핀란드, 스위스 등의 대학교 40군데에 지원하고 6-8 곳에서 인터뷰 요청을 받고 실제 오퍼는 3군데서 받았습니다. NYU가 그 중 하나였고 뉴욕에 살아보고 싶어서 (그리고 제가 도시 생활을 좋아해서) 그리고 NYU가 당시 가장 재밌어 보여서 NYU로 가기로 했습니다.

중간에 part-time이었지만 3년 정도 Facebook AI Research에서 research scientist로 3년 정도 일했고, 얼마 전 protein design하는 회사를 Genentech에 팔고 현재는 Genentech의 Senior Director of Frontier Research로 일하고 있기도 합니다만 2015년 이후 지금껏 꾸준히 NYU에서 교수로 재직 중이고 뉴욕에서 살고 있습니다.

뭔가 제 소개 한다는 것이 많이 길어졌습니다. 다만, 이렇게 제 소개를 하다보니 여러분들이 보내주신 질문들에 많은 답을 한 것 같습니다.

제가 그간 겪은 시행착오를 물어보신 분들이 있습니다. 특히나 신경망 기계번역 연구를 시작한 후에 겪은 시행착오를 물어보셨는데, 이미 답을 해버렸네요. 네, 시행착오 굉장히 많았고, 지금도 많이 하고 있습니다.

지금 제 강연을 듣고 있는 분들, 저와 비슷한 분야에서 일하는 모든 사람들, 그리고 저.. 엔지니어고, 엔지니어의 일은 세상에 없는 것을 만드는 것 입니다. 이 새로운 것이 새로운 연구 분야일 수도 있고, 새로운 제품일 수도 있고, 아니면 기존에 있는 제품을 더 향상 시키는 방법일 수도 있습니다. 뭔가 세상에 존재하지 않는, 아직 인류가 풀어내지 못한 새로운 것을 해내기 위해서는 시행착오가 있는 것이 당연하다고 생각합니다.

제 박사과정 학생들에게 종종 얘기하곤 합니다. 100가지 아이디어를 떠올렸다면 그 중 한두가지 정도가 맞는 아이디어, 실행 가능한 아이디어, 연구 가능한 아이디어라고 합니다. 만약 100가지 아이디어를 떠올렸는데 모든 아이디어가 맞는 아이디어, 실행 가능한 아이디어, 연구 가능한 아이디어라면 아마 셋 중에 하나 일 것 입니다. 첫째, 세상에 지금껏 없던 천재일 수 있습니다. 확률이 매우 낮다고 들었지만 불가능 하진 않겠죠. 둘째, 너무 쉽고 간단한, 나쁘게 말해 뻔한 아이디어만 찾고 있는 것 입니다. 셋째, 아니면 좋겠지만 사기를 치고 있는거 겠죠.

아쉽게도 가보지 않은 길을 개척하기 위해서는 시행착오가 불가피한 것 같습니다. 하지만, 오늘 보는 것과 같은 이런 좋은 커뮤니티가 있어서 서로를 support해주고 시행착오를 이해해줄 수 있기에 점점 해볼만 해진다고 생각합니다.

몇몇 분들이 어떤 계기로, 어떤 마음가짐으로 핀란드로 떠났는지, 지금 하고 있는 분야에 도전했는지 물어보셨습니다. 아쉽게도 정말 운이 좋았다, 그리고 우연의 연속이었다는 말 이상의 답이 없습니다. 제 선배, 용욱이 형이 팜플렛을 갖다주지 않았다면 핀란드 생각조차 못 했겠죠. 핀란드 알토 대학교의 학과에서 저를 뉴럴넷 연구하는 그룹에 배정하지 않았다면 deep learning이라는 분야 연구는 생각도 못 했겠죠. 만약 Yoshua Bengio가 딱히 제가 몬트리올로 간 그 순간에 기계 번역 생각을 안 하고 있었다면 기계 번역 연구를 상상도 못 했겠죠. 그리고 이때 기계 번역 연구를 안 했으면 2014년에 도하에서 열린 자연어처리 학회에 안 갔을 것이고, 그랬다면 Nando 교수가 저한테 교수 생각있냐고 물어보지도 않았을 것 입니다.

그리고 이 모든 순간 순간 (사실 얘기 안 한 우연들이 너무나도 많이 있습니다) 과연 내가 잘 할 수 있다는 자신감이 있었던 것도 아닙니다. 다만, 이런 새로운 옵션을 들었을때 흥미를 느끼고 해보고 싶다는 생각이 들었을 뿐 입니다.

기계학습 내 큰 분야 중 하나인 강화 학습에서 가장 중요한 원칙 중 하나가 “optimism in the face of uncertainty” 입니다. 한국어로 하면 불확실한 상황에 맞닥치면 낙천적인 선택을 해야 한다는 것 입니다. 이제와서 뒤돌아보면 정말 뭘 모르다보니 자연스럽게 낙천적인 선택을 했고, 많은 운과 우연이 따라줘서 이 곳에 오게 됐습니다.

말하고 나니 답이 아니네요. 죄송합니다.

어떤 분들은 어떤 계기로 제가 지금과 비슷한 일을 하고, 생각을 하는지 궁금해 하셨습니다.

어느 한 시점을 집어낼 수 없지만 또 다르게 생각해보면 사실 태어난 후 모든 순간 순간이 지금 저를 만들었으니 전체 다가 답이라고 할 수도 있지 않나 생각합니다.

다만, 제 생각에, 그리고 제 생활에 큰 변화가 생기는 지점들이 언제였나 곰곰히 생각해보면 대부분 제가 익숙한 편한 공간을 벗어나는 순간들이었습니다.

핀란드에 도착해서 몇 주 지난 후, 점차 핀란드 대학 생활에 익숙해 지고, 핀란드 사회에 대해 배워가고, 핀란드 학생들 그리고 유럽 내 학생들 그리고 전세계에서 핀란드로 유학 온 학생들과 친해지면서 세상의 중심은 어디에 사느냐에 따라 달라진다는 것을 느꼈습니다. 에스토니아와 스웨덴이 그리 국제 정세에 중요한 줄은 아마 핀란드에 안 갔으면 몰랐을 것 입니다.

몬트리올에서 Yoshua에게 기계 번역 연구를 하고 싶다고 말했을 때.. 사실 기계번역이 뭔지도 몰랐지만, 그 말을 하고 기계번역 연구를 시작하면서 기계학습 더 넓게는 인공지능에 대한 저의 좁은 시야가 확 넓어지는 것을 느꼈습니다.

Timnit Gebru가 트위터에 올린, 2014년인가 2015년 뉴립스 학회 사진을 올렸습니다. 그 사진을 보고 저는 Timnit의 트위터 글을 읽기 전 아무 생각부터 없이 나는 여기 없나 하면서 사진을 한참 봤었습니다. 그리고 나서 timnit의 짧은 트윗을 읽는 순간 갑자기 눈에 보이지 않던 것이 보이기 시작했습니다. 아니 사실 사진에 없는 것이 보이기 시작했다는 것이 맞겠네요. 그 수 많은 참가자 중에 여자도 거의 안 보이고, 흑인은 전혀 없다는 것을 그제서야 깨달은 것이죠.

조금 엉뚱하지만 이런 의미에서 전 언제나 해외에 나가는 것은 크게 찬성합니다.

마지막으로… 정말 재밌는 질문을 하나 봤습니다.

“세계에서 유명한 과학자가 되셨는데 어떤 기분이신지 궁금합니다.”

너무 좋게 봐주셔서 감사합니다. 우연의 연속으로 좋은 곳에 좋은 사람들과 좋은 때에 있는 바람에 편히 이런 자리까지 오게 됐습니다.

하지만 매일 같이 드는 생각이 있습니다.

저보다 더 전 세대의 deep learning 연구자분들, 예를 들면 geoff hinton, yann lecun, yoshua, juergen schmidhuber 등, 과 얘기해보고 그들의 연구를 따라가 보면 이들은 진정한 비전이 있기에 deep learning이 실제 성과를 보여주기엔 환경이 너무나도 열악했던 시절에도 멈추지 않고 deep learning 연구를 해왔고, 지금의 deep learning이라는 분야를 개척해냈습니다. 이런 분들과 비교해보면, 저는 AI가 뭔지, deep learning이 뭔지도 모르고, 어찌 저찌 선배가 팸플렛을 갖다줘서, 진학한 학과에서 배정해서, deep learning 연구를 시작했고, 타이밍이 잘 맞아서 deep learning이 확 뜰때 박사과정을 마무리하고 편히 교수 자리를 얻고 했습니다.

과연 지금까지는 운이 좋아서 이리 됐지만… 과연 저런 pioneer들처럼 계속 꾸준히 앞을 보고 연구할 수 있을까요…? 걱정이 많이 됩니다.

물론 제 앞 세대만 있는 것은 아닙니다. 제 석박사 과정 초반에는 뉴립스 등의 기계학습 학회에 가면 deep learning 논문을 손가락으로 꼽을 수 있었습니다. 그만큼 인공신경망 연구하는 학생들이 없었죠. 하지만 지금은 기계학습 학회만이 아니라 인공지능에 조금이라도 관련있는 분야의 학회에 가면 대부분의 논문이 deep learning 관련 내용입니다. 그만큼 어마어마하게 많은 학생분들이 이 분야에서 치열하게 연구를 하고 있습니다.

제가 NYU에서 학생 지도도 하고, 학생 입시에도 참여하다 보니 이런 학생들에 대해서 꽤 많이 알게 됐습니다.

정말 어마어마합니다. 전 기계학습이 뭔지, 기반되는 수학/통계 지식도 없이 정말 아무 것도 모르는 상태에서 deep learning 공부를 시작했는데, 지금 공부하는 학생들은 너무나도 학문적인 준비가 잘 되어 있고 심지어는 각종 연구, 개발 경험까지 있습니다. 그럼에도 불구하고 높은 경쟁 때문에 다들 많이 힘들어 하고, 제 시절보다 훨씬 더 힘들게 연구하고 공부합니다.

정말… 이런 학생들과 제 후배들을 보면 미안한 마음 밖에 없습니다. 교수라는 타이틀을 갖고 거들먹 대긴 하는데… 과연 이럴 자격이 있는 것 인지… 아마 지금 다시 박사 과정 석사 과정을 다시 시작한다면 다시 지금의 이 자리까지 올 수 있을까요? 아마 못 올 것 입니다.

기분이요…? 괴롭네요.

다시 한 번 이런 좋은 자리에 초대해주셔서 감사합니다. 이곳 시간은 조금 늦었지만 잠시 후 온라인으로나마 직접 만나뵙도록 하겠습니다.

]]>first, CIFAR started a program named “Neural Computation & Adaptive Perception” (NCAP) in 2004, supporting research in artificial neural networks, which has become a dominant paradigm in machine learning as well as more broadly artificial intelligence and all adjacent areas, including natural language processing and computer vision. i started my graduate study in 2009 with focus on restricted Boltzmann machines and graduated in 2014 with a PhD degree, which makes me perhaps *the* one who has benefited *most* from this success of deep learning. since this success of deep learning was fostered by CIFAR’s NCAP program already starting in 2004, i could even attribute a large part of my career to CIFAR and its NCAP program. i often wonder what would’ve happened to me and my career post-graduate school, had CIFAR decided to start and support another program. i can only guess it would’ve been very different and that i would’ve been worse off certainly.^{@}

second, CIFAR sponsored the very first publicly-open summer school on deep learning hosted by UCLA IPAM in 2012. i was a graduate student at Aalto University in Finland back then. due to a number of reasons, both political, financial and technical, the Bayes group, to which I belonged back then and which was actually a “neural net” group despite its name, had by then pretty much stopped taking in new students nor new postdocs. i was in a desperate need for meeting peers and talking with them about neural net research (i wasn’t still too familiar with the term “deep learning”, just like many others back then,) not to mention that i really needed to take some courses and learn about various technical aspects of deep learning beyond the limited selection of courses offered back then at Aalto (i mean… the neural net group was essentially at the brink of being dissolved, although this is for another post.) i then learned about this “Graduate Summer School: Deep Learning, Feature Learning” and did not hesitate a second to apply for a seat there. it was a three-week-long program filled up with a series of amazing lectures and lab sessions, allowing me to finally get a bigger picture and learn various technical details behind various algorithms and paradigms.* it was pretty intense, but it was just the right level of intensity that i needed back then. i wonder how my PhD thesis would’ve looked like had i not attended this summer school or even worse had CIFAR not sponsored this summer school in the first place. what a scary thought!

third, i attended the annual summer school organized by CIFAR NCAP (which is now called Learning in Machines and Brains (LMB)) in 2014 hosted at the University of Toronto, as a postdoc at the University of Montreal. it was a very exciting summer school following up on a series of CIFAR NCAP summer schools organized ever since NCAP was created in 2004. the entire summer school was fit in one reasonably small lecture room of U. Toronto, and there were a series of lectures and student talks. because we were all cramped into a single lecture room (talk about pre-pandemic!) it was intensely interactive, and i was just learning so much during those 2-3 days. at this summer school, i presented on-going work on machine translation (so did Ilya Sutskever who gave a much better, slicker and prophetic talk). this is where i coined the term “*neural machine translation*“, which i believe may be the only lasting contribution i’ve made to the field of machine translation (and i’m proud of myself for it!) in fact, after the school on that day, we all went to one of the dive bars where UT grad students used to hang out (can’t really recall the name anymore..) and were toasting to “neural machine translation”.^{#}

finally, CIFAR has been running a number of programs that are aimed at scientific and social aspects of research, such as a Global Scholar Program sponsored by the Azrieli Foundation, called the CIFAR Azrieli Global Scholars Program, and an AI Catalyst program. The Global Scholars program provides a set of opportunities for early-career scholars from a diverse set of disciplines, spanning from political science all the way to cosmology, to not only advance their science but also interact with peers from various disciplines to build up a broader view not only within science but across the society. the AI Catalyst program on the other hand provides funding for proof-of-concept, exploratory and blue-sky projects in order to continue to fuel scientific & societal innovation. i’ve benefited from both of these programs. i was a CIFAR Azrieli Global Scholar from 2017 to 2019 and thoroughly enjoyed my interaction with peer Global Scholars from a diverse set of disciplines, including cosmology, quantum physics, journalism, biology, etc. i received a Catalyst grant last year (2020) which has allowed me to work with Prof. Jimmy Lin at U Waterloo to build Neural Covidex, a specialized search engine for COVID-19 related literatures and make it publicly available at https://covidex.ai/. truly, these programs have enabled me to go above and beyond my comfort zone both scientifically and socially.

it’s pretty clear i have tremendously benefited from CIFAR over the past decade or so, and perhaps only naturally i want others to experience and benefit from being part of CIFAR both scientifically and socially. in particular, i want scientists from a diverse set of backgrounds and disciplines to enjoy such opportunities, in line with how CIFAR is “*committed to creating a more diverse, equitable, and inclusive environment*.”

going beyond wanting and wishing this, i’ve decided to more directly contribute to this cause by donating $50,000 USD to CIFAR so that CIFAR can “*provide funding resources in support of women and researchers from underrepresented groups to attend professional development opportunities.*” it is certainly not a lot, and the impact of this donation on its own will be quite limited. i only wish this would nudge people, including organizations such as governments and companies, to think once more about important roles performed by CIFAR and its likes in supporting innovation and promoting the diversity, inclusion and equity in science.

P.S. my little birds told me that my co-conspirator in Prescient Design, Richard Bonneau, is planning to make a similar donation to support CIFAR’s commitment to improving diversity in science. thanks, Rich!

(@) well.. perhaps most objectively, i wouldn’t have been a Fellow of the Learning in Machines and Brains (LMB) program of CIFAR

(*) oh, i forgot to mention this even more important tidbit: Geoff Hinton “pronounced” the success of deep convolutional nets for ImageNet and “described” dropout at this summer school approximately five months ahead of NeurIPS 2012.

(#) these toasts were mainly led by Jamie Kiros who has become my dear friend ever since.

]]>This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “*What is wrong with contrastive learning?*” thrown by Andrew Gordon Wilson. Lerrel said,

Lerrel Pinto (2021)

My understanding is that getting negatives for contrastive learning is difficult.

i haven’t worked on the (post-)modern version of contrastive learning, but every time i hear of “*negative samples*” i am reminded me of my phd years. during my phd years, i’ve mainly worked on a restricted Boltzmann machine which defines a distribution over the observation space as

$$p(x; W, b, c) \propto \exp(x^\top b) \prod_{j=1}^J (1+\exp(x^\top w_{\cdot, j} + c_j)),$$

where $W$, $b$ and $c$ are the weight matrix, visible bias and hidden bias. for simplicity, i’ll assume the visible bias is $0$, which is equivalent to saying that the input is on expectation an all-zero vector. This makes the definition above a bit simpler, and especially so when we look at the log-probability:

$$\log p(x; W, c) = \sum_{j=1}^J \log (1+\exp(x^\top w_{\cdot, j} + c_j)) – \log Z,$$

where $\log Z$ is the log-partition function or log-normalization constant.

the goal of learning with a restricted Boltzmann machine is then to maximize the log-probabilities of the observations (training examples):

$$\max_{W, c} \mathbb{E}_{x \sim D} [\log p(x; W, c)],$$

using stochastic gradient descent with the stochastic gradient derived to be

$$g_{\theta} = \sum_{j=1}^J \nabla_\theta \log (1+\exp(x^\top w_{\cdot,j} + c_j)) – \mathbb{E}_{x_- \sim p(x; W,c)} [\sum_{j=1}^J \nabla_\theta \log (1+\exp({x_-}^\top w_{\cdot,j} + c_j)].$$

the first term ensures that each hidden unit (or expert) $j$ is well aligned with the correct observation $x$ drawn from the data distribution (or training set.) not too surprising, since the alignment (dot product) between the expert weight $w_{\cdot, j}$ and a given observation gives rise to the probability of $x$.

the second term corresponds to computing the expected negative energy (ugh, i hate this discrepancy; we maximize the probability but we minimize the energy) over all possible observations according to the model distribution. what this term does is to look for all input configurations $x_-$ that are good under our current model and to make sure the hidden units (or experts) are not well aligned with them.

you can imagine this as playing whac-a-mole. we try to pull out our favourite moles, while we “whac” any mole that’s favoured by the whac-a-mole machine.

in training a restricted boltzmann machine, the major difficulty lies with how to efficiently and effectively draw negative samples from the model distribution. a lot of bright minds at the University of Toronto and University of Montreal back then (somewhere between 2006 and 2013) spent years on figuring this out. unfortunately, we (as the field) have never got it to work well, which is probably not surprising since we’re talking about sampling from an unnormalized (often discrete) distribution over hundreds if not thousands of dimensions. if it were easy, we would’ve solved most of problems in ML already.

let’s consider a stochastic transformation $T: \mathcal{X} \to \mathcal{X}$, where $\mathcal{X}$ is the input space. given any input $x \in \mathcal{X}$, this transformation outputs $\tilde{x} \sim T$ that highly likely maintains the same semantics as the original $x$. this is often used for data augmentation which has been found to be a critical component of contrastive learning (or as a matter of fact any so-called self-supervised learning algorithms).

imagine a widely used set of input transformations in e.g. computer vision. $T$ would include (limited) translation, (limited) rotation, (limited) color distortion, (limited) elastic distortion, etc. we know these transformations often in advance, and these are often domain/problem-specific.

what we will now do is to create a very large set of hidden units (or experts) by drawing transformed inputs from the stochastic transformation $T$ for one particular input $x$. that is, we have $J$-many $\tilde{x}_j \sim T(x)$. in the case of computer vision, we’ll have $J$-many possible distortions of $x$ that largely maintain the semantics of $x$.

these hidden units then define a restricted Boltzmann machine and allow us to compute the probability of any input $x’$:

$$\log p(x’ | \tilde{x}_1, \ldots, \tilde{x}_J) = \sum_{j=1}^J \log (1+\exp(s(x’,\tilde{x}_j))) – \log Z,$$

where i’m now using a compatibility function $s: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ instead of the dot-product for more generality.

starting from here, we’ll make two changes (one relaxation and one restriction). first, we don’t want to only use $J$ many transformed copies of the input $x$. we want to in fact use all possible transformed versions of $x$ out of $T$. in other words, we want to relax our construction that this restricted Boltzmann machine has a finite number of hidden units. this turns the equation above to be:

$$\log p(x’ | x, T) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log Z.$$

second, we will assume that the input space $\mathcal{X}$ coincides with the training set $D$ which has a finite number of training examples, i.e., $D=\left\{ x_1, \ldots, x_N \right\}$. this second change only affects the second term (the log-partition function) only:

$$\log p(x’ | T(x)) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log \frac{1}{N} \sum_{n=1}^N \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x_n,\tilde{x})))\right].$$

to summarize what we’ve done so far: we build one restricted Boltzmann machine for a given input $x \in \mathcal{X}$ by drawing the hidden units (or experts) from the transformation distribution $\tilde{x} \sim T(x)$. the support of this restricted Boltzmann machine is restricted (pun intended) to be a training set.

what would be a good training criterion for one such restricted Boltzmann machine? the answer is almost always maximum likelihood! in this particular case, we want to ensure that the original example $x$ is most likely under the restricted Boltzmann machine induced by itself:

$$\max_{\theta} \log p(x | T(x)),$$

where $\theta$ is the parameters for defining the compatibility function $s$ from above.

we do so for all $N$ restricted Boltzmann machines induced from $N$ training examples:

$$\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p(x_n | T(x_n)).$$

since it’s decomposed over the training examples, let’s consider only one example $x \in D$. we then train the induced restricted Boltzmann machine with stochastic gradient descent, following

$$\frac{1}{M} \sum_{m=1}^M \nabla_{\theta} \log (1+\exp(s(x, \tilde{x}_m; \theta))) – \frac{1}{M} \sum_{m=1}^M \sum_{n=1}^N p(x_n|T(x)) \nabla_{\theta} \log (1+\exp(s(x_n, \tilde{x}_m; \theta))),$$

where we use $M$ transformed copies to approximate the two expectations over $T(x)$ but not $p(x_n|T(x))$. we probably should use another set of $M$ transformed copies to get the unbiased estimate.

this does look quite similar to more recently popular variants of contrastive learning. we start from a training example $x$, generate a transformed version $\tilde{x}$, maximize the compatibility between $x$ and $\tilde{x}$, and minimize the compatibility between $\tilde{x}$ and all the training examples (including $x$). there are minor differences, such as the choice of nonlinearity, but at the high level, it turned out we can derive contrastive learning from the restricted Boltzmann machine.

perhaps the only major difference is that this formulation gives us a clear guideline on how we should pick the negative examples. that is, according to this formula, we should either use all the training examples weighted according to how likely they are under this $x$-induced restricted Boltzmann machine or use a subset of training examples drawn according to the $x$-induced restricted Boltzmann machine without further weighting. of course, another alternative is to use uniformly-selected training examples as negative samples but weight them according to their probabilities under the $x$-induced restricted Boltzmann machine, *à la* importance sampling.

so, yes, contrastive learning can be derived from restricted Boltzmann machines, and this is advantageous, because this tells us how we should pick negative examples. in fact, as i was writing this blog post (and an earlier internal slack message,) i was reminded of a recent workshop i’ve attended together with Yoshua Bengio. there was a talk on how to choose *hard* negative samples for contrastive learning (or representation learning) on graphs, and after the talk was over, Yoshua raised his hand and made this remark

Yoshua Bengio (2019, paraphrased)

That’s called Boltzmann machine learning!

Indeed…

Based on this exercise of deriving modern contrastive learning from restricted Boltzmann machines, we can now have a meta-framework for coming up with a contrastive learning recipe. Any recipe must consist of three major ingredients:

**A per-example density estimator**: i used the restricted Boltzmann machine, but you may very well use variational autoencoders, independent component analysis, principal component analysis, sparse coding, etc. these will give rise to different variants of self-supervised learning. the latter three are particularly interesting, because they are fully described by a set of basis vectors and don’t require any negative samples for learning. i’m almost 100% certain you can derive all these non-contrastive learning algorithms by choosing one of these three.**A compatibility function**$s$: this is the part where we design a network “architecture”, and how the output from this network is used to compute a scalar that indicates how similar a pair of examples is. it looks like the current practice is to use a deep neural net with a cosine similarity to implement this compatibility function.**A stochastic transformation****generator**: this generator effectively generates a density estimator for each example. this is very important, since it defines the set of bases used by these density estimators. any aspect of data cannot be modelled if these generated bases do not cover them.

we have a pretty good idea of what kind of density estimator is suitable for various purposes. we have a pretty good idea what’s the best way to measure the similarity between two highly-complex, high-dimensional inputs (thanks, deep learning!) but, we cannot know what the right stochastic transformation generator should be, because it is heavily dependent on the problem and domain. for instance, the optimal transformation generator for static, natural images won’t be optimal for e.g. natural language text.

so, my sense is that the success of using contrastive learning (or any self-supervised learning) for any given problem will ultimately boil down to the choice and design of stochastic transformation, since there’s a chance that we may find a near-optimal pair of the first two (density estimator and compatibility function) that works well across multiple problems and domains.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarhip (임미숙 장학금) at KAIST

i graduated from Korea Advanced Institute of Science and Technology (KAIST) with the Bachelor in Science (B.Sc.) degree. i majored in computer science which is the subject i’ve never left so far, having become a professor of computer science (and data science) in 2015. although my undergraduate years in terms of education was closer to failure than success (which is extremely visible on my transcript,) i thoroughly enjoyed my days at KAIST and have fond memory of the years I spent there.

although the whole field, including myself, has become so much more aware of the issue of gender imbalance in computer science in recent years, it was already super-clear that there was this issue in computer science when i was in my undergraduate years. my memory is definitely failing me, but i recall there were less than five if not four females students out of approximately 60-70 students in my cohort. of course, the awareness did not mean that i felt any issue with it nor was compelled to do something about it. it just felt only natural back then that boys majored computer science and girls in biology (yes, i’m simplifying it quite a bit here, but this is how it seemed to me back then.)

perhaps this is precisely what my mom and others in the family felt back when i was born. before i was born, my mom used to be a teacher in a (junior) high school, teaching Korean. my mom and dad graduated from the same university for their undergraduate degree, after which my mom became a teacher and my dad decided to pursue higher degrees, eventually becoming a professor of korean literature. clearly both of them had the same level of education up until a certain point, but at that point, mom gave up on her career to raise me and my younger brother who was born less than 2 years after i was born. again, i’m sure this was the choice that was only natural back then.

unfortunately it’s about 20 years since i started my undergrad years at KAIST, and the issue of gender balance in computer science hasn’t gotten any better. in fact, this issue, which i didn’t even realize existed back then, turned out to be just a tip of the iceberg. the field of computer science, or perhaps more narrowly machine learning, is riddled with imbalances; gender imbalance, geographical imbalances (over-representation of north america, europe and east asia over other parts of the world), imbalance across races (6 black researchers out of more than 5,000 attendees of NeurIPS 2017, noticed by Timnit Gebru), and many more.†

these issues are somehow “discovered” each day, but the truth is that we are barely freeing ourselves from the social constructs that have blinded us or have convinced us that these imbalances are only natural. this is just like how i never thought it was an issue that all boys majored in computer science while all girls majored in biology when i was in my sophomore year. this is just like why my mom quit her job to raise me and my brother more than 35 years ago, which i’m sure no one questioned then.

i don’t have any solution to this issue of social blindness, but one thing i have become aware of is that one cannot see what is not there for them to see. when i was one of 90% or more of the boys who majored in computer science 18 or so years ago, i couldn’t see the problem. when i was one of 90% or so of the non-black, male researchers attending ICML and NeurIPS over many years, i couldn’t see the problem. i mean i was having beer, tequila, etc. non-stop together with Yann Dauphin, but i couldn’t see this near-complete lack of black researchers as a troubling trend at all. i only started to see these problems of equal access, equity, etc. only when i started to see people raising these issues and bringing these issues to my attention. in other words, the one remedy i know and have experienced myself is to create a diverse environment in which each individual can see and interact with diverse individuals and hear their stories.

so, as a small effort toward helping build such diverse environments, i have decided to donate approximately ₩100,000,000 KRW (≈ \$91,000 USD) to the Department of Computer Science, School of Computing at KAIST to create a small scholarship named after my mom (Lim Mi-Sook 임미숙) that will provide a small amount of supplement (≈ \$900) each to a small group of female students who major in computer science, at the beginning of each semester, until the fund runs out.^{∘} it’s not a lot, but it never hurts to have some extra allowance at the beginning of each semester. they might use it for buying a new iPad for either taking better notes in their classes or watching Netflix more comfortably. they might use it to hang out with their friends and have some nice meals. they might use it to pay for their hobbies.^{⊚} however they spend it, i only hope this would encourage them to continue their study in computer science and to encourage others to join computer science in the future, thereby contributing toward building a more diverse community of computer scientists (so that my little niece will eventually want to study computer science and be a computer scientist.) furthermore, i wish this will help us, including myself, more easily and readily see and break ourselves free from these social constructs/biases that unfairly disadvantage and harm subsets of population.

finally, here’s why i named it after my mom: although i structured this scholarship to be from my mom, this won’t let me nor my mom answer how her career would’ve been had she not given up on it when i was born. it however will make all of us think more about the burden of raising children that is placed often disproportionately on mothers and how it should be better distributed among parents, relatives and society, in order to ensure and maximize equity in education, and career development and advances.

† more and more organizations and initiatives are founded to address these challenges, including Women in Machine Learning, Black in AI, etc. (see e.g. the Diversity, Equity and Inclusion page of ICLR’21.) these are organizations that make me proud to be a part of this research community.

∘ oh, and i asked the department to arrange a lunch between my parents and these students each semester. i think my parents will love talking with them, and i hope the students will also enjoy the lunch.

⊚ see my earlier post <Giving thanks: Samsung AI Researcher of the Year Award and Donation to Mila> for more of my thoughts on this.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarship (임미숙 장학금) at KAIST

i’ve rarely mentioned my father in this blog without any particular reason, but perhaps it’s a good time to talk about him briefly in this post.

his name is Kyu-Ick Cho (조규익),and he’s a professor of Korean Language and Literature at Soong-sil University in Seoul, Korea. perhaps unsurprisingly, i don’t know much of Korean language nor literature, not to mention Korean *classical *literature and art in which he is one of the world-wide experts. i only know a few things i picked up here and there about his research as i grew up. unfortunately i’m way out of my depth & breadth even list up what he’s worked on, done and continues to work on, although i can point you to his homepage (http://kicho.pe.kr/), where you can find the ever-growing list of books and papers he authors (warning: all in Korean).

one thing i can talk about is that it’s helped me see the stark difference between how things work in engineering/science and in humanity, just seeing my father from the side. when it comes to Korean classical literature and art research, the intellectual curiosity and perhaps intellectual responsibility truly matters. you do not build anything new that may change the world. you do not discover something that may change the world. you do not learn skills that may make you valuable to for-profit organizations. your research is probably not supported by deep pocketed industry and if by federal government, at the level that barely keeps you alive. it’s pretty much all about fulfilling your intellectual curiosity and carrying our your duty and responsibility as an academic.

although economy in korea has grown tremendously, this doesn’t necessarily translate to increased investment in humanities research, especially for those areas in humanities that do not translate immediately to economic value. korean classical literature and art is clearly one such area where no one expects any *return* on investment at any time. after all, it is *literature* and *art*, and perhaps worse yet, it is *classic*.

there are many negative consequences from such plateaued or shrinking investment, that i’d love to talk a lot about. in this post however let me stick to just one particular consequence. that is, such lack of investment discourages (if not outright prevents) researchers from pursuing their intellectual curiosity and responsibility, thereby effectively serving as a death sentence to the field. to understand what i mean here immediately, imagine how you’d react when your kid announces they’ll pursue PhD in Korean Literature .

perhaps surprisingly, i find it quite disturbing that we may be looking at a serious chance that there won’t be anyone who’ll study and research korean classical literature and art at some point not too far in the future. out of a few things that set us (humans) apart from other intelligent species, literature and art, which are closely related to each other with their boundary becoming fuzzier as we go back further in time, are clearly at the forefront of these unique features of us, and if we can’t afford to spare our effort & time in creating, enjoying and preserving these artifacts ourselves, what are we really doing here?

of course, despite this shrinking investment in korean classical literature & art research, researchers in this field have not given up, including my father. in order to build an environment to accommodate more junior and less established researchers in the field of korean classical literature & art, he founded a research center at Soong-Sil University, named the Center for Korean Literature & Art, in 2006 and has continued to run it so far. this research center has its own journal that publishes 3-4 issues each year. it hosts annual conferences to gather a small number of researchers who are dedicated to korean literature & art. it publishes many books each year. as far as i can tell, the center is not growing in terms of the number of people, but its activities as well as the coverage of research areas within Korean classical literature and art have steadily grown over the past decades.

so, yes, he is really trying hard together with a small number of his colleagues and peers. in fact, he’s been doing so ever since his career as a professor of korean language and literature in mid-80’s, although from what i’ve scantly seen from the side this has been an uphill battle. and, now with his retirement in 1 year, the future of korean classical literature and art does not look particularly brighter.

when i was a kid, i recall one year (1996) when my father received two highly respected awards. one was Do-Nam Award for Korean Literature Research (도남국문학상), and the other was Seong-San Award ~~for Korean Classical Poetary Research~~ (성산~~시조~~학술상). obviously i wasn’t aware of how big deals these awards were back then, not do i know how big deals these were even now. i could however feel that these must be big deals because i could sense the pride in my father’s eyes when he broke the news. i even remember attending the ceremony for one of these awards (not sure if i attended both, though. my memory is failing me here.)

that was 25 years ago, when my father was still considered junior (i mean… it’s the field of Korean *classical* literature and art, where everyone’s supposed to be junior ever.) these prizes must’ve meant quite a bit in that they recognize his own research but also encourage him to advance his research further. noticing that these two awards always mentioned in his bio’s as well as CV’s, i presume i’m not too wrong in this.

unfortunately, it doesn’t look like either of these awards exists anymore. i could trace Do-Name award up to 2008, but i couldn’t find any information about it. in fact, i couldn’t even find the list of awardees from a few minutes of Googling (and Navering). the same goes with Seong-San award. i could trace it up to 2003 or so, but i again can’t find anything substantial about this award. it’s quite shame. two prominent ways to recognize and encourage researchers in this relatively narrow field of korean classical literature and art seem to have been lost over time (, although these awards were not only for the classical literature & art but recognize achievements in a broader field of korean literature.)

no individual will be able to save the whole field of korean classical literature and art. it’ll have to be the whole society’s effort to save this field and along the way our soul as well. my father has contributed his entire career to this cause and will continue to do so even after his retirement, although his forecast becomes gloomier each time i talk with him. to this end, i’ve decided to contribute just a little myself to this effort to saving and perhaps even growing research in Korean classical literature and art by donating ₩100,000,000 (approx. $90,000 USD) to the Center for Korean Literature and Art with the stipulation that this is used to create an award for Korean classical literature and art.

this award will be given to 1-2 researchers each year with approximately $2,000-5,000 each (to be determined by the Center’s Board each year) until the fund runs out, with a hope that this award can be used to recognize the achievements of and encourage future endeavors of researchers in the field of Korean classical literature and art, just like what those two awards above did to my father and what Ho-Am Prize is doing to me.

oh, right, i almost forgot to mention: i’ve also put one small condition that this award be named after my father’s pen name^{*} 백규 (Baek-Gyu, 白圭). so, this award, which will hopefully start to be awarded starting next year (2022), will be called the Baek-Gyu Award in the field of Korean Classical Literature and Art (백규고전학술상).

* 호; i’m not sure what’s the right translation of this in English. it’s a kind of a nick name given by another, often a teacher or fatherly figure.

]]>**Note**: This is the first in a series of up to three posts related to the Ho-Am Prize I was awarded this year.

- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarship (임미숙 장학금) at KAIST

What an honour it has been to be a recipient of the Samsung Ho-Am Prize in Engineering this year (2021)! The Ho-Am Prize is one of the biggest and perhaps most recognized awards in Korea. Quoting the Ho-Am Foundation directly:

The Prize is presented each year to individuals who have contributed to academics, the arts, and social development, or who have furthered the welfare of humanity through distinguished accomplishments in their respective professional fields.

In particular, the Ho-Am Prize in Engineering is awarded to “*people of Korean heritage whose accomplishments have contributed to the development of industry for greater prosperity for humanity.*“

I’m quite certain what i’ve done so far is anything remotely close to contributing to either the development of industry or greater prosperity for humanity. but, i take it that this Prize was awarded to me not for my individual achievement but to recognize “*what we have been able to collectively achieve over many decades in the field of deep learning and more broadly artificial intelligence and data science.*“^{*}

regardless of whether the Prize celebrates my own achievement or the set of achievements we have made collectively, it turned out that i am the one who receives “*cash prize of KRW 300million (approx. 275,000 USD)*“. I KNOW! this is the biggest cash prize i’ve ever received. in fact, i could even say this is by far the biggest chunk of money i’ve received at once, and the second largest one does not even come close to it.

since i take it that this Prize recognizes our field rather than myself as an individual, i’ve decided to use this enormous cash prize not for myself but to serve a broader society. because it’s a pretty hefty prize, i’ll spend it in 2-4 distinct ways over the next few months, and in this post, i’ll share with you my first attempt at giving away this cash prize.

one of the most fortunate moments in my career so far was one day in Fall 2008. my friend (Yongwook) and i were taking a course designed for freshman students in a non-computer science major, when both of us were very, very, very far from our freshman years. perhaps obviously, we were always sitting at the very back of a large lecture hall with the sole goal of finally graduating from the university at some point. one day, Yongwook showed up a bit late, rushed into the lecture hall and sat down next to me. he then showed me a (possibly the ugliest) brochure he picked up in front of the department office on his way to the lecture hall. it was a brochure sent to KAIST Computer Science by Aalto University (back then Helsinki University Technology) about the (relatively) new international master’s program in **mac**hine learning **a**nd **da**ta **mi**ning. the program was named “**Macadami**a” (no idea where the final “a” comes from.)^{∘}

until then, i never planned to continue my study beyond my undergraduate degree, i never thought of going abroad for studying further, and i never even imagined moving to Finland. but, somehow, there it was: the pamphlet from Finland, telling me about this master’s program in machine learning and data mining. within a few months, i was on a Finnair flight on my way to Helsinki (though, i’ve never “lived” in Helsinki but only in Espoo ever.) and, until now, this was one of the best decisions, if not *the* best one, i’ve ever made in my whole life.

i still cherish the years i spent in Finland.

internationalization matters. just by talking with, hanging out with and just simply listening to people from all over the world, we not only learn how others live, but we ourselves live, experience, understand and accept how others live all over the world. in doing so, we become more tolerant and open-minded. so, yes, internationalization matters, and we must strive to actively create an environment in which no group of people is marginalized and in which everyone is welcome and can interact with each other.

representation matters. representation matters from at least two aspects. first, representation self-reinforces. for instance, it’s quite difficult for me to imagine my little niece dreaming of becoming an AI researcher, because it’s not easy for me to see how she would find the field of artificial intelligence welcoming, when the whole field is pretty much dominated by men. the only way to break this is to make sure all, truly all, are represented. second, representation is a path toward safety, equity and fairness in engineering and science. i might sound a bit like a broken record at this point, but for instance quite a bit of issues arising from deploying AI/ML systems could have been caught before their deployment had those systems been developed and vetted by a team of developers that properly represent the diversity of the society (see here for a few examples and pointers to original sources.) so, yes, representation matters in ensuring safe, equitable and fair development and deployment of systems we build.

compared to my experience prior to joining Aalto University back in Korea, Aalto University provided me an environment which were much better internationalized and had generally better representation across various aspects. this greatly helped me broaden my view and perspective on a diverse set of topics, and really changed how i perceive the world in general. looking back however i must unfortunately say that my bar was very low.

Aalto University, and the Finnish society more broadly, also suffers from the (relatively) lack of internationalization and diversity. i was in the “international” master’s program which was taught (almost) entirely in English (if i recall correctly Finnish 1 was required, which was perhaps unsurprisingly in the mix of English & Finnish) and attracted talents from all over the world. indeed in my cohort, if i recall correctly, either all but one or all my peers were from abroad, which allowed me to interact with them, learn from them and become a friend with them. however, outside this program, along with a few other international master’s programs, it was a reasonably rare sight to find non-Finnish students at Aalto University (well, at least in the School of Science and Engineering, back then.) there were certainly more non-Finnish but European students who were spending their exchange years, although they weren’t too many either.

Furthermore, within my cohort of Macadamia, if i recall correctly, there was one female student out of 12 or so students.^{#} this balance seems particularly bad, but the balance wasn’t too good among students as well as faculty members within general computer science. i have no statistics available in my hands now, but my personal experience tells me that gender balance was definitely better at Aalto CS than at KAIST CS where i studied computer science in my undergrad years. this however did not mean that it was any good at Aalto CS, but just that my bar was very low.

as i’ve explored beyond Finland, i’ve seen, experienced and enjoyed places that are more internationalized and have more balanced representation of a diverse population. Aalto University can and should do better to better serve its students as well as Finland and more broadly the world by further improving its internationalization and building an even more diverse campus.

here’s two sides of my feeling toward Aalto University:

- my experience at Aalto University and Finland was simply amazing, and i want to contribute to making this experience available to a broader group of students from all over the world.
^{@} - Aalto, and more broadly Finland, could benefit even more from having a more diverse set of students so that the whole society, and its members, continue to stay (and become even more) open-minded and tolerant.

these are not mutually exclusive nor mutually independent. in fact, one may say that these are essentially the same thing.

to this end, i’ve decided to donate €30,000, using the prize from the Ho-Am Award, to Aalto University School of Science with a condition that this is used to support *female *students from *non-EU countries *who are entering the *Macadamia *program.^{$} Similarly to my earlier donation to Mila, i’ve asked Aalto University to provide one-time supplement of €1,000 each to approximately five such students each year. See here for the official announcement from Aalto University.

this is my small gesture of thanking them for coming to Aalto University and Finland to study, which in turn improves internationalization and diversity in Aalto University and Finland and makes this place even more awesome. €1,000 in Finland is definitely not much (thus a small if not tiny gesture), but i hope this would even a tiny bit help students enjoy Aalto University and Finland, just like I did many years back.

안녕하세요?

13년 전 처음 <Probabilistic Robotics>와 <인공지능> 강의를 듣기 전까지 사실 전 machine learning, 자연어 처리, machine translation, 인공지능, 이런 단어들을 들어보지도 못했습니다. 다만 당시 우연찮게 학과 사무실에 앞에 놓여있던 핀란드 헬싱키 공대 machine learning 석사과정 팜플렛을 우연찮게 저와 같이 강의를 듣던 선배가 전달해줬고, 무장적 핀란드로 유학을 떠났습니다.

제가 진학한 석사 프로그램은 당시 신입생들을 신청 없이 무작위로 학과 내 연구실에 배정해서 일주일에 하루씩 연구 경험을 쌓도록 했습니다. 저는 우연찮게 당시 Bayes’ Group이라 불렸던, 이름과 달린 뉴럴넷 연구를 하던 그룹에 속하게 되었습니다. 당시에는 사실 뉴럴넷이 무엇인지도 모르던 시절이었고, 뉴럴넷을 갖고 뭘 할 수 있는지도 전혀 몰랐습니다. 다만 연구실에 일주일에 하루라도 속해서 연구하는 방법을 배우고 다른 연구원들의 연구하는 모습을 어깨 너머로 볼 수 있다는 것만으로도 굉장히 신이 났습니다.

아직 딥러닝이 그리고 인공지능이 지금만큼 뜨지 않았던 시절이라 그런지 이 내용으로 석사를 하고 같은 학과, 같은 그룹에서 박사 과정에 진학한 후에도 대단한 연구를 해봐야겠다, 대단한 논문을 써봐야겠다, 대단한 발명을 해봐야 겠다, 는 생각 없이, 마음 편히 궁금한 것은 공부하고, 새로운 것은 직접 시도해보면서 즐겁게 대학원 생활을 보냈습니다.

이런 대학원 생활 막바지 우연찮게 당시 새롭게 생긴 인공지능 학회인 아이클리어라는 학회에 참석했습니다. 제 기억에 따르면 40-60명 정도만 참여했던 조촐한 학회였습니다. 학회 첫 날 아침 식사에 우연찮게 몬트리올에 있는 Yoshua Bengio 교수 옆에 앉게 되었고 그 아침 식사를 계기로 몬트리올 대학교에 박사 후 연구원으로 가게 되었습니다.

몬트리올에 도착 한 다음날 Yoshua가 앉아있는 제게 와서 4가지 연구 주제를 던져줬습니다. 그 중 하나가 machine translation 이었고, 그에 대해 아는 것 하나 없는 상황에서도 다만 재밌을 것 같다는 생각 하나만 갖고 machine translation 연구를 해보겠다고 했습니다. 그 후 8년이 지났고, 이런 우연찮은 선택들이 모이고 모여 지금 이 자리에서 제게 너무 과분한 상을 받게되었습니다.

본 소감을 준비하다보니 제 공부 및 연구 경력에는 “우연”과 “운”이 많이 작용했다는 생각이 듭니다.

만약 13년 전 용욱이 형이 그 팸플렛을 우연찮게 주어서 갖다주지 않았으면 어땠을까? 만약 12년 전 우연찮게 Bayes group에 배정받지 않았으면 어땠을까? 만약 8년 전 우연찮게 아침 식사를 위해 앉은 자리 옆에 Yoshua Bengio가 없었으면 어땠을까? 만약 8년 전 뜬금없이 machine translation을 선택하지 않고 조금 더 익숙했던 주제를 선택했으면 어땠을까?

이런 질문에 대한 답을 곰곰히 생각하다 보면, 제가 지금껏 이룬 일은 제 개인이 이룬 것이 아니라 생각합니다. 제 성과는 인공지능, machine learning, data science 등 다양한 이름으로 불리우는 분야에 속하는 연구원 모두가 다같이 이뤄낸 수 많은 성과들 중 아주 작은 하나일 뿐이라는 결론에 도다르곤 합니다.

인공지능 연구란 큰 흐름 안에서 좋은 우연의 연속으로 남들보다 살짝 더 드러나는, 하지만 여전히 한 없이 작은 성과를 이뤘을 뿐이라는 것을 생각해보면 제가 개인적으로 이런 과분한 상을 받는다는 사실에 인공지능 연구의 선배, 동료, 후배 과학자 분들께 한없이 죄송할 뿐 입니다.

인공지능 연구의 궁극적 목표는 지능이란 무엇인지, 이성이란 무엇인지 등 감히 과학적으로는 답할 수 없을 것만 같았던 이런 질문에 대한 답을 찾는 것 입니다. 마치 지난 몇 년 또는 몇 십년 간 인공지능 분야에서 대단한 성과가 나온 것 처럼 보일 수도 있습니다만 이런 근원적인 질문에 대한 답을 찾기 위해서는 아직 갈 길이 멀고, 사실 어느 방향으로 전진해야 할지도 막막할 때가 많습니다.

그럼에도 불구하고 호암 재단에서 저희 분야, 인공지능 연구 분야, 에 이런 큰 상을 주셔서 감사드립니다.

지금까지 이룩한 것에 대한 축하보다는, 저를 비롯하여 인공지능 연구에 불철주야 매진하고 있는 교수, 연구원, 개발자 그리고 학생들에게 더 앞으로 나아가라는 격려와 응원의 의미가 담겨 있다고 생각합니다.

인공지능 분야의 모두를 감히 대신해서 호암재단에 다시 한 번 감사 말씀 드리고 싶습니다.

감사합니다.

* i know it’s weird to quote myself from another blog post, but i think i said it pretty well when i was asked about how i feel about this Prize earlier.

∘ if you want to know more about the origin and original design of the Macadamia program, see this report.

# i might be off by $N$ here. if any of my peers remembers the correct number, drop me a line so that i can fix it.

@ sadly Finland, as a whole, does not seem to share this goal with me. a few years back (a few years after i left Finland,) Finland introduced tuition for non-EU students enrolled in programs that are mainly taught in English, breaking its amazing tradition of providing free education to *all*. i seriously believe this was a mistake.

$ the name of the Macadamia program was changed to the “Master’s Programme in Computer, Communication and Information Sciences – Machine Learning, Data Science and Artificial Intelligence”. yes, they changed the name of Macadamia to include “Artificial Intelligence”, which would be the least surprising decision ever..

]]>this is a slightly expanded version of my fb post: https://www.facebook.com/cho.k.hyun/posts/10216267975445626.

i’ve lived in three countries-finland, canada and US- over the past 12 years as an expat/immigrant myself, which makes me pretty well aware of issues and challenges faced by immigrants, in particular east asian ones, in these countries. this made me *incorrectly* believe that i know the challenges and issues faced by immigrants everywhere beyond these three countries, including korea where i was born and raised as a korean national and had lived for 20+ years. this was until i saw this post by Alice, where she shared a link to the homepage of “*Hanmaum Education Volunteer Corp who helps children of immigrant families in challenging environments by providing free education*” (my own translation of an excerpt from the original post.)

how did i miss this? this glaringly obvious omission of immigrant kids from all those years i was growing up in korea. somehow i’ve never had a chance to even have a single peer in any of the schools i had attended who was a kid of an immigrant family. realizing this was and is still quite a shock, considering that the number of immigrants, immigrant families and their children has been only growing over the past decades.

then, i realize it’s because i was born and raised near the center of the society. this has made me pretty much blind to corners of the society, and all these immigrant moms (it’s also a bit concerning that it’s disproportionately immigrant “moms”) and their children were and are in those corners of the society. it was this post by Alice and this effort by Emeritus professor Byung-Gyu Choi of KAIST that barely made me take a glimpse at this corner. what a blind fool have i been, and what else am i being blind to..?

last november (2020), i was invited to give an opening talk at SK ICT Tech Summit 2020, perhaps unsurprisingly together with Alice (i’m a huge fan!), and talked about my on-going project on breast cancer screening (see the recording of the talk here). SKT generously paid me $6,000 lecture fee (and yes, it was super-generous, and i rarely receive any lecture fee from my invited talks ever.) i’ve been thinking about how i was going to spend this, and have decided to donate the entire sum to the Hanmaum education volunteer corp.

it’s not a lot, and it doesn’t come any close to students and other volunteers who are on the ground providing education to these moms and kids. i hope however that this small gesture of mine would help immigrant parents & kids receive education they truly deserve.

p.s. i’m quite proud to see my former visiting student and current good friend, Keunwoo, following my lead and showing others what to do

]]>NYU에서는 이번 가을에 blended insturction을 했다. 각 수업은 규모 (학생 수 및 주당 강의 수), 특성 (대면 필수) 등을 고려하여 remote, in-person 또는 blended로 학기 시작 전 구분을 지었고, 나는 blended mode의 강의를 진행했다. blended mode 수업의 강의는 in-person 그리고 lab sessions은 평소의 2-3배로 갯수를 늘려서 in-person과 remote를 모두 가졌다. 모든 강의와 lab은 zoom을 통해 livestream했고 이를 통해 뉴욕에 오지 못한, NYU의 global campus에 대신 진학한 학생들이 강의를 듣는데 문제 없도록 했다. 매 강의 및 lab session에 in-person으로 참석할 학생은 학기 시작 전 미리 배정된 주에 미리 배정된 자리에 앉도록 했고, NYU의 모든 facility에서는 마스크 착용이 필수였다. 각 강의에는 해당 강의실 최대 수용 인원의 1/4-1/3만 들어올 수있도록 했고, 교수도 예외 없이 언제나 마스크를 착용하고 강의를 진행했다. 내 강의에는 한번에 최대 25-30명이 들어올 수 있었으나 실질적으로는 3-10명 정도가 들어오고 나머지 학생들은 zoom을 통해 livestream으로 참석했다.

이와 동시에 각 학과는 교수 및 포스닥, PhD 학생들이 필요에 따라 연구실에 돌아올 수 있도록 연구실 배정 및 책상 배치를 모두 바꿨다. NYU Center for Data Science의 경우, 연구실 재배정 및 미팅룸 재배정을 통해 모든 교수, 포스닥, PhD 학생이 1인1실을 쓰도록 했고, 이를 통해 주거 환경이 상대적으로 열악한 학생들이 맘 편히 연구에 집중할 수 있는 환경을 제공하도록 노력했다.

학부생들도 원하는 경우 residence hall로 들어와서 학기를 지냈고, 이런 경우에도 residence hall reconfiguration을 통해서 학생들 간의 불필요한 접촉을 최소화하도록 했다. 학교내 식당 (대부분 학부생들 이용) 은 모두 pick up으로 변경했고, 학교 내 모든 책상 및 공부할 수 있는 공간은 미리 예약을 하지 않으면 쓰지 못 하도록 시스템을 갖추었다.

정지훈 교수님 글에 쓰신 것처럼 이런 환경을 뉴욕 맨하튼 한가운데서 구축하고 covid-19 outbreak을 피하기 위해 NYU는 뉴욕 캠퍼스에 온 모든 학생과 교직원에게 학기 시작 전 2주 동안 1-2 번씩 PCR 테스트를 받게 했다. 뉴욕에 사는 교직원들은 뉴욕대학교 Langone 대학병원에서, 그리고 학생들은 residence hall들 근처에 텐트를 치고 대대적으로 검사를 진행했다.

학기가 시작한 후 모든 구성원은 의무적으로 2주에 한 번씩 침을 이용한 테스트를 받았다 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/safety-and-health/coronavirus-testing/ongoing-testing.html) 매 2주에 한 번씩 reminder 이메일이 오고 해당 주에 학교 내에 구축된 4-5군데의 테스트 collection point에 직접 찾아가 test kit을 받은 후, 집 또는 사무실에서 침을 뱉은 후 다시 collection point에 돌려준다. 그 후 1-3일 후 온라인으로 결과를 확인할 수 있고, 해당 결과나 입력되지 않은 경우 카드키를 통한 NYU 출입이 제한된다.

이를 통해 양성 판정이 나면 해당 구성원은 바로 격리에 들어가고 학교에서는 contact tracing에 들어간다. 아쉽게도 contact tracing은 학교 내로 한정이 되고, NY주에서 학교 밖 contract tracing을 진행한다. 물론 후자가 전혀 안된다는 건 모두가 아는 비밀이다. 학기초반 residence hall 등에서 outbreak의 기미가 있어서 2-3 층을 통째로 격리하고 전원 검사를 두 번 진행한 경우가 있었고, 이를 통해 더 큰 outbreak을 피했다.

이런 과정을 통해 총 6만 여명 구성원 중 15,000명 정도가 이번 학기에 캠퍼스에 돌아왔고 지난주를 마지막으로 학기가 중단 없이 끝났다. 실시간으로 업데이트 되어온 대시보드 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/nyc-covid-19-testing-data.html)를 보니 8월 1일 이후 총 19만 9870번 테스트를 진행했고, 758 케이스가 양성으로 판정되었고, 뉴욕 이외의 지역까지 확장하면 약 1000여 케이스가 양성이었다. 양성율 0.38%로 실제 뉴욕시에 비해 현저하게 낮다.

올 봄 뉴욕시는.. 큰 병원들은 지옥이었고, 병원 밖은 유령 도시였다. 지금도 여전히 뉴욕주는 매일 만명 이상 확진되고 있고, 100명 이상 사망하고 있다. 그럼에도 불구하고 NYU에서 학기 중단 없이 학생들 교육을 시켰고, 뉴욕시의 public school 또한 중간에 일시적인 2주 중단 외에 학기를 진행했다는 것에 한편으로는 마음이 조금 편해진다. 이번 봄에도, 그리고 다음 가을에도 필요한 모든 것을 다 해서라도 학교가 열고, 학생 지도가 제대로 진행되길 바라고 학교 구성원 중 하나로써 최선을 다할 예정이다.

뉴욕도, 미국 다른주도, 한국도, 캐나다도, 유럽도, 내가 뉴스를 어느 정도 따라가는 많은 지역들이 대기업, 건물주, 부동산 그리고 부자들 걱정을 많이 한다. 이런 경제적인 고려와 그에 더해 미국에서 보다시피 정치적인 계산이 이번 pandemic을 얼마나 잘, 또는 얼마나 잘못 버텨내느냐에 많은 영향을 미치고 있다. 안타깝게도 이런 복잡한 고려 하에서 교육이 쉽게 묻혀 버린다. pandemic은 끝이 나겠지만 이 기간 1-3년 동안 다른 세대들에 비해 교육을 제대로 못 받은 세대에 대한 영향은 얼마나 오래갈까?

혹여나 지난주 밖에서 시원하다며 마신 맥주 한 잔 때문에 사회의 미래를 희생한건 아닐런지…

]]>Aalto University (in particular School of Science within) and Finland just keep on giving, and I feel like I continue to receive without giving anything back. I will have to think of some way for me to pay back all that I have received from them.

Kiitos paljon!

Of course, the whole event was virtual, and due to the time difference, I could not attend myself. Instead, I sent the video recording of my greetings. You can watch it at https://youtu.be/074nhA9SQvA. I’m also attaching the script I used for recording this video below.

hi,

i received admission to the international master’s program in machine learning and data mining, which was called back then Macadamia, from Helsinki University of Technology in the spring of 2009.

although i applied to the program myself, finland was largely a land of mystery to me. perhaps this mysterious nature of the country may have been one of the major motivations for me to apply for this program in the first place. in my mind back then, finland was associated with just a couple of things, such as Nokia and Helsinki Olympics. i must confess that i wasn’t even aware that finland shared a border with russia. unsurprisingly, going to finland to study was definitely not what i had in my mind until one of my friends then handed me the brochure of the Macadamia program in the winter of 2008.

the very first lecture i attended in helsinki university of technology, which was about to be merged with the other two universities to form aalto university back then, was of the course “Machine Learning: Basic Principles”. this course was taught by Tapani Raiko, who had advised and mentored me for the next five years and who i still continue to admire and keep in touch with. in the very first lecture, i could immediately tell that i made the right choice to be there to study machine learning. and, to this day, i still believe i made the right choice to be there at Aalto University to study machine learning and data mining.

as a part of the Macadamia program, some students were assigned to some of the labs within the department, which was back then information and computer science (ICS), to assist in research one day a week with a small amount of stipend. the master’s program was still free to anyone from anywhere in the world back then in finland, which i sadly learned recently that is not the case anymore. without tuition-free education, my decision to come to finland to study in Aalto University may have taken a very different course.

anyways, i was assigned to the Bayes group, which i do not believe exists anymore and despite its name had a longer history of research in neural networks. the group back then was led by Prof. Juha Karhunen, who i believe had recently retired, together with Tapani and Prof. Alexander Ilin, who recently made a comeback to Aalto to re-build the Bayes group however with a new name “Deep Learning”. this part-time research gig at the then-Bayes group, which started in September 2009, was the beginning of my research career that is still on-going.

i often wonder what i would’ve become had it not been for this program, called the “honours program” then, if i remember correctly, had it not been for me to be assigned to the Bayes group, or had it not been for me to be advised by Tapani and Alexander. it’s simply unimaginable. five years later in March 2014, i defended my doctoral dissertation against my “opponent” Prof. Nando de Freitas, in front of my friends, colleagues and supervisors from then-newly-formed the Department of Computer Science of Aalto University School of Science.

over those five years, i spent many days and nights in Maarintalo, studying for exams and working on projects. over those five years, i spent many days and nights in the computer science building, working toward my dissertation. over those five years, i had an uncountable number of lunches at the cafeterias in the computer science building as well as the main building. over those five years, i met so many friends and colleagues, many of whom i still keep in touch with.

Aalto University gave me an enormous opportunity by bringing me to Finland and giving me rigorous education on machine learning. Furthermore, Aalto University had successfully created an international environment in which I could immerse myself among talents from all over the world and be inspired by them. These were just the beginning of the series of opportunities Aalto University School of Science had given me over those five years.

my phd years were generously supported by FICS (the finnish doctoral programme in computational sciences), which has since discontinued and i believe has been replaced by HICT. near the end of my phd programme, i was given a chance and supported by FICS and Prof. Erkki Oja to spend six months visiting the University of Montreal to broaden my view and to further learn from the very best in the world.

this research visit opened my eyes to a broader set of topics in machine learning, and in particular this visit was how and when i began to seriously delve into studying how machine learning and more broadly AI could be used for and improve natural language processing and machine translation. this research visit led me to join the University of Montreal as a postdoc in a lab which was called Lisa back then and is now called Mila, immediately after i defended my dissertation.

And, now, i am an associate professor of computer science & data science at New York University, running my own research lab and teaching machine learning to aspiring students from all over the world.

in my opinion, one of the most important roles served by higher education is to bring the best out of each student. what this implies is that higher education cannot simply shove down knowledge into students, and education cannot simply show easy, comfortable and convenient ways forward to students. education must strive to provide as diverse and broad a set of opportunities and perspectives to students as possible in order to ensure each and every student has a chance to discover their way forward.

What i experienced during my years at Helsinki University Technology which had become Aalto University School of Science and Technology and has eventually become Aalto University School of Science, was precisely this; rigorous and thorough education, and a string of educational and extra-curricular opportunities within and beyond the wall of the university and even the country’s border.

It is truly my honour to be named the alumnus of the year, and to be frank I am quite unsure whether i deserve it. off the top of my head, i can think of Prof. Alexander Ilin, who is now back at Aalto University. Dr. Tapani Raiko, who is now at Apple, is another obvious candidate. and, no, it’s a totally objective list. they just happened to have mentored me throughout my years at Aalto.

let me wrap it up by dusting off my finnish: Kiitos paljon!

]]>i enjoyed answering those questions, because they made me think quite a bit about them myself. of course, as usual i ended up leaving only a short answer to each, but i thought i’d share them here in the case any students in the future run into the same questions. although my questions are all quite speculative and based on experience rather than rigorously justified, what’s fun in rigorously proven and well-known answers?

of course, there were so much more questions asked and answered during live lectures and at the chatrooms, but i just cannot recall all of them easily nor am i energetic enough after this unprecedented semester to go through the whole chat log to dig out interesting questions. i just ask you to trust me that the list of questions below is a tiny subset of interesting questions.

i will paraphrase/shorten the answers below and remove any identifying information (if any):

- Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspicious of using it?
- Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and online SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?
- Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive? I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.
- Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?
- In semi-supervised VAE, how do we decide the embedding dimensions for the class? Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?
- Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality?
- DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned earlier. Do you have any insights on how we should think about double-descent?
- In your opinion, will we achieve AGI?

**1. Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspect of using it?**

when yann said it was controversial to use backprop earlier, i believe he meant it in two different ways: (1) backprop itself to compute the gradient of the loss function w.r.t. the parameters and (2) backprop to refer to gradient-based optimization. i’ll explain a bit of each below, but neither of them is considered a serious argument against using backprop anymore.

(1) backprop was controversial and is under great scrutiny when artificial neural nets (what we learn) are compared against biological neural nets (what we have). it’s quite clear due to biological constraints that backprop is not implemented in brains, as it is in our deep learning toolkits (see e.g., https://openreview.net/forum?id=HJgPEXtIUS for some of interesting biological constraints/properties that should be satisfied by any biologically plausible learning algorithms.) to some people, this is a make-or-break kind of issue, because there seems to exist a learning algorithm that results in a superior neural net (human brains!) of course, this could just mean that a biological brain is approximating the gradient computation as well as it could under the constraints, but it’s not easy to verify this (see, e.g., https://www.youtube.com/watch?v=VIRCybGgHts for how a brain might implement backprop.)

another criticism or objection along this line is that biological brains seem to have either zero or multiple objectives that are being optimized simultaneously. this is unlike our usual practice in deep learning where we start by defining one clear objective function to minimize.

(2) gradient-based optimization often refers to a set of techniques developed for (constrained/unconstrained) convex optimization. when such a technique is used for a non-convex problem, we are often working with the local quadratic approximation, that is, given any point in the space, the underlying non-convex objective function can be approximated by a convex quadratic function ($\theta^\top H \theta + g^\top \theta + c$.) under this assumption, gradient-based optimization would be attracted toward the minimum of this local quadratic approximation, regardless of whether there exists a better minimum far away from the current point in the space. this is often used as a reason for criticizing the use of gradient-based optimization with a non-convex objective function, thereby for criticizing the use of backprop. see e.g. http://leon.bottou.org/publications/pdf/online-1998.pdf for extensive study on the convergence properties of SGD.

this criticism however requires one big assumption that there is a big gap of quality between one of the nearby local minimum (we’ll talk about it in a few weeks at the course) and the global minimum. if there is a big gap, this would indeed be a trouble, but what if there isn’t?

it turned out that we’ve known for already a few decades that most of local minima are of reasonable quality (in terms of both training and test accuracies) as long as we make neural nets larger than necessary. let me quote Rumelhart, Hinton & Williams (1986):

“

<Learning representations by back-propagating errors> by Rumelhart, Hinton & Williams (1986)The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima that are significantly worse than the global minimum. We have only encountered this undesirable behaviour in networks that have just enough connections to perform the task. Adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriers that create poor local minima in the lower dimensional subspaces.“

this phenomenon has been and is being studied quite extensively from various angles. if you’re interested in this topic, see e.g. http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization and https://arxiv.org/abs/1803.03635 for some recent directions. or, if you feel lazy, you can see my slides at https://drive.google.com/file/d/1YxHbQ0NeSaAANaFEmlo9H5fUsZRsiGJK/view which i prepared recently.

**2. Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?**

this is an interesting & important question, and the answer to this varies from one expert to another, including Yann and myself as well, based on what are being implicitly assumed and what are being used as criteria to tell which is preferred (computational efficiency, generalization accuracy, etc.)

Yann’s view is that noise in SGD greatly helps generalization because it prevents learning from being stuck at a sharp local minimum and drives learning to find a flatter local minimum which would imply that the final neural net is more robust to perturbation to the parameters, which naturally translates to the robust to the perturbation to the input, implying that it would generalize better. under this perspective, you want to maximize the level of noise, as long as they roughly cancel out on average across all the stochastic gradients computed from the training examples. that would correspond to using just one training example for computing each stochastic gradient.

of course, the amount of noise, which is proportional to the variance of the stochastic gradient, does impact the speed at which learning happens. in recent years, we (as the community of deep learning researchers) have found that certain network architectures require stochastic gradients computed using large minibatches (though, it’s unclear what large means, as it’s quite relative to the size of the training set) to be trained at all. in these cases, it looks like high level of noise sometimes prevents any progress in learning especially in the early stage.

so, in short, it’s still an open question. yann’s perspective may turn out to be the correct one (and that wouldn’t be the first time this happend,) or we may find a completely different explanation in the future.

**3. Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive?** **I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.**

there are three questions/points here. let me address each separately below:

“*normalization makes the optimization process easier through making the eigenvalues equal*“

we need to specify what kind of normalization you refer to, but in general, it’s not possible to make the hessian to be identity by simply normalizing the input. this is only possible when we are considering a linear network with a specific loss function (e.g., l2 loss for regression and cross-entropy for classification.) however, it is empirically known and for some cases rigorously as well that normalizing the input variables to be zero-mean and unit-variance makes the conditioning number (the ratio between the largest and smallest real eigenvalues of the hessian matrix) close to 1 (which is good.)

“*why we would do batch normalization instead of normalizing the whole dataset all at once at first?*“

now, in the case of a network with multiple layers, it turned out that we can maximize the benefit of normalization by normalizing the input to each layer to be zero-mean and unit-variance. unfortunately, this is not trivial, because the input to each layer changes as the lower layers’ weights and biases evolve. in other words, if we wanted to normalize the input to each layer, we would need to sweep through the entire dataset every time we update the weight matrices and bias vectors, which would make it intolerable. furthermore, renormalizing the input at a lower layer changes the input to the upper layers, ultimately resulting in the loss function to change dramatically each time we renormalize all the layers, likely making learning impossible. though, this is up to a certain degree addressible (see http://www.jmlr.org/proceedings/papers/v22/raiko12/raiko12.pdf by Tapani Raiko, my phd advisor, and Yann LeCun.)

“*your normalization for each batch is subject to noise*“

this is indeed true, and that’s precisely why it’s a customary practice to keep the running averages of the mean and variance of each dimension in batch normalization. assuming that the parameters of the network evolve slowly, such practice ultimately converges to the population mean and variance.

**4. Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?**

i don’t have a clear answer unfortunately, but can speculate a bit on why this is the case. my answer will depend on where batchnorm was used. of course, before reading the answer below, make sure your implementation of batchnorm doesn’t have a bug.

if batchnorm was used in the approximate posterior (encoder), it shouldn’t really matter, since the approximate posterior can be anything by definition. it can depend not only on the current observation $x$

, but can be anything else that helps minimizing the KL divergence from this approximate posterior to the true posterior. so, i wouldn’t be surprised if it’s totally fine leaving batchnorm in the encoder.

if batchnorm was used in the decoder, it may matter, as the likelihood distribution (generative distribution) is over the observation space $\mathcal{X}$ conditioned on the latent variable configuration $z$. with batchnorm, instead, the decoder is conditioned on the entire minibatch of latent variable configurations, that is, the latent variable configurations of the other examples. this may hinder optimization in the early stage of learning (in the later stage of learning, it shouldn’t really matter much, though.)

in general, batchnorm is a tricky technique and makes it difficult to analyze SGD, because it introduces correlation across per-example stochastic gradients within each minibatch.

5. **In semi-supervised VAE, how do we decide the embedding dimensions for the class**? **Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?**

this question can be answered from two angles.

a. network size

the embedding dimensionality is a part of a neural net, and it can be thought of as a part of determining the size of your neural network. it’s a good rule of thumb to use as large as neural net as you can within your computational and financial budget to maximize your gain in terms of generalization. this might sound counter-intuitive, if you have learned from earlier courses that we want to choose the most succinct model (according to the principle of occam’s razor,) but in neural nets, it’s not simply the size of the model, but the choice of optimization and regularization that matters perhaps even more. in particular, as we will learn next week, SGD is inherently working in a low-dimensional subspace of the parameter space and cannot explore the whole space of the parameters, a larger network does not imply that it’s more prone to overfitting.

b. why more than one dimension?

let’s think of the class embedding (though, the same argument applies to positional embedding.) take as an example handwritten digit classification, where our classes consists of 0, 1, 2, .., 9. it seems quite natural that there’s a clear one-dimensional structure behind these classes, and we would only need a one-dimensional embedding. why we do need then multi-dimensional class embedding?

it turned out that there are multiple degrees of similarity among these classes, and that the similarity among these classes is context-dependent. that is, depending on what we see as an input, the class similarity changes. for instance, when the input is a slanted 3 (3 significantly rotated clock-wise), it looks like either 3 or 2 but not 8 nor 0. when the input is a straight-standing 3, it looks like either 3 or 8 but not 2. in other words, the classes 3 and 2 are similar to each other when the input was a slanted 3, while the classes 3 and 8 are similar to each other when the input was a upright 3.

having multiple dimensions to represent each class allows us to capture these different degrees of similarity among classes. a few dimensions in the class embeddings of 3 and 2 will point toward a similar direction, while a few other dimensions in the class embeddings of 3 and 8 will point toward another similar direction. when the input is a slanted 3, the feature extractor (a convolutional net) will output a vector that will emphasize the first few dimensions and suppress the other dimensions to exploit the similarity between 3 and 2. a similar mechanism would lead to a feature vector of an upright 3 that would suppress the first few dimensions and emphasize the latter few to exploit the similarity between 3 and 8.

it’s impossible to tell in advance how many such degrees of similarity exist and how to encode them. that’s why we need to use as high dimensional embedding as possible for encoding any discrete, one-hot input.

**6. Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality? **

This question was asked at one of the office hours, and Richard Pang (one of the TA’s) and i attempted at reverse-engineering the motivations behind the scaled dot-product attention from the transformers.

assume each key vector $k \in \mathbb{R}^d$ is a sample drawn from a multivariate, standard Normal distribution, i.e., $k_i \sim \mathcal{N}(0, 1^2).$ given a query vector $q \in \mathbb{R}^d$, we can now compute the variance of the dot product between the query and key vectors as $\mathbb{V}[q^\top k] = \mathbb{V}[\sum_{i=1}^d q_i k_i] = \sum_{i=1}^d q_i^2 \mathbb{V}[k_i] = \sum_{i=1}^d q_i^2$. in other words, the variance of each logit is the squared norm of the query vector.

assume the query vector $q$ is also a sample drawn from a multivariate, standard Normal distribution, i.e., $q_i \sim \mathcal{N}(0, 1^2)$. in other words, $\mathbb{E}[q_i]=0$ and $\mathbb{V}[q_i]=\mathbb{E}{q_i} \left[(q_i – \mathbb{E}[q_i])^2\right] = \mathbb{E}{q_i} \left[ q_i^2 \right] = 1$. then, the expected variance of the logit ends up being $\mathbb{E}{q} \left[ \mathbb{V}[q^\top k] \right] = \mathbb{E}{q} \sum_{i=1}^d q_i^2 = \sum_{i=1}^d \mathbb{E}{q_i} q_i^2 = \sum{i=1}^d 1 = d.$

we can now standardize the logit to be $0$-mean and unit-variance (or more precisely, we make the logit’s scale to be invariant to the dimensionality of the key and query vectors) by dividing it with the standard deviation $\sqrt{\mathbb{E}_q \mathbb{V}[q^\top k]}=\sqrt{d}.$

these assumptions of Normality do not hold in reality, but as we talked about it earlier, Normality is one of the safest things to assume when we don’t know much about the underlying process.

As Ilya Kulikov kindly pointed out, this explanation doesn’t answer “why” and instead answers “what” scaling does. “why” is a bit more difficult to answer (perhaps unsurprisingly,) but one answer is that softmax saturates as the logits (the input to softmax) grow in their magnitudes, which may slow down learning due to the vanishing gradient. though, it’s unclear what’s the right way to quantify it.

**7. DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned early on. Do you have any insights about how we should think about double-descent? **

The so-called double descent phenomenon is a relatively recently popularized concept that’s still being studied heavily (though, it was observed and reported by Yann already in the early 90s. see, e.g., https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.66.2396 and also https://iopscience.iop.org/article/10.1088/0305-4470/25/5/020 by Krogh and Hartz) The issue I have with double descent in deep neural nets is that it’s unclear how we define a model capacity. the # of parameters is certainly not the best proxy, because the parameters are all heavily correlated and redundant. perhaps it should be the number of SGD steps, because we learned that the size of the hypothesis space is in fact the function of the number of SGD steps.

One particular proxy I find interesting and convincing is the fraction of positive eigenvalues of the Hessian at a solution. With this proxy, it looks like the apparent double descent phenomenon often lessens. see e.g. https://arxiv.org/abs/2003.02139.

So, in short, the model capacity is a key to understanding the bias-variance trade-off or more generally generalization in machine learning, but is not a simple concept to grasp with deep neural networks.

**8. In your opinion, will we achieve AGI?**

Of course, I’m far from being qualified to answer this question well. Instead, let me quote Yann:

]]><An executive primer on artificial general intelligence> by Federico Berruti, Pieter Nel, and Rob Whiteman

Yann LeCun, a professor at the Courant Institute of Mathematical Sciences at New York University (NYU), is much more direct: “It’s hard to explain to non-specialists that AGI is not a ‘thing’, and that most venues that have AGI in their name deal in highly speculative and theoretical issues…