2024 is the first year I got involved with deep learning, and also the first year I entered the field of large models. Perhaps one day in the future, looking back, this year will be one of many important choices made.
First, here’s a summary of Coding's annual review:
Located in Beijing, studying artificial intelligence at a regular 211 university, I spent the first two years focusing on FPS game rankings. Later, I developed a strong interest in front-end development, learning React from JS, and eventually became interested in CS and started taking some online courses. I plan to create a memoir of my four years in school, but to speak freely, I’ll wait until six months later to start writing; for now, I will only record 2024. In short, before this year, my thought was to work hard to improve my development skills, aiming to either pursue a master's degree abroad after my undergraduate studies or become an SDE. Under the influence of the arrangement of professional courses, AI and academia appeared to me as a rather chaotic field. Besides using GPT-4 to help write code, I did not plan to have any contact with AI.
Winter Break: Getting into Deep Learning#
Since our school is offering a DL course in the second semester of junior year, I prepared to learn a bit in advance during the winter break. After struggling with the watermelon book while learning ML, I found that foreign online courses like CS229 and YouTube tutorials suited me better. After some Googling, I discovered Andrej Karpathy's nn-zero-to-hero tutorial, which was truly delightful! The first lesson helped me understand the core concepts of deep learning through implementing automatic differentiation, and I highly recommend it as an introductory course; there’s also a barbecue version on Bilibili:
Under his guidance, I quickly learned about n-grams, word2vec, and CNNs. I had already heard of convolutional neural networks during a group meeting in my freshman year, but no one could explain what these models were doing in a way that a freshman could understand. Finally, the moment I implemented the GPT2 model, the door to a brilliant world slowly opened before me. My attitude towards AI changed to genuine fascination: oh, so I can actually glimpse a corner of the underlying workings of the Killer APP, ChatGPT.
One intuitive feeling is that the toolchain and mindset I encountered while learning front-end and back-end development would be somewhat helpful for studying DL. The most basic benefit is having a general concept when setting up the environment, allowing me to transfer some intuitions from other fields. Therefore, when junior students come to consult me about their learning paths, I generally advise them to start with some front/back-end development to compensate for the lack of practical skills cultivated by school education and to broaden their horizons.
At this time, I also came across @DIYgod's blogging platform xLog, and without hesitation, I moved my personal blog from Hexo+gh-page to xLog. Besides being aesthetically pleasing and user-friendly with comprehensive customization support, the most important thing is the community streaming, which provides me with positive feedback 🥹.
Welcome to visit! After completing MyGo this year, both my avatar and website icon feature Tokyo Anon 🥰.
Internship: An Outsider's View on the Large Model Landscape#
This year, there were numerous LLM-related competitions on Kaggle, and I chose to focus on a Google-sponsored LLM Prompt Recovery competition, mainly because everyone seemed confused about it, betting that hard work would yield miracles. Later, I put some effort into the data and, with a bit of luck, reached second place on the public leaderboard. However, due to the dual pressures of internship and life, I reduced my time investment and barely managed to secure my first competition silver medal.
By March, the pressure shifted to the summer internship credits. Based on experience, I inferred that the organizations arranged by the school would likely be of lower quality (the results later confirmed this, making it hard to boast). Therefore, I started submitting resumes on BOSS at the end of March. For some small to medium-sized companies that had nothing to do with large models, I did not consider them. At that time, I naively thought that after completing Karpathy's course and tearing apart Transformer and GPT2, I could find an internship in large models. I didn't expect that my school's endorsement was insufficient, and I had no other highlights or connections, leading me into a painful vortex, feeling that the path of large models was dead for me.
At this point, a group at Pony.ai, led by the building master, wanted to recruit an intern with knowledge of the field. The job involved fully following an internal project, and I thought this might be the best opportunity for me at this stage. Moreover, it would enrich my resume, so I accepted the offer. The courses in the second semester of junior year were still not worth studying, but fortunately, the teachers generally had low attendance requirements. After friendly communication, I went all in on the internship, attending about 4-5 days a week, with a round trip commute of nearly 3 hours. Looking back, it was indeed a painful yet joyful period. My colleagues and mentor were very nice, and I also earned my first 10,000 yuan using my professional knowledge, which helped me regain some confidence in my future.
My work mainly involved researching the workflow of large models in internal data annotation tasks. Initially, it was primarily about experimenting with a massive number of APIs. Besides improving the workflow, the most important thing was to find the best model. During this time, the APIs from various companies were indeed readily available, with some companies reimbursing for painless multi-threaded calls. I remember, among others: Byte's Lark & Doubao, Kimi, Alibaba's open-source and closed-source models Qwen, Baidu's Ernie, Tencent's Hunyuan, and the baseline was GPT-3.5 (requiring that data cannot leave the country, hence needing domestic models or private deployment). Among the above models, the only one whose task performance was close to OpenAI's was Kimi, but the cost was relatively high, so I began to research other models. During a Kaggle AI math competition, I had a great experience with DeepSeek-7B-Math, which led me to learn about the company DeepSeek. At that time, they had just launched DeepSeek V2, and the pricing of 3 RMB per million tokens for input and output directly attracted me. In my benchmark tests, I found its performance far exceeded GPT-3.5, except for occasional minor issues with instruction following. Thus, the model running in the business pipeline during that time became DeepSeek.
The reason I say I viewed the development of the large model field as an outsider is that: my work did not involve the core algorithms of LLMs, but rather focused on their specific applications. However, I could see the true capabilities of various models and my own real strength from my experimental results (this business indeed tests the models' instruction following, information extraction, and reasoning abilities, and the rankings were quite similar to lmsys), which desensitized me to the marketing of various vendors. I could even directly dismiss some companies' future high-scoring models based on stereotypes.
As for what followed, domestic large model vendors began to follow DeepSeek into a price war, which was a huge boon for the current business. Of course, I didn't care about the price cuts of Doubao or Wenxin, but Qwen's price drop also made it a consideration. In June, Tongyi Qianwen launched the Qwen2.0 open-source model. After testing, I began to wonder why Alibaba's open-source models were far better than the closed-source series (like qwen-plus, qwen2.0 even better than the expensive qwen-max; I remember this reversal phenomenon only disappeared after August). From here, I developed a favorable impression of these two domestic companies' models, and DeepSeek also became my dream company (which is very close to my high school, so I can often visit 🥹). It later proved that both companies continued to maintain excellent momentum and had high discussion rates in foreign open-source communities, with Qwen2.5/QwQ and the recent DeepSeek V3 leading the way. The fact that many people accepted using these two models in Cursor speaks volumes.
During this time, I met up with @JerryYin777 for an internship back home. Talking to him was incredibly informative, and I learned a lot about the LLM field, learning paths, and interesting stories. Let's grab a meal again when we have the chance!
I originally planned to find a summer research opportunity, but for various reasons, it didn't materialize, so I continued my internship until August.
A keepsake from my last day at work. The company environment and atmosphere were great; I’ll come back next time :)
Knowledge Supplement and Research#
During the internship, besides Python back-end, LLM application implementation, and performance evaluation, I also learned about model architecture, training processes, etc. Looking back now, these are all basic fundamentals, and I was satisfied with running an unsloth implementation of LoRA fine-tuning 7B scripts. Additionally, there was a period when the company needed to research private deployment, so I started studying vLLM and other topics, pushing myself to learn while outputting. This led me to the DataWhale llm-deploy project, where I was responsible for the concurrency tutorial, which helped me fill a significant knowledge gap in the mlsys field.
However, I was still quite confused about my learning and development path. Fortunately, while browsing Zhihu, I came across this article:
Quokka: I don't have experience with large models, can I get a chance?
Coincidentally, the author was working at my dream company, DeepSeek, so I took the advice he mentioned at the end of the article as a bitter lesson and read it daily to motivate myself. Now, looking back at my progress:
- A. I have implemented and compared the performance of different pipeline algorithms on two 2080Ti cards: after learning the principles of DeepSeed and Megatron, I am preparing to produce some related content soon;
- B. I have implemented some operators using Triton: including but not limited to Nagiovo: Triton First Try: Implementing Softmax Forward Kernel, and I plan to organize this into a GitHub repo after deepening my learning;
- C. I can explain the differences in tokenizers used by different large models: Karpathy's course blog records The Evolution of LLMs (Part 6): Unveiling the Mystery of Tokenizers, implementing BPE and understanding some characteristics in tiktoken;
- D. I have decent development skills in languages other than Python (e.g., endorsements from certain open-source projects): perhaps? I’ve gone through Modern C++ and have some front-end and back-end development skills, mainly being willing to learn hands-on 🥹.
- E. I have implemented an outstanding Gomoku AI (preferably using RL algorithms): Nagiovo: From MCTS to AlphaZero.
My roommate became a great help in testing the AI's chess strength.
I replicated AlphaZero, and the training results were good, mainly influenced by O1 Reasoning, and I have a strong interest in RL itself (when I chose the AI major, I thought artificial intelligence was just game AI like reinforcement learning).
I also fully learned the basics of reinforcement learning to prepare for RLHF and other topics.
I also experienced the industry and wanted to have a formal research experience. A new AP teacher at my school offered me an opportunity in the CV field (also as my graduation project), and I have been working on it ever since. After graduation, I will definitely write about this experience and promote my super nice teacher 🥳.
After starting my senior year, I realized that I had always exhibited symptoms consistent with ADHD. After getting a basic diagnosis at the hospital, I felt much more at ease, as if I had a manual for my body. Looking on the bright side, this is also an advantage that allows me to maintain hyperfocus in areas of interest.
At the same time, regarding further studies, I shifted from considering a one-year master's program to contemplating a PhD, but I am uncertain whether this choice will provide me with enough motivation. Therefore, I plan to take a gap year for internships/RA positions to explore slowly (by the way, my publications are also insufficient, so I may need to settle down a bit). Of course, based on my understanding of myself, whether in research or industry internships, I only want to engage in meaningful work. Therefore, if I could have an internship opportunity at my dream company, I might go all in (dreams are still important, as they remain one of my driving forces).
Looking Ahead to 2025#
-
I set my goal for this year to refine my character. After seeing many peers and seniors in the same field, anxiety is inevitable. Based on my personal experience, comparing myself only to yesterday's self makes it easy to fall into local minima, so I need to balance chasing role models and self-improvement.
-
Regarding Agents, there is also a trade-off between long-term planning and fixed local workflows. I hope to look far while taking each step well, ensuring that even small tasks are done quickly and well.
-
As for other matters, while outputting blogs, I will also manage Zhihu, Xiaohongshu, and X, maintaining expressive abilities in the era of GenAI is also a challenge, as daily life cannot be completely entrusted to LLMs 🥰.
-
Solidify the basics and maintain curiosity.
Thus, the annual recollection ends. Thank you for reading, and I wish everyone a Happy 2025 ~