How to start working on large code base
Recently I started working on large python code base. It is in production from last several years. As usual the architecture has taken on many responsibilities which were not intended at the time of inception. And, the original authors have moved on. Sounds familiar?
This code is heavily multiprocess and multithreaded. It uses REST calls to communicate with main process.
Here are some lessons learned dealing with this codebase and my past experiences,
- Understand functionality before jumping on code
It is very important to understand the functionality before jumping on the code. For example I did not realize the same codebase is also used for training custom models as well as serving the inference requests.
If services exposes API, read documentation and try out as many call as you can to understand what it does and what it does not.
Also look at the call flow — how calls are flowing between components, which component manages state etc. Try to match what you see with the Architecture diagrams and if there are any discrepancies highlight the same in team meetings.
2. Setting up local development environment
I would says this is the most important part. No matter how painful it is, it is worthy investment for long term gains.
While setting up, most of the time the readme is no up to date. Please make sure you update the documentation to reflect what you did to make it work — the future generation will thank you for that. For example — in my case one of the test config was pointing to the old version of the model — It took me long time to download and setup that model — just to realize it is old and not required any more.
I use local docker container to setup the local development environment. Mounting local code as a volume on docker container is a best way to modify the code locally without building new docker image.
If you don’t know how the docker run should look like, start with deployment.yaml which contains the environment variable and also look at cmd section to understand the starting script of container.
Most of the time, my first PR includes changes in readme, documentation and code changes required to setup local dev environment.
3. Take Stock of unit and functional tests
Unit test gives you with the opportunity to understand code in small chunks. Functional test provide a baseline which you can use to verify your changes.
If there are no tests — life is going to be hard :) in that case consider writing tests as a first task.
Always looks for unit and functional test for the part you are going to touch first — consider writing more test to cover all aspect of functionality.
I would say, maintaining test up to date is a most sacred responsibility of every developer.
4. Understand Dockerfile and deployment manifest
If you run using docker container — reading Dockerfile is the most easiest way to know what is going on inside the container.
For example, after looking at Dockerfile I realized there is a cron job running periodically to cleanup the logs. Pay special attention to instruction ENTRYPOINT \ CMD to know which script starts main process and how it passes the argument.
In case you have k8s cluster, check deployment yaml — it gives fair bit on idea on what runs along with this code and what are the environment variables., and how configuration is provided at runtime etc.
I use deployment manifest to design my local docker run command for local development environment.
5. Focus on one aspect of the code
It is impractical to start a quest to understand everything. It might take forever. I usually focus on functionality which is easiest — for example understanding code for GET call is easier than understanding POST call.
Another way to choose is based on existing unit and functional test — you can use them to understand the flow.
Focus on area which you want to modify first — ask for previously recorded knowledge sharing sessions pertaining to this area.
I use “Go to definition” shortcuts in code editor to navigate through the code.
6. Debuggers are over hyped and Log are your best friend
Code debuggers are great to understand small code base, However, I rarely used debugger in large code bases. It further complicates the thing if code is multithreaded.
I found using log to read code is the best method. Go through the log and try to match it with the code. For example if the log entry says “server is ready” I try to find that string into code to see from where it is coming from. Reverse also help where you copy the log message from code and try to search it in logs.
Bumping log level also help if does not opens flood gate of logs.
Adding stdout ( message like print(“Here 1 “) and print(“opts: %s”, opts) ) is still my favorite debug method irrespective of the size of the code.
7. Do not start with learning programing language
With due respect, I will add “and reading good code”.
Instead of starting with programming language documentation, google the parts when you are stuck for example, recently I looked up for special Python method __call__ — It allows you to created syntax sugar to call directly on instance for example calling object() will actually calls object.__call__(), similarly for **kwargs ( I found this great link which explain this nicely: https://realpython.com/python-kwargs-and-args/
I also try writing snippet of code to understand more — for example difference between list and tuple can be easily understood by writing few lines of code.
8. Commit History
As described in my previous post Sanctity of Git Repo, commit history provides great deal of information about what part of the code is active, who is actively modifying the code, what features implemented recently, and bugs are fixed are made and when was the last refactoring done etc. This commit authors are best people to pick their brain.
9. Learning from code comments
Code comments provide more valuable information about the assumption previous developer made. Please consider to enhance the existing comment or add new comment to reflect your understanding of the code — that way next person after you can stand on your shoulder move up quickly.
10. Start with the refactoring
I usually start with refactoring, no matter how small or large it is.
You don’t need to understand full picture before you start making changes. Refactoring is a best way to start contributing while increasing your confidence.
For example here the change I did few year ago in real production code.
Always make sure your changes are not breaking anything by running functional and unit tests. Do not mix functional changes and refactoring into one PR — keep them separate.
Hope it helps, please let me know in comments what works for you.