Do I need to write CUDA myself for algorithm jobs?

Click “Xiaobai Xue Vision” above and choose to add “Star” or “Pin“

Heavy stuff, delivered as soon as possible

Link: https://www.zhihu.com/question/436008648

Statement: Only for academic sharing, no infringement or deletion

Author: Zhihu user
https://www.zhihu.com/question/436008648/answer/1683251210

95% of algorithm positions do not require it.

True story: My colleague applied to G Company before coming to NVIDIA, but his second round of phone interview was rejected. The reason is that you can’t master a b-tree and it will be brushed. This colleague’s background was an Assistant Professor at a certain school in Germany at that time. He had published two books on CUDA and parallel programming and was proficient in PTX. At that time, they also argued with G Company that they could provide a binary tree CUDA implementation with comparable performance, but they were rejected and said that it was not needed.

I can only say that 95% of the algorithm and programmer positions in the world do not require you to be able to program with GPU. It is more important to master the questions well than these.

Of course, if G Club came back a while ago and failed to counterattack, that would be another matter.

Some people in the comment circle said that since you have chosen G club, you should know that you need to answer questions. This is poor preparation. Indeed, my colleague also said the same thing later. At that time, I was not prepared to answer the questions at all, because I was not prepared to be a coder, so I went with the mentality of giving it a try.

As for this kind of data structure and algorithm wheels, don’t think that questions that you don’t need to brush up on are useless. In fact, even if you use pytorch to write the alchemy algorithm, you still need to pay attention to these. For example, under what circumstances will new memory be allocated, when will the memory be copied, etc. The deeper the foundation of the wheel, the better the comfort and performance of the final car, that’s for sure.

Author: DLing
https://www.zhihu.com/question/436008648/answer/1746022828

The default is cv!

Generally, we don’t need to use CUDA in our work, but this great master is always around us. We can see it in various third-party libraries every day, but we rarely call CUDA functions directly. But if you encounter a performance bottleneck, you may have to ask the CUDA master to appear.

For example, now we need to optimize the model inference performance, pruning, distillation, and quantification. After a series of operations, the pure inference time of the model has been reduced from 30ms to 15ms, and the performance has doubled. It feels good. However, when looking at the data preprocessing, it takes 10ms. After the model The processing takes 15ms, and the time spent processing the data is longer than the time spent on my model inference. At this time, the master of CUDA can come on stage. We moved the pre-processing and post-processing to CUDA. The test, pre-processing and post-processing were completed in 5ms. At this time, as soon as the report was written, the model inference time was reduced from 55ms to 20ms. After reading it, the leader praised it and felt happy. Promotion and salary increase are not a dream.

For another example, the model indicators have been unable to improve recently. Looking at the old model from three years ago, I feel disgusted and feel that it is time to embrace the new model. Then after various conferences and reading blogs, I found a new paper just published three months ago. The gorgeous rhetoric and sota’s indicators make you unable to hold back. You want to use it in your own projects quickly, but the result is on github. No open source code found. At this time, the CUDA master can come on stage again. He stayed up all night to convert the beating characters in the paper into beautiful operators implemented by CUDA. Then the model was trained and the data was measured. The indicator was raised by 3 points. After reading it, the boss added After a compliment, I felt happy, and promotion and salary increase were no longer a dream.

There are many functions of cuda, and the water is very deep. Generally, there are not many opportunities to use cuda by hand in the direction of CV, but when it is really used, it is most likely to be used to solve major problems. Learn more about it, and using 6 can indeed be a big plus for you.

Author: Chan Yu
https://www.zhihu.com/question/436008648/answer/1649590705

Let me put the conclusion first: basically not required, but it is a goodplus point!

If you are working in a pure algorithm position, you generally need to implement some unconventional operators to meet some experimental algorithm attempts when you come into contact with CUDA.

Most of the custom operators, whether tensorflow or pytorch, can be worked around through the rich basic operator library at this stage.

The worst solution is to use the properties of dynamic graphs to implement them with numpy, py_func, etc. and then manually define the reverse gradient function.

What is relatively more important here is the algorithm engineer’s ability to define and back-propagate. Although it is not a particularly difficult job, many students in algorithm positions may not have thought carefully about how to define grad in the most basic matrix multiplication, or they may not know how to start with the autograd mechanism of tensorflow or pytorch.

Then I mentioned that CUDA’s biggest concern must be performance. Both speed and video memory can be greatly improved through CUDA development, which is generally reflected in memory access efficiency, data structure design, operator fusion and other techniques. In fact, from personal experience, sometimes the operator library implemented by oneself can easily run the model two to three times faster than the native tensorflow. In some special cases, it may be more than ten times or even hundreds of times faster in extreme cases. And reasonable design can greatly reduce graphics memory overhead compared to rigid use of native operators, which is of great benefit to both training and inference.

However, CUDA development has considerable costs, especially for algorithm research, where flexibility is very important. It may be necessary to try many different custom operators in a short period of time. If the heap machine can solve the problem at this time, it is generally not urgent to carry out in-depth optimization. Moreover, it will be difficult to carry out final optimization before the entire network structure is determined. Unless there are particularly frequently used operators and the difference between before and after optimization is very obvious (for example, the overall impact is about 5 times, whether it is video memory or speed), optimization of individual operators can be considered. If the model effect is not significantly improved after CUDA operation, the final time/labor cost may not be cost-effective. I prefer to try to verify the algorithm first before optimizing it.

In addition, when such in-depth research is involved and CUDA optimization is very necessary, generally companies will have a dedicated HPC group or team to take over, or it may be time to think about why general operators cannot meet the needs.

Let’s talk about the cost of CUDA development. Not to mention the difficulty of writing and debugging C/CPP itself. CUDA C is basically half a new language, and the debug logic is also completely new. If you make magic modifications for extreme performance and then encounter some illegal video memory access, it will be even more troublesome. However, now that various NV visual debugging tools have come out, the experience should be much better. Some students who are new to the industry may think that it is enough to replace the operators with the cudnn API, but in fact, many times the performance bottleneck comes from memory access efficiency, which often means that it is necessary to write a new kernel to integrate the operators and control the video memory personally. /Cache management. The workload, especially debugging, will be much greater than imagined. If you simply replace cudnn, it is likely to increase the potential data copy overhead and lead to poor performance. All aspects of experience in this require a long period of accumulation. Either leave it to a professional team, or keep learning and practicing in this area for a long time. Basically, it is not a necessary quality for an algorithm engineer.

But it is very good for algorithm engineers to have such a vision. Even if it is not used, they can always think about problems from the perspective of performance and efficiency when designing models, rather than being limited to model effects. The industry attaches great importance to cost.

Author: OLDPAN
https://www.zhihu.com/question/436008648/answer/1707546242

Looking at the business direction, for example, I am mainly responsible for model development and model deployment. The former is mainly python and the latter is C++ and cuda.

There are also people in our group who only write Python training models, research models, improve model building services, and leave the acceleration tasks to others or me.

If it is just model development, we generally use frameworks, such as pytorch. Most likely you don’t need to write CUDA yourself. As long as you can understand some CUDA codes and be able to use the correct posture to use CUDA wheels made by others. Because most OPs have CUDA versions, such as DCN, you can directly use the trained model for free.

But if it involves deployment and model acceleration or the model needs to be uploaded to the server, as long as your model needs to run quickly on the GPU. You need to write some accelerated CUDA code yourself, pre-processing or post-processing, or a certain op in the model, which is also a custom op like dcn. For example, TensorRT or triton-server requires you to write some custom plug-ins, written in cuda and c++, for the purpose of acceleration.

Even if you don’t use CUDA in your business, you can still learn the ideas of CUDA parallel language. The gap between parallel thinking and ordinary serial thinking is still quite large. There is a certain generation gap and it takes time to transition.

Download 1: OpenCV-Contrib extension module Chinese version tutorial

Reply in the background of the "Xiaobai Xue Vision" public account: Chinese tutorial on extension module, you can download the first Chinese version of OpenCV extension module tutorial on the entire network, covering extension module installation, SFM algorithm, stereo vision, target tracking, biological vision, ultrasonic vision Resolution processing and more than twenty chapters.


Download 2: Python visual practical project 52 lectures
Reply in the background of the "Xiaobai Xue Vision" public account: Python visual practical projects, you can download them, including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, 31 practical vision projects, including facial recognition, support rapid school computer vision.


Download 3: 20 lectures on OpenCV practical projects
Reply in the background of the "Xiaobai Xue Vision" public account: OpenCV practical projects 20 lectures, you can download 20 practical projects based on OpenCV to achieve advanced OpenCV learning.


Communication group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + visual SLAM". Please note according to the format, otherwise it will not be approved. After successful addition, you will be invited to join the relevant WeChat group according to the research direction. Please do not send advertisements in the group, otherwise you will be asked to leave the group, thank you for your understanding~