Hi Clint,
The requirements are quite clear and straightforward to implement. The tools for the job though depends on the expected quality of output and scalability. JSON has nothing to do here, not in that sense.
We can utilize ffmpeg and simply place the text on the designated areas for each video. Possible downside of that approach is the output quality may not be up to your standard since we'll go out of adobe ecosystem. Hard to say anything without taking a look at some of your videos and the fonts you want to use for the texts.
The other option, and the more complex one, is to utilize aerender and possibly scripting. With this option an actual ae instance will be used to render the video, just like you do manually in ae hence the quality will be top notch no different than the manual output.
In either case, the output video will not be ready immediately so the user will wait for the process. The time it takes is mostly based on the size of the video but ffmpeg would be faster and less resource incentive for sure. I'm not sure how many users do you expect but you should consider creating a queue system. Resource incentive jobs like that has the potential to get out of hand very quickly. For the sake of clarity I leave out the details of infrastructure you need to run this, maybe you don't need scalability at all.
Anyway, my curret bid is for an entry level system utilizing ffmpeg. Per your request I'll implement it in php with version 7.2. It'll take 5 days, thanks.