Great point! The article assumes the user already has the raw data stored somewhere and they can download it in any of the pipeline tasks for further processing, but you're right, we should be more explicit, we're planning a more detailed series and will make sure we cover this, thanks for the feedback!
Regarding data privacy: this largely depends on the data you're working with. If you're not dealing with sensitive data, then storing in S3 is convenient, but if you are, you need to ensure that the are proper access controls in place and be extra careful with all the pipeline artifacts like notebooks or logs, as they may leak sensitive data.