PySpark Developer
Infosys
Key Responsibilities
Develop and maintain data pipelines using PySpark Process and analyze large-scale datasets in distributed environments Design and implement ETL/ELT workflows Optimize Spark jobs for performance and scalability Work with data stored in HDFS, Hive, or cloud storage (S3, ADLS) Collaborate with data engineers, analysts, and business teams Ensure data quality, integrity, and governance Debug and troubleshoot data processing issues Automate workflows using scheduling tools (Airflow, Oozie, etc.) Write clean, scalable, and efficient code
Required Skills & Qualifications Technical Skills
Strong proficiency in Python and PySpark Good experience with Apache Spark (RDDs, DataFrames, Spark SQL) Knowledge of Hadoop ecosystem (HDFS, Hive) Experience in ETL pipeline development Familiarity with SQL and database concepts Experience with data formats (Parquet, ORC, JSON, CSV) Basic understanding of distributed computing concepts Exposure to version control tools (Git)
Preferred Skills (Nice-to-Have)
Experience with cloud platforms (AWS, Azure, GCP) Knowledge of Databricks or EMR environments Familiarity with workflow orchestration tools (Airflow) Exposure to Kafka or real-time data streaming Understanding of Delta Lake / Lakehouse architecture Experience with NoSQL databases (MongoDB, Cassandra) Knowledge of CI/CD pipelines and DevOps practices Basic understanding of machine learning workflows
Don't want to miss the next one?
Subscribe to daily email alerts for roles matching your interests.