Data
·
Warsaw
·
Fully Remote
Senior Data Engineer
Are you ready to join a cutting-edge Digital Solution Company and help to shape the future of business IT solutions?
Our client is a leading global provider of IT solutions and services, known for their customer-centric approach to digital transformation. With a rich history dating back to 1996, they have continually evolved to meet the changing needs of their customers. Their services encompass consulting, technology, and outsourcing, delivering innovative solutions to complex challenges. They have also been honored multiple times as a top employer, including being named a Great Place To Work from 2015 to 2024.
Responsibilities:
- Design and implement scalable, efficient data pipelines to support analytical and operational use cases.
- Architect and maintain robust data processing workflows, ensuring high availability, performance, and reliability across development, staging, and production environments.
- Collaborate with cross-functional teams to define data requirements, validate quality, and deliver consistent, timely data products.
- Establish and follow best practices in code development, testing, and deployment of data solutions.
- Maintain clear documentation and operational procedures to ensure maintainability, transparency, and business continuity.
Primary Skills:
- 5–8 years of professional experience.
- PySpark & Python Expertise
- Advanced PySpark programming: window functions, UDFs, mapPartitions.
- Performance tuning: caching, partitioning, broadcast joins, shuffle optimization.
- Python best practices: modular/testable code (pytest, mocks), typing, linting (flake8), formatting (black).
- Databricks & JupyterLab
- Workspace management: notebooks, repos, Unity Catalog, Git integration.
- Notebook parameterization: widgets, papermill; testing via Databricks Connect.
- Cluster configuration: autoscaling, spot/on-demand, instance pools, init scripts, cluster policies.
- AWS EMR (EC2 & Serverless)
- EC2-based: bootstrap actions, S3 deployments, EMR Steps API.
- Serverless: job orchestration, monitoring with CloudWatch, cost control.
- AWS Glue, Athena, DynamoDB
- Glue: ETL job development, catalog integration, schema management, crawler automation.
- Athena: schema design (Parquet/ORC, partitioning), efficient SQL queries, CTAS transformations.
- DynamoDB: single-table design, GSIs, Streams + Lambda for CDC/incremental ingestion.
- Orchestration & CI/CD
- Airflow pipelines
- DAGs with retries, SLAs, branching.
- AWS integration: GlueJobOperator, EMRStartJobFlowOperator, AthenaOperator.
- Monitoring: Airflow UI, alerting via SNS/PagerDuty.
- CI/CD (GitLab CI)
- Pipelines for testing and linting PySpark/Notebook code.
- Deployment automation for Databricks jobs, Glue jobs, EMR clusters.
- Secure secrets management: Vault, AWS Secrets Manager.
- Airflow pipelines
- Data Architecture
- Batch & streaming design: Lambda/Kappa, Spark Structured Streaming, Kafka/Kinesis.
- Schema evolution & governance: Delta Lake/Iceberg, data contracts, metadata management.
- Strong documentation habits: architecture diagrams, failure recovery runbooks.
- Ability to translate business requirements into technical roadmaps and data product plans.
- Effective communication with both technical and non-technical stakeholders.
Nice-to-Have:
- Observability & Cost Management
- Logging and monitoring: CloudWatch, ELK, Ganglia.
- Custom dashboards, anomaly detection for pipeline health and data quality.
- Security
- IAM roles and access control following least privilege.
- Encryption at rest/in-transit: SSE-S3, KMS, TLS.
You will love to join this company for:
- B2B contract
- Remote work
- Work-life balance
- Agile work environment
- Support in learning and development further while also providing multi-year career opportunities
- Department
- Data
- Locations
- Warsaw
- Remote status
- Fully Remote