By 2025, global data creation is said to exceed 180 zettabytes. In 2020, we had already consumed and created more than 64 zettabytes. These data carry information that can put any business ahead of its competition. But how to take advantage of data effectively and use it to improve products, services and meet modern customers' needs?
In this interview, we'll discuss the importance of data engineering with Tomasz Jędrośka, Head of Data Engineering at CodiLime.
Tomasz's role is to grow this relatively new division of the company and make sure it delivers competence synergies with our Networking, Cloud, and Security services and therefore extend further CodiLime’s competitive advantages.
As a hands-on practitioner, he also makes sure that CodiLime keeps pace with ongoing changes in the data environment - the Blockchain, IoT & AI developments revolutionizing the world nowadays.
What is the role of data engineering in the modern tech industry?
Tomasz Jędrośka: Data engineering (DE) plays a crucial role in the modern tech industry by enabling organizations to effectively manage and leverage their data assets. Data engineers are responsible for designing, building, and maintaining the infrastructure and systems required to collect, store, process, and analyze vast amounts of data.
They work closely with data scientists, analysts, and other stakeholders to understand their data needs and develop robust data pipelines.
Moreover, by ensuring data quality, completeness, and security, data engineers enable organizations to make informed decisions and derive valuable insights from their data. They optimize data workflows and implement automation, allowing for efficient data processing and faster time to value. Additionally, those actions often end with ingesting the pieces of information into AI solutions to squeeze even more competitive advantages out of it. Furthermore, they are involved in building data architectures and data governance frameworks, ensuring compliance with regulations and standards.
In summary, data engineering empowers organizations to unlock the potential of their data, enabling data-driven decision-making and innovation in the modern tech industry.
What are the key responsibilities and challenges associated with DE?
TJ: I’d say to learn learn learn - the tech stack and frameworks in the data world change incredibly fast, so the data engineers need to regularly spend cycles to ‘sharpen their saw’. Other than that, I’d distinguish these key responsibilities of data engineering:
- Data architecture design: data engineers design and build scalable data architectures that accommodate the organization's data requirements. They choose appropriate technologies and frameworks, considering factors like data volume, velocity, and variety. They create data models, define schemas, and implement storage solutions to support efficient data processing and analytics.
- Data pipeline development: data engineers are responsible for designing and developing efficient data pipelines. This involves tasks such as data ingestion, data transformation, and data integration from various sources. They ensure the smooth flow of data from its source to the destination, making it accessible for analysis and processing.
- Data quality and governance: data engineers ensure data quality by implementing data cleansing, validation, and enrichment processes. They establish data governance frameworks and enforce data standards to maintain data accuracy, consistency, and security. They collaborate with stakeholders to understand data requirements and establish data governance policies and procedures.
And also a few main challenges:
- Scalability and performance: managing large volumes of data and ensuring high-performance data processing can be challenging. Data engineers need to design scalable systems that can handle increasing data volumes and support efficient data processing to meet business requirements.
- Data integration and complexity: integrating data from multiple sources with different formats and structures can be complex. Data engineers need to tackle data integration challenges by implementing efficient ETL (extract, transform, load) processes and handling data inconsistencies and discrepancies.
- Data security and privacy: with the increasing importance of data privacy and security, data engineers face challenges in ensuring data protection. They need to implement robust security measures, encryption techniques, access controls, and data anonymization methods to safeguard sensitive data and comply with regulations such as GDPR and HIPAA.
In summary, data engineers have the critical responsibilities of developing data pipelines, designing data architectures, and ensuring data quality and governance. They also face challenges related to scalability and performance, data integration complexities, and data security and privacy. Overcoming these challenges is essential for successful data-driven decision-making and effective utilization of data assets in the modern tech industry.
How can data engineering positively influence the growth of an organization?
TJ: I believe I’ve covered a bit of that before - by building robust data architectures and pipelines, and by applying analytics and AI-based forecasting, data engineering departments can enable organizations to derive valuable insights and make informed decisions based purely on data and not on assumptions. Moreover, by implementing automation and efficient data integration processes, data engineering streamlines operations, reduces manual effort, and improves overall efficiency.
Ultimately, the organization can leverage data-driven insights to enhance productivity, identify new opportunities, optimize business processes, and drive innovation, leading to sustainable growth and a competitive edge in the modern tech industry.
Can you share some examples of how data engineering has made a significant impact on a client's product/project development?
TJ: Yeah, sure - there were a few that gave great business benefits to our clients:
Alerting and monitoring of a telco data center project. It was a 2 year project with multiple vendors and collaborating parties. What we recognized as vital from the outset in our responsibility for monitoring and alerting was the establishment of a message-based control flow. This control flow would encompass all events occurring in the ecosystem and utilize a standardized format to connect the entire program, clients, and other participating vendors. The goal was to create a universal data pool accessible across all client systems. This functionality was based on Apache Kafka and further enriched with ElasticSearch and Kibana, which made it possible to distribute and search for events, alarms, and KPI information. With up to 50k messages being exchanged back and forth every minute, this component was the beating heart of the entire ecosystem.
The other aspect in the testing area was establishing a trigger- or schedule-based E2E validation cycles executing particular groups of tests based on events occurring in the system (component upload, code, or configuration change in Git repositories); execution time matched strict project constraints and was subject to measuring and reporting.
Last but not least, to achieve immediate and reliable real-time monitoring of the whole infrastructure for the data center operators, we recommended using time series databases (InfluxDB, TimeScale) as a base for the controlling application.
Those technical pieces and our DevOps expertise helped set this all up in an IaaC, fully-automated manner, resulting in respective cost- and time-savings for our customer. A significant fact about the program consequences is that before it took place, the telco company needed between six months to two years to update or set up a new data center; afterwards this time was reduced to 1-2 weeks.
And another example?
TJ: Another project was a cloud-native telemetry data source integration for a publicly listed technology corporation specializing in application security, multi-cloud management, online fraud prevention, etc.
In this case, using our experience in creating solutions for data pipelines exchanging massive sets of telemetry data CodiLime was able to step into the project with proven recommendations and deliver a solid and scalable solution. We also leveraged our previous experience to establish close working relations with the stakeholders and made sure our architecture team was present and active during the decision-making process.
In the course of implementation, our customer was immediately able to see the first benefits like improved business insights - with a joint ecosystem gathering product and client metrics; our customer gained a holistic view of the usage of their services, their forecasted utilization, and therefore a perspective on the potential business growth and areas to invest in as well as a good chance for optimizing their infra spendings. Another aspect worth mentioning is increased data reliability - previously separated pieces of information were gathered in one place, validated for internal consistency and cross-checked against each other to eliminate potential mismatches allowing more trustworthy analysis and more informed decision-making. Last but probably the most important outcome for our customer was the ability to implement a more modern business model - with a close to real-time reporting system, our customer was able to switch their modus operandi from monthly subscriptions to a pay-as-you-go model based on actual usage.
There are many other good examples from a purely Data Engineering standpoint as well as those in synergy with other domains like Networking or Data Science. I won’t describe them in detail, but such synergies are what we champion at CodiLime. To give a brief overview, I’d like to point our reader to our webinar repository and, more specifically, to the one about AI in networks to detect and prevent faults.
What are some common challenges faced by data engineering teams, and how do you address them?
TJ: Today’s data world evolves very quickly; new data engineering solutions and AI models are being released every week, but still, different data engineering teams often face common challenges, including data quality issues, data integration complexities, and scalability concerns. In the rush of the day, pursuing new business opportunities, business stakeholders might forget that to have a properly working product or service, the underlying data has to be accurate and consistent. To address these challenges, teams employ several strategies which shouldn’t be avoided or postponed.
Firstly, they implement data quality frameworks, including data cleansing, validation, and enrichment processes, to ensure accurate and reliable data. They also establish data governance policies and procedures to maintain data consistency and security. Secondly, they leverage integration techniques like ETL (Extract, Transform, Load) processes and data integration tools to handle diverse data sources and formats, ensuring seamless data integration.
Lastly, teams employ scalable infrastructure and distributed computing frameworks such as Apache Spark or its cloud-based wrappers like EMR and Dataproc, together with warehousing solutions like RedShift or Snowflake, to handle large data volumes and ensure efficient data processing and analysis.
The above-mentioned strategies enable data engineering teams to overcome challenges and ensure the smooth functioning of data pipelines, reliable data integration, and scalable data infrastructure which eventually lead to better products and customer experience.
What emerging technologies or trends do you find most exciting in the field of data engineering?
TJ: What I fancy the most in the data engineering world are the data warehousing solutions, so I’m mostly interested in the growth of projects like Snowflake or Starburst. Due to the fact that as a company, we deal a lot with data streaming and time series-based workloads (i.e. for network monitoring), I’m heavily investigating Apache Druid lately. Basically, whatever helps me deliver scalable storage and processing capabilities and enable my customers to handle massive amounts of data efficiently is what draws my attention.
Obviously, nowadays, we wouldn’t be able to escape from the topic of chat GPT and other emerging AI technologies. Data engineering teams are slowly but surely leveraging AI for tasks like automated data cleansing, anomaly detection, and data quality assessment. AI also plays a crucial role in improving data integration processes (i.e. matching pieces of connected information between heterogeneous data sources), leading to more efficient and accurate data processing. At the moment, my competency center team is working on taking advantage of frameworks like LlamaHub and LangChain and adopting them to real-life corporate reality, which in my opinion, can give organizations a huge performance boost.
How do you incorporate these technologies into your work?
TJ: Obviously, it works best if we have a chance to take advantage of a particular technology with a concrete client - then the solution combines the benefits of making a proof of concept and fulfilling actual business needs. It’s worth mentioning that sometimes scenarios that work well in the lab need to be heavily customized and fine-tuned in real-life scenarios. That’s why validation with the customers is so important for us at CodiLime.
On the other hand, some frameworks are very fresh and not mature enough to be used in the products that have a firm position on the market and cannot risk downtime or data breach due to software issues - in those cases, to be on top of the newest technologies we run internal projects within CodiLime competency centers. Those initiatives usually fulfill the needs which we believe our existing customers ‘might need soon enough’ so that when the time comes, we are in a position to present a recommendation and a working piece of code. This way, our engineers gain a way of sharpening their saw and the company - a framework for growing our business.
How do you see the role of data engineering evolving in the coming years?
TJ: I expect that in the near future, the role of data engineering will evolve significantly. With the increasing volume, velocity, and variety of data, data engineering will continue to play a crucial role in enabling organizations to effectively manage and extract value from their data assets. There will be a growing emphasis on real-time data processing, cloud-based solutions, and automation through AI and machine learning.
Data engineers will be at the forefront of designing scalable data architectures, implementing advanced data integration techniques, and ensuring data quality, security, and governance.
I strongly believe our role will become more strategic as I imagine we, as data engineers, together with data scientists, will deepen our collaboration with business stakeholders to drive innovation, derive actionable insights, make data-driven decisions, and bring corporate efficiency to a whole new level.
Tomasz Jędrośka, Head of the Data Engineering department at CodiLime
His journey in the data space started at the university, where he wrote a master's thesis about database replication while already working as a full-time DBA for data-heavy applications; back then, the main focus area was modeling and normalizing corporate databases, fine-tuning batch procedures, index optimization, and making sure that hardware was configured correctly (in today's cloud era, not many DBAs need to care about RAID configuration or the operating system boot sector offsets).
Later on, already a team manager, he was engaged in application development for the biggest retail and investment banks when he entered the areas connected with distributed high-performance computing on-prem and in the cloud, data migrations, and regulatory reporting on the corporate level.
Currently, at CodiLime, Tomasz is (besides programming Excel sheets, as he sometimes jokes) responsible for setting up and maintaining collaboration with customers and providing them with advice on the data architecture design relevant to their business case, and leading or supervising the delivery teams. Besides that, he is actively growing the Data Engineering department by extending our collective knowledge and skillset of modern data solutions, as well as the service offered to our customers.
Along with his professional responsibilities, he is an avid volleyball player and loves to take cross-country cycling trips.
—
Interested in other DE-related topics and challenges? If yes, definitely check out From raw data to insights: Effective data processing techniques webinar , where Tomasz, along with other CodiLime experts cover the most important data engineering issues.