About me

My name is Heitor Felix

I hold a degree in Data Science from Uninter and currently work as a Data Engineer at NTT Data, building and optimizing data pipelines in Databricks for the financial sector. I have over 3 years of experience in Data Engineering, with strong expertise in Data Lakehouse architectures, AI solutions development (RAG/LLMs), and cloud infrastructure implementation. I hold Databricks certifications (Associate and Professional) and Azure certifications (DP-203, AZ-900, DP-900). Explore my projects below and feel free to reach out.

Professional Experience

Data Engineer at NTT Data

2025 - Present

Building and optimizing data pipelines in Databricks for financial sector clients. Working on migration projects to modern Data Lakehouse architectures, ensuring performance, scalability, and data governance.

Data Engineer II at Sapiensia Tecnologia

2022 - 2025

Technical leadership in data engineering projects, developing critical pipelines on Azure and Databricks. Implemented AI solutions with RAG and LLMs for process automation. Architected disaster recovery strategies and serverless infrastructure. Responsible for analytical dashboards and Python automations that directly impacted business decisions.

Data Science Intern at 027capital

2022 - 2022

Development of churn prediction models and data ingestion pipelines using Python and Google Cloud. Created software for financial data processing and analysis.

Projects

Deputies project diagram

Brazilian Congress Deputies Data Pipeline

Snowflake dbt Airflow AWS S3 Python

Complete data pipeline focused on data engineering, with automated ingestion of public information from all federal deputies. The data includes biography, mandates, expenses and parliamentary activity, extracted via official API. The architecture implements a modern ELT approach with Snowflake and dbt . Daily incremental ingestion is orchestrated with Airflow and stored in S3 in Parquet format. Transformations follow robust dimensional modeling patterns (SCD Type 2). The project ensures end-to-end scalability, automation and data quality.

Tools used

  • Python, Pandas and requests
  • Apache Airflow
  • Amazon S3 and SQS
  • Snowflake, Snowpipe
  • dbt Core
  • Streamlit and Jupyter notebook
Olist project diagram

Data Lakehouse: Olist

Databricks Delta Lake Spark Azure

This project used the Databricks Data Lakehouse architecture to manage data in layers (Raw, Bronze, Silver, and Gold) and simulate ingestion scenarios with CDC (Change Data Capture). The data, from a Kaggle dataset, was enriched to create a complete pipeline, from ingestion to business analysis. I implemented data governance with Unity Catalog, orchestration with Databricks Workflows, and continuous integration via GitHub Actions. The project consolidated skills in data pipelines, automation, and analysis with the Medallion architecture, optimizing the use of data for insights and analytical applications.

Tools used

  • Pandas
  • Git, GitHub, GitHub Actions
  • Azure Blob Storage, Parquet
  • Databricks, UnityCatalog
  • Spark, Delta Lake
  • Databricks Workflows
Diagram of the OCR project

Chatbot with GPT-4 and Azure

GPT-4 RAG Azure OpenAI Azure AI Search

In this project, I explored Azure Artificial Intelligence tools to build a chatbot specialized in Azure using GPT-4. I copied the data from the Azure documentation on GitHub to the Storage Account, used Azure AI Search to perform embedding and indexing of the content, and Azure OpenAI to build the chatbot in an App on Azure. The goal is to provide accurate and contextualized answers about Azure services and functionalities.

Tools used

  • Python
  • Azure Blob Storage
  • Azure AI Search
  • Azure OpenAI
  • Git, GitHub
  • Bicep template (IaC)
Diagram of the OCR project

IN PROGRESS: Telegram Bot: Text Recognition (Computer Vision)

Computer Vision Azure AI Python

In this project, I explored Azure Artificial Intelligence tools for optical character recognition (OCR), such as Azure Computer Vision and Azure AI Document Intelligence. I used Python to develop a Telegram bot that processes images sent by the user, returning the extracted text and the confidence interval for each recognized word. I implemented dynamic settings in the bot, allowing adjustment of parameters such as the minimum confidence level to accept words and the application of pre-processing. This project demonstrates skills in API integration, image processing and the creation of interactive interfaces with bots.

Tools used

  • Python
  • Telegram API
  • Azure Computer Vision
  • Azure AI Document Intelligence
  • Git, GitHub
  • Bicep template (IaC)

Older Projects (2021 - 2022)

Stone Data Challenge 2022

Python Power BI Pandas

I was a semifinalist in the Stone Data Challenge 2022. In this Stone challenge, my task was to use historical data from a loan program from 2019 to April 2022 from 14,700 clients. The business problem was related to contacting clients who were behind on payments. The question to be answered was: What is the ideal curve of times we should contact a client? To answer it, I used Python and Power BI to answer the question with data analysis.

Tools used

  • Git, GitHub, LSF Git files
  • Python, Pandas, Seaborn, Plotly
  • Power BI

Sales Prediction

Scikit-Learn Python Heroku Flask

I used Python to create a Machine Learning model to predict the sales of each of the 3,000 registered stores in the next 6 weeks. The model was put into production and can be requested via API by Telegram, just needing internet access to use it. The model had a 90% prediction of the real value, allowing the CFO to make decisions based on the future revenue of each store unit and thus be able to make investments without losses.

Tools used

  • Git, GitHub
  • Python, Pandas, Seaborn, Boruta
  • Scikit-Learn and Scipy
  • Flask
  • Heroku Cloud
  • Telegram API

Classification of customers most likely to buy

I used Python to create a Machine Learning model to rank the customers most likely to purchase a new product (cross-sell strategy). With an accuracy of 33.5% for the top 20,000 customers in the database, the sales team is able to reach interested parties with much less cost.

Tools used

  • Git, GitHub
  • Python, Pandas, Seaborn, Extra Tree Classifier
  • Scikit-Learn, Scipy and Scikit-Plot
  • Flask
  • Heroku Cloud
  • Google Sheets API with Google Scripts

Customer loyalty with clustering

I used Python to create a Machine Learning model to find the "Insiders," the best customers of the company. The objective of this project was to group customers with similar behaviors so that the business team can build personalized actions, based on the characteristics of each cluster.

Tools used

  • Git, GitHub
  • Python, Pandas, Seaborn, GMM
  • Scikit-Learn, Scipy and Yellowbrick
  • SQLite
  • Metabase
  • Papermill
  • Learn more
  • House Rocket Data Analysis

    I used Python and Power BI to perform exploratory data analysis and thus confirmed or not some hypotheses about the business, resulting in insights for better business performance. The analysis aimed to increase the revenue of the fictitious company, House Rocket, which works with the buying and selling of real estate, finding the best times to buy or sell the property.

    Tools used

    • Git, GitHub
    • Python, Pandas, Seaborn, Plotly
    • Geopy API
    • Power BI
    • SQLite