Week 1
15/12/25
Our first week of development greatly focused on exploratory data analysis, cleaning data and laying the foundations for our web application. Our first task was to clean our car insurance premium dataset and ensure everything was ready for the EDA. During this process we found that there were many missing values for variables such as the vehicle’s fuel type as well as the number of doors the vehicle has.
Machine Learning
As the ultimate goal of our project is to make a program that can estimate car insurance premiums. So since we both had limited experience in developing a machine learning model, we did some research on machine learning and asked our supervisor for advice on how to approach it. After collecting information on how to approach building the model, we came across recources and frameworksbut the ones that sttod out to us was the CRISP-DM framwork and geeksfor geeks guide guiding us on how to build a model in steps like this one. We found that it was quite helpful during our development process.
we began first understanding what we needed to do, simple we need to predict insurance cost based on the driver and vehicel information provided to us quickly
second we had too look for real insurance data to provide accurate estimates. unfortunatlty it was very difficult to get a hold of real insurance data with good information as they can contain senstive information and can be very valuable asseets to insurance company’s. we then came acrosswe wanted to prove it cam work any wherre in the world a data set with real historical insurance data from spain
our EDA by plotting the premium values against various other metrics to try and find correlations and give us a better idea of what machine learning model would be the best to create an insurance estimator. Since insurance premiums are calculated based on many different important factors and each case can be vastly different from another, it was difficult to find easy correlation between insurance premiums and one other factor, this was also an indicator that a random forest or gradient boosting model may be a good choice.
One discovery that we found was that the “Type_risk” variable that was an integer ranging from 1-4 split our dataset into 4 different types of vehicles that are passenger cars, motorbikes, vans and agricultural vehicles. In the interest of our project, we chose to filter the dataset to only contain passenger cars.

Our main references in getting started and getting used to working with pandas dataframes was the pandas documentation https://pandas.pydata.org/docs/
https://www.geeksforgeeks.org/machine-learning/steps-to-build-a-machine-learning-model/
And our mentor helping us explore the fundamentals of data analysis and machine learning.