From personal curiosity to public dataset

When I look back at this, my first dataset, I’ll probably cringe and think about all the “bad things” I did while creating it.

But there’s also something beautiful in the naivety of working on something for the first time. You don’t have the experience to judge yourself too harshly, you come with a beginner’s mind, and that’s something valuable.

I had just started my first course in Data Analysis (the basics, you know) and they suggested we work on a project for our portfolio. Even though I knew I wanted to work with something familiar, I wasn’t sure what to do or how to tackle it.

Then I thought, okay, I have to keep it simple. I don’t want to overcomplicate things. I’m learning; this is a new field for me. Even though I have some background, I didn’t want to pressure myself.

That’s how I landed on this idea: create a dataset and share it publicly (in case people wanted to give feedback), and make it about something I love. So, logically, the first thing that popped into my mind was to work with astrology and K-pop.

Where astrology meets data: my first K-Pop dataset

I already had a project where I’d compiled a huge spreadsheet with astrological info for over 400 idols. It started as an exercise for me, then became a website. But life happened and I didn’t have time to maintain the site, so I decided to close it.

I thought maybe it was a good time to revive it, but it felt like a huge task. In the end, I decided to do something simpler. I already had information on many groups and their debuts; all I had to do was structure it into a clean dataset and add the astrological details.

At first, I wasn’t familiar with dataset standards, how to name columns, format dates, handle numbers, etc. But as I learned more, I decided it was the perfect time to take all that information and organize it the “right” way.

Learning to structure my Data

In my first attempt, I thought about making both a light and a complete version of the dataset. But once I started filling in the information and deciding what was relevant and what wasn’t, I realized it was cleaner to just have one version.

I initially included a lot of data that felt valuable but didn’t add much in the end. For example, to get the exact debut time for the astrological charts, I pulled metadata from YouTube videos. For older groups without a YouTube debut, I had to leave it blank. At first, I included the actual YouTube links, but I realized that was unnecessary. In the final version, I simplified it to a single column that simply indicates whether the time data is reliable or not. This made the dataset cleaner and more focused on what truly mattered for analysis.

I also included less relevant info early on, like the group’s Korean name, something that I get rid off too. But one thing I almost left out turned out to be important: whether a planet was retrograde or not. From an astrological perspective, that can add meaningful layers to the analysis.

In the end, I ended up with a 23.73 kB dataset with 27 columns and 120 groups. I know, it doesn’t sound simple, but I tried my best to clean it and keep what was relevant for future use.

What made the cut and what didn’t

This exercise taught me a lot about data gathering. I used sites like Kpopping and SoriData to collect group info: names, debut dates, companies, group type, status, and success markers like PAKs, music show wins, year-end awards, physical sales, and organic YouTube views.

Once I had the debut date, I extracted metadata from YouTube videos to get the exact debut time (when available), then calculated all the astrological details with astro-seek: sun sign, moon sign, rising sign, and planetary positions.

This gave me a rich dataset that can be used for multiple purposes. You can cross-reference debut dates, group success, and astrological data to find patterns. That was actually the second step of this project, but I’ll save that for another article.

Through this simple exercise, I learned a lot and I also got over the shyness of sharing something that reveals my interests, even while I’m still learning. The funny thing is, I thought no one would care about a dataset like this. Astrology and K-pop? Really?

But to my surprise, even though it’s not one of those famous downloaded datasets, people actually engaged with it. I thought maybe three people would see it. But as of today, the dataset has 2,811 views, was downloaded 412 times, and, most exciting, someone other than me created a notebook using the data.

This experience reminded me that passion projects are worth sharing, even if they feel niche or imperfect. You never know who might find it useful, interesting, or inspiring. What began as a simple learning exercise became a small but meaningful contribution to a community I care about. And that alone made every hour spent on it worthwhile.

Let's work together!

© 2010 – © 2025