YouTube Data Influencers

Description

In this project, I aim to gain a comprehensive understanding of the YouTube API and the process of gathering video-related data. The analysis will focus on verifying common myths about video performance on YouTube, such as the influence of likes and comments on view count, the impact of video duration, the correlation between title length and view count, and the role of tags and upload frequency. Additionally, I will explore trending topics using NLP techniques to identify prevalent themes through word clouds and analyze comment sections to uncover frequently asked questions.

Live Demo GitHub

Tech Stack

numpy Numpy

pandas Pandas

Matplotlib Matplotlib

Textblob

Tools Used

TablaueTableau

FigmaFigma

Data API v3 Data API v3




My Role



1- Steps of the project:


Obtain video meta data via Youtube API for the top 20-30 channels in the data science niche, this includes several small steps:


  • Create a developer key, request data and transform the responses into a usable data format.

  • Preprocess data and engineer additional features for analysis.

  • Exploratory data analysis.

  • Bulding a tableau dashboard.

  • Conclusions.



2- Dataset:


  • Data selection : As this project is particularly focused on data science channels, I found that not many readily available datasets online are suitable for this purpose. I created my own dataset using the Google Youtube Data API version 3.0. The exact steps of data creation is presented below.

  • Data limitations: The dataset is a real-world dataset and suitable for the research. However, the selection of the top 28 Youtube channels to include in the research is purely based on my knowledge of the channels in data science field and might not be accurate. My definition is "popular" is only based on subscriber count but there are other metrics that could be taken into consideration as well (e.g. views, engagement). The top 28 also seems arbitrary given the plethora of channels on Youtube. There might be smaller channels that might also very interesting to look into, which could be the next step of this project.

  • Ethics of data source: According to Youtube API's guide, the usage of Youtube API is free of charge given that your application send requests within a quota limit. "The YouTube Data API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others. " The default quota allocation for each application is 10,000 units per day, and you could request additional quota by completing a form to YouTube API Services if you reach the quota limit.


Since all data requested from Youtube API is public data (which everyone on the Internet can see on Youtube), there is no particular privacy issues as far as I am concerned. In addition, the data is obtained only for research purposes in this case and not for any commercial interests.



3- Conclusions and future research ideas:


In this project, we have explored the video data of the 28 most popular Data science/ Data analyst channels and revealed many interesting findings for anyone who are starting out with a Youtube channel in data science or another topic:


  • The more likes and comments a video has, the more views the video gets (it is not guaranteed that this is a causal relationship, it is simply a correlation and can work both way). Likes seem to be a better indicator for interaction than comments and the number of likes seem to follow the "social proof", which means the more views the video has, the more people will like it.

  • Most videos have between 5 and 30 tags.

  • Most-viewed videos tend to have average title length of 6-12 words. Too short or too long titles seem to harm viewership.

  • There is a small variation in the number of video uploads on weekdays.

  • Comments on videos are generally positive, we noticed a lot "please" words, suggesting potential market gaps in content that could be filled.



3.1- Project limitation:


The findings should also be taken with a grain of salt for a number of reasons:


  • The number of videos is quite small (the dataset has only ~5,779 videos)

  • There are many other factors that haven't been taken into the analysis, including the marketing strategy of the creators and many random effects that would affect how successful a video is.



3.2- Ideas for future research:


To expand and build on this research project, one can:


  • Expand the dataset to also smaller channels in data science niche

  • Do market research by analyzing questions in the comment threads and identifying common questions/ market gaps which could potentially filled

  • Conduct this research for other niches (e.g. vlogs or beauty channels), to compare different niches with each other to see the different patterns in viewership and video characteristics.

  • Explore the impact of collaborations between data science creators on video performance. Investigate whether cross-promotions and joint content contribute to increased views and audience reach

  • Investigate audience engagement dynamics, such as viewer retention throughout videos, click-through rates, and user interactions. Understanding how viewers engage with videos on a granular level could reveal valuable strategies for optimizing video content.

  • Extend the analysis of upload timing by considering the time of day for video uploads. Determine whether specific hours of the day result in higher interaction and views.