Soccer Intelligence: How to win the game?

Posted on Jul 2, 2019


As one of the most popular global competitive sports, soccer has little statistical and analytic works on it. However, with the advent of big data, the trend of using data to improve becoming more and more obvious. There are already some soccer analysis companies such like OPTA and Prozone growing very fast. Motivated by the 2018 - 2019 UEFA Champions League, what I am trying to do is not only to collect data, but also to better analyze the data to serve the sport. 

Here, you can find my Shiny App and Github.

Data Collection

The first part of dataset came from Kaggle. It contains more than 25 thousand matches from season 2008/2009 to 2015/2016, 10 thousand players, 11 European leagues with their lead championship. After calculating goal number and group them by country or season, we can obtain a whole picture of this 11 leagues. The rule is every two teams in the same league would have a match every year. On the left, we can find Spain, France and England have the most number of games, which means they have most teams and therefore the internal competition is more intense. On the right is average goal number of every league in every season.  We can compare different leagues' performance every year by adding or dropping the bar representing their average goal number. Actually, the majority of winners of previous UEFA Champions League came from the five leagues with higher average goal number.

In order to figure out how the value of players influence the performance of a team, I also scrape data from transfermarkt using python. It contains the basic information of each year's 250 most valuable players from 2008 to 2015.

Application Features

Comparison between teams

Users can compare the win percentage of same team or different teams. For example, let's compare the most famous teams from England,  Germany and Spanish: Liverpool, Bayern Munich and Barcelona. It’s interesting one of the best teams in England doesn’t have high win percentage as teams in other leagues. This might due to, remember this dataset only have matches between teams from same league. there are too many great teams in England. 

Comparison between players

This application also allow users to get access to a table containing players information, especially their ratings and potentials at every match. Users can search by typing names of one or many specific players and get a line plot of their overall rating during these years.

Intuition 1: Home or Away?

It is interesting to find that home team wins almost twice number of games of away team, which is reasonable because in home game court, most of fans and supporters are here for the home team. It’s called home advantage. This benefit has been attributed to psychological effects supporting fans have on the competitors or referees. 

Intuition 2: How to be a good player?

This dataset also contains player’s overall rating and 36 parameters evaluating players. I calculated the correlation between every parameter and overall rating. The larger the correlation, the more important the parameter is. We can get a conclusion that the reaction is the most important factor that would affect a player's performance. And age is also a useful parameter when a coach choosing players for his team.

Intuition 3: The importance of market value.

The last, left is the scatter plot of  win percentage and market value of a team. The market value means how much the team used to buy players, so it’s actually the value of players in every team. And the value of players not only means the ability of players, it can also attract more fans, resource, sponsors. These points are all collected in the left up area. So with low value, a team might get high win percentage. But with high value, a team has less probability of performing bad. On the right, the blue line is the win percentage of Manchester city every year, it fit well with the trend of market value.

Future Works

This application can be improved in following aspects:

  1. Adding data about matches between teams from different leagues. 
  2. With every player’s performance and previous matches result, using machine learning algorithm to predict the win probability for any two teams in a game.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup music Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp