BUS4 118D - Big Data
BUS4 118D, Big Data
This course counts as an MIS elective for purposes of the MIS concentration. In the past it has also counted as an elective in the Business Analytics concentration, but be sure to check with your advisor in the Jack Holland Student Success Center to ensure it will count towards your concentration. For Fall 2021 there are two sections offered:
Section 1:
Course number: 43278
Class Time: Tuesday & Thursday 12:30pm - 1:45pm
Hybrid: Online (Tuesdays) and In-Person (Thursdays)
Location (Thursdays): BBC 103
Section 2:
Course number: 43279
Class Time: Tuesday & Thursday 2:15pm - 3:30pm
Hybrid: Online (Tuesdays) and In-Person (Thursdays)
Location (Thursdays): BBC 103
For the detailed syllabus and weekly schedule, please see Canvas
Course Format: The course is very hands-on using Big Data tools to wrangle, analyze, and visualize
a Yelp dataset. We will discuss how companies approach data wrangling (working with
Big Data), a framework for data projects, how companies are using Big Data, issues
of ethics and privacy (the CCPA went into effect this year), and data visualization.
Since this class is hands-on, you will need a computer. If you don't have a laptop or desktop, you can check one out from the University (we will discuss in Canvas and other students have done this in the past).
Since we are hybrid this semester, some of the hands-on exercises will have videos
posted that walk you through the exercise. When we were meeting in person, we will
do these in class. The exercises and labs are designed to get everyone up-to-speed
and comfortable with the tools since you will use them on a team project. If you have
taken BUS4 92 (introduction to programming - which uses Python), and BUS4 112 (Databases),
both of these courses will be helpful. During the scheduled class times we will have
discussions, exercises, and team meetings. The class lectures will be recorded and
you will be expected to watch them prior to attending class. After the first few weeks, each team will select a project question
and we will alternate between having weeks as class discussions and weeks where I
meet with each team to discuss their progress. Attendance for the class discussions
is required. For the team meetings, you are required to be online with your team
the day we are meeting (Tuesday or Thursday), and contributing to the team discussion.
We will form teams early in the semester. Your team will work together to answer
a potential business question of a real-world dataset provided by Yelp and then apply
the framework we learn in class along with the Big Data tools to answer that question.
Since you do not know the answer to your question at the start, you are graded on
how you apply the process, how you document your work, your identification of issues
in the data, and whether you are curious about your data - not getting a specific
result. The team will be responsible for deliverables throughout the semester and
each team presents their results at the end of the semester. To encourage everyone
to contribute to their team, part of our team deliverable grade is based on your contribution
to your team. Each deliverable requires a team discussion as to the contribution of
the team members. Based on the team's assesment, the instructor will allocate the
team points - depending on your contribution, you may earn more or less than what
the team earned overall.
Course Goals and Description: Data Science is currently a hot topic in industry and Big Data is the fuel of data
science. In the early years, data scientists were often Ph.D.'s from the hard sciences
(such as astrophysics), but increasingly data science is a team sport. The aim of
this course is to prepare you for the aspects of data science that consume most of
the team's effort and give you skills that can help you enter this exciting field.
Across many industries, 80% of a data scientist's day is spent wrangling data. This includes getting data, formatting it, transforming it, and profiling it - asking questions of the data to understand it. The "sexy" aspect of developing complex models is a small part of the job, and then being able to visualize and communicate the results to upper management is required for businesses to get any value out of the analysis. In this course we will focus on the data wrangling aspect using a dataset provided by Yelp, you will ask questions of the data, doing your data wrangling in Apache Spark using a notebook interface. The tools are web based and both Spark and Jupyter notebooks are some of the hottest tools in Big Data and data science. Your team will then create a visualization using Tableau that is embedded back into your notebook.
The importance of data wrangling was summed up by DJ Patil, the first Chief Data Scientist for the U.S. Government (in the Obama administration), who stated that: "Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem."
The goal is for every team to create a notebook that the team members can discuss and show to recruiters if interested in being a data analyst or getting into a data-related career. The notebook and visualizations your team create can be published, shown, and shared with recruiters and you can include a link on your website or LinkedIn profile. Keep in mind that you need to do the exciting but hard work of producing a team project you understand and are proud of. If you do not understand your team project, a recruiter will not be impressed no matter how well your team did.
The Yelp data is available for academic use and a new dataset was posted in March
of 2021. Last summer, Yelp also published a supplemental dataset related to how businesses
are reacting to the COVID crisis and we may also use that data. The current dataset
contains data on reviews, businesses, users, tips (mini-reviews), and user check-ins.
Each time Yelp published a new version of the dataset, it grows. The version we will
be using covers 160K businesses in 8 metro areas in the United States. The current
version contains over 8 million reviews in total.
The tools in this field continue to evolve, but we will be using the following:
Apache Spark: Currently one of the fastest growing Big Data tools, it is hosted by Databricks on
Amazon's AWS using Jupyter notebooks which are currently popular with data scientists.
Databricks was founded by members of Berkeley's AMPLab who developed and open sourced
Apache Spark.
Tableau: One of the most popular visualization tools and a skill that prior students have
found to be in demand by recruiters.
Textbooks and Materials: We will be using chapters from a number of books that are available online from
the MLK Library through the Safari Online database. This allows us to use selected
chapters from a number of excellent books for free thanks to the MLK Library. The
cost for books and materials in this class is $0, but there is a heavy time commitment
expected.
Prerequisites: Both BUS4 92, Introduction to Business Programming, and BUS4 112, Database Management
Systems are helpful knowledge for this course, but are not required. The exercises
will walk you through step-by-step to learn the tools. We will also have a couple
sessions where we do a hands-on review of topics you may have covered in more detail
in those courses.
Curiosity is a greater asset than specific technical skills.