0:06
Hi.
0:06
Welcome to our webcast of Building a Bridge to Data Science presented by Jim Stern and Ian Thomas.0:13
Thank you for joining us.
0:14
My name is Deanna Morrison and I will be the moderator for the presentation today.
0:19
Before we get started, I’d like to take a moment to acquaint you with a few features of this web event technology.
0:26
On the right hand of your screen, you’ll see a chat window.
0:30
To send a question.
0:31
Click in the text box and type your question.
0:34
All questions that you submit are seen only by today’s presenter.
0:38
Your questions will be responded to and the order in which they were received and will be addressed at the end of the presentation.
0:46
At the conclusion of today’s program, we ask that you would complete a brief webinar evaluation.
0:52
Please take a moment to complete this evaluation as it will help Digital Analytics Association to plan future web events.
1:00
We are joined today by our speakers, Jim Stern, President of Rising Media Incorporated and Ian Thomas, Foreign former data and analytics leader of digital marketing, data science and Operations for Microsoft.
1:14
At this time, I’ll turn the microphone phone over to Ian.
1:19
Thanks very much, Deanna and thank you everybody for for joining this webinar.
1:25
I’m going to kick off and and and introduce us to the topic and and then Jim’s going to pick up a big chunk in the middle of our section and then I’m going to come back a little bit later on.
1:36
So I’ll just talk to you for a few moments about our our agenda this morning or this afternoon.
1:43
We’re going to start out by, by asking really why we care about data science.
1:47
I assume that most of you are on this call because you either care about data science or somebody has told you you should care about data science or or you need to find out more about it.
2:00
It’s important I think for us to understand why data science is important to us as as as digital analytics professionals.
2:07
So we’ll start by, by looking at that and looking at the definition of data science.
2:13
And once you start looking at data science, it becomes quite clear that there’s a big portion of it which actually really is about data engineering.
2:22
And so really, there’s actually three kind of foundational roles that organisations are evolving, data science, data engineering and analytics.
2:30
And I’ll be handing to Jim, who’s going to take you through some more detail on that about how to to run those 3 roles and how they relate to one another and to the stakeholder role in practice in in in terms of how projects actually get started and get and get concluded.
2:48
I’ll then come back and offer some more practical suggestions on how to become what we call kind of more data sciency.
2:56
So there’s a a big question people have about how to become a data scientist.
3:01
And one of the things that I’ve learned in this area is there’s no bright line where you suddenly magically become a data scientist.
3:07
It’s more about moving in a direction of data science.
3:10
And we thank Tim Wilson for this term of data science-y.
3:15
And so we’ll talk about some, some practical tips there and then how to grow data science in a team.
3:20
If you’re involved in managing an analytics team and you want to add data science to it, we’ll cover some of that and then some recommended resources and then we’ll finish up and hopefully have a little time for some questions.
3:33
Jim, could we move to the next slide?
3:34
Thank you.
3:35
So why care about data science?3:37
Well, data science and it’s in the news all the time.
3:40
Of course, now everybody’s talking about data science and machine learning.
3:44
So it’s it’s really becoming a very mainstream business analytics activity and it’s driven really by a couple of of trends.
3:54
The first thing is that stakeholder business stakeholders are looking for more actionability in the data and the analytics that is that is done.
4:05
There’s been a lot of progress made in the last 20 years or so in being able to extract insights from data and perform sort of, you know, regular analysis on on, you know, complex data sets.
4:21
But still, there’s frequently a gap between between that analysis and its actionability.
4:27
And some organisations, particularly digital marketing organisations or digital marketing teams, are really looking to turn data directly into into action.
4:38
And the 2nd trend is the the tools and technologies that underpin data science have been dramatically improving over the last few years and they’re starting to really democratise the field of data science.
4:53
So data science really did used to require a sort of postgraduate degree in a sort of in a mathematical discipline and a lot of underlying coding skills and statistics skills and so on.
5:08
Which is why data scientists tend to be these unicorns that command astronomical salaries.
5:14
But now is the evolution of tools is opening the field up to a broader group of people and and that’s really meaning that that analysts can start thinking about building out their own their own data science skills and teams can think about more a more multidisciplinary approach.
5:33
And so analytics teams need to include data science, data engineering and data analytics disciplines.
5:38
And it’s important to to to grow analytics skills because if you don’t help your analysts grow their skills towards data science, they will likely leave for somewhere that can help them grow in that direction.
5:55
So what is data science?
5:56
Well, there’s many, many definitions of data science.
6:00
The one that I like is this one, which is based on something called Crisp DM, which is a process for data mining that was actually written down back in the the 1990s.
6:13
And it contains this set of of linked activities, which I sort of bucketed into 3 broad phases.
6:21
The first is a phase of of stakeholder engagement and storytelling.
6:25
So this is really the phase that sort of bookends a typical project where the data scientist starts with understanding business goals and working with stakeholders to evolve and define the goals of the project and the and the analytical outcomes that they’re they’re looking for.
6:44
And then at the end of the project comes back to those stakeholders and presents the results using, you know various techniques including, you know, visualisations and so on.
6:55
The 2nd row key phase is data analysis and preparation.
7:00
Data science is characterised by and it’s and it’s and it’s different from.
7:07
So the traditional digital analytics that many of you may may be more familiar with in that the data that is used for, for data science is often much less well, well formed, well structured, much less clean than say web analytics data.
7:25
But at the same time, even when data is relatively well, well structured, there’s a lot of data prep that a data scientist does in order to to to make that data ready to build predictive models and machine learning algorithms against.
7:42
And so there’s a a a significant amount of of of data prep and data munging and wrangling that goes on in this in this phase.
7:51
And then the third phase is the building of predictive models or machine learning algorithms and the evaluation and deployment of those algorithms.
8:00
And this is where I think a lot of people feel like the sort of the magic of data science happens as the man behind the curtain sort of comes out with amazing algorithms that are fantastically predictive of of of customer loyalty or purchase propensity or whatever.
8:16
The practical reality here is there’s a lot of hard work of trying out different models, running, running the data through those models, seeing how well they perform in the real world in the in a predictive context, going back and making changes to the data preparation step in order to make the models perform better.
8:37
And then crucially, once those models have been built, actually figuring out how they get deployed.
8:42
If you want to build a model that that you want to deploy into your, you know into a mobile app for example, you have to partner with developers of the mobile app to deploy that into into the app.
8:57
So these these cycle of phases of data science really really kind of constitute the whole, the whole activity.
9:05
Jim, if you move to the next slide and so if we look at those, those three big phases, you actually find that digital analysts perform a lot of activities in those same in those same broad areas that I’ve called here called storytelling, data wrangling and analytics and modelling.
9:30
And as you can see here that you know the the key message of this slide is that if you’re a digital analyst you’re already doing many activities that overlap with data science.
9:42
It’s there’s no bright line between the two, between the two disciplines.
9:47
For example a lot of analysts spend a fair you know a fair amount of time making sure data is in a you know as well structured for analytics.
9:54
They spend a lot of time doing visualisation and increasingly experimentation and and and causality is a key thing that analysts do, but on the right hand side here you can see the skills and the activities that are more explicitly around data science, particularly advanced statistical modelling, the building of predictive models and so on.
10:16
And so for an analyst looking to grow into a data scientist or data science role, it’s really about building out the skills on the right hand side here.
10:25
But as I mentioned at the beginning, a, a key portion of this of this activity ends up being data engineering.
10:33
And so I’m going to head over to Jim now.
10:35
He’s going to talk a little bit more about that and about the the way these roles can fit together in rural work.
10:41
Thank you, Ian.
10:42
So the data engineer, you know we’ve got analyst on one side and data scientist on the other and then there’s this weird data engineering thing.
10:51
So I wanted to take this perspective that Ian’s provided and and turn it a little bit, you know, do the do the pivot on it and say here are the things the analyst does.
11:05
Here’s what the data engineer does and here’s what the data scientist does.
11:08
Now I’m going to repeat no bright line.
11:12
Imagine this is all a Venn diagram and it all gets mushed together.
11:16
Pretty straightforward.
11:17
But essentially the analyst is responsible for trying to figure out what the goals of the organization are.
11:24
What what problem are we solving for?
11:27
What are we trying to accomplish And do the analysis.
11:32
Put two and two together and see if you can come up with five.
11:35
Try to figure out what information is valuable and meaningful and then reporting it back out to the decision makers.
11:46
The data engineer is going to be responsible for data stewardship.
11:51
So we’re going to collect, we’re going to manage and and we’re going to make sure that it’s trustworthy.
11:58
We’re going to keep an eye on it to make sure that it is consistent, that it is trustworthy.
12:05
And when models are created that it can be operationalized, which means building the pipelines that take in the raw data automatically, do the transformation, load them into the proper systems and and let those systems then operate automatically.
12:23
So this is, this is where we can really start start stepping toward machine learning.
12:29
Well, for that to happen, we need the people in the backroom in the white lab codes figuring out what is the best next step.
12:41
I think of them as as raw materials scientists, I’ve come up with a new transparent aluminum that we can use in order to transport whales.
12:53
That’s great.
12:54
But now we need the designer and the architect and everybody else to participate the data scientist in and again in this pure classical bright line kind of definition is doing that research is figuring out what’s next is playing with machine learning is figure out which algorithms might work best in order to create models which then the analysts can figure out, Oh well, are these models actually going to be useful for solving the problem for the end user?
13:26
Well, the end user what, what do we mean by that business stakeholder?
13:32
And we’ll get to them in just a second.
13:34
But first, let’s let’s talk about reality a little bit, which is not a clean definition of here’s a data analyst, here’s a data engineer, and here’s data scientist.
13:44
There’s a whole lot of overlap.
13:46
Now this is a nice, clean, even overlap.
13:50
And if you are that person in the middle, congratulations, you are a Unicorn and you don’t actually exist.
13:58
So in your organization, chances are you’re doing the data engineering yourself.
14:05
And maybe there’s somebody over in IT that’s helping you with a Hadoop cluster or doing some heavy lifting from systems you you actually aren’t allowed to have access to to bring data into your system.
14:18
But mostly you’re doing the engineering yourself, which is, you know, page tagging, other organizations IT.
14:25
Or maybe there’s an analytic center of excellence that does have some data scientists and and you as a digital analyst are doing some of that data science, that’s great.
14:36
But most of it lives over on the other side of the wall and you’re you’re watching them saying, Gee, I’d I’d be interested in doing some of that.
14:45
So that’s that’s what we want to talk about is you know, building that bridge from digital analytics to data science and a lot of organizations at the moment you, you are the only one doing the work.
14:58
You are the digital analyst who oh by the way you are responsible for all of your own data and you’re spending all of your time collecting and cleaning and managing and piping and God bless you and and there is no data science.
15:09
That’s just work that that that’s not on offer yet.
15:13
So you will have to be the one who brings it into the organization.
15:17
In many companies you can see off in the distance there are data scientists and they’re probably connected to manufacturing and shop floor control, supply chain management, human resources certainly in finance and they’re they’re off in the corner.
15:38
They have nothing to do with marketing and and so you all you can do is is watch them wistfully.
15:45
So in reality, when we look at this, you know, create these bright lines, what’s the difference?
15:51
You’re actually responsible for some chunk of it.
15:56
Yes, everything on the left hand side is is your purview.
15:59
But then you’re also yes, you have to do the page tagging.
16:02
Yes, you have to clean the data.
16:04
Yes, you have to figure out how to make it all work together happily, or you’re only responsible for certain bits of it.
16:12
This is this is the your mileage will vary.
16:15
Every organization does it differently, which brings us to the stakeholder.
16:20
So if we think about it in instead of in terms of discrete function, we look at it in terms of process.
16:30
How do we go about solving problems with data?
16:34
It begins with the business stakeholder and whether that is the CEO or the CMO or somebody who’s just responsible for advertising spend who wants to know where to move their money.
16:47
Somebody has the decision making responsibility and we as digital analysts support them.
16:55
So they’re using their domain knowledge which is knowledge of the industry, the product, the customers, the company, the competition to figure out what are we trying to accomplish.
17:08
And and that could be as simple as more e-mail opens or it could be as complex as improve shareholder value in the next quarter.
17:18
So goals spotting opportunities, seeing where we might take a step we hadn’t thought of before and what’s standing in our way.
17:27
So they’re going to turn to the analyst and say, OK, here are my needs, How can you use data to help me?
17:34
And the analyst is going to come up with hypotheses and do some data diving and come up with insights.
17:39
And for that, they’re going to rely on the data engineer to have trustworthy data to work with.
17:46
Because we’ve all been in that position where the numbers didn’t add up.
17:49
And when you went up the chain, back up the pipeline, you found out that you’re the numbers you’re working with are pretty solid except for every Tuesday at 2:00 in the afternoon, everything collapses and then it’s not up and running again until Wednesday morning.
18:08
And nobody knew that.
18:10
So somebody needs to be the data engineer, the the data steward for each and every data stream you’re using.
18:17
How are we collecting it?
18:19
How are we managing it, are what kind of pre processing are we doing?
18:23
How are the metrics that we’re that we’re counting on modified in order to reach us.
18:31
And not only is the collection solid, but is the pre processing still working OK.
18:39
And then the data scientist, Oh well, they’re going to take the advice from the analyst and figure out what methods and algorithms to use in order to build models.
18:51
And building a model as an iterative, iterative process.
18:54
It is, you know is, is, is this model helpful and the analyst has to look at that and figure it out.
19:02
Now the reality is that this is 1 big Venn diagram that’s all mushed together.
19:09
And this list on this screen, the collaboration list.
19:15
This is why you will always have a job, whether you are a business stakeholder or an analyst.
19:23
You will always have a job, regardless of how much machine learning comes in, because we as humans are going to be responsible for working together in collaboration to figure out what problem to solve.
19:36
What are we trying to achieve?
19:37
What what where do we want to point this incredibly powerful data tool?
19:46
What what do we want to accomplish?
19:49
The next step is figuring out what data might be interesting if we If we say, Oh well, let’s just throw everything in there, the machine’s going to say, yeah, I can consider all of this and and give you.
20:02
A.
20:03
An output that is 5050.
20:06
It’s like, it’s just noise.
20:07
I don’t know.
20:08
And if you give me too little, the machine will say, oh, well, you flipped a coin five times, It came up heads five times.
20:15
I can tell you with 100% confidence it’s going to be heads next time.
20:19
Well, we know that’s not true.
20:20
That’s just, that’s just statistics at play and a bad sample size.
20:25
So we have to have the right problem to solve, and then the right data and not just which data is informative, but not too much and not too little, and then figure out the best model to run.
20:36
Well, the data scientist is going to figure out what the possible models are, what the possible methodologies are, and and frankly new stuff.
20:45
New papers are being written by this every week.
20:48
Keeping up with the advances in in machine learning is a challenge.
20:54
It is a job unto itself, and figuring out which model might be the most creative and predictive is something that the data scientist and the analyst have to work with together, because the analyst is the one to figure out whether the output makes sense.
21:12
Now the output is going to be mathematically logically correct.
21:19
The machine can do that in a heartbeat.
21:22
But the machine might say mathematically, logically, if the goal is more interactivity or more conversion or more customer lifetime value, then the correct recommendation, the mathematically logical recommendation is send everybody on your e-mail list 27 emails an hour and and it will produce results.
21:48
That is mathematically true.
21:50
It is not rational.
21:52
It’s not practical.
21:53
It’s not wise.
21:55
And those are things the machine doesn’t understand.
21:59
That’s why you will always have a job.
22:01
So we’re working in collaboration and that’s why this is 1 big fuzzy Venn diagram, because you might actually be the stakeholder, you might be responsible for conversion rate.
22:13
You are definitely the analyst.
22:14
You are probably doing most of the heavy lifting for collecting and cleaning the data and they’re asking you to build models and you have some resources in IT and you have some resources in the analytics centre to help you with this.
22:27
But it’s all kind of mushed together.
22:30
So if we look at these fine lines and we see that as we move from left to right, we become more data sciency.
22:38
You know the analyst, yes, you live in Google Analytics.
22:41
The data engineer is writing R and Python scripts and and actually physically manipulating the data.
22:51
And then over on the far side, we have the research into machine learning.
22:54
And so that’s becoming more data sciency.
22:57
Thank you, Tim Wilson.
22:59
But as the digital analyst who’s responsible for the engineering today, how do you get there?
23:04
You can see it over in the corner, but how do you get there?
23:09
And for that I get to relinquish the microphone back to Ian to give us some perspective because he’s actually been there and done that.
23:20
Thank you Jim for the people.
23:22
Just as I take the mic back, I noticed we’re now we have a bunch of new people who joined a little bit after the beginning of the webinar.
23:31
If you joined late, you may not have heard that you are welcome to type questions into the chat window on the in the bar on the right hand side of your screen.
23:41
And I would appreciate that because otherwise I have no idea that anybody is listening to this webinar.
23:46
So it will reassure me that there’s somebody out there.
23:49
I’m listening, Ian, I’m listening.
23:50
Thank you, Jim.
23:52
But seriously we will be happy to take your questions once we get through this final section and and so feel free to to type away and we’ll we’ll get to those in a few minutes.
24:03
So, yeah, so as Jim says in my in my role at Microsoft, actually we, I was part of a fairly substantial data science and data analytics team, essentially a data analytics centre of excellence.
24:20
And we had about sort of 30 data analysts and about another 30 data scientists.
24:27
And it was the most common question from a career development perspective from the data analysts was how they could become more like the data scientists.
24:38
And it was an A seemingly equally common question from the data scientists as how they could stop their their sort of ivory tower from being invaded by these these dirt covered data analysts.
24:53
I exaggerate obviously for effect but but but only a little.
24:57
As I as I mentioned at the beginning the the district of data science historically has really required you to understand how the whole thing is put together end to end.
25:09
Including a great deal of of of of detailed maths about how these data science algorithms are are created and the the linear algebra that goes into them and so on.
25:21
And that’s that that that is becoming less and less the the case.
25:26
And nevertheless there’s still a sort of slightly intimidating breadth of capability that that the data scientists sort of tend to have are certainly good data scientists.
25:38
And so there is a good question for for analysts professionals as to really kind of where to where to start.
25:46
And the the good news is there’s a lot of online training materials which I’ll I’ll come on to in a few minutes, which can help you through this, through this, through this path.
25:59
But the the the sort of key set or key skills areas to sort of build out data science capabilities I’ve kind of highlighted on the right hand side here.
26:08
But everything in the right hand column is really is really sort of foundational particularly the bolded, the bolded items.
26:17
A key thing to do is to start understanding a little bit more about about some core statistical concepts.
26:24
In fact starting sort of towards the the bottom and even not even but you know in a day-to-day analytics context understanding statistical significance of of a control group versus a holdout group and so on understanding the sort of foundations of sort of distributions and and sampling errors and so on is a good is a good foundation to to get into and and and as I sort of started to deepen my skills in data science, it was very interesting to have that sort of back to school feeling of of of dusting off my my statistics background from from from my college years to to look into that.
27:08
The next key piece really is is data visualisation and preparation and because of the way that some of the key technology particularly are in Python work in data science, these are quite closely linked together.
27:19
They go hand in hand because as I mentioned earlier, a lot of data science projects start with a data set that may or may not be clean in a sort of strict sense.
27:34
So we’ll certainly need manipulation in order to be good for building models against.
27:41
And that includes various kinds of sort of normalization of the data, turning it into a format that a computer can understand from a model building perspective, which is not trivial sometimes.
27:55
And removing things like outliers.
27:57
If you have a a lot of outlying values in in in a data set, then it can make it very difficult to for a machine to to build a prediction around the sort of the key, the key attributes of that of that data set.
28:11
So there’s quite a lot of preparation and then also examining and understanding of the data in the context of building models.
28:21
There’s obviously a lot of talk in the press these days about sort of, you know, the way that we’re all going to be replaced by by machines and AI.
28:30
And one of the most reassuring ways to to learn that that’s not likely to happen too soon is to actually do a little bit of of data science and machine learning.
28:39
Because the way that this data has to be spoon fed to the machine in order to to make a particular model work well makes you realise that machines are not, you know, not not close to rising up and taking over just yet, which is, which is reassuring.
28:55
So and then with the so with the reasonable understanding of the foundation of statistics and and the ability to kind of prep data effectively for model building, you can go on to build on your build your own models using you know using one or more of the of the many model algorithms that have already been have already been written such as daily regression, K means clustering, collaborative filtering, other kinds of other kinds of models.
29:22
And those terms by the way are they’re the sort of you know those those those sort of famous names of models.
29:32
Those are the first thing that aspiring data scientists learn mostly so they can bandy them about in presentations like this and sound like they know what they’re talking about.
29:39
So don’t be intimidated by all these, you know these these blog posts to who say you know should it, should I use K means clustering or another another clustering algorithm.
29:48
It’s a lot of good debate but the key thing is these these algorithms have already been built.
29:54
You do not need to reinvent the the the math and statistics behind those algorithms in order to use them effectively for for building predictive models.
30:05
So Jim, if we could go to the next slide.
30:13
Jim, taking a minute to show up.
30:16
There we go.
30:16
Thank you.
30:17
Yeah, so this piece around stats and model building is the part of data science that most intimidates non data scientists.
30:25
So it certainly intimidated me when I started looking into this field and as I say it doesn’t contain a great deal of complexity and people have founded entire academic careers and certainly got PhDs and and tenure and so on on the back of original research in the area of predictive modelling.
30:47
And I was reminded the other day of the the Netflix prize from it was probably about, say, almost 10 years ago, which was a competition to produce a newer, better predictor for movies on Netflix.
31:02
And you know, the prize for that was $1,000,000.
31:05
And there’s a lot of of genuine model innovation and sort of algorithmic innovation behind that.
31:12
But it’s just as it’s important, You know, you don’t have to understand everything about how your car was built in order to firstly to drive the car, but even to sort of maintain it and and do sort of, you know, a certain amount of maintenance on it.
31:29
It’s not necessary to understand every detail of how machine learning model works in order to use it.
31:38
It is important to understand the principles of how it works, just as it’s useful to understand the principles of how a car works.
31:47
If you have no concept of how a manual gearbox works, for example you drive a stick shift, then you know you will struggle to drive that car effectively.
31:57
But it’s not essential to be able to.
32:01
You know at the drop of a hat draw a an accurate diagram of the gearbox of your car in order to in order to drive it.
32:08
And so if you’re planning to come up with the you know the next, the next great model algorithm yourself, then yes, you know by all means go away and do a PhD in it.
32:18
But otherwise it’s it’s it’s perfectly possible to to use stuff that’s come before.
32:23
And as I have mentioned a couple of times, the way the industry has evolved is the data scientists who came up, you know, 5-10 years ago in this field did have to learn a lot more of that stuff.
32:36
And so there’s a little bit of an understandable sort of irritation almost among more established data scientists that all this stuff is coming on, that’s making their field newer.
32:47
Many of us have been in the IT industry for a long time, are familiar with that because those of us who had to install operating systems from floppy disk and configure network stacks manually somehow, you know, it’s, it feels always too easy these days.
33:02
And so that’s just a natural kind of evolution.
33:06
So a big question a lot of people have is should you learn R or Python?
33:11
And the the answer to this question is yes, you should learn R or Python.
33:18
Oh, something’s Jim’s backing up.
33:21
And the point being that it it doesn’t matter that much which one of these two you learn.
33:29
And but it does matter for reasons which I’ll explain that you get into this.
33:37
There’s there’s basically sort of three, three real you know, three real reasons why you why this is not not kind of crazy and should get freaked out about this.
33:50
First thing is you don’t have to go and you know spend 12 months learning every detail of R or you know of R or Python.
33:58
They are.
33:59
They are what’s called statistical programming languages and they are specifically designed to allow you to manipulate data in a tabular format.
34:13
In our, this is called data frames.
34:15
And anybody who’s worked with data for, you know, a reason period of time, once you start start getting into how these languages work, they’re kind of a revelation because it’s like, Oh my God, somebody wrote a language to specifically work programmatically with data.
34:29
And yes, you know, Sequel is of course a data language, but R and Python are much more kind of familiar to people who have a more of a kind of coding background.
34:41
You don’t have to learn them all, learn the whole thing straight out of the gate.
34:45
In fact, I would actively recommend you don’t do that.
34:49
The second thing to say about this is the reason to to to learn these is they enable pretty much all the capabilities we saw that we saw on the slide before last the data visualisation, the data manipulation and in fact the ability to build to build models.
35:07
And so you can actually get the whole, you know, the whole of a of a data science project done just using R and Python in terms of building the models.
35:17
I’m not suggesting you do you do want to do that, but actually yeah actually is possible.
35:21
So they are very foundational technologies.
35:24
And then the the third thing is that as I mentioned, if you if you work with data a lot, you will find these languages to be a real revelation for their ability to express very succinctly what what previously would take a lot of steps in Excel or Google Analytics or even Sequel and they will help you build a deeper relationship with the data that you’re working with.
35:49
It’s it’s you know, they really help you to understand the the schema and the structure of of of data as you build out these these these these data sets and and and models in this in this environment.
36:02
That said, Stack Overflow is your friend.
36:04
If you go and Stack Overflow and ask questions about these languages, you will get a lot of answers, and that’s really what everybody does.
36:10
If you’re trying to decide which to choose, Python is a little more popular.
36:16
It’s becoming very popular.
36:17
In fact it’s almost overtaking Java in its popularity and it’s somewhat more general purpose.
36:25
R is more specialised for stats work and has a larger set of of model libraries.
36:30
So R is more beloved by the sort of traditional sort of stats community.
36:36
Whereas let’s say Python is is a bit more general purpose and one way of choosing is if you if you if your if your organization already has an orientation towards one cloud platform or another or one sort of tool set in this area or another that may guide you.
36:51
So for example, Google loves Python.
36:53
Microsoft has a bit more of an RR orientation though they support both Python And R and so that can help to help to choose.
37:05
OK, so next slide.
37:10
So as I say it, the the you know the great thing about about R or Python is you can use them to write code like the code on the left hand side.
37:24
That really because it’s code, it does exactly what you need to do.
37:29
Also though, because it’s code, it takes time to learn and of course 1 spends that time or spends a certain amount of time debugging things when they don’t work.
37:41
And if you particularly if you don’t have a a coding background, I’m not saying you have to have been a professional programmer.
37:47
But if you haven’t really studied coding then then that you know that that learning curve can be can be steeper and so really but the so the good news is that there’s a lot of tools coming along that make it increasingly possible to build out both sort of data prep and and and model and model model evaluation, data flows without actually having to write any or very much code.
38:20
And so the screenshot on the right is actually from Azure Machine Learning.
38:26
And and as I say that allows you to to build all these sort of linked steps and the like to take a data source, join it with other data, clean it up, remove some fields, change the metadata, normalize it, remove outliers and so on.
38:42
And then train and evaluate a series of models and compare the results of the models.
38:50
There’s there are plenty of other tools that are coming along that do the same kinds of things, such as Rapid Miner, Google’s Cloud Auto, MLML jar, and the IBM Watson Studio.
39:03
So Microsoft isn’t by name is the only the only player in this game.
39:07
But this is the key reason why it’s becoming easier and more necessary at the same time to grasp some of these data science concepts.
39:16
But the other good news is that if you do some online learning that uses some of these tools, they are a really great educational tool and and and I’ve spent a lot of time in Azure ML now and really it’s helped me to understand a lot of core concepts around data mining much more quickly than if I’d had to crank everything out using using code.
39:36
OK, so next slide please Jim.
39:41
So here are some learning resources in the right hand panel.
39:44
This presentation is available for download in the handouts.
39:49
There’s a box called Handouts.
39:51
Click on that, load the presentation and these are all clickable clickable links.
39:57
And if, if, if, if you’re targeted the way that I’m targeted based upon keywords and so on in various social media, you’ve probably already received many ads for for online data science courses or or two year masters programs and so on.
40:16
The key information I have for you is there’s a lot of good free learning resources for learning the concepts of Data science and edx and Data Camp.
40:25
I’ve both.
40:26
I’ve used both of and they have a lot of really great data science courses from, for example, Harvard and UC San Diego.
40:33
The Microsoft Professional program in Data Science is a whole curriculum that gets you from sort of nought to 60 in Data Science actually is in partnership with edx and Data Camp.
40:45
And so there’s a lot of content on both those environments.
40:49
And again, I mean, I’m obviously more familiar with Microsoft because because I was working there, so I’m not explicitly plugging that.
40:57
But what I would say to you very strong recommendation is for you to find a course that takes you through a curriculum of learning about this area rather than for example, just trying to learn all you can about art.
41:11
And so the Microsoft Professional program and these courses on Coursera and there’s similar content on Udacity has these these data science curricula which take you through the the, the core process and the key activities and then go through these various topics of statistics and data manipulation and data visualisation and model building and evaluation and so on.
41:41
And that’s a really great a really great way to to do this and I strongly recommend that you you do some of these.
41:48
Most of these as they say, you can do for free and you only pay if you actually want to get the certificate.
41:53
And so there’s they’re a great way of getting of getting learning in this area.
41:59
There’s also even a machine learning guide podcast from from from OC Devil, which if you want to drive, be driving in your car and attempting to visualize data science models in your head while driving, then that’s a great a great podcast.
42:15
It’s a pretty interesting intellectual challenge.
42:18
OK, if we go to the next slide.
42:21
So I’m going to finish out the prestation just by talking a little bit about, if you’re not so much thinking about how you become a data scientist, but instead thinking about how you build data science into your team.
42:35
Perhaps you run a an analytics team and you’re being told or perceiving and learning that you need to have more data science capability.
42:46
You know, what would be my tips based upon my time at Microsoft.
42:51
This really kind of builds on what, you know, a lot of what Jim was saying about the, you know the three pillars of data science, data analytics and and data engineering.
43:01
And those will come from different places and you have different blends.
43:05
And in fact the first bullet here really, it really captures that is that as you build out your data science skills in your team, think about complementing existing skill sets.
43:16
So for example, we data scientists are expected to integrate or interact with with stakeholders and certainly it’s our experience of Microsoft that data scientists don’t want to solely be back, you know, to be back off as people who never touch the the stakeholders for their work.
43:36
That said, there is a reasonably strong negative correlation between technical data science skill and software communication skills, and that is a gross generalisation for which to any data scientists on this call, I apologise, but certainly it can be.
43:56
It is, it is.
43:56
It is often the case.
43:57
Or it can be the case that you find a great data scientist who doesn’t have such great communication skills.
44:02
And that can be OK if you have other great communicators and storytellers on your team.
44:06
You just need to pair them together to take the story back to the the the stakeholder.
44:14
Similarly, if you have a relationship with a great data engineering team or you have great data engineers within your organization who are really good at doing a lot of heavy lifting around data and building reusable assets.
44:27
Then as you build out your data science capability, you don’t have to worry so much about people who are great at that part of data science and instead focus on people who are really good at model innovation and and perhaps stakeholder management.
44:40
So look at the existing skills you have in that matrix that we presented and and think about where you need to build out skills rather than looking always looking for the Unicorn because the unicorns will be very, very expensive and you will have a real a real hard time holding on to them.
44:58
And the second recommendation is make smart choices for projects.
45:02
One of the things I’ve seen happen is people hire data scientists and then they kind of let them loose and just say, well you know, do interesting stuff, come back in three months and tell me an amazing thing that changes my life.
45:14
And there is room for that kind of sort of 20% time type work, sort of speculative work.
45:24
But it can’t be everything, because it’s very easy for for business stakeholders to become disillusioned with these expensive and sometimes prickly data scientists.
45:36
So make smart choices for projects.
45:39
As Jim said, you know, it’s really about working closely with stakeholders to make sure they really are clear about what they’re trying to achieve and that that thing they’re trying to achieve is a good thing to achieve.
45:50
And it’s not likely to be gained by some, by some automated system sending 1000 emails a day.
45:56
And so there’s a real, there’s a real interaction there.
46:00
You should balance geniuses with journeymen.
46:02
This again is a little bit of a statement about or about the kinds of people one tends to find doing data science.
46:08
So I don’t want to kind of over egg this, but again it it it.
46:14
There can sometimes be a a choice to be made between a brilliant but but sometimes difficult to work with individual or a more day-to-day and you know a much easier to work with individual who may who may be less brilliant day-to-day.
46:29
One is not better than the other, but you should not have all of one because otherwise you’ll have a lot of very nice people who don’t do really inspiring work or a lot of people who are all amazing but spend all their time fighting with each other.
46:45
And then you know, there’s in a lot of organisations, as Jim, again as Jim alluded, there’s lots of organisations where data scientists are getting hired into different parts of the organization and there’s a little bit of an arms race going on sometimes.
47:02
And so it can be very easy to to enter into, you know, to get into that arms race and feel like it’s a 0 sum game.
47:12
And unless you know, unless model A from team A is chosen over model B from team B, then model A has failed.
47:20
And my strong recommendation is to treat it not as a 0 sum game but as a as a as a sort of raising all boats type of game and encourage collaboration.
47:33
But also if there’s other people across the organization who are sort of leaning in on this area then just get on with your own work and and and the value of what you do will will will come through.
47:43
As long as you have good stakeholder engagement you should plan for operationalisation.
47:49
At Microsoft, we had too many instances where a data scientist would would produce an amazing piece of work, an amazing model, and they would consider their job done when they produced the model and had presented it to a very senior person.
48:05
There’s a lot of currency attached to presenting the senior people at Microsoft.
48:08
That’s not unique to the company.
48:11
But we struggled sometimes to actually get real business value out of these models because the data scientists tended to lose interest once they’d actually written the code for the model.
48:23
And there’s a big step about operation operationalization, which which you really do need to pay for it, to pay attention to, and then finally, always be hiring.
48:34
Data scientists have a huge amount of choice in their career these days, and they can come on very high salaries.
48:44
And we’ve had people, we had people at Microsoft who were relatively junior people at Microsoft tempted away for for seven figure sums.
48:52
And I’m not exaggerating to other companies.
48:57
And the best way we found to address that was to always be hiring, to always have a pipeline of good people that you’re talking to so that as you have people churn out, you have somebody else to to to take their place.
49:11
Or at least you’re starting to warn people.
49:13
So you know like oh crap, my best, my best, my best girl or guy just left and now I have to start a four month process to find a replacement.
49:20
You’re already sort of partly part way down the the line with that and by the way we also found that relatively junior but the Super smart and and enthusiastic people are really worth having on the team as long as you particularly as long as you have more experienced people to to mentor and and grow them.
49:44
So next slide and I’ll let Jim chime in here as well.
49:49
But here are some some some resources.
49:51
My blog, Jim’s Blog and Tim Wilson who mentioned a couple of times is speaking on this same topic at the DAA Atlanta Symposium.
50:00
And also Jim will be covering the topic in his events, the marketing evolution experience in London and Berlin.
50:10
Jim, any other comments to make and before we open it up for for questions, yeah, I through.
50:14
Thank you very much for that Ian.
50:15
I threw in a data science Central which is newsletter and resources and sort of news of serious data science stuff if you want to dive deep.
50:25
The other one that I’ve personally found fascinating is this week in machine learning and AI.
50:31
It is not a how to, but it is.
50:34
It is a data scientist who interviews other data scientists who are doing actual work.
50:39
Now only some of it is in marketing, a lot of it is in visualization or self driving cars or Cancer Research or whatever.
50:47
But it gives you a really clear picture of the state-of-the-art, which is amazing.
50:53
But Oh my God, we have so many problems still to be solved.
50:56
So it’s a real eye opener and yes, if you’re in Europe coming up next month and then November is marketing evolution experience in London and Berlin, end of commercial.
51:08
Open it up for questions.
51:13
Unfortunately, I think we can hear people if you ask ask questions, I think we can only see this.
51:19
Sorry, this is Deanna, I I had to unmute myself.
51:22
Hi, this was great stuff guys.
51:24
Thank you, Jim and Ian, we do have a couple of questions.
51:29
What are the biggest challenges that you see facing data scientists today and in the future?
51:36
Well the the short answer is hype and over expectation in what’s what’s the rest of the answer.
51:43
Yeah I think that’s it depends on it depends on the definition of data scientists.
51:51
As I as I mentioned I think existing data scientists are challenged by paradoxically challenged by the by the democratisation of their of their field.
52:04
And I think a challenge there is that we could see a lot of bad of of poor quality work done with the, with the tools that are emerging.
52:19
And just as just As for many years, web analytics was plagued by by, you know, by the, by by people who were not terribly skilled drawing conclusions from the descriptive data and web analytics, I think we may see a lot of a lot of poor quality work being done.
52:40
And so, you know, it’s the kind of, it’s the, it’s the paradoxical flip side of the of the democratization that it is important to to learn the basics is.
52:49
But fortunately, you don’t have to learn all of the code all of the time.
52:54
It’s such a moving target to that learning any of this stuff is you.
52:59
There is no end to it.
53:00
You don’t, you don’t learn it.
53:01
Tick a box and you’re done.
53:03
You dive in and you hold your breath because there’s new stuff every day that’s, yeah, that’s very, that’s very true, Jim.
53:11
And actually that’s something that I I admitted to mention earlier on that the process of this this this model building process for data science.
53:21
There is no point in the process where a bell rings and a light goes on and a big sign flashes up saying you’re done it it it.
53:31
There is entirely human judgement about the about whether the model that has been built is essentially good enough and meets various other criteria like it’s not over complicated and over fitting to the data and can therefore you know would would work in the real world and would be deployed.
53:48
So there’s certainly, before I started to lean in on data science, I had the sense that these data scientists did come to a much more definitive kind of conclusion that their work was done.
54:03
And that’s not true.
54:04
And there’s that sort of artificial kind of confidence that sometimes comes through when people have have, you know, sort of are in that environment.
54:10
So the job is not a definitive job and then and the and the activity is not not definitive either.
54:17
So helping people, helping stakeholders to get comfortable with that is another key challenge of data science for sure.
54:27
OK, so this question kind of ties into what you were just saying then.
54:32
So in your professional opinions then, what does the job forecast look like for data scientists?
54:39
The future is so bright, I have to wear shades.
54:42
I’ve heard you say that before, Jim.
54:46
Well, it’s true.
54:47
I mean everybody, everybody wants a data scientist.
54:51
If you and and you have to decide, it’s a personal decision of whether you call yourself a data scientist.
54:57
If you are comfortable in R and Python, you can say I am a data scientist to some degree.
55:05
If you are really well read on all things machine learning, then by golly you are a data scientist and and no one will take that away from you.
55:16
But everybody needs help in this area.
55:20
Everybody still needs analysts.
55:22
So if you don’t really care for deep mathematics and you’re really happy solving business problems with data own being an analyst, you are needed.
55:34
Doesn’t matter if I have 1000 data scientists, if I don’t have any analysts, I’m not getting any work done.
55:42
Yeah, that’s absolutely right, Jim.
55:43
And I again, I was going to mention this earlier all the material certainly you know, I think that Jim and I presented it certainly does not.
55:52
We’re not trying to place data science above data analysts or you know the data analysts role.
56:00
And you know, that said, I do know because I’ve spoken to various folks in organisations, you know in Microsoft and elsewhere that data science is the sexier job title at the moment.
56:10
And so you’re seeing all sorts of people retitle themselves data scientist when they, you know, they may not really have the chops to do that.
56:19
And that’s a, you know, a big challenge of sort of title inflation in in the area the the the future for data scientists, for for people who are who can do good data science regardless of whether they call themselves data scientists or not, in fact is very bright.
56:37
And in fact so my strongest recommendation as if you’re looking to grow your skills in this area is it’s simply do projects.
56:45
You know the lot of these online learning environments, you know they have these capstone projects.
56:52
There’s lots of sample data sets out there to work with and the real way to get comfortable with this field is to is to do projects and and and and have a sense of the the complexity of this area and and that will you know that will stand you in good stead.