Letters to a Young Data Scientist: 1

Stretch Yourself

Nov 29, 2024

Wonder. Explore. Model. Scale. Observe. Repeat.

So you are a Data Scientist. What a wonderful thing to be. The world has always been data at its fundamental level. You were always convinced of that, but now physicists declare it too. While an older generation wandered into data science from disparate fields in the applied and pure sciences, with a shared love of problem solving, statistical reasoning, and turning problems into solutions by flowing data through code, you are different. You are exquisitely trained in one of the emerging graduate level programs in Data Science. Or you are from one of the applied statistical or social sciences, and have pursued your fascination with analysis beyond the techniques in your original discipline. You may have started in one of the closely allied fields required for large scale analytical applications: Software and Data Engineering, Quality Control and DevOps.

You have augmented your original skills through self-study, coding, additional courses. Whatever your path to data science, here you are, a young data scientist (experientially young, if not chronologically so). In your back pocket you have techniques from a confluence of disciplines: statistics, AI, machine learning, data mining, distributed systems …. You can code analyses in several languages. You tackle data with gusto in its myriad forms: structured, unstructured, big, small, simple, complex, streaming and at rest. I have only two words of advice for you: “Stretch Yourself”. More specifically, first Stretch Yourself beyond your technical skill set as a data scientist, and secondly learn how to combine with and incorporate closely allied skills: DevOps, Data & Software Engineering, Quality Control and Observability, Product Design. Below are some suggestions how to do this.

UNDERSTAND THE ORIGINS OF TECHNIQUE

Technique has two aspects, the specific analytical technique used, as well as best practices you have incorporated into your personal analytical style. Analytical techniques have a point of origin, often tied to a specific problem that led to the invention of the technique. While “Data Science” is a recent term and occupational description, data science is old. Indeed without data, there is no empirical science. Kepler was a data scientist. Darwin was a data scientist.

Discriminant Analysis in its myriad forms is often listed as a machine learning technique. However it originated in a particular taxonomic problem, how to assign individual samples to their correct species. The Fast Fourier transform, associated with John Tukey (the father of exploratory data analysis) first originated in the analysis of asteroid orbits by Gauss. Reinforcement learning was inspired by models of animal behaviour. Neural networks have come in and out of vogue multiple times. What is the utility in knowing the history and origins of these techniques , beyond being interesting anecdotes? First, they provide intuition into a technique, which often helps one reason about other areas of applications of the technique. Secondly, the assumptions inherent in a technique are often tied to abstracting the original problem being solved, as well as computational limitations at the time of origin. Understanding these assumptions is critical to correct application.

Technique also comes in to the best practices of a domain. As a data scientist you may cross many domains in biology, engineering, environmental and social sciences over your career. As you cross these domains, you need to be aware of the standard of best practices in each field you enter into. That is your minimal standard. Is a notched boxplot considered sufficient to compare groups? Do you need an ANOVA with terms above a particular significance level? What sample size is considered sufficient? Are the industry norms weighted towards assuming a known distribution, towards using resampling techniques, or towards using robust statistics whose performance “decays gracefully” as distributional assumptions are not met? Are the relationships of seemingly disparate techniques like K Nearest Neighbours (used in developing production inventories in forestry) and Spatial Kriging (used in developing production inventories in oil and gas) well understood in the industry. It is possible to do a technically sound application from a particular field of data science, and stumble over a normative best practice in a particular industry. Be aware of best practices in the domain you are working in. One of the best ways to develop that awareness is to allow yourself to be mentored by someone steeped in those best practices.

LET YOURSELF BE MENTORED; BE A MENTOR

As a data scientist, you are likely to work in industries far beyond that of your original training, work on problems that initially look unfamiliar, and work within business constraints very specific to the industry you are in. You may be nested in a functional business unit, be part of a technology group or be within an AI centre of intelligence; or simply be on a multi-disciplinary product team. Each of these contexts will push your core analytical, modelling and data transformation skills in different directions. The best way to get up to speed fast is to be mentored. As a starting point, be mentored in the best practices of a subject domain by a seasoned expert. How do meteorologists downscale climate data to make local predictions? How do foresters use inventory data to make a harvest plan? How do geologists use their experience to find sites rich in a particular mineral resource? How do you combine weather and topographical information to make a good short term solar generation forecast? What are the key performance metrics an experienced marketer looks to first? Why is Net Promotor Score (NPS) important to your CEO? If you don’t know, ask. Don’t just ask for the “answer”. Ask to understand the thought process. Be open to asking “dumb questions”. Politely challenge if there appear to be holes in the reasoning. You may be missing something. Or you may be onto something!

Many technical/analytical best practices are built around issues that can be put in quantitative form. You also need to understand how such problems are nested within business context, where best practices may be more qualitative; difficult to design and measure. Which problems are highest priority to the business? Does the business own, or have access to all the data it needs? What are the low hanging fruit — problems where a little analytics could go a long way? Are there problems whose solution could disrupt industry business models? What are areas of inefficiency in the business? Such qualitative insights come out of conversations, asking the right questions, or having mentors lead you towards the right questions. A good “best practice” for all seasons is to build systems as simply as possible — the fewer the moving parts, the easier it is to understand yourself and explain to others. Systems accrue complexity over time. So start simply and focus on reliability. Things will break. Features will creep. Bugs will appear and require tracing back to root cause.

As you allow yourself to be mentored, be prepared to be a mentor. Find ways to explain your thought process, and analytical techniques to others in simple sketches. Partner with domain experts to do exploratory analyses and model development collaboratively. Bring their domain understanding into the analysis from the beginning. One way to do this is via collaborative sketching, where you layer both their domain knowledge and your understanding of analytical options. This is a good way to learn each other’s thinking styles and go from “their knowledge” and “my knowledge” to our knowledge

Innovating

BUILDING A LIFELINE

Mishtu

May 29, 2024

Experiment Box (main post begins below this grey box) For those who are aural learners, Google’s NotebookLM was used to auto-generate a podcast from this post. All I submitted is the URL for this post. No additional “human instruction”. Enjoy. https://notebooklm.google.com/notebook/96ac1961-f1b8-4f0b-94df-c8471b9a37c8/audio

Read full story

Another technique — assuming both parties are willing — is pair programming, or in this case pair data analysis. If your partner in pair analysis is a non-programmer, you have to take the responsibility for analysis and programming. The benefit is you get instant feedback and interpretation from your domain partner, as well as insight into data anomalies. If you both program, take turns throughout the analysis. You will both learn something about each other’s thinking, analysis and programming styles. The code produced in such sessions should usually be considered prototype not production code. At the point you and your mentor (or mentee) can comfortably sketch, analyze and code together, you are well on your way towards a mentoring relationship moving into a problem solving team.

Mentoring is a two way street. It takes willingness to invest in the relationship on both sides. Flexibility to pitch to each other’s learning and teaching styles. It takes respect for each other’s strengths and weaknesses, and confidence to challenge each other, and work through differences in perspective and training. As the relationship develops, who is mentor, and who mentee may undergo transformation; the classic “student becomes the teacher”.

BE A TEAM PLAYER

The most exhilarating moments of my career have been working on gelled teams firing on all cylinders. Consider the co-mentoring relationship described above as the nucleus of a team. Now add more people, with different functional skills, and with different perspectives. Teams are mechanisms to harness disparate talents to a common purpose. Critical to being a team player is the ability to lead and to follow. Teams usually have an official leader. This is the person who decides when there is an impasse. But such official leadership is often most effective when worn lightly. This creates myriad opportunities for leadership within the team. Lead when the problem is in your domain. Lead when you have a new idea. Follow when someone else has a good idea. Particularly if it’s in your domain. In a gelled team conversations flow, rapidly lead to building things that work. There is passion, and with passion dissension, but also coherence. When a decision is made, everyone commits. The essence of being a good team player, is the ability to both follow and lead, and to know which is required in a particular moment. When everyone on the team can do this, creativity is unleashed, but so is productivity. Bad ideas get shut down early; good ideas go forward with the least fuss. People feel confident to argue for their own ideas, but have enough trust in each other to follow a different path. Everyone has a strong sense of ownership in and responsibility for the outcome. Growing a team with these features requires discipline and subtlety. The first step is to be comfortable with leading and following, a close parallel to the learning and teaching relationship in mentoring, but now with a larger group. Being a team player comes naturally to some, but requires practice for others. If you struggle to be a team player, begin by listening to other team members conversations. Which conversations go forward into working processes and products? What is different about those conversations? Practice following. Practice leading. If some members of the team seem more effective at this, let them mentor you.

See:

Innovating

LET US FLOW

Mishtu

June 1, 2024

updated — 20240915

Read full story

LEARN TO TALK, TO WRITE, TO SKETCH

As a data scientist, you are a master of a rather arcane technical skill set. The tools data scientists use are both technical and abstract. Outside of a data science team, most of these methods will not be deeply understood. In working with domain experts in developing an analysis, you need to be able to speak simply but accurately about how a technique works and the assumptions it makes. Often a sketch helps illustrate the main points. After an analysis or analytical product is completed, you need to be able to communicate its essence in writing.

Every analysis is an argument, a seeking of substantiated truth in the data. Clear tables of summary statistics and well designed data visualization highlight the nuggets you wish to convey. But it is words that provide context, and explore consequence. As careful as you are in picking analytical techniques, summary statistics and visuals, so should you be with the words that bind the argument you are making. You are making these arguments to the key decision makers in your organization, or to customers, or to investors. Don’t let all the effort in analysis go to waste by choosing the wrong words.
Sketches do not have to be elegant. They have to be effective in carrying an idea forward into other minds. I prefer very simple sketches. But this may be due to my total lack of artistic talent. Your sketches are likely very different than mine. Just as everyone has a unique speaking voice, they have a unique “sketching voice”. Practice speaking and sketching at the same time. Be comfortable with others marking up your sketches, or speaking to them — collaborative sketching. The process of having a conversation in words and sketches is part of the creative flow. When a team learns to speak and sketch together, they develop a shorthand for idea flow, that often looks to an outsider like a foreign language. But what is happening is a set of different minds with their different skills and perspectives are getting themselves entrained on collectively solving an analytical problem. Working through problems via talking and sketching is a good way to get to gelled teams. Conversation and gesture — sketching and talking about operations on data — can often be the informal birth of new algorithms.

SEARCH FOR GOOD PROBLEMS

But don’t go looking for new algorithms. Go looking for good problems, and new algorithms will come. What is a good problem? Something that stretches you beyond your current skill set. Something that forces you to learn a new domain. Something that takes a new angle on a well known concept. For example, in what ways are sorting data and the concept of a correlation similar. Could the concept be applied to generalize the notion of correlation from variables to complex objects? Is this a good problem? It depends. If your task is to calculate the similarity between two objects, it may very well be.

Often a source of good problems begins in trying to develop quantitative techniques that capture domain knowledge. You walk with a forester through several conifer stands, and she describes one as “complex and multi-story”, while another stand is described as “simple and even aged”. How can you statistically capture the structure of a forest stand with trees of different species, ages, heights, and diameters? An epidemiologist describes how flu outbreaks start in terms of a contact network. How can you evaluate the stability of an arbitrary network to viral spread? An insurer explains that their payouts on flood damage have increased ten fold in the last decade. How can you assess the likelihood of a building withstanding a flood, or the risk it won’t? For each of these problems there are domain experts. Can you develop a method that gets to the same place as experience and expertise? Whose effectiveness you can verify? Try and work out how to solve the problem from first principles. The algorithm you develop may be new. Or you may re-discover an ancient algorithm. Both are discoveries. Both are exciting personal triumphs you had to stretch to reach.

GO BEYOND TECHNIQUE AND STATE OF THE ART

Finding good problems is an efficient approach to innovating beyond current technique and state of the art. Particularly if you think of problems that are not effectively solved by current techniques. Many techniques that are “new” in industry, have twenty or more years of development in academia. So a second way to go beyond technique, is to read the primary literature on the origin and initial application of a technique. This is why understanding the origins of specific analytical techniques is useful in understanding the strength and limitations of an analytic technique. A third way is to relate different techniques to each other. Under which conditions will a polynomial regression out perform a feedforward neural network. Under which conditions would the neural network out perform polynomial regression? In what ways are fuzzy logic and neural network based techniques similar; in which ways different.

For a specific problem, what are the strengths of a Frequentist approach versus a Bayesian approach? Can you summarize the differences between supervised and unsupervised learning in a sketch? How about the differences (and similarities) between genetic algorithms, reinforcement learning, and multi agent systems? In what ways can statistical learning algorithms be related to search procedures? A fourth method is to question the current state of the art. Innovation comes in waves. With each wave, there is a tendency that given a new hammer, all past problems are seen as nails. This is the basis of hype cycles. Seeing through the limitations of the current state of the art, is the beginning of seeing new opportunities and innovations. What is a fifth method to go beyond current technique and state of the art? What is a sixth?

STAND BACK AND LET THE DATA SPEAK

Once Data was like gold, precious and rare; a luxury. Increasingly Data is becoming like oxygen, commonplace, polluted, and absolutely necessary. As a data scientist, you have always believed in the veracity of data — the signal and story — hidden in the noise. At times you may find yourself under pressure to find a particular signal or story, regardless of what the data is telling you. Resist. More commonly, the danger is reifying our own expectations via analysis — finding one signal but missing others; prompting ourselves into a hallucination. There are several approaches to guarding against this very human tendency. One is to cross-validate any data driven conclusion using techniques that make different assumptions about the data. Does the same signal keeps appearing again and again under different methods? A second corrective is detailed analysis of errors from a model. Having pulled one signal, look for others, via both exploratory and confirmatory methods. A third corrective is the use of sampling and experimental design techniques to control for potential sources of bias. A fourth corrective is to explicitly develop hypotheses that would hold if other factors were of influence. Then go search for evidence. Closely related, is to ask if more than one hypothesis could equally explain the data before you. What additional data would be needed to distinguish between these alternative hypotheses? Even when you are sure you are right, ask yourself what might be a signal if you are wrong. This is psychologically difficult. It is often useful to have your analysis critiqued by someone with a different perspective, either from within your team or without. Hopeful skepticism is at the heart of empirical science. We build theories which most simply embody a set of facts. Every product begins with a theory about a customer: what they need, what will delight them, what can make them immersively badass at their daily work. We then look for anomalies, outliers, facts that may point towards a more inclusive theory. Pat yourself on the back when you achieve a good predictive model, classification, or simulation. Then look to the residuals, and let the chase begin; “the game is afoot”. Happy data hunting!