Chapter 1: The basics

The very basics of AI and Machine Learning, and what it has to do with you.

The phrase “Artificial Intelligence” is used to refer to many different types of technology. It can be confusing to know where to start.

I’ll give a very basic definition: AI is a field of computer science that aims to use computers to do complex things that have historically been considered to require human intelligence. This includes tasks like prediction, reasoning, making decisions, and solving complex problems.

One of the most common types of AI is machine learning, or ML. In very general terms, ML teaches computers to “learn” through “models,” which are programs that use algorithms to recognize patterns in data and then make predictions based on those patterns. The goal of building an ML system is to “teach” a computer to complete tasks it was not explicitly programmed to do.

Here’s a hypothetical example: Imagine a ML model that is trained to predict whether or not a photo has a dog in it or not. We could “train” this model by having it process one million images of dogs, each labeled “dog,” and one million images that did not have dogs, each labeled “no dog.” 

We would do this through algorithms that would recognize patterns about the images that have dogs, and about the images that have no dogs. If these algorithms are detailed enough, we might then be able to “show” the model a photo it has never seen before, and have it apply those patterns to predict if there are dogs or not.

Sometimes the classifications made by an ML model fall under predetermined categories (like dog/no dog). This is called supervised machine learning. Other models might not be given predetermined labels to apply to data, and instead create new classifications based on how their algorithms process the data. This is called unsupervised machine learning, and it can allow people to use ML to find patterns in data they wouldn’t have otherwise noticed.

Another kind of machine learning responds to feedback. This is called reinforcement learning. Reinforcement ML models are programmed to receive feedback, and “learn” to recognize patterns that will lead to prioritized outcomes. An example of this could be a model that plays chess. Through playing thousands of games, the model will come to understand which patterns (moves/strategies) lead to the prioritized outcome (winning).

Machine learning provides the foundation for much of the current AI technology that has recently become commercially accessible. ML models can be combined into complex layered systems called “neural networks” (named after the neurons in our brains), to create what are called “deep learning models,” which power generative systems like ChatGPT and DALL-E. These systems create text and images based on predicting the most likely output to occur based on a user-submitted prompt.

So what does this all mean? Do you have to learn how to build these systems yourself? You probably do not. Understanding the intricate details of these models is not important for the average person. 

But what is important is knowing that they work by making predictions based on heaps of data, and that data is often produced by us.

AI models can only make predictions that are as accurate as the data they are trained on. This means that an AI system can only understand “reality” as easily as humans are able to turn reality into data points for the system to ingest. If the data is inaccurate, or not appropriate for the task, you won’t get much use out of a model. As the saying goes, “garbage in, garbage out.”

This is fundamentally different from the way humans understand the world. We express empathy, and feel responsible for other beings. We have bodies that grow, age, and feel pain. We have relationships, families, and friends. A computer does not have any of this! It’s very important to remember that when AI systems “act” like humans, it's because we trained them on data that was created by humans. 

Data, even data generated by computers, is fundamentally a human product because we decided to build computers, design interfaces, write software, and distribute computing power in a particular way. Personal computers did not come out of nowhere, and there is no mystical force of nature pushing technological innovation towards any given outcome. As computer scientist and data journalist Meredith Broussard writes in her book Artificial Unintelligence, “data always comes down to people counting things.” 

Training data for generative AI models is often sourced from massive scrapes of public data on the web. These scrapes regularly include copyrighted materials like news articles, photos, and artwork, as well as people’s tweets, selfies, Reddit posts, and other social media content. These datasets often must be annotated, which takes a lot of work. This work is often outsourced to millions of remote workers all across the world, who are often kept in the dark about what exactly they are working on. Ignoring this social context, placing blind faith in the power of predictive automation can have dangerous consequences. 

For example, ML systems built on arrest data have been used by police departments to “optimize” their policing strategies, using data to predict where criminal activity is more likely to occur. These models are advertised as data-driven “force multipliers,” promising a more efficient approach to policing. But because these models can only learn from where criminal activity has been reported in the past, they will always predict crime in areas which have historically had a greater police presence. In areas where Black and brown communities are already overpoliced, these systems tend to deepen existing police discrimination.

The bedrock of data that enables AI systems is the reason why these systems are universally relevant: just about everyone produces data, whether they mean to or not. Our phones know where we go, when we usually wake up, how much time we spend online, what websites and apps we use, what we write in our notes, who we are texting, how long it takes us to respond to emails, etc. Our human behaviors are constantly being observed and chopped up into data by tech companies, social media platforms, our employers, and our governments. 

Despite what the hype might tell us, it’s unlikely that AI systems will become sentient and destroy humanity as we know it. But they will create a shift in the technological power structures that journalists and their audiences must know how to navigate. 

At the risk of further anthropomorphizing AI, learning about it matters because it's already learning about you.


Chapter 1: Homework!
Here’s some resources for further understanding what AI and ML are, how they work, and what they do for society:

If you want a simple guide to machine learning:
  • Check out Normal AI, a guide to machine learning and modern AI by Jonathan Soma, the director of the Lede Program at Columbia’s Journalism School. This guide goes over the very basics of machine learning, and includes very detailed, yet easy to grasp explanations of how ML models are trained and fine tuned to do text analysis and image recognition.
  • The London School of Economics, in collaboration with the Google News Initiative, also created a short beginners course to machine learning and how it can be used in journalism. 

If you want to read more about AI and its implications for society:
  • Read Artificial Unintelligence by computer scientist and data journalist Meredith Broussard. This book covers the limitations of AI technology, and makes a case against technochauvinism, which Broussard defines as the belief that technology is always the best solution to a problem. The book also explains concepts like artificial intelligence and machine learning in detailed yet digestible terms. 

If you want to read more about algorithmic bias:
  • Read “A people’s guide to finding algorithmic bias,” a project from the Center for Critical Race + Digital studies. This guide gives a clear definition of algorithmic bias under an intersectional social justice framework, and details the ways algorithms are imbued with human values in every step of machine learning development. 

If you’re interested in running a localized large language model on your own computer:
  • Check out this guide to running local LLMs, from Sharon Machlis at InfoWorld. Running an LLM locally means you aren’t handing your input data over to an external platform, which may be a useful application for newsrooms dealing with sensitive information. This will take a little coding experience, but this guide provides step-by-step instructions on how to set these up.

If you want read more about the hidden workforce behind the data training AI models:
  • Read AI Is a Lot of Work by Josh Dzeiza, a Verge/New York Magazine feature about the millions of workers who annotate training data for AI companies. This article highlights the sheer amount of labor that goes into making tech seem human, profiling multiple workers who label datasets for hourly wages, often without knowing exactly what the data will be used for.

If you want to read more about the philosophy of AI:
If you want to read about what the US Government sees as AI’s greatest risks and prospects:
  • Read the Biden administration’s recent executive order on the use and development of AI systems. Here is the full version, formatted to be more readable. And here is a much shorter, easier to digest fact sheet. The administration seems most concerned about the potential for AI to create dangerous biological materials, new cybersecurity and fraud threats, and also seeks to attract AI talent to the US.