Chapter 16: Data Science and Machine Learning with Python
Data science and machine learning are transformative fields where Python excels due to its extensive libraries and frameworks. This chapter introduces core libraries like numpy
, pandas
, and scikit-learn
to process data, analyze it, and build predictive models.
Data Science with Python
Numpy: Numerical Computing
numpy
is a library for efficient numerical computations, particularly with large datasets.
Key Features:
Multi-dimensional arrays (
ndarray
).Mathematical operations.
Example:
import numpy as np
# Create an array
array = np.array([1, 2, 3, 4])
# Perform operations
print(array + 10) # Output: [11, 12, 13, 14]
print(np.mean(array)) # Output: 2.5
Pandas: Data Manipulation
pandas
is a library for working with structured data using DataFrames.
Key Features:
Reading and writing data files (CSV, Excel).
Data cleaning and transformation.
Example:
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
# Access data
print(df['Name']) # Output: Series of names
# Filter data
filtered = df[df['Age'] > 25]
print(filtered)
Matplotlib and Seaborn: Data Visualization
matplotlib
: Basic plotting.seaborn
: Advanced statistical visualizations.
Example:
import matplotlib.pyplot as plt
import seaborn as sns
# Plot with matplotlib
plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Basic Plot")
plt.show()
# Plot with seaborn
sns.barplot(x=['A', 'B', 'C'], y=[4, 5, 6])
plt.show()
Machine Learning with Python
Scikit-Learn: Core ML Library
scikit-learn
is a comprehensive library for implementing machine learning models.
Steps in Machine Learning:
Load Data:
from sklearn.datasets import load_iris data = load_iris() print(data['feature_names'])
Preprocess Data:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
Build Model:
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X_train, y_train)
Evaluate Model:
from sklearn.metrics import accuracy_score predictions = model.predict(X_test) print(accuracy_score(y_test, predictions))
Hands-On Exercises
Exercise 1: Analyze a CSV File
Load a CSV file with pandas
and display basic statistics.
Solution:
df = pd.read_csv('data.csv')
print(df.describe())
Exercise 2: Train a Linear Regression Model
Use scikit-learn
to predict housing prices.
Solution:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print(model.coef_)
Exercise 3: Visualize Data
Plot a histogram of ages using matplotlib
.
Solution:
plt.hist(df['Age'], bins=10)
plt.title("Age Distribution")
plt.show()
Best Practices
Data Cleaning: Handle missing values and outliers before analysis.
Feature Scaling: Normalize or standardize features for better model performance.
Validation: Use cross-validation to assess model performance.
Documentation: Annotate data analysis steps for reproducibility.
In the next chapter, we will dive deeper into working with APIs, including consuming and building REST APIs with Python.
Last updated