Chapter 16: Data Science and Machine Learning with Python

Data science and machine learning are transformative fields where Python excels due to its extensive libraries and frameworks. This chapter introduces core libraries like numpy, pandas, and scikit-learn to process data, analyze it, and build predictive models.


Data Science with Python

Numpy: Numerical Computing

numpy is a library for efficient numerical computations, particularly with large datasets.

Key Features:

  • Multi-dimensional arrays (ndarray).

  • Mathematical operations.

Example:

import numpy as np

# Create an array
array = np.array([1, 2, 3, 4])

# Perform operations
print(array + 10)  # Output: [11, 12, 13, 14]
print(np.mean(array))  # Output: 2.5

Pandas: Data Manipulation

pandas is a library for working with structured data using DataFrames.

Key Features:

  • Reading and writing data files (CSV, Excel).

  • Data cleaning and transformation.

Example:

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

# Access data
print(df['Name'])  # Output: Series of names

# Filter data
filtered = df[df['Age'] > 25]
print(filtered)

Matplotlib and Seaborn: Data Visualization

  • matplotlib: Basic plotting.

  • seaborn: Advanced statistical visualizations.

Example:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot with matplotlib
plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Basic Plot")
plt.show()

# Plot with seaborn
sns.barplot(x=['A', 'B', 'C'], y=[4, 5, 6])
plt.show()

Machine Learning with Python

Scikit-Learn: Core ML Library

scikit-learn is a comprehensive library for implementing machine learning models.

Steps in Machine Learning:

  1. Load Data:

    from sklearn.datasets import load_iris
    
    data = load_iris()
    print(data['feature_names'])
  2. Preprocess Data:

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
  3. Build Model:

    from sklearn.ensemble import RandomForestClassifier
    
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
  4. Evaluate Model:

    from sklearn.metrics import accuracy_score
    
    predictions = model.predict(X_test)
    print(accuracy_score(y_test, predictions))

Hands-On Exercises

Exercise 1: Analyze a CSV File

Load a CSV file with pandas and display basic statistics.

Solution:

df = pd.read_csv('data.csv')
print(df.describe())

Exercise 2: Train a Linear Regression Model

Use scikit-learn to predict housing prices.

Solution:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
print(model.coef_)

Exercise 3: Visualize Data

Plot a histogram of ages using matplotlib.

Solution:

plt.hist(df['Age'], bins=10)
plt.title("Age Distribution")
plt.show()

Best Practices

  1. Data Cleaning: Handle missing values and outliers before analysis.

  2. Feature Scaling: Normalize or standardize features for better model performance.

  3. Validation: Use cross-validation to assess model performance.

  4. Documentation: Annotate data analysis steps for reproducibility.

In the next chapter, we will dive deeper into working with APIs, including consuming and building REST APIs with Python.

Last updated