DO: Specify dtypes for better performance and memory usage, use chunksize for large files, set index_col when appropriate
❌ Common Mistake:
DON'T: Load entire large files into memory without chunking, ignore encoding issues (use encoding="utf-8"), forget to handle missing values with na_values parameter
Create DataFrame
Create DataFrame from dictionary, list, or array
pd.DataFrame(data, columns=cols)
basiccreationbasics
Example:
df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
Quick Tip:
Can create from dict, list of lists, or numpy arrays
✅ Best Practice:
DO: Use dictionaries for column-oriented data, specify index when meaningful, set appropriate dtypes during creation
❌ Common Mistake:
DON'T: Create DataFrames in loops (use pd.concat instead), mix data types without considering memory usage, forget to set meaningful column names
Preview Data
View first or last n rows of DataFrame
df.head(n) / df.tail(n)
basicexplorationpreview
Example:
df.head(10) # First 10 rows
df.tail(5) # Last 5 rows
Quick Tip:
Default n=5, use df.sample(n) for random rows
✅ Best Practice:
DO: Use head() for initial data inspection, combine with info() and describe() for comprehensive overview, use sample() for random inspection of large datasets
❌ Common Mistake:
DON'T: Rely only on head() for data understanding (may miss patterns), use large n values that clutter output, forget that head/tail may not be representative of entire dataset
Data Overview
Get DataFrame structure and statistical summary
df.info() / df.describe()
basicexplorationanalysis
Example:
df.info() # Data types, null counts
df.describe() # Statistical summary
Quick Tip:
Use df.describe(include="all") for all columns
✅ Best Practice:
DO: Use info() to check data types and missing values, describe(include="all") for comprehensive statistics, check memory usage with info(memory_usage="deep")
❌ Common Mistake:
DON'T: Skip data type verification before analysis, ignore missing value counts, use describe() without understanding what statistics are meaningful for your data
DO: Use index=False unless index is meaningful, specify encoding explicitly, handle special characters with proper encoding
❌ Common Mistake:
DON'T: Save index when not needed, ignore encoding issues, forget to handle file paths with spaces or special characters
2
Data Selection & Indexing
Select Columns
Select single column or multiple columns
df["column"] / df[["col1", "col2"]]
basicselectionbasics
Example:
name = df["name"] # Single column (Series)
subset = df[["name", "age"]] # Multiple columns (DataFrame)
Quick Tip:
Single brackets return Series, double brackets return DataFrame
✅ Best Practice:
DO: Use double brackets when you need DataFrame output, check if columns exist before selection, use .loc for more explicit selection
❌ Common Mistake:
DON'T: Confuse Series vs DataFrame output, select non-existent columns without error handling, use chained selection that can cause SettingWithCopyWarning
Label & Position Indexing
Select data by labels (loc) or positions (iloc)
df.loc[rows, cols] / df.iloc[rows, cols]
intermediateselectionindexing
Example:
df.loc[0:5, "name":"age"] # By labels
df.iloc[0:5, 1:3] # By positions
Quick Tip:
loc is inclusive of end, iloc is exclusive
✅ Best Practice:
DO: Use .loc for label-based selection, .iloc for position-based selection, understand inclusive vs exclusive behavior
❌ Common Mistake:
DON'T: Mix up loc and iloc behavior, use chained indexing instead of .loc/.iloc, forget that loc includes both endpoints while iloc excludes the end
Use & (and), | (or), ~ (not) for multiple conditions
✅ Best Practice:
DO: Use parentheses for complex conditions, combine conditions with & and |, use .query() for complex string-based filtering
❌ Common Mistake:
DON'T: Use "and"/"or" instead of "&"/"|" for DataFrame conditions, forget parentheses in complex conditions, chain boolean operations without parentheses
Query Method
Filter data using string expressions
df.query("condition")
intermediatefilteringquery
Example:
df.query("age > 25 and salary < 60000")
df.query("name.str.contains('John')")
Quick Tip:
More readable for complex conditions
✅ Best Practice:
DO: Use for complex conditions, reference external variables with @, combine with string methods for text filtering
❌ Common Mistake:
DON'T: Use for simple conditions (boolean indexing is faster), forget @ for external variables, ignore performance implications for large datasets