python API reference

hhpy.main Module

hhpy.main.py

Contains basic calculation functions that are used in the more specialized versions of the package but can also be used on their own

Functions

today(date_format) Returns today’s date as string
size(byte, unit, dec) Formats bytes as human readable string
mem_usage(pandas_obj, *args, **kwargs) Get memory usage of a pandas object
tprint(*args, sep, **kwargs) Wrapper for print() but with a carriage return at the end.
fprint(*args, file, sep, mode, append_sep, …) Write the output of print to a file instead.
elapsed_time_init() Resets reference time for elapsed_time()
elapsed_time(do_return, ref_t) Get the elapsed time since reference time ref_time.
total_time(i, i_max) Estimates total time of running operation by linear extrapolation using iteration counters.
remaining_time(i, i_max) Estimates remaining time of running operation by linear extrapolation using iteration counters.
progressbar(i, i_max, symbol, empty_symbol, …) Prints a progressbar for the currently running process based on iteration counters.
time_to_str(t, time_format) Wrapper for strftime
cf_vec(x, func, *args, **kwargs) Pandas compatible vectorize function.
round_signif_i(x, digits) Round to significant number of digits
round_signif(x, *args, **kwargs) Round to significant number of digits
floor_signif(x, digits) Floor to significant number of digits
ceil_signif(x, digits) Ceil to significant number of digits
concat_cols(df, columns, sep, to_int) Concat a number of columns of a pandas DataFrame
list_unique(lst) Returns unique elements from a list
list_flatten(lst) Flatten a list of lists
list_merge(*args[, unique, flatten]) Merges n lists together
list_intersection(lst, *args) Returns common elements of n lists
list_exclude(lst, *args) Returns a list that includes only those elements from the first list that are not in any subsequent list.
rand(shape, lower, upper, step, seed) A seedable wrapper for numpy.random.random_sample that allows for boundaries and steps
dict_list(*args) Creates a dictionary of empty named lists.
append_to_dict_list(dct, append, list], inplace) Appends to a dictionary of named lists.
is_list_like(obj) Checks any python object to see if it is list like
force_list(*args) Takes any python object and turns it into an iterable list.
qformat(value, int_format, float_format, …) Creates a human readable representation of a generic python object
to_hdf(df, file, groupby, List[str]] = None, …) saves a pandas DataFrame as h5 file, if groupby is supplied will save each group with a different key.
get_hdf_keys(file) Reads all keys from an hdf file and returns as list
read_hdf(file, key, List[str]] = None, …) read a DataFrame from hdf file

hhpy.ds Module

hhpy.ds.py

Contains DataScience functions extending on pandas and sklearn

Functions

optimize_pd(df, c_int, c_float, c_cat, cat_frac) optimize memory usage of a pandas df, automatically downcast all var types and converts objects to categories
get_df_corr(df, target, groupby, list] = None) returns a pandas DataFrame containing all pearson correlations in a melted format
drop_zero_cols(df) Drop columns with all 0 or None Values from DataFrame.
get_duplicate_indices(df) Returns duplicate indices from a pandas DataFrame
get_duplicate_cols(df) Returns names of duplicate columns from a pandas DataFrame
drop_duplicate_indices(df) Drop duplicate indices from pandas DataFrame
drop_duplicate_cols(df) Drop duplicate columns from pandas DataFrame
change_span(s, steps) return a True/False series around a changepoint, used for filtering stepwise data series in a pandas df must be properly sorted!
outlier_to_nan(df, col, groupby, …) this algorithm cuts off all points whose DELTA (avg diff to the prev and next point) is outside of the n std range
butter_pass_filter(data, cutoff, fs, order, …) Implementation of a highpass / lowpass filter using scipy.signal.butter
pass_by_group(df, col, groupby, list], …) allows applying a butter_pass filter by group
lfit(x, str], y, str] = None, w, …) quick linear fit with numpy
qf(df, fltr, pandas.core.series.Series, …) quickly filter a DataFrame based on equal criteria.
quantile_split(s, n, signif, na_to_med) splits a numerical column into n quantiles.
acc(y_true, str], y_pred, str], df) calculate accuracy for a categorical label
rel_acc(y_true, str], y_pred, str], df, …) relative accuracy of the prediction in comparison to predicting everything as the most common group :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :param target_class: name of the target class, by default the most common one is used [optional] :return: accuracy difference as percent
cm(y_true, str], y_pred, str], df) confusion matrix from pandas df :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :return: Confusion matrix as pandas DataFrame
f1_pr(y_true, str], y_pred, str], df, …) get f1 score, true positive, true negative, missed positive and missed negative rate
f_score(y_true, str], y_pred, str], df, …) generic scoring function base on pandas DataFrame.
r2(*args, **kwargs) wrapper for f_score using sklearn.metrics.r2_score
rmse(*args, **kwargs) wrapper for f_score using numpy.sqrt(skearn.metrics.mean_squared_error)
mae(*args, **kwargs) wrapper for f_score using skearn.metrics.mean_absolute_error
stdae(*args, **kwargs) wrapper for f_score using the standard deviation of the absolute error
medae(*args, **kwargs) wrapper for f_score using skearn.metrics.median_absolute_error
corr(*args, **kwargs) wrapper for f_score using pandas.Series.corr
df_score(df, y_true, str], pred_suffix, …) creates a DataFrame displaying various kind of scores
rmsd(x, df, group, return_df_paired, …) calculated the weighted root mean squared difference for a reference columns x by a specific group
df_rmsd(x, df, groups, str] = None, hue, …) calculate rmsd for reference column x with multiple other columns and return as DataFrame
df_p(x, group, df, hue, agg_func, agg, …) returns a DataFrame with the p value.
df_split(df, split_by, str], return_type, …) Split a pandas DataFrame by column value and returns a list or dict
mahalanobis(point, …) Calculates the Mahalanobis distance for a single point or a DataFrame of points
top_n(s, n, w) select n elements form a categorical pandas series with the highest counts
top_n_coding(s, n, other_name, na_to_other, w) returns a modified version of the pandas series where all elements not in top_n become recoded as ‘other’
k_split(df, k, groupby, str] = None, sortby, …) splits a DataFrame into k (equal sized) parts that can be used for train test splitting or k_cross splitting

hhpy.ipython Module

hhpy.ipython.py

Contains convenience wrappers for ipython

Functions

wide_notebook(width) makes the jupyter notebook wider by appending html code to change the width,
hide_code() hides the code and introduces a toggle button
display_full(*args, **kwargs) wrapper to display a pandas DataFrame with all rows and columns
pd_display(*args[, number_format, full]) wrapper to display a pandas DataFrame with a specified number format
display_df(df[, int_format, float_format, …]) Wrapper to display a pandas DataFrame with separate options for int / float, also adds an option to exclude columns
highlight_max(df, color) highlights the largest value in each column of a pandas DataFrame
highlight_min(df, color) highlights the smallest value in each column of a pandas DataFrame
highlight_max_min(df, max_color, min_color) highlights the largest and smallest value in each column of a pandas DataFrame

hhpy.modelling Module

hhpy.modelling.py

Contains a model class that is based on pandas DataFrames and wraps around sklearn and other frameworks to provide convenient train test functions.

Functions

dict_to_model(dic, VT_co]) restore a Model object from a dictionary
force_model(model, Mapping[KT, VT_co]]) takes any Model, model object or dictionary and converts to Model
get_coefs(model, y, str]) get coefficients of a linear regression in a sorted data frame
get_feature_importance(model, predictors, …) get feature importance of a decision tree like model in a sorted data frame

Classes

Model(model, name, X_ref, str] = None, …) A unified modeling class that is extended from sklearn, accepts any model that implements .fit and .predict
Models(*args, name, df, X_ref, str] = None, …) Collection of Models that allow for fitting and predicting with multiple Models at once, comparing accuracy and creating Ensembles

Class Inheritance Diagram

Inheritance diagram of hhpy.modelling.Model, hhpy.modelling.Models

hhpy.plotting Module

hhpy.plotting.py

Contains plotting functions

Functions

heatmap(x, y, z, data, ax, cmap, agg_func, …) Wrapper for seaborn heatmap in x-y-z format
corrplot(data, annotations, number_format[, ax]) function to create a correlation plot using a seaborn heatmap based on: https://www.linkedin.com/pulse/generating-correlation-heatmaps-seaborn-python-andrew-holt
corrplot_bar(data, target, columns, …) correlation plot as barchart
pairwise_corrplot(data, corr_cutoff, …) print a pairwise_corrplot to for all variables in the df, by default only plots those with a correlation coefficient of >= corr_cutoff
distplot(x, str], data, hue, hue_order, …) Similar to seaborn.distplot but supports hues and some other things.
hist_2d(x, y, data, bins, std_cutoff, …) generic 2d histogram created by splitting the 2d area into equal sized cells, counting data points in them and drawn using pyplot.pcolormesh
paired_plot(data, cols, color, cmap, alpha, …) create a facet grid to analyze various aspects of correlation between two variables using seaborn.PairGrid
q_plim(s, q_min, q_max, offset_perc, …[, …]) returns quick x limits for plotting (cut off data not in q_min to q_max quantile)
levelplot(data, level, cols, str], hue, …) Plots a plot for each specified column for each level of a certain column plus a summary plot
get_legends(ax) returns all legends on a given axis, useful if you have a secaxis
facet_wrap(func, data, facet, str], *args, …) modeled after r’s facet_wrap function.
get_subax(ax, numpy.ndarray], row, col, …) shorthand to get around the fact that ax can be a 1D array or a 2D array (for subplots that can be 1x1,1xn,nx1)
ax_as_list(ax, numpy.ndarray]) takes any Axes and turns them into a list
ax_as_array(ax, numpy.ndarray]) takes any Axes and turns them into a numpy 2D array
rmsdplot(x, data, groups, str] = None, hue, …) creates a seaborn.barplot showing the rmsd calculating hhpy.ds.df_rmsd
insert_linebreak(s, pos, frac, max_breaks) used to insert linebreaks in strings, useful for formatting axes labels
ax_tick_linebreaks(ax, x, y, **kwargs) uses insert_linebreaks to insert linebreaks into the axes ticklabels
annotate_barplot(ax, x, y, ci, ci_newline, …) automatically annotates a barplot with bar values and error bars (if present).
animplot(data, x, y, t, lines, …) wrapper for FuncAnimation to be used with pandas DataFrames.
legend_outside(ax, width, loc, legend_space, …) draws a legend outside of the subplot
set_ax_sym(ax, x, y) automatically sets the select axes to be symmetrical
custom_legend(colors, str], labels, str][, …]) uses patches to create a custom legend with the specified colors
stemplot(x, y[, data, ax, color, baseline, …]) modeled after pyplot.stemplot but more customizeable
get_twin(ax) get the twin axis from an Axes object
get_axlim(ax, xy) Wrapper function to get x limits, y limits or both with one function call
set_axlim(ax, lim, Mapping[KT, VT_co]], xy) Wrapper function to set both x and y limits with one call
share_xy(ax, x, y, mode, adj_twin_ax) set the subplots on the Axes to share x and/or y limits WITHOUT sharing x and y legends.
share_legend(ax, keep_i) removes all legends except for i from an Axes object
barplot_err(x, y, xerr, yerr, data, **kwargs) extension on seaborn barplot that allows for plotting errorbars with preprocessed data.The idea is based on this StackOverflow question.
countplot(x, str] = None, data, hue, ax, …) Based on seaborn barplot but with a few more options
quantile_plot(x, str], data, qs, …) plots the specified quantiles of a Series using seaborn.barplot