python API reference

hhpy.main Module

hhpy.main.py

Contains basic calculation functions that are used in the more specialized versions of the package but can also be used on their own

Functions

today(date_format) Returns today’s date as string
size(byte, unit, dec) Formats bytes as human readable string
mem_usage(pandas_obj, *args, **kwargs) Get memory usage of a pandas object
tprint(*args, sep, r_loc, **kwargs) Wrapper for print() but with a carriage return at the end.
fprint(*args, file, sep, mode, append_sep, …) Write the output of print to a file instead.
elapsed_time_init() Resets reference time for elapsed_time()
elapsed_time(do_return, ref_t) Get the elapsed time since reference time ref_time.
total_time(i, i_max) Estimates total time of running operation by linear extrapolation using iteration counters.
remaining_time(i, i_max) Estimates remaining time of running operation by linear extrapolation using iteration counters.
progressbar(i, i_max, symbol, empty_symbol, …) Prints a progressbar for the currently running process based on iteration counters.
time_to_str(t, time_format) Wrapper for strftime
cf_vec(x, func, to_list, *args, **kwargs) Pandas compatible vectorize function.
round_signif_i(x, digits) Round to significant number of digits for a Scalar number
round_signif(x, *args, **kwargs) Round to significant number of digits
floor_signif(x, digits) Floor to significant number of digits
ceil_signif(x, digits) Ceil to significant number of digits
concat_cols(df, columns, sep, to_int) Concat a number of columns of a pandas DataFrame
list_unique(lst) Returns unique elements from a list (dropping duplicates)
list_duplicate(lst) Returns only duplicate elements from a list
list_flatten(lst) Flatten a list of lists
list_merge(*args, unique, flatten) Merges n lists together
list_intersection(lst, int, float, str, …) Returns common elements of n lists
list_exclude(lst, int, float, str, bytes, …) Returns a list that includes only those elements from the first list that are not in any subsequent list.
rand(shape, lower, upper, step, seed) A seedable wrapper for numpy.random.random_sample that allows for boundaries and steps
dict_list(*args, dict_type) Creates a dictionary of empty named lists.
append_to_dict_list(dct, …) Appends to a dictionary of named lists.
is_scalar(obj) Checks if a given python object is scalar, i.e.
is_list_like(obj) Checks if a given python object is list like.
assert_list(*args, default, int, float, str, …) Takes any python object(s) and turns them into an iterable list.
assert_tuple(*args, **kwargs) Takes any python object(s) and turns them into an iterable tuple.
assert_scalar(obj, warn, default, float, …) Takes any python object and turns it into a scalar object.
qformat(value, int_format, float_format, …) Creates a human readable representation of a generic python object
to_hdf(df, file, groupby, List[str]] = None, …) saves a pandas DataFrame as h5 file, if groupby is supplied will save each group with a different key.
get_hdf_keys(file) Reads all keys from an hdf file and returns as list
read_hdf(file, key, List[str]] = None, …) read a DataFrame from hdf file based on pandas.read_hdf but with default option to read all keys (since we’re
rounddown(x, digits) convenience wrapper for np.floor with digits option
roundup(x, digits) convenience wrapper for np.ceil with digits option
reformat_string(string, case, replace, …) Function to quickly reformat a string to a specific convention.
dict_inv(dct, VT_co], key_as_str, duplicates) Returns an inverted copy of a given dictionary (if it is invertible)
copy_function(f) return a copy of a function, based on this StackOverflow answer https://stackoverflow.com/questions/13503079/how-to-create-a-copy-of-a-python-function
get_else_key(dct, VT_co], key, exclude, int, …) Returns a value from a dictionary if the key is present, if not returns the key

Classes

BaseClass Base class for various classes deriving from this.

Class Inheritance Diagram

Inheritance diagram of hhpy.main.BaseClass

hhpy.ds Module

hhpy.ds.py

Contains DataScience functions extending on pandas and sklearn

Functions

assert_df(df, groupby, int, float, str, …) assert that input is a pandas DataFrame, raise ValueError if it cannot be cast to DataFrame
optimize_pd(df, c_int, c_float, c_cat, …) optimize memory usage of a pandas df, automatically downcast all var types and converts objects to categories
get_df_corr(df, columns, target, groupby, …) Calculate Pearson Correlations for numeric columns, extends on pandas.DataFrame.corr but automatically melts the output.
drop_zero_cols(df) Drop columns with all 0 or None Values from DataFrame.
get_duplicate_indices(df) Returns duplicate indices from a pandas DataFrame
get_duplicate_cols(df) Returns names of duplicate columns from a pandas DataFrame
drop_duplicate_indices(df, warn) Drop duplicate indices from pandas DataFrame
drop_duplicate_cols(df, warn) Drop duplicate columns from pandas DataFrame
change_span(s, steps) return a True/False series around a changepoint, used for filtering stepwise data series in a pandas df must be properly sorted!
outlier_to_nan(df, col, groupby, …) this algorithm cuts off all points whose DELTA (avg diff to the prev and next point) is outside of the n std range
butter_pass_filter(data, cutoff, fs, order, …) Implementation of a highpass / lowpass filter using scipy.signal.butter
pass_by_group(df, col, groupby, list], …) allows applying a butter_pass filter by group
lfit(x, int, float, str, bytes, None, …) quick linear fit with numpy
rolling_lfit(x, int, float, str, bytes, …) Rolling version of lfit: for each row of the DataFrame / Series look at the previous window rows, then perform an lfit and use this value as a prediction for this row.
qf(df, fltr, pandas.core.series.Series, …) quickly filter a DataFrame based on equal criteria.
quantile_split(s, n, signif, na_to_med) splits a numerical column into n quantiles.
acc(y_true, str], y_pred, str], df) calculate accuracy for a categorical label
rel_acc(y_true, str], y_pred, str], df, …) relative accuracy of the prediction in comparison to predicting everything as the most common group :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :param target_class: name of the target class, by default the most common one is used [optional] :return: accuracy difference as percent
cm(y_true, str], y_pred, str], df) confusion matrix from pandas df :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :return: Confusion matrix as pandas DataFrame
f1_pr(y_true, str], y_pred, str], df, …) get f1 score, true positive, true negative, missed positive and missed negative rate
f_score(y_true, str], y_pred, str], df, …) generic scoring function base on pandas DataFrame.
r2(*args, **kwargs) wrapper for f_score using sklearn.metrics.r2_score
rmse(*args, **kwargs) wrapper for f_score using numpy.sqrt(skearn.metrics.mean_squared_error)
mae(*args, **kwargs) wrapper for f_score using skearn.metrics.mean_absolute_error
stdae(*args, **kwargs) wrapper for f_score using the standard deviation of the absolute error
medae(*args, **kwargs) wrapper for f_score using skearn.metrics.median_absolute_error
pae(*args, times_hundred, pmax, **kwargs) wrapper for f_score using percentage absolute error
corr(*args, **kwargs) wrapper for f_score using pandas.Series.corr
df_score(df, y_true, int, float, str, bytes, …) creates a DataFrame displaying various kind of scores
rmsd(x, df, group, return_df_paired, …) calculated the weighted root mean squared difference for a reference columns x by a specific group.
df_rmsd(x, df, groups, str] = None, hue, …) calculate rmsd() for reference column x with multiple other columns and return as DataFrame.
df_p(x, group, df, hue, agg_func, agg, …) returns a DataFrame with the p value.
col_to_front(df, cols, float, str, bytes, …) Brings one or more columns to the front (first n positions) of a DataFrame
df_split(df, split_by, str], return_type, …) Split a pandas DataFrame by column value and returns a list or dict
rank(df, rankby, int, float, str, bytes, …) creates a ranking (without duplicate ranks) based on columns of a DataFrame
mahalanobis(point, …) Calculates the Mahalanobis distance for a single point or a DataFrame of points
df_count(x, df, hue, sort_by_count, top_nr, …) Create a DataFrame of value counts.
top_n(s, n, str], w, n_max) Select n elements form a categorical pandas series with the highest counts. Ties are broken by sorting
top_n_coding(s, n, other_name, na_to_other, …) Returns a modified version of the pandas series where all elements not in top_n become recoded as ‘other’
k_split(df, k, groupby, str] = None, sortby, …) Splits a DataFrame into k (equal sized) parts that can be used for train test splitting or k_cross splitting
remove_unused_categories(df, inplace) Remove unused categories from all categorical columns in the DataFrame
read_csv(path, nrows, encoding, errors, …) wrapper for pandas.read_csv that reads the file into an IOString first.
get_columns(df, dtype, int, float, str, …) A quick way to get the columns of a certain dtype.
reformat_columns(df, printf, **kwargs) A quick way to clean the column names of a DataFrame

Classes

DFMapping(df, dict, str] = None, **kwargs) Mapping object bound to a pandas DataFrame that standardizes column names and values according to the chosen conventions.

Class Inheritance Diagram

Inheritance diagram of hhpy.ds.DFMapping

hhpy.ipython Module

hhpy.ipython.py

Contains convenience wrappers for ipython

Functions

wide_notebook(width) makes the jupyter notebook wider by appending html code to change the width,
hide_code() hides the code and introduces a toggle button
display_full(*args[, rows, cols]) wrapper to display a pandas DataFrame with all rows and columns
pd_display(*args[, number_format, full]) wrapper to display a pandas DataFrame with a specified number format
display_df(df[, int_format, float_format, …]) Wrapper to display a pandas DataFrame with separate options for int / float, also adds an option to exclude columns
highlight_max(df, color) highlights the largest value in each column of a pandas DataFrame
highlight_min(df, color) highlights the smallest value in each column of a pandas DataFrame
highlight_max_min(df, max_color, min_color) highlights the largest and smallest value in each column of a pandas DataFrame

hhpy.modelling Module

hhpy.modelling.py

Contains a model class that is based on pandas DataFrames and wraps around sklearn and other frameworks to provide convenient train test functions.

Functions

assert_array(a, return_name, name_default) Take any python object and turn it into a 2d numpy array (if possible).
dict_to_model(dic, VT_co]) restore a Model object from a dictionary
assert_model(model) takes any Model, model object or dictionary and converts to Model
get_coefs(model, y, int, float, str, bytes, …) get coefficients of a linear regression in a sorted data frame
get_feature_importance(model, predictors, …) get feature importance of a decision tree like model in a sorted data frame
to_keras_3d(x, numpy.ndarray], window, y, …) reformat a DataFrame / 2D array to become a keras compatible 3D array.

Classes

Model(model, name, X_ref, int, float, str, …) A unified modeling class that is extended from sklearn, accepts any model that implements .fit and .predict
Models(*args, df, X_ref, int, float, str, …) Collection of Models that allow for fitting and predicting with multiple Models at once, comparing accuracy and creating Ensembles

Class Inheritance Diagram

Inheritance diagram of hhpy.modelling.Model, hhpy.modelling.Models

hhpy.plotting Module

hhpy.plotting.py

Contains plotting functions using matplotlib.pyplot

Functions

heatmap(x, y, z, data, ax, cmap, agg_func, …) Wrapper for seaborn heatmap in x-y-z format
corrplot(data, annotations, number_format[, ax]) function to create a correlation plot using a seaborn heatmap based on: https://www.linkedin.com/pulse/generating-correlation-heatmaps-seaborn-python-andrew-holt
corrplot_bar(data, target, columns, …) Correlation plot as barchart based on get_df_corr()
pairwise_corrplot(data, corr_cutoff, …) print a pairwise_corrplot to for all variables in the df, by default only plots those with a correlation coefficient of >= corr_cutoff
distplot(x, str], data, hue, hue_order, …) Similar to seaborn.distplot but supports hues and some other things.
hist_2d(x, y, data, bins, std_cutoff, …) generic 2d histogram created by splitting the 2d area into equal sized cells, counting data points in them and drawn using pyplot.pcolormesh
paired_plot(data, cols, color, cmap, alpha, …) create a facet grid to analyze various aspects of correlation between two variables using seaborn.PairGrid
q_plim(s, q_min, q_max, offset_perc, …[, …]) returns quick x limits for plotting (cut off data not in q_min to q_max quantile)
levelplot(data, level, cols, str], hue, …) Plots a plot for each specified column for each level of a certain column plus a summary plot
get_legends(ax) returns all legends on a given axis, useful if you have a secaxis
facet_wrap(func, data, facet, str], *args, …) modeled after r’s facet_wrap function.
get_subax(ax, numpy.ndarray], row, col, …) shorthand to get around the fact that ax can be a 1D array or a 2D array (for subplots that can be 1x1,1xn,nx1)
ax_as_list(ax, numpy.ndarray]) takes any Axes and turns them into a list
ax_as_array(ax, numpy.ndarray]) takes any Axes and turns them into a numpy 2D array
rmsdplot(x, data, groups, str] = None, hue, …) creates a seaborn.barplot showing the rmsd calculating df_rmsd()
insert_linebreak(s, pos, frac, max_breaks) used to insert linebreaks in strings, useful for formatting axes labels
ax_tick_linebreaks(ax, x, y, **kwargs) uses insert_linebreaks to insert linebreaks into the axes ticklabels
annotate_barplot(ax, x, y, ci, ci_newline, …) automatically annotates a barplot with bar values and error bars (if present).
animplot(data, x, y, t, lines, …) wrapper for FuncAnimation to be used with pandas DataFrames.
legend_outside(ax, width, loc, legend_space, …) draws a legend outside of the subplot
set_ax_sym(ax, x, y) automatically sets the select axes to be symmetrical
custom_legend(colors, str], labels, str][, …]) uses patches to create a custom legend with the specified colors
stemplot(x, y[, data, ax, color, baseline, …]) modeled after pyplot.stemplot but more customizeable
get_twin(ax) get the twin axis from an Axes object
get_axlim(ax, xy) Wrapper function to get x limits, y limits or both with one function call
set_axlim(ax, lim, Mapping[KT, VT_co]], xy) Wrapper function to set both x and y limits with one call
share_xy(ax, x, y, mode, adj_twin_ax) set the subplots on the Axes to share x and/or y limits WITHOUT sharing x and y legends.
share_legend(ax, keep_i) removes all legends except for i from an Axes object
barplot_err(x, y, xerr, yerr, data, **kwargs) extension on seaborn barplot that allows for plotting errorbars with preprocessed data.The idea is based on this StackOverflow question.
countplot(x, str] = None, data, hue, ax, …) Based on seaborn barplot but with a few more options, uses df_count()
quantile_plot(x, str], data, qs, …) plots the specified quantiles of a Series using seaborn.barplot
plotly_aggplot(data, x, float, str, bytes, …) create a (grouped) plotly aggplot that let’s you select the groupby categories