python API reference¶
hhpy.main Module¶
hhpy.main.py¶
Contains basic calculation functions that are used in the more specialized versions of the package but can also be used on their own
Functions¶
today (date_format) |
Returns today’s date as string |
size (byte, unit, dec) |
Formats bytes as human readable string |
mem_usage (pandas_obj, *args, **kwargs) |
Get memory usage of a pandas object |
tprint (*args, sep, r_loc, **kwargs) |
Wrapper for print() but with a carriage return at the end. |
fprint (*args, file, sep, mode, append_sep, …) |
Write the output of print to a file instead. |
elapsed_time_init () |
Resets reference time for elapsed_time() |
elapsed_time (do_return, ref_t) |
Get the elapsed time since reference time ref_time. |
total_time (i, i_max) |
Estimates total time of running operation by linear extrapolation using iteration counters. |
remaining_time (i, i_max) |
Estimates remaining time of running operation by linear extrapolation using iteration counters. |
progressbar (i, i_max, symbol, empty_symbol, …) |
Prints a progressbar for the currently running process based on iteration counters. |
time_to_str (t, time_format) |
Wrapper for strftime |
cf_vec (x, func, to_list, *args, **kwargs) |
Pandas compatible vectorize function. |
round_signif_i (x, digits) |
Round to significant number of digits for a Scalar number |
round_signif (x, *args, **kwargs) |
Round to significant number of digits |
floor_signif (x, digits) |
Floor to significant number of digits |
ceil_signif (x, digits) |
Ceil to significant number of digits |
concat_cols (df, columns, sep, to_int) |
Concat a number of columns of a pandas DataFrame |
list_unique (lst) |
Returns unique elements from a list (dropping duplicates) |
list_duplicate (lst) |
Returns only duplicate elements from a list |
list_flatten (lst) |
Flatten a list of lists |
list_merge (*args, unique, flatten) |
Merges n lists together |
list_intersection (lst, int, float, str, …) |
Returns common elements of n lists |
list_exclude (lst, int, float, str, bytes, …) |
Returns a list that includes only those elements from the first list that are not in any subsequent list. |
rand (shape, lower, upper, step, seed) |
A seedable wrapper for numpy.random.random_sample that allows for boundaries and steps |
dict_list (*args, dict_type) |
Creates a dictionary of empty named lists. |
append_to_dict_list (dct, …) |
Appends to a dictionary of named lists. |
is_scalar (obj) |
Checks if a given python object is scalar, i.e. |
is_list_like (obj) |
Checks if a given python object is list like. |
assert_list (*args, default, int, float, str, …) |
Takes any python object(s) and turns them into an iterable list. |
assert_tuple (*args, **kwargs) |
Takes any python object(s) and turns them into an iterable tuple. |
assert_scalar (obj, warn, default, float, …) |
Takes any python object and turns it into a scalar object. |
qformat (value, int_format, float_format, …) |
Creates a human readable representation of a generic python object |
to_hdf (df, file, groupby, List[str]] = None, …) |
saves a pandas DataFrame as h5 file, if groupby is supplied will save each group with a different key. |
get_hdf_keys (file) |
Reads all keys from an hdf file and returns as list |
read_hdf (file, key, List[str]] = None, …) |
read a DataFrame from hdf file based on pandas.read_hdf but with default option to read all keys (since we’re |
rounddown (x, digits) |
convenience wrapper for np.floor with digits option |
roundup (x, digits) |
convenience wrapper for np.ceil with digits option |
reformat_string (string, case, replace, …) |
Function to quickly reformat a string to a specific convention. |
dict_inv (dct, VT_co], key_as_str, duplicates) |
Returns an inverted copy of a given dictionary (if it is invertible) |
copy_function (f) |
return a copy of a function, based on this StackOverflow answer https://stackoverflow.com/questions/13503079/how-to-create-a-copy-of-a-python-function |
get_else_key (dct, VT_co], key, exclude, int, …) |
Returns a value from a dictionary if the key is present, if not returns the key |
Class Inheritance Diagram¶
hhpy.ds Module¶
hhpy.ds.py¶
Contains DataScience functions extending on pandas and sklearn
Functions¶
assert_df (df, groupby, int, float, str, …) |
assert that input is a pandas DataFrame, raise ValueError if it cannot be cast to DataFrame |
optimize_pd (df, c_int, c_float, c_cat, …) |
optimize memory usage of a pandas df, automatically downcast all var types and converts objects to categories |
get_df_corr (df, columns, target, groupby, …) |
Calculate Pearson Correlations for numeric columns, extends on pandas.DataFrame.corr but automatically melts the output. |
drop_zero_cols (df) |
Drop columns with all 0 or None Values from DataFrame. |
get_duplicate_indices (df) |
Returns duplicate indices from a pandas DataFrame |
get_duplicate_cols (df) |
Returns names of duplicate columns from a pandas DataFrame |
drop_duplicate_indices (df, warn) |
Drop duplicate indices from pandas DataFrame |
drop_duplicate_cols (df, warn) |
Drop duplicate columns from pandas DataFrame |
change_span (s, steps) |
return a True/False series around a changepoint, used for filtering stepwise data series in a pandas df must be properly sorted! |
outlier_to_nan (df, col, groupby, …) |
this algorithm cuts off all points whose DELTA (avg diff to the prev and next point) is outside of the n std range |
butter_pass_filter (data, cutoff, fs, order, …) |
Implementation of a highpass / lowpass filter using scipy.signal.butter |
pass_by_group (df, col, groupby, list], …) |
allows applying a butter_pass filter by group |
lfit (x, int, float, str, bytes, None, …) |
quick linear fit with numpy |
rolling_lfit (x, int, float, str, bytes, …) |
Rolling version of lfit: for each row of the DataFrame / Series look at the previous window rows, then perform an lfit and use this value as a prediction for this row. |
qf (df, fltr, pandas.core.series.Series, …) |
quickly filter a DataFrame based on equal criteria. |
quantile_split (s, n, signif, na_to_med) |
splits a numerical column into n quantiles. |
acc (y_true, str], y_pred, str], df) |
calculate accuracy for a categorical label |
rel_acc (y_true, str], y_pred, str], df, …) |
relative accuracy of the prediction in comparison to predicting everything as the most common group :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :param target_class: name of the target class, by default the most common one is used [optional] :return: accuracy difference as percent |
cm (y_true, str], y_pred, str], df) |
confusion matrix from pandas df :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :return: Confusion matrix as pandas DataFrame |
f1_pr (y_true, str], y_pred, str], df, …) |
get f1 score, true positive, true negative, missed positive and missed negative rate |
f_score (y_true, str], y_pred, str], df, …) |
generic scoring function base on pandas DataFrame. |
r2 (*args, **kwargs) |
wrapper for f_score using sklearn.metrics.r2_score |
rmse (*args, **kwargs) |
wrapper for f_score using numpy.sqrt(skearn.metrics.mean_squared_error) |
mae (*args, **kwargs) |
wrapper for f_score using skearn.metrics.mean_absolute_error |
stdae (*args, **kwargs) |
wrapper for f_score using the standard deviation of the absolute error |
medae (*args, **kwargs) |
wrapper for f_score using skearn.metrics.median_absolute_error |
pae (*args, times_hundred, pmax, **kwargs) |
wrapper for f_score using percentage absolute error |
corr (*args, **kwargs) |
wrapper for f_score using pandas.Series.corr |
df_score (df, y_true, int, float, str, bytes, …) |
creates a DataFrame displaying various kind of scores |
rmsd (x, df, group, return_df_paired, …) |
calculated the weighted root mean squared difference for a reference columns x by a specific group. |
df_rmsd (x, df, groups, str] = None, hue, …) |
calculate rmsd() for reference column x with multiple other columns and return as DataFrame. |
df_p (x, group, df, hue, agg_func, agg, …) |
returns a DataFrame with the p value. |
col_to_front (df, cols, float, str, bytes, …) |
Brings one or more columns to the front (first n positions) of a DataFrame |
df_split (df, split_by, str], return_type, …) |
Split a pandas DataFrame by column value and returns a list or dict |
rank (df, rankby, int, float, str, bytes, …) |
creates a ranking (without duplicate ranks) based on columns of a DataFrame |
mahalanobis (point, …) |
Calculates the Mahalanobis distance for a single point or a DataFrame of points |
df_count (x, df, hue, sort_by_count, top_nr, …) |
Create a DataFrame of value counts. |
top_n (s, n, str], w, n_max) |
Select n elements form a categorical pandas series with the highest counts. Ties are broken by sorting |
top_n_coding (s, n, other_name, na_to_other, …) |
Returns a modified version of the pandas series where all elements not in top_n become recoded as ‘other’ |
k_split (df, k, groupby, str] = None, sortby, …) |
Splits a DataFrame into k (equal sized) parts that can be used for train test splitting or k_cross splitting |
remove_unused_categories (df, inplace) |
Remove unused categories from all categorical columns in the DataFrame |
read_csv (path, nrows, encoding, errors, …) |
wrapper for pandas.read_csv that reads the file into an IOString first. |
get_columns (df, dtype, int, float, str, …) |
A quick way to get the columns of a certain dtype. |
reformat_columns (df, printf, **kwargs) |
A quick way to clean the column names of a DataFrame |
Classes¶
DFMapping (df, dict, str] = None, **kwargs) |
Mapping object bound to a pandas DataFrame that standardizes column names and values according to the chosen conventions. |
Class Inheritance Diagram¶
hhpy.ipython Module¶
hhpy.ipython.py¶
Contains convenience wrappers for ipython
Functions¶
wide_notebook (width) |
makes the jupyter notebook wider by appending html code to change the width, |
hide_code () |
hides the code and introduces a toggle button |
display_full (*args[, rows, cols]) |
wrapper to display a pandas DataFrame with all rows and columns |
pd_display (*args[, number_format, full]) |
wrapper to display a pandas DataFrame with a specified number format |
display_df (df[, int_format, float_format, …]) |
Wrapper to display a pandas DataFrame with separate options for int / float, also adds an option to exclude columns |
highlight_max (df, color) |
highlights the largest value in each column of a pandas DataFrame |
highlight_min (df, color) |
highlights the smallest value in each column of a pandas DataFrame |
highlight_max_min (df, max_color, min_color) |
highlights the largest and smallest value in each column of a pandas DataFrame |
hhpy.modelling Module¶
hhpy.modelling.py¶
Contains a model class that is based on pandas DataFrames and wraps around sklearn and other frameworks to provide convenient train test functions.
Functions¶
assert_array (a, return_name, name_default) |
Take any python object and turn it into a 2d numpy array (if possible). |
dict_to_model (dic, VT_co]) |
restore a Model object from a dictionary |
assert_model (model) |
takes any Model, model object or dictionary and converts to Model |
get_coefs (model, y, int, float, str, bytes, …) |
get coefficients of a linear regression in a sorted data frame |
get_feature_importance (model, predictors, …) |
get feature importance of a decision tree like model in a sorted data frame |
to_keras_3d (x, numpy.ndarray], window, y, …) |
reformat a DataFrame / 2D array to become a keras compatible 3D array. |
Classes¶
Model (model, name, X_ref, int, float, str, …) |
A unified modeling class that is extended from sklearn, accepts any model that implements .fit and .predict |
Models (*args, df, X_ref, int, float, str, …) |
Collection of Models that allow for fitting and predicting with multiple Models at once, comparing accuracy and creating Ensembles |
Class Inheritance Diagram¶
hhpy.plotting Module¶
hhpy.plotting.py¶
Contains plotting functions using matplotlib.pyplot
Functions¶
heatmap (x, y, z, data, ax, cmap, agg_func, …) |
Wrapper for seaborn heatmap in x-y-z format |
corrplot (data, annotations, number_format[, ax]) |
function to create a correlation plot using a seaborn heatmap based on: https://www.linkedin.com/pulse/generating-correlation-heatmaps-seaborn-python-andrew-holt |
corrplot_bar (data, target, columns, …) |
Correlation plot as barchart based on get_df_corr() |
pairwise_corrplot (data, corr_cutoff, …) |
print a pairwise_corrplot to for all variables in the df, by default only plots those with a correlation coefficient of >= corr_cutoff |
distplot (x, str], data, hue, hue_order, …) |
Similar to seaborn.distplot but supports hues and some other things. |
hist_2d (x, y, data, bins, std_cutoff, …) |
generic 2d histogram created by splitting the 2d area into equal sized cells, counting data points in them and drawn using pyplot.pcolormesh |
paired_plot (data, cols, color, cmap, alpha, …) |
create a facet grid to analyze various aspects of correlation between two variables using seaborn.PairGrid |
q_plim (s, q_min, q_max, offset_perc, …[, …]) |
returns quick x limits for plotting (cut off data not in q_min to q_max quantile) |
levelplot (data, level, cols, str], hue, …) |
Plots a plot for each specified column for each level of a certain column plus a summary plot |
get_legends (ax) |
returns all legends on a given axis, useful if you have a secaxis |
facet_wrap (func, data, facet, str], *args, …) |
modeled after r’s facet_wrap function. |
get_subax (ax, numpy.ndarray], row, col, …) |
shorthand to get around the fact that ax can be a 1D array or a 2D array (for subplots that can be 1x1,1xn,nx1) |
ax_as_list (ax, numpy.ndarray]) |
takes any Axes and turns them into a list |
ax_as_array (ax, numpy.ndarray]) |
takes any Axes and turns them into a numpy 2D array |
rmsdplot (x, data, groups, str] = None, hue, …) |
creates a seaborn.barplot showing the rmsd calculating df_rmsd() |
insert_linebreak (s, pos, frac, max_breaks) |
used to insert linebreaks in strings, useful for formatting axes labels |
ax_tick_linebreaks (ax, x, y, **kwargs) |
uses insert_linebreaks to insert linebreaks into the axes ticklabels |
annotate_barplot (ax, x, y, ci, ci_newline, …) |
automatically annotates a barplot with bar values and error bars (if present). |
animplot (data, x, y, t, lines, …) |
wrapper for FuncAnimation to be used with pandas DataFrames. |
legend_outside (ax, width, loc, legend_space, …) |
draws a legend outside of the subplot |
set_ax_sym (ax, x, y) |
automatically sets the select axes to be symmetrical |
custom_legend (colors, str], labels, str][, …]) |
uses patches to create a custom legend with the specified colors |
stemplot (x, y[, data, ax, color, baseline, …]) |
modeled after pyplot.stemplot but more customizeable |
get_twin (ax) |
get the twin axis from an Axes object |
get_axlim (ax, xy) |
Wrapper function to get x limits, y limits or both with one function call |
set_axlim (ax, lim, Mapping[KT, VT_co]], xy) |
Wrapper function to set both x and y limits with one call |
share_xy (ax, x, y, mode, adj_twin_ax) |
set the subplots on the Axes to share x and/or y limits WITHOUT sharing x and y legends. |
share_legend (ax, keep_i) |
removes all legends except for i from an Axes object |
barplot_err (x, y, xerr, yerr, data, **kwargs) |
extension on seaborn barplot that allows for plotting errorbars with preprocessed data.The idea is based on this StackOverflow question. |
countplot (x, str] = None, data, hue, ax, …) |
Based on seaborn barplot but with a few more options, uses df_count() |
quantile_plot (x, str], data, qs, …) |
plots the specified quantiles of a Series using seaborn.barplot |
plotly_aggplot (data, x, float, str, bytes, …) |
create a (grouped) plotly aggplot that let’s you select the groupby categories |