python API reference¶

hhpy.main Module¶

hhpy.main.py¶

Contains basic calculation functions that are used in the more specialized versions of the package but can also be used on their own

Functions¶

`today`(date_format)	Returns today’s date as string
`size`(byte, unit, dec)	Formats bytes as human readable string
`mem_usage`(pandas_obj, args, *kwargs)	Get memory usage of a pandas object
`tprint`(args, sep, *kwargs)	Wrapper for print() but with a carriage return at the end.
`fprint`(*args, file, sep, mode, append_sep, …)	Write the output of print to a file instead.
`elapsed_time_init`()	Resets reference time for elapsed_time()
`elapsed_time`(do_return, ref_t)	Get the elapsed time since reference time ref_time.
`total_time`(i, i_max)	Estimates total time of running operation by linear extrapolation using iteration counters.
`remaining_time`(i, i_max)	Estimates remaining time of running operation by linear extrapolation using iteration counters.
`progressbar`(i, i_max, symbol, empty_symbol, …)	Prints a progressbar for the currently running process based on iteration counters.
`time_to_str`(t, time_format)	Wrapper for strftime
`cf_vec`(x, func, args, *kwargs)	Pandas compatible vectorize function.
`round_signif_i`(x, digits)	Round to significant number of digits
`round_signif`(x, args, *kwargs)	Round to significant number of digits
`floor_signif`(x, digits)	Floor to significant number of digits
`ceil_signif`(x, digits)	Ceil to significant number of digits
`concat_cols`(df, columns, sep, to_int)	Concat a number of columns of a pandas DataFrame
`list_unique`(lst)	Returns unique elements from a list
`list_flatten`(lst)	Flatten a list of lists
`list_merge`(*args[, unique, flatten])	Merges n lists together
`list_intersection`(lst, *args)	Returns common elements of n lists
`list_exclude`(lst, *args)	Returns a list that includes only those elements from the first list that are not in any subsequent list.
`rand`(shape, lower, upper, step, seed)	A seedable wrapper for numpy.random.random_sample that allows for boundaries and steps
`dict_list`(*args)	Creates a dictionary of empty named lists.
`append_to_dict_list`(dct, append, list], inplace)	Appends to a dictionary of named lists.
`is_list_like`(obj)	Checks any python object to see if it is list like
`force_list`(*args)	Takes any python object and turns it into an iterable list.
`qformat`(value, int_format, float_format, …)	Creates a human readable representation of a generic python object
`to_hdf`(df, file, groupby, List[str]] = None, …)	saves a pandas DataFrame as h5 file, if groupby is supplied will save each group with a different key.
`get_hdf_keys`(file)	Reads all keys from an hdf file and returns as list
`read_hdf`(file, key, List[str]] = None, …)	read a DataFrame from hdf file

hhpy.ds Module¶

hhpy.ds.py¶

Contains DataScience functions extending on pandas and sklearn

Functions¶

`optimize_pd`(df, c_int, c_float, c_cat, cat_frac)	optimize memory usage of a pandas df, automatically downcast all var types and converts objects to categories
`get_df_corr`(df, target, groupby, list] = None)	returns a pandas DataFrame containing all pearson correlations in a melted format
`drop_zero_cols`(df)	Drop columns with all 0 or None Values from DataFrame.
`get_duplicate_indices`(df)	Returns duplicate indices from a pandas DataFrame
`get_duplicate_cols`(df)	Returns names of duplicate columns from a pandas DataFrame
`drop_duplicate_indices`(df)	Drop duplicate indices from pandas DataFrame
`drop_duplicate_cols`(df)	Drop duplicate columns from pandas DataFrame
`change_span`(s, steps)	return a True/False series around a changepoint, used for filtering stepwise data series in a pandas df must be properly sorted!
`outlier_to_nan`(df, col, groupby, …)	this algorithm cuts off all points whose DELTA (avg diff to the prev and next point) is outside of the n std range
`butter_pass_filter`(data, cutoff, fs, order, …)	Implementation of a highpass / lowpass filter using scipy.signal.butter
`pass_by_group`(df, col, groupby, list], …)	allows applying a butter_pass filter by group
`lfit`(x, str], y, str] = None, w, …)	quick linear fit with numpy
`qf`(df, fltr, pandas.core.series.Series, …)	quickly filter a DataFrame based on equal criteria.
`quantile_split`(s, n, signif, na_to_med)	splits a numerical column into n quantiles.
`acc`(y_true, str], y_pred, str], df)	calculate accuracy for a categorical label
`rel_acc`(y_true, str], y_pred, str], df, …)	relative accuracy of the prediction in comparison to predicting everything as the most common group :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :param target_class: name of the target class, by default the most common one is used [optional] :return: accuracy difference as percent
`cm`(y_true, str], y_pred, str], df)	confusion matrix from pandas df :param y_true: true values as name of df or vector data :param y_pred: predicted values as name of df or vector data :param df: pandas DataFrame containing true and predicted values [optional] :return: Confusion matrix as pandas DataFrame
`f1_pr`(y_true, str], y_pred, str], df, …)	get f1 score, true positive, true negative, missed positive and missed negative rate
`f_score`(y_true, str], y_pred, str], df, …)	generic scoring function base on pandas DataFrame.
`r2`(args, *kwargs)	wrapper for f_score using sklearn.metrics.r2_score
`rmse`(args, *kwargs)	wrapper for f_score using numpy.sqrt(skearn.metrics.mean_squared_error)
`mae`(args, *kwargs)	wrapper for f_score using skearn.metrics.mean_absolute_error
`stdae`(args, *kwargs)	wrapper for f_score using the standard deviation of the absolute error
`medae`(args, *kwargs)	wrapper for f_score using skearn.metrics.median_absolute_error
`corr`(args, *kwargs)	wrapper for f_score using pandas.Series.corr
`df_score`(df, y_true, str], pred_suffix, …)	creates a DataFrame displaying various kind of scores
`rmsd`(x, df, group, return_df_paired, …)	calculated the weighted root mean squared difference for a reference columns x by a specific group
`df_rmsd`(x, df, groups, str] = None, hue, …)	calculate rmsd for reference column x with multiple other columns and return as DataFrame
`df_p`(x, group, df, hue, agg_func, agg, …)	returns a DataFrame with the p value.
`df_split`(df, split_by, str], return_type, …)	Split a pandas DataFrame by column value and returns a list or dict
`mahalanobis`(point, …)	Calculates the Mahalanobis distance for a single point or a DataFrame of points
`top_n`(s, n, w)	select n elements form a categorical pandas series with the highest counts
`top_n_coding`(s, n, other_name, na_to_other, w)	returns a modified version of the pandas series where all elements not in top_n become recoded as ‘other’
`k_split`(df, k, groupby, str] = None, sortby, …)	splits a DataFrame into k (equal sized) parts that can be used for train test splitting or k_cross splitting

hhpy.ipython Module¶

hhpy.ipython.py¶

Contains convenience wrappers for ipython

Functions¶

`wide_notebook`(width)	makes the jupyter notebook wider by appending html code to change the width,
`hide_code`()	hides the code and introduces a toggle button
`display_full`(args, *kwargs)	wrapper to display a pandas DataFrame with all rows and columns
`pd_display`(*args[, number_format, full])	wrapper to display a pandas DataFrame with a specified number format
`display_df`(df[, int_format, float_format, …])	Wrapper to display a pandas DataFrame with separate options for int / float, also adds an option to exclude columns
`highlight_max`(df, color)	highlights the largest value in each column of a pandas DataFrame
`highlight_min`(df, color)	highlights the smallest value in each column of a pandas DataFrame
`highlight_max_min`(df, max_color, min_color)	highlights the largest and smallest value in each column of a pandas DataFrame

hhpy.modelling Module¶

hhpy.modelling.py¶

Contains a model class that is based on pandas DataFrames and wraps around sklearn and other frameworks to provide convenient train test functions.

Functions¶

`dict_to_model`(dic, VT_co])	restore a Model object from a dictionary
`force_model`(model, Mapping[KT, VT_co]])	takes any Model, model object or dictionary and converts to Model
`get_coefs`(model, y, str])	get coefficients of a linear regression in a sorted data frame
`get_feature_importance`(model, predictors, …)	get feature importance of a decision tree like model in a sorted data frame

Classes¶

`Model`(model, name, X_ref, str] = None, …)	A unified modeling class that is extended from sklearn, accepts any model that implements .fit and .predict
`Models`(*args, name, df, X_ref, str] = None, …)	Collection of Models that allow for fitting and predicting with multiple Models at once, comparing accuracy and creating Ensembles

Class Inheritance Diagram¶

Inheritance diagram of hhpy.modelling.Model, hhpy.modelling.Models

hhpy.plotting Module¶

hhpy.plotting.py¶

Contains plotting functions

Functions¶

`heatmap`(x, y, z, data, ax, cmap, agg_func, …)	Wrapper for seaborn heatmap in x-y-z format
`corrplot`(data, annotations, number_format[, ax])	function to create a correlation plot using a seaborn heatmap based on: https://www.linkedin.com/pulse/generating-correlation-heatmaps-seaborn-python-andrew-holt
`corrplot_bar`(data, target, columns, …)	correlation plot as barchart
`pairwise_corrplot`(data, corr_cutoff, …)	print a pairwise_corrplot to for all variables in the df, by default only plots those with a correlation coefficient of >= corr_cutoff
`distplot`(x, str], data, hue, hue_order, …)	Similar to seaborn.distplot but supports hues and some other things.
`hist_2d`(x, y, data, bins, std_cutoff, …)	generic 2d histogram created by splitting the 2d area into equal sized cells, counting data points in them and drawn using pyplot.pcolormesh
`paired_plot`(data, cols, color, cmap, alpha, …)	create a facet grid to analyze various aspects of correlation between two variables using seaborn.PairGrid
`q_plim`(s, q_min, q_max, offset_perc, …[, …])	returns quick x limits for plotting (cut off data not in q_min to q_max quantile)
`levelplot`(data, level, cols, str], hue, …)	Plots a plot for each specified column for each level of a certain column plus a summary plot
`get_legends`(ax)	returns all legends on a given axis, useful if you have a secaxis
`facet_wrap`(func, data, facet, str], *args, …)	modeled after r’s facet_wrap function.
`get_subax`(ax, numpy.ndarray], row, col, …)	shorthand to get around the fact that ax can be a 1D array or a 2D array (for subplots that can be 1x1,1xn,nx1)
`ax_as_list`(ax, numpy.ndarray])	takes any Axes and turns them into a list
`ax_as_array`(ax, numpy.ndarray])	takes any Axes and turns them into a numpy 2D array
`rmsdplot`(x, data, groups, str] = None, hue, …)	creates a seaborn.barplot showing the rmsd calculating hhpy.ds.df_rmsd
`insert_linebreak`(s, pos, frac, max_breaks)	used to insert linebreaks in strings, useful for formatting axes labels
`ax_tick_linebreaks`(ax, x, y, **kwargs)	uses insert_linebreaks to insert linebreaks into the axes ticklabels
`annotate_barplot`(ax, x, y, ci, ci_newline, …)	automatically annotates a barplot with bar values and error bars (if present).
`animplot`(data, x, y, t, lines, …)	wrapper for FuncAnimation to be used with pandas DataFrames.
`legend_outside`(ax, width, loc, legend_space, …)	draws a legend outside of the subplot
`set_ax_sym`(ax, x, y)	automatically sets the select axes to be symmetrical
`custom_legend`(colors, str], labels, str][, …])	uses patches to create a custom legend with the specified colors
`stemplot`(x, y[, data, ax, color, baseline, …])	modeled after pyplot.stemplot but more customizeable
`get_twin`(ax)	get the twin axis from an Axes object
`get_axlim`(ax, xy)	Wrapper function to get x limits, y limits or both with one function call
`set_axlim`(ax, lim, Mapping[KT, VT_co]], xy)	Wrapper function to set both x and y limits with one call
`share_xy`(ax, x, y, mode, adj_twin_ax)	set the subplots on the Axes to share x and/or y limits WITHOUT sharing x and y legends.
`share_legend`(ax, keep_i)	removes all legends except for i from an Axes object
`barplot_err`(x, y, xerr, yerr, data, **kwargs)	extension on seaborn barplot that allows for plotting errorbars with preprocessed data.The idea is based on this StackOverflow question.
`countplot`(x, str] = None, data, hue, ax, …)	Based on seaborn barplot but with a few more options
`quantile_plot`(x, str], data, qs, …)	plots the specified quantiles of a Series using seaborn.barplot