Python HaloNotebook.text示例

编程语言: Python

命名空间/包名称: halo

类/类型: HaloNotebook

方法/功能: text

hotexamples.com的示例: 7

Python HaloNotebook.text - 已找到7个示例。这些是从开源项目中提取的最受好评的halo.HaloNotebook.text现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

start(23)

HaloNotebook(22)

succeed(16)

stop(15)

text(7)

fail(6)

_output(2)

stop_and_persist(2)

clear(1)

color(1)

info(1)

placement(1)

spinner(1)

warn(1)

示例#1

显示文件

    def test_spinner_getters_setters(self):
        """Test spinner getters and setters.
        """
        spinner = HaloNotebook()
        self.assertEqual(spinner.text, '')
        self.assertEqual(spinner.color, 'cyan')
        self.assertIsNone(spinner.spinner_id)

        spinner.spinner = 'dots12'
        spinner.text = 'bar'
        spinner.color = 'red'

        self.assertEqual(spinner.text, 'bar')
        self.assertEqual(spinner.color, 'red')

        if is_supported():
            self.assertEqual(spinner.spinner, Spinners['dots12'].value)
        else:
            self.assertEqual(spinner.spinner, default_spinner)

        spinner.spinner = 'dots11'
        if is_supported():
            self.assertEqual(spinner.spinner, Spinners['dots11'].value)
        else:
            self.assertEqual(spinner.spinner, default_spinner)

        spinner.spinner = 'foo_bar'
        self.assertEqual(spinner.spinner, default_spinner)

        # Color is None
        spinner.color = None
        spinner.start()
        spinner.stop()
        self.assertIsNone(spinner.color)

示例#2

显示文件

def model_explanation(data_df,
                      prediction_column,
                      problem_type,
                      snr='auto',
                      file_name=None):
    """
	.. _model-explanation:
	Analyzes the variables that a model relies on the most in a brute-force fashion.
	
	The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.

	Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.

	When :code:`problem_type=None`, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not :code:`prediction_column` is categorical.


	Parameters
	----------
	data_df : pandas.DataFrame
		The pandas DataFrame containing the data.
	prediction_column : str
		The name of the column containing true labels.
	problem_type : None | 'classification' | 'regression'
		The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
	file_name : None | str
		A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.


	Returns
	-------
	result : pandas.DataFrame
		The result is a pandas.Dataframe with columns (where applicable):

		* :code:`'Selection Order'`: The order in which the associated variable was selected, starting at 1 for the most important variable.
		* :code:`'Variable'`: The column name corresponding to the input variable.
		* :code:`'Running Achievable R-Squared'`: The highest :math:`R^2` that can be achieved by a classification model using all variables selected so far, including this one.
		* :code:`'Running Achievable Accuracy'`: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
		* :code:`'Running Achievable RMSE'`: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.


	.. admonition:: Theoretical Foundation

		Section :ref:`a) Model Explanation`.

	"""
    assert prediction_column in data_df.columns, 'The label column should be a column of the dataframe.'
    assert problem_type.lower() in ['classification', 'regression']
    if problem_type.lower() == 'regression':
        assert np.can_cast(data_df[prediction_column],
                           float), 'The prediction column should be numeric'

    k = 0
    kp = 0
    max_k = 100

    file_name = upload_data(data_df, file_name=file_name)
    spinner = Halo(text='Waiting for results from the backend.',
                   spinner='dots')
    spinner.start()

    if file_name:
        job_id = EXPLANATION_JOB_IDS.get(
            (file_name, prediction_column, problem_type), None)
        if job_id:
            api_response = APIClient.route(
             path='/wk/variable-selection', method='POST', \
             file_name=file_name, target_column=prediction_column, \
             problem_type=problem_type, timestamp=int(time()), job_id=job_id, \
             snr=snr)
        else:
            api_response = APIClient.route(
             path='/wk/variable-selection', method='POST', \
             file_name=file_name, target_column=prediction_column, \
             problem_type=problem_type, timestamp=int(time()), snr=snr)

        initial_time = time()
        while api_response.status_code == requests.codes.ok and k < max_k:
            if kp % 2 != 0:
                sleep(2 if kp < 5 else 10 if k < max_k - 4 else 300)
                kp += 1
                k = kp // 2

            else:
                try:
                    response = api_response.json()
                    if 'job_id' in response:
                        job_id = response['job_id']
                        EXPLANATION_JOB_IDS[(file_name, prediction_column,
                                             problem_type)] = job_id
                        sleep(2 if kp < 5 else 10 if k < max_k - 4 else 300)
                        kp += 1
                        k = kp // 2

                        # Note: it is important to pass the job_id to avoid being charged twice for the work.
                        api_response = APIClient.route(
                         path='/wk/variable-selection', method='POST', \
                         file_name=file_name, target_column=prediction_column, \
                         problem_type=problem_type, timestamp=int(time()), job_id=job_id, \
                         snr=snr)

                        try:
                            response = api_response.json()
                            if 'eta' in response:
                                progress_text = '%s%% Completed.' % response[
                                    'progress_pct'] if 'progress_pct' in response else ''
                                spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (
                                    response['eta'], progress_text)
                        except:
                            pass

                    if ('job_id' not in response) or ('selection_order'
                                                      in response):
                        duration = int(time() - initial_time)
                        duration = str(
                            duration) + 's' if duration < 60 else str(
                                duration // 60) + 'min'

                        result = {}

                        if 'selection_order' in response:
                            result['Selection Order'] = response[
                                'selection_order']

                        if 'variable' in response:
                            result['Variable'] = response['variable']

                        if 'r-squared' in response:
                            result['Running Achievable R-Squared'] = response[
                                'r-squared']

                        if 'log-likelihood' in response:
                            result[
                                'Running Achievable Log-Likelihood Per Sample'] = response[
                                    'log-likelihood']

                        if 'rmse' in response and problem_type.lower(
                        ) == 'regression':
                            result['Running Achievable RMSE'] = response[
                                'rmse']

                        if 'accuracy' in response and problem_type.lower(
                        ) == 'classification':
                            result['Running Achievable Accuracy'] = response[
                                'accuracy']

                        result = pd.DataFrame.from_dict(result)

                        if 'selection_order' in response:
                            result.set_index('Selection Order', inplace=True)

                        spinner.text = 'Received results from the backend after %s.' % duration
                        spinner.succeed()
                        return result

                except:
                    logging.exception(
                        '\nModel explanation failed. Last HTTP code: %s, Content: %s'
                        % (api_response.status_code, api_response.content))
                    spinner.text = 'The backend encountered an unexpected error we are looking into. Please try again later.'
                    spinner.fail()
                    return None

        if api_response.status_code != requests.codes.ok:
            spinner.text = 'The backend is taking longer than expected. Please try again later'
            spinner.fail()
            try:
                response = api_response.json()
                if 'message' in response:
                    logging.error('\n%s' % response['message'])
            except:
                logging.error(
                    '\nModel explanation failed. Last HTTP code: %s, Content: %s'
                    % (api_response.status_code, api_response.content))

    raise LongerThanExpectedException(
        'The backend is taking longer than expected, but rest reassured your task is still running. Please try again later to retrieve your results.'
    )

    return None

示例#3

显示文件

def data_valuation(data_df,
                   target_column,
                   problem_type,
                   snr='auto',
                   include_mutual_information=False,
                   file_name=None):
    """
	.. _data-valuation:
	Estimate the highest performance metrics achievable when predicting the :code:`target_column` using all other columns.

	When :code:`problem_type=None`, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not :code:`target_column` is categorical.


	Parameters
	----------
	data_df : pandas.DataFrame
		The pandas DataFrame containing the data.
	target_column : str
		The name of the column containing true labels.
	problem_type : None | 'classification' | 'regression'
		The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
	include_mutual_information : bool
		Whether to include the mutual information between target and explanatory variables in the result.
	file_name : None | str
		A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.



	Returns
	-------
	achievable_performance : pandas.Dataframe
		The result is a pandas.Dataframe with columns (where applicable):

		* :code:`'Achievable Accuracy'`: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.
		* :code:`'Achievable R-Squared'`: The highest :math:`R^2` that can be achieved by a model using provided inputs to predict the label.
		* :code:`'Achievable RMSE'`: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.		
		* :code:`'Achievable Log-Likelihood Per Sample'`: The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.


	.. admonition:: Theoretical Foundation

		Section :ref:`1 - Achievable Performance`.
	"""
    assert target_column in data_df.columns, 'The label column should be a column of the dataframe.'
    assert problem_type.lower() in ['classification', 'regression']
    if problem_type.lower() == 'regression':
        assert np.can_cast(data_df[target_column],
                           float), 'The target column should be numeric'

    k = 0
    max_k = 100

    file_name = upload_data(data_df, file_name=file_name)
    spinner = Halo(text='Waiting for results from the backend.',
                   spinner='dots')
    spinner.start()

    if file_name:
        job_id = VALUATION_JOB_IDS.get(
            (file_name, target_column, problem_type, snr), None)

        if job_id:
            api_response = APIClient.route(
             path='/wk/data-valuation', method='POST',
             file_name=file_name, target_column=target_column, \
             problem_type=problem_type, \
             timestamp=int(time()), job_id=job_id, \
             snr=snr)
        else:
            api_response = APIClient.route(
             path='/wk/data-valuation', method='POST', \
             file_name=file_name, target_column=target_column, \
             problem_type=problem_type, timestamp=int(time()), \
             snr=snr)

        initial_time = time()
        while api_response.status_code == requests.codes.ok and k < max_k:
            try:
                response = api_response.json()
                if 'eta' in response:
                    progress_text = '%s%% Completed.' % response[
                        'progress_pct'] if 'progress_pct' in response else ''
                    spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (
                        response['eta'], progress_text)

                if ('job_id' in response) and ('r-squared' not in response):
                    job_id = response['job_id']
                    VALUATION_JOB_IDS[(file_name, target_column, problem_type,
                                       snr)] = job_id
                    k += 1
                    sleep(15.)

                    # Note: it is important to pass the job_id to avoid being charged twice for the same work.
                    api_response = APIClient.route(
                     path='/wk/data-valuation', method='POST',
                     file_name=file_name, target_column=target_column, \
                     problem_type=problem_type, \
                     timestamp=int(time()), job_id=job_id, \
                     snr=snr)

                    try:
                        response = api_response.json()
                        if 'eta' in response:
                            progress_text = '%s%% Completed.' % response[
                                'progress_pct'] if 'progress_pct' in response else ''
                            spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (
                                response['eta'], progress_text)
                    except:
                        pass

                if ('job_id' not in response) or ('r-squared' in response):
                    duration = int(time() - initial_time)
                    duration = str(duration) + 's' if duration < 60 else str(
                        duration // 60) + 'min'

                    result = {}
                    if 'r-squared' in response:
                        result['Achievable R-Squared'] = [
                            response['r-squared']
                        ]

                    if 'log-likelihood' in response:
                        result['Achievable Log-Likelihood Per Sample'] = [
                            response['log-likelihood']
                        ]

                    if 'rmse' in response and problem_type.lower(
                    ) == 'regression':
                        result['Achievable RMSE'] = [response['rmse']]

                    if 'accuracy' in response and problem_type.lower(
                    ) == 'classification':
                        result['Achievable Accuracy'] = [response['accuracy']]

                    if include_mutual_information and 'mi' in response:
                        result['Mutual Information'] = [response['mi']]

                    result = pd.DataFrame.from_dict(result)

                    spinner.text = 'Received results from the backend after %s.' % duration
                    spinner.succeed()

                    return result

            except:
                logging.exception(
                    '\nData valuation failed. Last HTTP code: %s' %
                    api_response.status_code)
                spinner.text = 'The backend encountered an unexpected error we are looking into. Please try again later.'
                spinner.fail()
                return None

        if api_response.status_code != requests.codes.ok:
            spinner.text = 'The backend is taking longer than expected. Try again later.'
            spinner.fail()
            try:
                response = api_response.json()
                if 'message' in response:
                    logging.error('\n%s' % response['message'])
            except:
                logging.error('\nData valuation failed. Last HTTP code: %s' %
                              api_response.status_code)

    raise LongerThanExpectedException(
        'The backend is taking longer than expected, but rest reassured your task is still running. Please try again later to retrieve your results.'
    )

    return None

示例#4

显示文件

文件： improvability.py 项目： kxytechnologies/kxy-python

def data_driven_improvability(data_df, target_column, new_variables, problem_type, snr='auto', file_name=None):
	"""
	.. data-driven-improvability:
	Estimate the potential performance boost that a set of new explanatory variables can bring about.


	Parameters
	----------
	data_df : pandas.DataFrame
		The pandas DataFrame containing the data.
	target_column : str
		The name of the column containing true labels.
	new_variables : list
		The names of the columns to use as new explanatory variables.
	problem_type : None | 'classification' | 'regression'
		The type of supervised learning problem. When None, it is inferred from whether or not :code:`target_column` is categorical.
	file_name : None | str
		A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.



	Returns
	-------
	result : pandas.Dataframe
		The result is a pandas.Dataframe with columns (where applicable):

		* :code:`'Accuracy Boost'`: The classification accuracy boost that the new explanatory variables can bring about.
		* :code:`'R-Squared Boost'`: The :math:`R^2` boost that the new explanatory variables can bring about.
		* :code:`'RMSE Reduction'`: The reduction in Root Mean Square Error that the new explanatory variables can bring about.
		* :code:`'Log-Likelihood Per Sample Boost'`: The boost in log-likelihood per sample that the new explanatory variables can bring about.


	.. admonition:: Theoretical Foundation

		Section :ref:`3 - Model Improvability`.
		
	"""
	assert target_column in data_df.columns, 'The label column should be a column of the dataframe.'
	assert problem_type.lower() in ['classification', 'regression']
	assert len(new_variables) > 0, 'New variables should be provided'
	for col in new_variables:
		assert col in data_df.columns, '%s should be a column in the dataframe' % col
	if problem_type.lower() == 'regression':
		assert np.can_cast(data_df[target_column], float), 'The target column should be numeric'

	k = 0
	kp = 0
	max_k = 100

	file_name = upload_data(data_df, file_name=file_name)
	spinner = Halo(text='Waiting for results from the backend.', spinner='dots')
	spinner.start()

	if file_name:
		job_id = DD_IMPROVABILITY_JOB_IDS.get((file_name, target_column, str(new_variables), problem_type, snr), None)

		if job_id:
			api_response = APIClient.route(
				path='/wk/data-driven-improvability', method='POST', \
				file_name=file_name, target_column=target_column, \
				problem_type=problem_type, new_variables=json.dumps(new_variables), \
				job_id=job_id, timestamp=int(time()), snr=snr)
		else:
			api_response = APIClient.route(
				path='/wk/data-driven-improvability', method='POST', \
				file_name=file_name, target_column=target_column, \
				problem_type=problem_type, new_variables=json.dumps(new_variables), \
				timestamp=int(time()), snr=snr)


		initial_time = time()
		while api_response.status_code == requests.codes.ok and k < max_k:
			if kp%2 != 0:
				sleep(2 if kp<5 else 10 if k < max_k-4 else 300)
				kp += 1
				k = kp//2

			else:
				try:
					response = api_response.json()
					if 'job_id' in response:
						job_id = response['job_id']
						DD_IMPROVABILITY_JOB_IDS[(file_name, target_column, str(new_variables), problem_type, snr)] = job_id
						sleep(2 if kp<5 else 10 if k < max_k-4 else 300)
						kp += 1
						k = kp//2
						api_response = APIClient.route(
							path='/wk/data-driven-improvability', method='POST', \
							file_name=file_name, target_column=target_column, \
							problem_type=problem_type, new_variables=json.dumps(new_variables), \
							timestamp=int(time()), snr=snr)

						try:
							response = api_response.json()
							if 'eta' in response:
								progress_text = '%s%% Completed.' % response['progress_pct'] if 'progress_pct' in response else ''
								spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (response['eta'], progress_text)
						except:
							pass

					if ('job_id' not in response) or ('r-squared-boost' in response):
						duration = int(time()-initial_time)
						duration = str(duration) + 's' if duration < 60 else str(duration//60) + 'min'
						result = {}
						if 'r-squared-boost' in response:
							result['R-Squared Boost'] = [response['r-squared-boost']]

						if 'log-likelihood-boost' in response:
							result['Log-Likelihood Per Sample Boost'] = [response['log-likelihood-boost']]

						if 'rmse-reduction' in response and problem_type.lower() == 'regression':
							result['RMSE Reduction'] = [response['rmse-reduction']]

						if 'accuracy-boost' in response and problem_type.lower() == 'classification':
							result['Accuracy Boost'] = [response['accuracy-boost']]

						result = pd.DataFrame.from_dict(result)
						spinner.text = 'Received results from the backend after %s' % duration
						spinner.succeed()
						return result

				except:
					spinner.text = 'The backend encountered an unexpected error we are looking into. Please try again later.'
					spinner.fail()
					return None

		if api_response.status_code != requests.codes.ok:
			spinner.text = 'The backend is taking longer than expected. Try again later.'
			spinner.fail()
			try:
				response = api_response.json()
				if 'message' in response:
					logging.error('\n%s' % response['message'])
			except:
				logging.error('\nData-driven improvability failed. Last HTTP code: %s' % api_response.status_code)

	return None

示例#5

显示文件

文件： improvability.py 项目： kxytechnologies/kxy-python

def model_driven_improvability(data_df, target_column, prediction_column, problem_type, snr='auto', file_name=None):
	"""
	.. model-driven-improvability:
	Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).


	Parameters
	----------
	data_df : pandas.DataFrame
		The pandas DataFrame containing the data.
	target_column : str
		The name of the column containing true labels.
	prediction_column : str
		The name of the column containing model predictions.
	problem_type : None | 'classification' | 'regression'
		The type of supervised learning problem. When None, it is inferred from whether or not :code:`target_column` is categorical.
	file_name : None | str
		A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.


	Returns
	-------
	result : pandas.Dataframe
		The result is a pandas.Dataframe with columns (where applicable):

		* :code:`'Lost Accuracy'`: The amount of classification accuracy that was irreversibly lost when training the supervised learner.
		* :code:`'Lost R-Squared'`: The amount of :math:`R^2` that was irreversibly lost when training the supervised learner.
		* :code:`'Lost RMSE'`: The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.		
		* :code:`'Lost Log-Likelihood Per Sample'`: The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.

		* :code:`'Residual R-Squared'`: For regression problems, this is the highest :math:`R^2` that may be achieved when using explanatory variables to predict regression residuals.
		* :code:`'Residual RMSE'`: For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.
		* :code:`'Residual Log-Likelihood Per Sample'`: For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.


	.. admonition:: Theoretical Foundation

		Section :ref:`3 - Model Improvability`.

	"""
	assert target_column in data_df.columns, 'The label column should be a column of the dataframe.'
	assert prediction_column in data_df.columns, 'The prediction column should be a column of the dataframe.'
	assert problem_type.lower() in ['classification', 'regression']
	if problem_type.lower() == 'regression':
		assert np.can_cast(data_df[target_column], float), 'The target column should be numeric'
		assert np.can_cast(data_df[prediction_column], float), 'The prediction column should be numeric'

	k = 0
	kp = 0
	max_k = 100

	file_name = upload_data(data_df, file_name=file_name)
	spinner = Halo(text='Waiting for results from the backend.', spinner='dots')
	spinner.start()

	if file_name:
		job_id = MD_IMPROVABILITY_JOB_IDS.get((file_name, target_column, prediction_column, problem_type, snr), None)

		if job_id:
			api_response = APIClient.route(
				path='/wk/model-driven-improvability', method='POST', \
				file_name=file_name, target_column=target_column, \
				problem_type=problem_type, prediction_column=prediction_column, \
				job_id=job_id, timestamp=int(time()), snr=snr)
		else:
			api_response = APIClient.route(
				path='/wk/model-driven-improvability', method='POST', \
				file_name=file_name, target_column=target_column, \
				problem_type=problem_type, prediction_column=prediction_column, \
				timestamp=int(time()), snr=snr)

		while api_response.status_code == requests.codes.ok and k <= max_k:
			if kp%2 != 0:
				sleep(2 if kp<5 else 10 if k < max_k-4 else 300)
				kp += 1
				k = kp//2

			else:
				try:
					response = api_response.json()
					if 'job_id' in response:
						job_id = response['job_id']
						MD_IMPROVABILITY_JOB_IDS[(file_name, target_column, prediction_column, problem_type, snr)] = job_id
						sleep(2 if kp<5 else 10 if k < max_k-4 else 300)
						kp += 1
						k = kp//2
						api_response = APIClient.route(
							path='/wk/model-driven-improvability', method='POST', \
							file_name=file_name, target_column=target_column, \
							problem_type=problem_type, prediction_column=prediction_column, \
							job_id=job_id, timestamp=int(time()), snr=snr)

						try:
							response = api_response.json()
							if 'eta' in response:
								progress_text = '%s%% Completed.' % response['progress_pct'] if 'progress_pct' in response else ''
								spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (response['eta'], progress_text)
						except:
							pass

					if ('job_id' not in response) or ('lost-r-squared' in response):
						result = {}

						if 'lost-r-squared' in response:
							result['Lost R-Squared'] = [response['lost-r-squared']]			

						if 'lost-log-likelihood' in response:
							result['Lost Log-Likelihood Per Sample'] = [response['lost-log-likelihood']]

						if 'lost-rmse' in response and problem_type.lower() == 'regression':
							result['Lost RMSE'] = [response['lost-rmse']]

						if 'lost-accuracy' in response and problem_type.lower() == 'classification':
							result['Lost Accuracy'] = [response['lost-accuracy']]


						if problem_type.lower() == 'regression':
							if 'residual-r-squared' in response:
								result['Residual R-Squared'] = [response['residual-r-squared']]			

							if 'residual-log-likelihood' in response:
								result['Residual Log-Likelihood Per Sample'] = [response['residual-log-likelihood']]

							if 'residual-rmse' in response:
								result['Residual RMSE'] = [response['residual-rmse']]

						result = pd.DataFrame.from_dict(result)
						spinner.text = 'Received results from the backend after %s' % duration
						spinner.succeed()
						return result

				except:
					spinner.text = 'The backend encountered an unexpected error we are looking into. Please try again later.'
					spinner.fail()
					return None

		if api_response.status_code != requests.codes.ok:
			spinner.text = 'The backend is taking longer than expected. Please try again later.'
			spinner.fail()
			try:
				response = api_response.json()
				if 'message' in response:
					logging.error('\n%s' % response['message'])
			except:
				logging.error('\nModel-driven improvability failed. Last HTTP code: %s' % api_response.status_code)

	raise LongerThanExpectedException('The backend is taking longer than expected, but rest reassured your task is still running. Please try again later to retrieve your results.')

	return None

示例#6

显示文件

文件： corr.py 项目： kxytechnologies/kxy-python

def information_adjusted_correlation(data_df, market_column, asset_column):
    """
	Estimate the information-adjusted correlation between an asset return :math:`r` and the market return :math:`r_m`: :math:`\\text{IA-Corr}\\left(r, r_m \\right) := \\text{sgn}\\left(\\text{Corr}\\left(r, r_m \\right) \\right) \\left[1 - e^{-2I(r, r_m)} \\right]`, where :math:`\\text{sgn}\\left(\\text{Corr}\\left(r, r_m \\right) \\right)` the sign of the Pearson correlation coefficient.

	Unlike Pearson's correlation coefficient, which is 0 if and only if asset return and market return are **decorrelated** (i.e. they exhibit no linear relation), information-adjusted correlation is 0 if and only if market and asset returns are **statistically independent** (i.e. the exhibit no relation, linear or nonlinear).


	Parameters
	----------
	data_df : pandas.DataFrame
		The pandas DataFrame containing the data.
	market_column : str
		The name of the column containing market returns.
	asset_column : str
		The name of the column containing asset returns.


	Returns
	-------
	result : float
		The information-adjusted correlation.

	"""
    assert market_column in data_df.columns, 'The market column should be a column of the dataframe.'
    assert asset_column in data_df.columns, 'The asset column should be a column of the dataframe.'
    assert np.can_cast(data_df[market_column],
                       float), 'The market return column should be numeric'
    assert np.can_cast(data_df[asset_column],
                       float), 'The asset return column should be numeric'

    k = 0
    kp = 0
    max_k = 100
    spinner = Halo(text='Waiting for results from the backend.',
                   spinner='dots')
    spinner.start()

    df = data_df[[market_column, asset_column]]
    file_name = upload_data(df)
    if file_name:
        job_id = IACORR_JOB_IDS.get(file_name, None)

        if job_id:
            api_response = APIClient.route(
             path='/wk/ia-corr', method='POST',
             file_name=file_name, market_column=market_column, \
             asset_column=asset_column, \
             timestamp=int(time()), job_id=job_id)
        else:
            api_response = APIClient.route(
             path='/wk/ia-corr', method='POST', \
             file_name=file_name, market_column=market_column, \
             asset_column=asset_column, \
             timestamp=int(time()))

        initial_time = time()
        while api_response.status_code == requests.codes.ok and k < max_k:
            if kp % 2 != 0:
                sleep(2 if kp < 5 else 5 if k < max_k - 4 else 300)
                kp += 4
                k = kp // 2
            else:
                try:
                    response = api_response.json()
                    if 'job_id' in response:
                        job_id = response['job_id']
                        IACORR_JOB_IDS[file_name] = job_id
                        sleep(2 if kp < 5 else 5 if k < max_k - 4 else 300)
                        kp += 4
                        k = kp // 2

                        # Note: it is important to pass the job_id to avoid being charged twice for the same work.
                        api_response = APIClient.route(
                         path='/wk/ia-corr', method='POST',
                         file_name=file_name, market_column=market_column, \
                         asset_column=asset_column, \
                         timestamp=int(time()), job_id=job_id)

                        try:
                            response = api_response.json()
                            if 'eta' in response:
                                progress_text = '%s%% Completed.' % response[
                                    'progress_pct'] if 'progress_pct' in response else ''
                                spinner.text = 'Waiting for results from the backend. ETA: %s. %s' % (
                                    response['eta'], progress_text)
                        except:
                            pass

                    if 'job_id' not in response:
                        duration = int(time() - initial_time)
                        duration = str(
                            duration) + 's' if duration < 60 else str(
                                duration // 60) + 'min'
                        spinner.text = 'Received results from the backend in %s' % duration
                        spinner.succeed()

                        if 'ia-corr' in response:
                            return response['ia-corr']
                        else:
                            return np.nan

                except:
                    spinner.text = 'The backend encountered an unexpected error we are looking into. Please try again later.'
                    spinner.fail()
                    logging.exception(
                        '\nInformation-adjusted correlation failed. Last HTTP code: %s'
                        % api_response.status_code)
                    return None

        if api_response.status_code != requests.codes.ok:
            spinner.text = 'The backend is taking longer than expected. Please try again later.'
            spinner.fail()
            try:
                response = api_response.json()
                if 'message' in response:
                    logging.error('\n%s' % response['message'])
            except:
                logging.error(
                    '\nInformation-adjusted correlation failed. Last HTTP code: %s'
                    % api_response.status_code)

    return None

示例#7

显示文件

    def _additive_fit(self, obj, target_column, learner_func, problem_type=None, snr='auto', train_frac=0.8, random_state=0, \
      force_redo=False, max_n_features=None, min_n_features=None, start_n_features=None, anonymize=False, \
      benchmark_feature=None, missing_value_imputation=False, score='auto', n_down_perf_before_stop=3, \
      regression_baseline='mean', regression_error_type='additive', return_scores=False, start_n_features_perf_frac=0.9, \
      val_performance_buffer=0.0, path=None, file_name=None):
        # A base learner here is fitted to the residuals of the best model so far.
        assert inspect.isfunction(
            learner_func), 'learner_func should be a class'
        assert target_column in obj.columns, 'The target column should be a valid column'
        if problem_type is None:
            problem_type = 'classification' if obj.kxy.is_discrete(
                target_column) else 'regression'
        assert problem_type in ('classification', 'regression')
        self.problem_type = problem_type
        self.additive_learning = True
        assert regression_error_type in ('additive', 'multiplicative')
        self.regression_error_type = regression_error_type

        for col in obj.columns:
            assert not obj.kxy.is_categorical(
                col), 'All columns should be numeric'

        x_columns = [_ for _ in obj.columns if _ != target_column]
        if self.problem_type == 'classification':
            labels = set(list(obj[target_column].values.astype(int)))
            binary_labels = {0, 1}
            assert labels.issubset(
                binary_labels), 'Classification labels should either be 0 or 1'

        if benchmark_feature:
            assert benchmark_feature in obj.columns, 'The benchmark feature should be a valid column'
        self.benchmark_feature = benchmark_feature
        if callable(score):
            score_func = score
        else:
            if score == 'auto':
                score = 'r2_score' if problem_type == 'regression' else 'accuracy_score'
            score_func = eval(score)
        score_name = score_func.__name__
        self.val_scores = []

        if getattr(self, 'models', None) is None or force_redo:
            # 0. Train/Validation split
            self.target_column = target_column
            if return_scores:
                # Reserve 1-train_frac for testing, and train_frac for training and validation
                self.test_df = obj.sample(frac=1. - train_frac,
                                          random_state=random_state)
                self.train_val_df = obj.drop(self.test_df.index)
            else:
                self.train_val_df = obj
            # Reserve train_frac*train_frac for training [...]
            self.train_df = self.train_val_df.sample(frac=train_frac,
                                                     random_state=random_state)
            # [...] and train_frac*(1-train_frac) for validation.
            self.val_df = self.train_val_df.drop(self.train_df.index)

            if missing_value_imputation:
                # Basic missing value imputation
                self.train_df.fillna(self.train_df.median(), inplace=True)
                self.val_df.fillna(self.train_df.median(), inplace=True)
                if return_scores:
                    self.test_df.fillna(self.train_df.median(), inplace=True)

            # 1. Model-free variable selection
            vs_accessor = PreLearningAccessor(obj)
            self.variable_selection_results = vs_accessor.variable_selection(self.target_column, problem_type=self.problem_type, \
             snr=snr, anonymize=anonymize, file_name=file_name)

            self.variables = [
                _ for _ in self.variable_selection_results['Variable'].values
                if _.lower() != 'no variable'
            ]
            n_variables = len(self.variables)
            if max_n_features:
                n_variables = min(n_variables, max_n_features)

            if start_n_features is None:
                perfs = [
                    _ for _ in self.
                    variable_selection_results['Running Achievable R-Squared'].
                    astype(float)
                ]
                max_perf = np.max(perfs)
                perf_threshold = start_n_features_perf_frac * max_perf
                start_n_features = n_variables - len(
                    [_ for _ in perfs if _ > perf_threshold]) + 1

            # 2. Sequentially add variables in decreasing order of importance.
            # 2.1 Baseline performance
            y_train = self.train_df[[self.target_column]].values
            y_val = self.val_df[[self.target_column]].values
            if return_scores:
                y_test = self.test_df[[self.target_column]].values
            x_train = self.train_df[self.variables[:1]].values
            x_val = self.val_df[self.variables[:1]].values

            base_m = BaselineRegressor(
                baseline=regression_baseline
            ) if problem_type == 'regression' else BaselineClassifier()
            base_m.fit(x_train, y_train)
            y_val_pred = base_m.predict(x_val)
            previous_score = score_func(y_val, y_val_pred)

            spinner = Halo(text='Lean Boosting:', spinner='dots')
            spinner.start()
            logging.info('Baseline score (%s): %.4f' %
                         (score_name, previous_score))
            spinner.text = 'Lean Boosting -- Baseline %s: %.4f' % (
                score_name, previous_score)

            self.start_n_features = min(start_n_features, n_variables)
            n_down_perf = 0
            target_train = y_train.copy()
            y_val_pred = None
            self.models = []
            self.max_var_ixs = []
            for i in range(self.start_n_features, n_variables + 1):
                gc.collect()
                vs = self.variables[:i]
                x_train = self.train_df[vs].values
                x_val = self.val_df[vs].values
                n_vars = x_train.shape[1] if len(x_train.shape) > 1 else 1

                # Create the new model
                model_path = path + '-model-%d-LeanMLPredictor' % i if path else None
                m = learner_func(n_vars=n_vars, path=model_path)

                # Fit the new model
                m.fit(x_train, target_train)

                # New validation score
                target_val_pred = m.predict(x_val)
                target_val_pred = target_val_pred if len(
                    target_val_pred.shape) > 1 else target_val_pred[:, None]
                target_train_pred = m.predict(x_train)
                target_train_pred = target_train_pred if len(
                    target_train_pred.shape) > 1 else target_train_pred[:,
                                                                        None]

                if y_val_pred is None:
                    y_val_pred = target_val_pred
                else:
                    if self.problem_type == 'regression':
                        y_val_pred = y_val_pred+target_val_pred if self.regression_error_type == 'additive' else \
                         y_val_pred*target_val_pred

                    if self.problem_type == 'classification':
                        y_val_pred = np.abs(y_val_pred - target_val_pred)

                val_score = score_func(y_val, y_val_pred)
                if val_score > previous_score + val_performance_buffer or (
                        min_n_features and i <= min_n_features):
                    n_down_perf = 0
                    logging.info(
                        'Variable #%d (%s) increased validation performance from %.4f to %.4f'
                        %
                        (i, self.variables[i - 1], previous_score, val_score))
                    spinner.text = 'Lean Boosting -- %d Variables, Validation %s: %.4f' % (
                        i, score_name, val_score)
                    previous_score = val_score

                    if self.problem_type == 'regression':
                        target_train = target_train - target_train_pred if self.regression_error_type == 'additive' else target_train / target_train_pred

                    if self.problem_type == 'classification':
                        target_train = np.logical_not(
                            target_train == target_train_pred).astype(int)

                    self.models = self.models + [m]
                    if path:
                        self.predictor_paths = self.predictor_paths + [
                            model_path
                        ]
                    self.max_var_ixs = self.max_var_ixs + [i]
                    self.val_scores = self.val_scores + [(i, val_score)]

                else:
                    n_down_perf += 1
                    logging.info(
                        'Validation performance did not increase for the %d-th consecutive time. Old: %.4f, New: %.4f, Variable: %s'
                        % (n_down_perf, previous_score, val_score,
                           self.variables[i - 1]))
                    if n_down_perf >= n_down_perf_before_stop:
                        # Only stop after a certain number of consecutive down performance
                        logging.info(
                            'Stopping training as validation performance did not increase %d consecutive times.'
                            % n_down_perf_before_stop)
                        spinner.succeed()
                        break

                if max_n_features and (i == max_n_features):
                    logging.info(
                        'Stopping training as the maximum number of variables (%d) has been reached'
                        % max_n_features)
                    spinner.succeed()
                    break

            self.selected_variables = self.variables[:self.max_var_ixs[
                -1]] if self.models else []
            if self.models == []:
                self.models = [base_m]
                model_path = path + '-base-model-LeanMLPredictor'
                if path:
                    self.predictor_paths = self.predictor_paths + [model_path]
                self.max_var_ixs = [1]
                self.val_scores = [(0, previous_score)]

            results = {'Selected Variables': self.selected_variables}
            if return_scores:
                # Inputs
                x_train = self.train_df[self.selected_variables].values
                x_val = self.val_df[self.selected_variables].values
                x_test = self.test_df[self.selected_variables].values

                # Predictions
                self.y_train_pred = self.predict(self.train_df)
                self.y_train_pred = self.y_train_pred.values.flatten()

                self.y_val_pred = self.predict(self.val_df)
                self.y_val_pred = self.y_val_pred.values.flatten()

                self.y_test_pred = self.predict(self.test_df)
                self.y_test_pred = self.y_test_pred.values.flatten()

                # Scores
                self.train_score = score_func(y_train, self.y_train_pred)
                self.val_score = score_func(y_val, self.y_val_pred)
                self.test_score = score_func(y_test, self.y_test_pred)

                results['Training Score'] = '%.5f' % self.train_score
                results['Validation Score'] = '%.5f' % self.val_score
                results['Testing Score'] = '%.5f' % self.test_score

                if self.problem_type == 'regression':
                    results['Training R-Squared'] = '%.3f' % r2_score(
                        y_train.flatten(), self.y_train_pred.flatten())
                    results['Validation R-Squared'] = '%.3f' % r2_score(
                        y_val.flatten(), self.y_val_pred.flatten())
                    results['Testing R-Squared'] = '%.3f' % r2_score(
                        y_test.flatten(), self.y_test_pred.flatten())

                    results['Training RMSE'] = '%.5f' % mean_squared_error(
                        y_train.flatten(),
                        self.y_train_pred.flatten(),
                        squared=False)
                    results['Validation RMSE'] = '%.5f' % mean_squared_error(
                        y_val.flatten(),
                        self.y_val_pred.flatten(),
                        squared=False)
                    results['Testing RMSE'] = '%.5f' % mean_squared_error(
                        y_test.flatten(),
                        self.y_test_pred.flatten(),
                        squared=False)

                if self.problem_type == 'classification':
                    results['Training Accuracy'] = '%.3f' % self.train_score
                    results['Validation Accuracy'] = '%.3f' % self.val_score
                    results['Testing Accuracy'] = '%.3f' % self.test_score

            return results