Python에서 Pearson 상관 관계 및 의미 계산

Programming

Python에서 Pearson 상관 관계 및 의미 계산

procodes 2020. 5. 17. 21:45

Python에서 Pearson 상관 관계 및 의미 계산

두 목록을 입력으로 받아 Pearson correlation 과 상관 관계 의 중요성을 반환하는 함수를 찾고 있습니다.

당신은 볼 수 있습니다 scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

Pearson 상관 관계는 numpy 's로 계산할 수 있습니다 corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]

대안은 다음 을 계산하는 linregress 의 기본 scipy 함수일 수 있습니다 .

기울기 : 회귀선의 기울기

가로 채기 : 회귀선 가로 채기

r- 값 : 상관 계수

p-value : 귀무 가설이 기울기가 0이라는 가설 검정의 양면 p- 값

stderr : 추정치의 표준 오차

그리고 여기 예가 있습니다 :

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

당신을 반환합니다 :

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

scipy를 설치하고 싶지 않다면 Programming Collective Intelligence 에서 약간 수정 된이 빠른 해킹을 사용했습니다 .

(정확성을 위해 편집되었습니다.)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

다음 코드는 정의에 대한 간단한 해석입니다 .

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

테스트:

print pearson_def([1,2,3], [1,5,7])

보고

0.981980506062

이것은 엑셀, 동의 이 계산기 , SciPy (도 NumPy와 각각 0.981980506 및 0.9819805060619657 및 0.98198050606196574을 반환).

R :

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

편집 : 댓글 작성자가 지적한 버그를 수정했습니다.

당신도 이것을 할 수 있습니다 pandas.DataFrame.corr:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

이것은 준다

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

numpy / scipy에 의존하기보다는 Pearson Correlation Coefficient (PCC) 계산 단계 를 이해하고 코딩하는 것이 가장 쉬운 방법이라고 생각합니다 .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC 의 중요성 은 기본적으로 두 변수 / 목록이 얼마나 강한 상관 관계 가 있는지 보여줍니다 . PCC 값의 범위 는 -1 ~ 1 입니다. 0에서 1 사이의 값은 양의 상관 관계를 나타냅니다. 0의 값은 가장 높은 변동 (상관 관계 없음)입니다. -1에서 0 사이의 값은 음의 상관 관계를 나타냅니다.

흠, 이러한 응답 중 많은 부분이 길고 읽기 어려운 코드입니다 ...

배열로 작업 할 때 멋진 기능으로 numpy를 사용하는 것이 좋습니다.

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

이것은 numpy를 사용한 Pearson Correlation 함수의 구현입니다.


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr

다음은 mkh의 답변보다 훨씬 빠르게 실행되는 변형과 numba를 사용하는 scipy.stats.pearsonr입니다.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

파이썬에서 팬더를 사용한 피어슨 계수 계산 : 데이터에 목록이 포함되어 있으므로이 접근법을 시도하는 것이 좋습니다. 데이터 구조를 시각화하고 원하는대로 업데이트 할 수 있으므로 콘솔에서 데이터와 쉽게 상호 작용하고 조작 할 수 있습니다. 나중에 분석하기 위해 데이터 세트를 내보내고 저장하고 파이썬 콘솔에서 새 데이터를 추가 할 수도 있습니다. 이 코드는 더 간단하고 적은 코드 줄을 포함합니다. 추가 분석을 위해 데이터를 스크리닝하기 위해 몇 가지 빠른 코드 줄이 필요하다고 가정합니다.

예:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

그러나 데이터 세트의 크기 또는 분석 전에 필요할 수있는 변환을 볼 수 있도록 데이터를 게시하지 않았습니다.

다음은 희소 벡터를 기반으로 한 피어슨 상관에 대한 구현입니다. 여기의 벡터는 (인덱스, 값)으로 표현 된 튜플 목록으로 표현됩니다. 두 희소 벡터는 길이가 다를 수 있지만 모든 벡터 크기에서 동일해야합니다. 이 기능은 대부분의 기능이 단어로 인해 벡터 크기가 매우 큰 텍스트 마이닝 응용 프로그램에 유용하므로 계산은 일반적으로 스파 스 벡터를 사용하여 수행됩니다.

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

단위 테스트 :

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

특정 방향 (음 또는 양의 상관)으로 상관 관계를 찾는 상황에서 확률을 해석하는 방법이 궁금 할 수 있습니다. 여기에 도움을주기 위해 작성한 함수가 있습니다. 심지어 옳을 수도 있습니다!

It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

참고URL : https://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python

저작자표시 (새창열림)

'Programming' 카테고리의 다른 글

단일 값으로 C # 배열을 채우거나 인스턴스화하는 방법은 무엇입니까? (0)	2020.05.17
Microsoft.Jet.OLEDB.4.0 '공급자가 로컬 컴퓨터에 등록되어 있지 않습니다 (0)	2020.05.17
strftime을 사용하여 Python datetime을 epoch로 변환 (0)	2020.05.17
부트 스트랩 모달에서 특정 필드의 포커스를 설정하는 방법 (0)	2020.05.17
클릭시 버튼 주위의 초점을 제거하는 방법 (0)	2020.05.17

현재글Python에서 Pearson 상관 관계 및 의미 계산

procodes

Python에서 Pearson 상관 관계 및 의미 계산

Python에서 Pearson 상관 관계 및 의미 계산

'Programming' 카테고리의 다른 글

'Programming'의 다른글

티스토리툴바

Python에서 Pearson 상관 관계 및 의미 계산

Python에서 Pearson 상관 관계 및 의미 계산

'Programming' 카테고리의 다른 글

'Programming'의 다른글

관련글

티스토리툴바