How to append with loop in python - python

I have been searching for hours. I have 190 columns of pivot table to loop on my script
I have this script:
corr = pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[0]], list(df_pvt.columns)], method='pearson')[['X','Y','r']]
this provide output:
X ... r
0 CORSEC_Mainstream Media_Negative Count ... 1.000
1 CORSEC_Mainstream Media_Negative Count ... 0.960
2 CORSEC_Mainstream Media_Negative Count ... -0.203
3 CORSEC_Mainstream Media_Negative Count ... -0.446
4 CORSEC_Mainstream Media_Negative Count ... 0.488
.. ... ... ...
179 CORSEC_Mainstream Media_Negative Count ... -0.483
180 CORSEC_Mainstream Media_Negative Count ... -0.487
181 CORSEC_Mainstream Media_Negative Count ... 0.145
182 CORSEC_Mainstream Media_Negative Count ... 0.128
183 CORSEC_Mainstream Media_Negative Count ... 0.520
[184 rows x 3 columns]
I want to append 189 other columns to my script,
but this script keep providing 2 appended variables and keep replacing until the 189th variables
for var in list(range(1,189)):
corr_all = corr.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[var]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
print(corr_all)
Any advice?
Edit:
Its work like this:
corr = pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[0]], list(df_pvt.columns)], method='pearson')[['X','Y','r']]
corr_1 = corr.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[1]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
corr_2 = corr_1.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[2]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
But how I loop it until the corr_189?

You can try making 189 lists of values (Pearson coefficients) for each of your 189 columns, and then concatenate the columns with " df_final " which would be the dataframe containing all the 190 columns :
corr = pd.DataFrame(corr)
df_final = pd.DataFrame()
for k in range(189):
list_Pearson_k = 'formula to compute a list of pearson values'
df_list_k = pd.DataFrame(list_Pearson_k)
df_final = pd.concat([corr,df_list_k ], axis = 1)

Python list append method returns None.
Change your code to this:-
corr_all = []
for var in range(1,189):
corr_all.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[var]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
print(corr_all)
This should help.

Related

Create a nested dictionary using columns and row names as keys with dictionary comprehension

Context: I have the following dataframe:
gene_id Control_3Aligned.sortedByCoord.out.gtf Control_4Aligned.sortedByCoord.out.gtf ... NET_101Aligned.sortedByCoord.out.gtf NET_103Aligned.sortedByCoord.out.gtf NET_105Aligned.sortedByCoord.out.gtf
0 ENSG00000213279|Z97192.2 0 0 ... 3 2 7
1 ENSG00000132680|KHDC4 625 382 ... 406 465 262
2 ENSG00000145041|DCAF1 423 104 ... 231 475 254
3 ENSG00000102547|CAB39L 370 112 ... 265 393 389
4 ENSG00000173826|KCNH6 0 0 ... 0 0 0
And I'd like to get a nested dictionary as this example:
{Control_3Aligned.sortedByCoord.out.gtf:
{ENSG00000213279|Z97192.2:0,
ENSG00000132680|KHDC4:625,...},
Control_4Aligned.sortedByCoord.out.gtf:
{ENSG00000213279|Z97192.2:0,
ENSG00000132680|KHDC4:382,...}}
So the general format would be:
{column_name : {row_name:value,...},...}
I was trying something like this:
sample_dict ={}
for column in df.columns[1:]:
for index in range(0,len(df.index)+1):
sample_dict.setdefault(column, {row_name:value for row_name,value in zip(df.iloc[index,0], df.loc[index,column])})
sample_dict[column] += {row_name:value for row_name,value in zip(df.iloc[index,0], df.loc[index,column])}
But I keep getting TypeError: 'numpy.int64' object is not iterable (the problem seems to be in the zip() as zip only takes iterables and I'm not really doing that in this example and most certainly in the way I'm populating the dictionary as well)
Any help is very welcome! Thank you in advance
Managed to do it like this:
sample_dict ={}
gene_list = []
for index in range(0,len(df.index)):
temp_data = df.loc[index,'gene_id']
gene_list.append(temp_data)
for column in df.columns[1:]:
column_list = df.loc[:,column]
gene_dict = {}
for index in range(0,len(df.index)):
if gene_list[index] not in gene_dict:
gene_dict[gene_list[index]]=df.loc[index,column]
sample_dict[column] = gene_dict
sample_dict.items()
dict_pairs = sample_dict.items()
pairs_iterator = iter(dict_pairs)
first_pair = next(pairs_iterator)
first_pair

Pandas replace not working on known values in a Dataframe

have a quick question about pandas replace.
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
df_no_na = df.replace({0: np.nan})
df_no_na = df_no_na.dropna()
print(df_no_na.head())
print(df.head())
This code will return:
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
[5 rows x 20 columns]
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
I have done this exact same thing and it has worked before I have no idea why it won't now, any help would be awesome, thanks!
You are passing df.replace() a set instead of a dictionary. You need to replace {0, np.nan} with {0: np.nan}:
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
print(df)
df_no_na = df.replace({0: np.nan}) # change this line
print(df_no_na)
df_no_na = df_no_na.dropna()
print(df_no_na())
index Cell_ID Duration Overall_Angle_Median Overall_Euclidean_Median
1 1.0 372005.0 536.0 45.164322 37.859595

Why is pandas.join() not merging correctly along index?

I'm trying to merge two dataframes, with identical indices into a single dataframe, but i cant seem to get it working. I expect the repeated values due to the resample function. The final dataframe then seems to have sorted the indices in ascending order which is fine. But why is it now 2x as long?
Here is the code:
Original dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
X = default[['balance','income']]
y = default['default']
boot = resample(X,y,replace=True,n_samples = len(X),random_state=1)
#convert to dataframe
boot = np.array(boot)
X = np.array(boot)[0]
y = np.array(boot)[1]
df = pd.DataFrame(X,index = X.index)
dfy = pd.DataFrame(y,index=y.index)
df = df.join(dfy)
X dataframe:
balance income
235 964.820253 34390.746035
5192 0.000000 29322.631394
905 1234.476479 31313.374575
7813 1598.020831 39163.361056
2895 1270.092810 16809.006452
... ... ...
7920 761.988491 39172.945235
1525 916.536937 20130.915258
4981 1037.573018 18769.579024
8104 912.065531 62142.061061
6990 1341.615739 26319.015588
[10000 rows x 2 columns]
Y dataframe
default
235 No
5192 No
905 No
7813 Yes
2895 No
... ...
7920 No
1525 No
4981 No
8104 No
6990 No
[10000 rows x 1 columns]
Combine to give this for some reason:
balance income default
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
1 817.180407 12106.134700 No
... ... ... ...
9998 1569.009053 36669.112365 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
20334 rows × 3 columns
Can someone explain where im going wrong?

Sort letters in ascending order ('a-z') in Python after using value_counts

I imported my data file and isolated the first letter of each word, and provided the count of the word. My next step is to sort the letters in ascending order 'a-z'. This is the code that I have right now:
import pandas as pd
df = pd.read_csv(text.txt", names=['FirstNames'])
df
df['FirstLetter'] = df['FirstNames'].str[:1]
df
df['FirstLetter'] = df['FirstLetter'].str.lower()
df
df['FirstLetter'].value_counts()
df
df2 = df['FirstLetter'].index.value_counts()
df2
Using .index.value_counts() wasn't working for me. It turned this output:
Out[72]:
2047 1
4647 1
541 1
4639 1
2592 1
545 1
4643 1
2596 1
549 1
2600 1
2612 1
553 1
4651 1
2604 1
557 1
4655 1
2608 1
561 1
2588 1
4635 1
..
`````````
How can I fix this?
You can use the sort_index() function. This should work df['FirstLetter'].value_counts().sort_index()

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

Categories

Resources