Spilt dataframe in pandas - python

I have a csv file that read using pandas, I' want to split the dataframe in chunks in a specified column:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
list_of_classes=[]
# Reading file
fileName = 'Training.csv'
df = pd.read_csv(fileName)
classID = df.iloc[:,-2]
len(classID)
df.iloc[0,-2]
for i in range(len(classID)):
print(classID[i])
if classID[i] not in list_of_classes:
list_of_classes.append(classID[i])
for i in range(len(df)):
...............................
UPDATE
Say the dataframe looks like :
........................................
Feature0 Feature1 Feature2 Feature3 ......... classID lastColum
190 565 35474 0.336283 2.973684 255 0
311 984 113199 0.316057 3.163987 155 0
310 984 94197 0.315041 3.174194 1005 0
280 984 116359 0.284553 3.514286 255 18
249 984 107482 0.253049 3.951807 1005 0
283 984 132343 0.287602 3.477032 155 0
213 984 88244 0.216463 4.619718 255 0
839 984 203139 0.852642 1.172825 255 0
376 984 105133 0.382114 2.617021 1005 0
324 984 129209 0.329268 3.037037 1005 0
in this example the result that I'm aiming to get, is 3 dataframes, each of them has only 1 classID either 155, 1005, or 255.
my question is, is there a finer way to do this ?

Split to 3 separate CSV files:
df.groupby('classID') \
.apply(lambda x: x.to_csv(r'c:/temp/{}.csv'.format(x.name), index=False))
Generate a dictionary of "splitted" DataFrames:
In [210]: dfs = {g:x for g,x in df.groupby('classID')}
In [211]: dfs.keys()
Out[211]: dict_keys([155, 255, 1005])
In [212]: dfs[155]
Out[212]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
1 311 984 113199 0.316057 155 0
5 283 984 132343 0.287602 155 0
In [213]: dfs[255]
Out[213]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
0 190 565 35474 0.336283 255 0
3 280 984 116359 0.284553 255 18
6 213 984 88244 0.216463 255 0
7 839 984 203139 0.852642 255 0
In [214]: dfs[1005]
Out[214]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
2 310 984 94197 0.315041 1005 0
4 249 984 107482 0.253049 1005 0
8 376 984 105133 0.382114 1005 0
9 324 984 129209 0.329268 1005 0

Here is an example of how you can do it:
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'part': [1, 1, 1, 2, 2, 2]})
parts = df.part.unique()
for part in parts:
print df.loc[df.part == part]
So the point is that you take all unique parts by calling unique() on series that you want to use for split.
After that, you can access those parts via loop and do whatever you need on each one of them.

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!
If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833
You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Python - Plot linear percentage graph

I have this numpy array:
[
[ 0 0 0 0 0 0 2 0 2 0 0 1 26 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 4]
[21477 61607 21999 17913 22470 32390 11987 41977 81676 20668 17997 15278 46281 19884]
[ 5059 13248 5498 3866 2144 6161 2361 8734 16914 3724 4614 3607 11305 2880]
[ 282 1580 324 595 218 525 150 942 187 232 430 343 524 189]
[ 1317 6416 1559 882 599 2520 525 2560 19197 729 1391 1727 2044 1198]
]
I've just created logarithm heatmap which works as intended. However I would like to create another heat map that would represent linear scale across rows, and shows each position in matrix corresponding percentage value, while sum of row would give 100%. Without using seaborn or pandas
Something like this:
Here you go:
import matplotlib.pyplot as plt
import numpy as np
a = np.array([[0,0,0,0,0,0,2,0,2,0,0,1,26,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,4],
[21477,61607,21999,17913,22470,32390,11987,41977,81676,20668,17997,15278,46281,19884],
[5059,13248,5498,3866,2144,6161,2361,8734,16914,3724,4614,3607,11305,2880],
[282,1580,324,595,218,525,150,942,187,232,430,343,524,189],
[1317,6416,1559,882,599,2520,525,2560,19197,729,1391,1727,2044,1198]])
# normalize
normalized_a = a/np.sum(a,axis=1)[:,None]
# plot
plt.imshow(normalized_a)
plt.show()

Python: replace a column from a dataframe based on another column from another dataframe over an index column

I have a dataframe DF1:
1
2
3
4
ID
1
121
1313
+
102466751
2
112
133
+
6147
3
122
313
-
55207
4
212
413
-
113655
5
1012
343
+
79501
and another dataframe DF2"
no
Ensmbl
ID
1212
ENSG00000146083
22838
1512
ENSG00000198242
6147
1262
ENSG00000134108
55207
1219
ENSG00000167700
113655
1512
ENSG00000070087
521
I am trying to get on the following final Dataframe DF3 in which it will look like:
1
2
3
4
ID
1
121
1313
+
102466751
2
112
133
+
ENSG00000198242
3
122
313
-
ENSG00000134108
4
212
413
-
ENSG00000167700
5
1012
343
+
521
where the DF3 contains on the DF2.ensembl if and only if DF1.ID == DF2.ID otherwise DF1.ID remains with no changes.
I wrote in Python:
DF3['ID'] = DF1['ID'].apply(lambda x: DF2['Ensembl'] if DF1['ID'] == DF2['ID'] else DF1['ID'])
Value Error was:
ValueError: Can only compare identically-labeled Series objects
Any help?
You can merge into df1 and then replace ID with non-NaN values from Ensmbl column.
df3 = pd.merge(df1, df2, on="ID", how="left")
m = ~df3["Ensmbl"].isna()
df3.loc[m, "ID"] = df3.loc[m, "Ensmbl"]
print(df3[df1.columns])
Prints:
1 2 3 4 ID
0 1 121 1313 + 102466751
1 2 112 133 + ENSG00000198242
2 3 122 313 - ENSG00000134108
3 4 212 413 - ENSG00000167700
4 5 1012 343 + 79501
Note: I'm assuming the last ID is 79501 and not 521 (probably a typo.)

Make a new column based on other columns id values - Pandas

How can i make new columns based on another columns id values?
The data look like this.
value id
551 54089
12 54089
99 54089
55 73516
123 73516
431 73516
742 74237
444 74237
234 74237
I want the dataset to look like this.
v1 v2 v3
54089 551 12 99
73516 55 123 431
74237 742 444 234
Use groupby with unstack:
df = df.groupby('id')['value'].apply(lambda x: pd.Series(x.tolist(),
index=['v1', 'v2', 'v3']))\
.unstack()
# or
df.groupby('id')['value'].apply(lambda x: pd.DataFrame(x.tolist(),
index=['v1', 'v2', 'v3']).T)
print(df)
v1 v2 v3
id
54089 551 12 99
73516 55 123 431
74237 742 444 234
If you have more than 3 values you can create a little helper, that adapts to the size of your dataframe.
import pandas as pd
import numpy as np
#Dummy Dataframe
np.random.seed(2016)
df = pd.DataFrame({'id':
[54089, 54089, 54089, 73516, 73516, 73516, 73516, 74237, 74237,74237],
'value': np.random.randint(1, 100, 10)})
#Create group
grp = df.groupby('id')
#Create helper column
df['ID_Count'] = grp['value'].cumcount() + 1
#Pivot dataframe using helper column and add 'value' column to pivoted output.
df_out = df.pivot('id','ID_Count','value').add_prefix('v')
An addition to the excellent answers already provided :
(df.astype({'value':str})
.groupby('id')
.agg(','.join)
.value.str.split(',',expand=True)
.set_axis(['v1','v2','v3'],axis=1)
.astype(int)
)
v1 v2 v3
id
54089 551 12 99
73516 55 123 431
74237 742 444 234

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Categories

Resources