Format Pandas Pivot Table - python

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.

The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Related

text file rows into CSV column python

I've a question
I've a text file containing data like this
A 34 45 7789 3475768 443 67 8999 3343 656 8876 802 383358 873 36789 2374859 485994 86960 32838459 3484549 24549 58423
T 3445 574649 68078 59348604 45959 64585304 56568 595 49686 656564 55446 665 677 778 433 545 333 65665 3535
and so on
I want to make a csv file from this text file, displaying data like this, A & T as column headings, and then numbers
A T
34 3445
45 574649
7789 68078
3475768 59348604
443 45959
EDIT (A lot simpler solution inspired by Michael Butscher's comment):
import pandas as pd
df = pd.read_csv("filename.txt", delimiter=" ")
df.T.to_csv("filename.csv", header=False)
Here is the code:
import pandas as pd
# Read file
with open("filename.txt", "r") as f:
data = f.read()
# Split data by lines and remove empty lines
columns = data.split("\n")
columns = [x.split() for x in columns if x!=""]
# Row sizes are different in your example so find max number of rows
column_lengths = [len(x) for x in columns]
max_col_length = max(column_lengths)
data = {}
for i in columns:
# Add None to end for columns that have less values
if len(i)<max_col_length:
i += [None]*(max_col_length-len(i))
data[i[0]] = i[1:]
# Create dataframe
df = pd.DataFrame(data)
# Create csv
df.to_csv("filename.csv", index=False)
Output should look like this:
A T
0 34 3445
1 45 574649
2 7789 68078
3 3475768 59348604
4 443 45959
5 67 64585304
6 8999 56568
7 3343 595
8 656 49686
9 8876 656564
10 802 55446
11 383358 665
12 873 677
13 36789 778
14 2374859 433
15 485994 545
16 86960 333
17 32838459 65665
18 3484549 3535
19 24549 None
20 58423 None
here is my code
import pandas as pd
data = pd.read_csv("text (3).txt", header = None)
Our_Data = pd.DataFrame(data)
for rows in Our_Data:
New_Data=pd.DataFrame(Our_Data[rows].str.split(' ').tolist()).T
New_Data.columns = New_Data.iloc[0]
New_Data = New_Data[1:]
New_Data.to_csv("filename.csv", index=False)
The Output
A T
1 34 3445
2 45 574649
3 7789 68078
4 3475768 59348604
5 443 45959
6 67 64585304
7 8999 56568
8 3343 595
9 656 49686
10 8876 656564
11 802 55446
12 383358 665
13 873 677
14 36789 778
15 2374859 433
16 485994 545
17 86960 333
18 32838459 65665
19 3484549 3535
20 24549 None
21 58423 None

Python: replace a column from a dataframe based on another column from another dataframe over an index column

I have a dataframe DF1:
1
2
3
4
ID
1
121
1313
+
102466751
2
112
133
+
6147
3
122
313
-
55207
4
212
413
-
113655
5
1012
343
+
79501
and another dataframe DF2"
no
Ensmbl
ID
1212
ENSG00000146083
22838
1512
ENSG00000198242
6147
1262
ENSG00000134108
55207
1219
ENSG00000167700
113655
1512
ENSG00000070087
521
I am trying to get on the following final Dataframe DF3 in which it will look like:
1
2
3
4
ID
1
121
1313
+
102466751
2
112
133
+
ENSG00000198242
3
122
313
-
ENSG00000134108
4
212
413
-
ENSG00000167700
5
1012
343
+
521
where the DF3 contains on the DF2.ensembl if and only if DF1.ID == DF2.ID otherwise DF1.ID remains with no changes.
I wrote in Python:
DF3['ID'] = DF1['ID'].apply(lambda x: DF2['Ensembl'] if DF1['ID'] == DF2['ID'] else DF1['ID'])
Value Error was:
ValueError: Can only compare identically-labeled Series objects
Any help?
You can merge into df1 and then replace ID with non-NaN values from Ensmbl column.
df3 = pd.merge(df1, df2, on="ID", how="left")
m = ~df3["Ensmbl"].isna()
df3.loc[m, "ID"] = df3.loc[m, "Ensmbl"]
print(df3[df1.columns])
Prints:
1 2 3 4 ID
0 1 121 1313 + 102466751
1 2 112 133 + ENSG00000198242
2 3 122 313 - ENSG00000134108
3 4 212 413 - ENSG00000167700
4 5 1012 343 + 79501
Note: I'm assuming the last ID is 79501 and not 521 (probably a typo.)

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787
Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129
pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.
It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

pandas: how to run a pivot with a multi-index?

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.
The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers

Categories

Resources