This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 5 years ago.
I have a data-frame which looks similar to this (has about 300k rows):
df = DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
which gives this :
what I want is a list of data-frames(subset) like this:
df_list = [df_1, df_2, df_3]
where df_1, df_2, df_3 are essentially these:
df_1 = df.query("name == 'jon'")
df_2 = df.query("name == 'dany'")
df_3 = df.query("name == 'mindy'")
In the dataset that I'm working with, there are about 500+ names. So how do I efficiently do this?
Here's one way.
In [1497]: df_list = [x[1] for x in df.groupby('name', sort=False)]
In [1498]: df_list[0]
Out[1498]:
name power rank
0 jon 1 a
1 jon 2 b
In [1499]: df_list[1]
Out[1499]:
name power rank
2 dany 2 c
3 dany 4 d
In [1500]: df_list[2]
Out[1500]:
name power rank
4 mindy 5 r
5 mindy 5 a
6 mindy 7 g
But, it's better to store them as dict
In [1501]: {g: v for g, v in df.groupby('name', sort=False)}
Out[1501]:
{'dany': name power rank
2 dany 2 c
3 dany 4 d, 'jon': name power rank
0 jon 1 a
1 jon 2 b, 'mindy': name power rank
4 mindy 5 r
5 mindy 5 a
6 mindy 7 g}
In [1502]: df_dict = {g: v for g, v in df.groupby('name', sort=False)}
In [1503]: df_dict['jon']
Out[1503]:
name power rank
0 jon 1 a
1 jon 2 b
You can try doing this
import pandas as pd
df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
dfs = []
for each in df.name.unique():
dfs.append(df.loc[df.name == each,:])
Alternatively, you can use numpy to do this -
import numpy as np
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))
Speed comparison between the above two methods -
import pandas as pd
import numpy as np
from time import time
df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
t0 = time()
dfs = []
for each in df.name.unique():
dfs.append(df.loc[df.name == each,:])
t1 = time()
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))
t2 = time()
t1 - t0 #0.003524303436279297
t2 - t1 #0.0016787052154541016
Numpy is faster and can be helpful in your case as you have a large dataset
Related
I have table that looks like:
Group Name
1 A
1 B
2 R
2 F
3 B
3 C
And i need group this records by following rool:
If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group.
As result i want to get something like this:
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1
I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?
def printList(l, head=""):
if(head!=""):
print(head)
for i in l:
print(i)
def find_group(groups, vals):
for k in groups.keys():
for v in vals:
if v in groups[k]:
return k
return 0
task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]
ptrs = {}
groups = {}
group_id = 1
printList(task, "Initial table")
for i in range(0, len(task)):
itask = task[i]
resp = itask[1]
val = [ x[0] for x in task if x[1] == resp ]
minval = min(val)
for v in val:
if not v in ptrs.keys(): ptrs[v] = minval
myGroup = find_group(groups, val)
if(myGroup == 0):
groups[group_id] = list(set(val))
myGroup = group_id
group_id += 1
else:
groups[myGroup].extend(val)
groups[myGroup] = list(set(groups[myGroup]))
itask.append(myGroup)
task[i] = itask
print()
printList(task, "Result table")
You can groupby 'Name' and keep the first Group:
df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()
Then merge with the original data-frame and drop duplicates of the original group:
df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']
One more merge will give you the result:
df.merge(df3, on='Group', how='left')
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1
How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5
let's say i have three list
listA = ['a','b','c', 'd']
listP = ['p', 'q', 'r']
listX = ['x', 'z']
so the dataframe will have 4*3*2 = 24 rows.
now, the simplest way to solve this problem is to do this:
df = pd.DataFrame(columns=['A','P','X'])
for val1 in listA:
for val2 in listP:
for val3 in listX:
df.loc[<indexvalue>] = [val1,val2,val3]
now in the real scenario I will have about 800k rows and 12 columns (so 12 nesting in the loops). is there any way i can create this dataframe much faster?
Similar discussion here. Apparently np.meshgrid is more efficient for large data (as an alternative to itertools.product.
Application:
v = np.stack(i.ravel() for i in np.meshgrid(listA, listP, listX)).T
df = pd.DataFrame(v, columns=['A', 'P', 'X'])
>> A P X
0 a p x
1 a p z
2 b p x
3 b p z
4 c p x
You could use itertools.product:
import pandas as pd
from itertools import product
listA = ['a', 'b', 'c', 'd']
listP = ['p', 'q', 'r']
listX = ['x', 'z']
df = pd.DataFrame(data=list(product(listA, listP, listX)), columns=['A','P','X'])
print(df.head(10))
Output
A P X
0 a p x
1 a p z
2 a q x
3 a q z
4 a r x
5 a r z
6 b p x
7 b p z
8 b q x
9 b q z
I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long.
I need two dataframes with classes 0 & 1 for the first data set and 3 & 5 for the second.
I can get 0 & 1 together easily enough:
mnist_01 = mnist.loc[mnist['class']<= 1]
However, I am not sure how to get classes 3 & 5... so what I would like to be able to do is:
mnist_35 = mnist.loc[mnist['class'] == (3 or 5)]
...rather than doing:
mnist_3 = mnist.loc[mnist['class'] == 3]
mnist_5 = mnist.loc[mnist['class'] == 5]
mnist_35 = pd.concat([mnist_3,mnist_5],axis=0)
You can use isin, probably using set membership to make each check an O(1) time complexity operation:
mnist = pd.DataFrame({'class': [0, 1, 2, 3, 4, 5],
'val': ['a', 'b', 'c', 'd', 'e', 'f']})
>>> mnist.loc[mnist['class'].isin({3, 5})]
class val
3 3 d
5 5 f
>>> mnist.loc[mnist['class'].isin({0, 1})]
class val
0 0 a
1 1 b
I fetch a spreadsheet into a Python DataFrame named df.
Let's give a sample:
df=pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
df.columns=['a','a']
a a
0 0.973858 0.036459
1 0.835112 0.947461
2 0.520322 0.593110
3 0.480624 0.047711
4 0.643448 0.104433
5 0.961639 0.840359
6 0.848124 0.437380
7 0.579651 0.257770
8 0.919173 0.785614
9 0.505613 0.362737
When I run df.columns.is_unique I get False
I would like to automatically rename column 'a' to 'a_2' (or things like that)
I don't expect a solution like df.columns=['a','a_2']
I looking for a solution that could be usable for several columns!
You can uniquify the columns manually:
df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']
def uniquify(df_columns):
seen = set()
for item in df_columns:
fudge = 1
newitem = item
while newitem in seen:
fudge += 1
newitem = "{}_{}".format(item, fudge)
yield newitem
seen.add(newitem)
list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
I fetch a spreadsheet into a Python DataFrame named df... I would like
to automatically rename [duplicate] column [names].
Pandas does that automatically for you without you having to do anything...
test.xls:
import pandas as pd
import numpy as np
df = pd.io.excel.read_excel(
"./test.xls",
"Sheet1",
header=0,
index_col=0,
)
print df
--output:--
a b c b.1 a.1 a.2
index
0 10 100 -10 -100 10 21
1 20 200 -20 -200 11 22
2 30 300 -30 -300 12 23
3 40 400 -40 -400 13 24
4 50 500 -50 -500 14 25
5 60 600 -60 -600 15 26
print df.columns.is_unique
--output:--
True
If for some reason you are being given a DataFrame with duplicate columns, you can do this:
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame(
{
'k': np.random.rand(10),
'l': np.random.rand(10),
'm': np.random.rand(10),
'n': np.random.rand(10),
'o': np.random.rand(10),
'p': np.random.rand(10),
}
)
print df
--output:--
k l m n o p
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.16604
df.columns = ['a', 'b', 'c', 'b', 'a', 'a']
print df
--output:--
a b c b a a
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.166047
print df.columns.is_unique
--output:--
False
name_counts = defaultdict(int)
new_col_names = []
for name in df.columns:
new_count = name_counts[name] + 1
new_col_names.append("{}{}".format(name, new_count))
name_counts[name] = new_count
print new_col_names
--output:--
['a1', 'b1', 'c1', 'b2', 'a2', 'a3']
df.columns = new_col_names
print df
--output:--
a1 b1 c1 b2 a2 a3
0 0.264598 0.321378 0.466370 0.986725 0.580326 0.671168
1 0.938810 0.179999 0.403530 0.675112 0.279931 0.011046
2 0.935888 0.167405 0.733762 0.806580 0.392198 0.180401
3 0.218825 0.295763 0.174213 0.457533 0.234081 0.555525
4 0.891890 0.196245 0.425918 0.786676 0.791679 0.119826
5 0.721305 0.496182 0.236912 0.562977 0.249758 0.352434
6 0.433437 0.501975 0.088516 0.303067 0.916619 0.717283
7 0.026491 0.412164 0.787552 0.142190 0.665488 0.488059
8 0.729960 0.037055 0.546328 0.683137 0.134247 0.444709
9 0.391209 0.765251 0.507668 0.299963 0.348190 0.731980
print df.columns.is_unique
--output:--
True
In case anyone needs this in Scala->
def renameDup (Header : String) : String = {
val trimmedList: List[String] = Header.split(",").toList
var fudge =0
var newitem =""
var seen = List[String]()
for (item <- trimmedList){
fudge = 1
newitem = item
for (newitem2 <- seen){
if (newitem2 == newitem ){
fudge += 1
newitem = item + "_" + fudge
}
}
seen= seen :+ newitem
}
return seen.mkString(",")
}
>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
Here's a solution that uses pandas all the way through.
import pandas as pd
# create data frame with duplicate column names
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.rename({'a': 'col', 'b': 'col'}, axis=1, inplace=True)
df
---output---
col col
0 1 4
1 2 5
2 3 6
# make a new data frame of column headers and number sequentially
dfcolumns = pd.DataFrame({'name': df.columns})
dfcolumns['counter'] = dfcolumns.groupby('name').cumcount().apply(str)
# remove counter for first case (optional) and combine suffixes
dfcolumns.loc[dfcolumns.counter=='0', 'counter'] = ''
df.columns = dfcolumns['name'] + dfcolumns['counter']
df
---output---
col col1
0 1 4
1 2 5
2 3 6
I ran into this problem when loading DataFrames from oracle tables. 7stud is right that pd.read_excel() automatically designates duplicated columns with a *.1, but not all of the read functions do this. One work around is to save the DataFrame to a csv (or excel) file and then reload it to re-designate duplicated columns.
data = pd.read_SQL(SQL,connection)
data.to_csv(r'C:\temp\temp.csv')
data=read_csv(r'C:\temp\temp.csv')