As most pandas problems, I am guessing the problem has been dealt with before, but I can't find a direct answer and I'm also worried about performance. My dataset is large, so I'm hoping to find the most efficient way of doing this.
The Problem
I have 2 dataframes - dfA contains a list of id's from dfB. I'd like to
transpose those IDs as columns
replace the IDs with a value looked up from dfB
collapse repeated columns and aggregate with sum
Here's an illustration:
dfA
dfA = pd.DataFrame({'a_id':['0000001','0000002','0000003','0000004'],
'list_of_b_id':[['2','3','7'],[],['1','2','3','4'],['6','7']]
})
+------+--------------+
| a_id | list_of_b_id |
+------+--------------+
| 1 | [2, 3, 7] |
+------+--------------+
| 2 | [] |
+------+--------------+
| 3 | [1, 2, 3, 4] |
+------+--------------+
| 4 | [6, 7] |
+------+--------------+
dfB
dfB = pd.DataFrame({'b_id':['1','2','3','4','5','6','7'],
'replacement': ['Red','Red','Blue','Red','Green','Blue','Red']
})
+------+-------------+
| b_id | replacement |
+------+-------------+
| 1 | Red |
+------+-------------+
| 2 | Red |
+------+-------------+
| 3 | Blue |
+------+-------------+
| 4 | Red |
+------+-------------+
| 5 | Orange |
+------+-------------+
| 6 | Blue |
+------+-------------+
| 7 | Red |
+------+-------------+
Goal (Final Result)
Here is what I'm hoping to eventually get to, in the most efficient way possible.
In reality, I may have over 5M obs in both dfA and dfB, and ~50 unique values for replacement in dfB, which explains why I need to do this in dynamic fashion and not just hard-code it.
+------+-----+------+
| a_id | Red | Blue |
+------+-----+------+
| 1 | 2 | 1 |
+------+-----+------+
| 2 | 0 | 0 |
+------+-----+------+
| 3 | 3 | 1 |
+------+-----+------+
| 4 | 1 | 1 |
+------+-----+------+
First all lists are flattening by numpy.repeat and numpy.concatenate:
df = pd.DataFrame({'id':np.repeat(dfA['a_id'], dfA['list_of_b_id'].str.len()),
'b': np.concatenate(dfA['list_of_b_id'])})
print (df)
b id
0 2 0000001
0 3 0000001
0 7 0000001
2 1 0000003
2 2 0000003
2 3 0000003
2 4 0000003
3 6 0000004
3 7 0000004
Then map by Series created from dfB, which is used for
groupby for counts, reshape by unstack and add missing values by reindex:
df = (df.groupby(['id',df['b'].map(dfB.set_index('b_id')['replacement'])])
.size()
.unstack(fill_value=0)
.reindex(dfA['a_id'].unique(), fill_value=0))
print (df)
b Blue Red
id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
print (df['b'].map(dfB.set_index('b_id')['replacement']))
0 Red
0 Blue
0 Red
2 Red
2 Red
2 Blue
2 Red
3 Blue
3 Red
Name: b, dtype: object
a = [['2','3','7'],[],['1','2','3','4'],['6','7']]
b =['Red','Red','Blue','Red','Green','Blue','Red']
res = []
for line in a:
tmp = {}
for ele in line:
tmp[b[int(ele)-1]] = tmp.get(b[int(ele)-1], 0) +1
res.append(tmp)
print pd.DataFrame(res).fillna(0)
Blue Red
0 1.0 2.0
1 0.0 0.0
2 1.0 3.0
3 1.0 1.0
Use
In [5611]: dft = (dfA.set_index('a_id')['list_of_b_id']
.apply(pd.Series)
.stack()
.replace(dfB.set_index('b_id')['replacement'])
.reset_index())
In [5612]: (dft.groupby(['a_id', 0]).size().unstack()
.reindex(dfA['a_id'].unique(), fill_value=0))
Out[5612]:
0 Blue Red
a_id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
Details
In [5613]: dft
Out[5613]:
a_id level_1 0
0 0000001 0 Red
1 0000001 1 Blue
2 0000001 2 Red
3 0000003 0 Red
4 0000003 1 Red
5 0000003 2 Blue
6 0000003 3 Red
7 0000004 0 Blue
8 0000004 1 Red
You can try the code below:
pd.concat([dfA, dfA.list_of_b_id.apply(lambda x: dfB[dfB.b_id.isin(x)].replacement.value_counts())], axis=1)
d=dfB.set_index('b_id').T.to_dict('r')[0]
dfA['list_of_b_id']=dfA['list_of_b_id'].apply(lambda x : [d.get(k,k) for k in x])
pd.concat([dfA,pd.get_dummies(dfA['list_of_b_id'].apply(pd.Series).stack()).sum(level=0)],axis=1)
Out[66]:
a_id list_of_b_id Blue Red
0 0000001 [Red, Blue, Red] 1.0 2.0
1 0000002 [] NaN NaN
2 0000003 [Red, Red, Blue, Red] 1.0 3.0
3 0000004 [Blue, Red] 1.0 1.0
Related
Currently I am working on clustering problem and I have a problem with copying the values from one dataframe to the original dataframe.
CustomerID | Date | Time| TotalSum | CohortMonth| CohortIndex
--------------------------------------------------------------------
0 |17850.0|2017-11-29||08:26:00|15.30|2017-11-01|1|
--------------------------------------------------------------------
1 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
2 |17850.0|2017-11-29||08:26:00|22.00|2017-11-01|1|
--------------------------------------------------------------------
3 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
And the dataframe with values (clusters) to copy:
CustomerID|Cluster
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
Please help me with the problem: How to copy values from the second df based on Customer ID criteria to the first dataframe.
I tried the code like this:
df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left').drop('CustomerID',1).fillna('')
But it doesn't work and I get an error...
Besides it tried a version of such code as:
df, ic = [d.reset_index(drop=True) for d in (df, ic)]
ic.join(df[['CustomerID']])
But it gets the same error or error like the 'Customer ID' not in df...
Sorry if it's not clear and bad formatted question...It is my first question on stackoverflow. Thank you all.
UPDATE
I have tried this
df1=df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left')
if ic['CustomerID'].values != df1['CustomerID_x'].values:
df1.Cluster=ic.Cluster
else:
df1.Cluster='NaN'
But I've got different cluster for the same customer.
CustomerID_x| Date | Time | TotalSum | CohortMonth | CohortIndex | CustomerID_y | Cluster
0|17850.0|2017-11-29||08:26:00|15.30 | 2017-11-01 | 1 | NaN | 1.0
1|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 0.0
2|17850.0|2017-11-29||08:26:00|22.00 | 2017-11-01 | 1 | NaN | 1.0
3|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 2.0
4|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 1.0
Given what you've written, I think you want:
>>> df1 = pd.DataFrame({"CustomerID": [17850.0] * 4, "CohortIndex": [1,1,1,1] })
>>> df1
CustomerID CohortIndex
0 17850.0 1
1 17850.0 1
2 17850.0 1
3 17850.0 1
>>> df2
CustomerID Cluster
0 12346.0 1
1 17850.0 1
2 12345.0 1
>>> pd.merge(df1, df2, 'left', 'CustomerID')
CustomerID CohortIndex Cluster
0 17850.0 1 1
1 17850.0 1 1
2 17850.0 1 1
3 17850.0 1 1
i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .
train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean') - train.surface.mean())
here i have set the values of grouby by "cat" feature with the mean of suface . Cool
now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .
mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)
So , far there is no problem . Now , i would use two columns in groupby .
train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean') - train.surface.mean())
but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .
Thanks
for example my train is
+------+------+-------+
| Cat1 | Cat2 | surface |
+------+------+-------+
| 1 | 3 | 10 |
+------+------+-------+
| 2 | 2 | 12 |
+------+------+-------+
| 3 | 1 | 12 |
+------+------+-------+
| 1 | 3 | 5 |
+------+------+-------+
| 2 | 2 | 10 |
+------+------+-------+
| 3 | 2 | 13 |
+------+------+-------+
my test is
+------+------+
| Cat1 | Cat2 |
+------+------+
| 1 | 2 |
+------+------+
| 2 | 1 |
+------+------+
| 3 | 1 |
+------+------+
| 1 | 3 |
+------+------+
| 2 | 3 |
+------+------+
| 3 | 1 |
+------+------+
Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5
Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .
i hope that you have got me .
You can use
groupby().means() to calculate means
reset_index() to convert indexes Cat1, Cat2 into columns again
merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).
.
headers = ['Cat1', 'Cat2', 'surface']
train_data = [
[1, 3, 10],
[2, 2, 12],
[3, 1, 12],
[1, 3, 5],
[2, 2, 10],
[3, 2, 13],
]
test_data = [
[1, 2],
[2, 1],
[3, 1],
[1, 3],
[2, 3],
[3, 1],
]
import pandas as pd
train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])
print('--- train ---')
print(train)
print('--- test ---')
print(test)
print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()
print(means)
print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])
print(means)
print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')
print(result)
print('--- result (fillna)---')
result = result.fillna(0)
print(result)
Result:
--- train ---
Cat1 Cat2 surface
0 1 3 10
1 2 2 12
2 3 1 12
3 1 3 5
4 2 2 10
5 3 2 13
--- test ---
Cat1 Cat2
0 1 2
1 2 1
2 3 1
3 1 3
4 2 3
5 3 1
--- means ---
surface
Cat1 Cat2
1 3 7.5
2 2 11.0
3 1 12.0
2 13.0
--- means (dataframe) ---
Cat1 Cat2 surface
0 1 3 7.5
1 2 2 11.0
2 3 1 12.0
3 3 2 13.0
--- result ----
Cat1 Cat2 surface
0 1 2 NaN
1 2 1 NaN
2 3 1 12.0
3 1 3 7.5
4 2 3 NaN
5 3 1 12.0
--- result (fillna)---
Cat1 Cat2 surface
0 1 2 0.0
1 2 1 0.0
2 3 1 12.0
3 1 3 7.5
4 2 3 0.0
5 3 1 12.0
Hi as I am new to python, a friend recommended me to seek help on stackoverflow, so I decided to give it a shot. I'm currently using python version 3.x.
I have over 100k of data set in a csv file with no column header, I have loaded the data into pandas DataFrame.
Due to the fact that the documents are confidential I cant display the data here
but this is an example of the data and column that can be define as below
("id", "name", "number", "time", "text_id", "text", "text")
1 | apple | 12 | 123 | 2 | abc | abc
1 | apple | 12 | 222 | 2 | abc | abc
2 | orange | 32 | 123 | 2 | abc | abc
2 | orange | 11 | 123 | 2 | abc | abc
3 | apple | 12 | 333 | 2 | abc | abc
3 | apple | 12 | 443 | 2 | abc | abc
3 | apple | 12 | 553 | 2 | abc | abc
As you can see from the name column, I have 2 duplicates clusters of "apple" but with different ID.
so my question is:
how do I drop the entire cluster (rows) that has a higher mean value base on "time"?
Example: if (cluster with ID: 1).mean(time) < (cluster with ID: 3).mean(time) then drop all the rows in cluster with ID: 3
Desired output:
1 | apple | 12 | 123 | 2 | abc | abc
1 | apple | 12 | 222 | 2 | abc | abc
2 | orange | 32 | 123 | 2 | abc | abc
2 | orange | 11 | 123 | 2 | abc | abc
I need a lot of help and any that I can get, I'm running out of time, thanks in advance!
You can use groupby and apply to get the rows that you want to remove first.
Then you can use take to obtain the final result.
import pandas as pd
## define the rows with higher than mean value
def my_func(df):
return df[df['time'] > df['time'].mean()]
## get rows to removed
df1 = df.groupby(by='name', group_keys=False).apply(my_func)
## take only the row we want
index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))
Example:
## df
id name number time text_id text text1
0 1 apple 12 123 2 abc abc
1 1 apple 12 222 2 abc abc
2 2 orange 32 123 2 abc abc
3 2 orange 11 123 2 abc abc
4 3 apple 12 333 2 abc abc
5 3 apple 12 444 2 abc abc
6 3 apple 12 553 2 abc abc
df1 = df.groupby(by='name', group_keys=False).apply(my_func)
## df1
id name number time text_id text text1
5 3 apple 12 444 2 abc abc
6 3 apple 12 553 2 abc abc
index_to_keep = set(range(df.shape[0])) - set(df1.index)
df2 = df.take(list(index_to_keep))
#index_to_keep
{0, 1, 2, 3, 4}
# df2
id name number time text_id text text1
0 1 apple 12 123 2 abc abc
1 1 apple 12 222 2 abc abc
2 2 orange 32 123 2 abc abc
3 2 orange 11 123 2 abc abc
4 3 apple 12 333 2 abc abc
P.S I took the usage of take from this answer.
What you need are these things:
groupby
mean
min
Try the following:
import pandas as pd
df = pd.read_csv('filename.csv', header=None)
df.columns = ['id', 'name', 'number', 'time', 'text_id', 'text', 'text']
print(df)
for eachname in df.name.unique():
eachname_df = df.loc[df['name'] == eachname]
grouped_df = eachname_df.groupby(['id', 'name'])
avg_name = grouped_df['time'].mean()
for a, b in grouped_df:
if b['time'].mean() != avg_name.min():
indextodrop = b.index.get_values()
for eachindex in indextodrop:
df = df.drop([eachindex])
print(df)
Result:
id name number time text_id text text
0 1 apple 12 123 2 abc abc
1 1 apple 12 222 2 abc abc
2 2 orange 32 123 2 abc abc
3 2 orange 11 123 2 abc abc
4 3 apple 12 333 2 abc abc
5 3 apple 12 443 2 abc abc
6 3 apple 12 553 2 abc abc
id name number time text_id text text
0 1 apple 12 123 2 abc abc
1 1 apple 12 222 2 abc abc
2 2 orange 32 123 2 abc abc
3 2 orange 11 123 2 abc abc
Is there a better way to create a contingency table in pandas with pd.crosstab() or pd.pivot_table() to generate counts and percentages.
Current solution
cat=['A','B','B','A','B','B','A','A','B','B']
target = [True,False,False,False,True,True,False,True,True,True]
import pandas as pd
df=pd.DataFrame({'cat' :cat,'target':target})
using crosstab
totals=pd.crosstab(df['cat'],df['target'],margins=True).reset_index()
percentages = pd.crosstab(df['cat'],
df['target']).apply(lambda row: row/row.sum(),axis=1).reset_index()
and a merge
summaryTable=pd.merge(totals,percentages,on="cat")
summaryTable.columns=['cat','#False',
'#True','All','percentTrue','percentFalse']
output
+---+-----+--------+-------+-----+-------------+--------------+
| | cat | #False | #True | All | percentTrue | percentFalse |
+---+-----+--------+-------+-----+-------------+--------------+
| 0 | A | 2 | 2 | 4 | 0.500000 | 0.500000 |
| 1 | B | 2 | 4 | 6 | 0.333333 | 0.666667 |
+---+-----+--------+-------+-----+-------------+--------------+
you can do the following:
In [131]: s = df.groupby('cat').agg({'target': ['sum', 'count']}).reset_index(level=0)
In [132]: s.columns
Out[132]:
MultiIndex(levels=[['target', 'cat'], ['sum', 'count', '']],
labels=[[1, 0, 0], [2, 0, 1]])
Let's bring order to column names:
In [133]: s.columns = [col[1] if col[1] else col[0] for col in s.columns.tolist()]
In [134]: s
Out[134]:
cat sum count
0 A 2.0 4
1 B 4.0 6
In [135]: s['pctTrue'] = s['sum']/s['count']
In [136]: s['pctFalse'] = 1 - s.pctTrue
In [137]: s
Out[137]:
cat sum count pctTrue pctFalse
0 A 2.0 4 0.500000 0.500000
1 B 4.0 6 0.666667 0.333333
I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).