I have two data frames. They are structured like this:
df a
Letter
ID
A
3
B
4
df b
Letter
ID
Value
A
3
100
B
4
300
B
4
100
B
4
150
A
3
200
A
3
400
I need to take for each combo of Letter and ID in df A values from df B and run an outlier function on then.
Currently I am using over 40,000 rows of A and a list of 4,500,000 of list b
a['Results'] = a.apply(lambda x: outliers(b[(b['Letter']==x['Letter']) & (b['ID']==x['ID'])]['value'].to_list()),axis=1)
As you can imagine this is taking forever. Is there some mistake im making or something that can improve this code?
I'd first aggregate every combination of [Letter, ID] in df_b into a list using .groupby, then merge with df_a and apply your outliers function afterwards. Should be faster:
df_a["results"] = df_a.merge(
df_b.groupby(["Letter", "ID"])["Value"].agg(list),
left_on=["Letter", "ID"],
right_index=True,
how="left",
)["Value"].apply(outliers)
print(df_a)
You can first try to merge the datasets a and b and then run a group by over letter and ID, aggregate the Value by Outlier function.
pd.merge(a,b,how="inner",on = ['letter','ID']).groupby(['letter','ID']).agg(outlier).reset_index()
Related
I'm new in Python, I need to get many variables in multiple dataframes:
I wrote this code but I need a long time to configure it for many excersises.
This is the code:
import pandas as pd
df = pd.concat([df1[df1.columns[0]], df2[df1.columns[0]], df1[df1.columns[1]],
df2[df1.columns[1]], df1[df1.columns[2]], df2[df1.columns[2]],
df1[df1.columns[3]], df2[df1.columns[3]], df1[df1.columns[4]],
df2[df1.columns[4]], df1[df1.columns[5]], df2[df1.columns[5]],
df1[df1.columns[6]], df2[df1.columns[6]]], axis=1)
The number of dataframes and columns can be much bigger. Thanks.
It looks like what you're trying to do is: for all of the columns in one dataframe, combine the columns from that dataframe with those from another with the same columns, into a single dataframe with two of every column in the same original order.
In your case:
df1 = DataFrame([['a','b','c'], ['d','e','f']])
df2 = DataFrame([['g','h','i'], ['j','k','l']])
df = concat([s for ss in [(df1[c], df2[c]) for c in df1.columns] for s in ss], axis=1)
print(df)
Result:
0 0 1 1 2 2
0 a g b h c i
1 d j e k f l
I have a big dataframe with many duplicates in it. I want to keep the first and last entry of each duplicate but drop every duplicate in between.
I've already tried to get this done by using df.drop_duplicates with the parameters 'first' and 'last' to get two dataframes and then merge them again to one df so I have the first and last entry, but that didn't work.
df_first = df
df_last = df
df_first['Path'].drop_duplicates(keep='first', inplace=True)
df_last['Path'].drop_duplicates(keep='last', inplace=True)
Thanks for your help in advance!
Use GroupBy.nth for avoid duplicates if group with length is 1:
df = pd.DataFrame({
'a':[5,3,6,9,2,4],
'Path':list('aaabbc')
})
print(df)
a Path
0 5 a
1 3 a
2 6 a
3 9 b
4 2 b
5 4 c
df = df.groupby('Path').nth([0, -1])
print (df)
a
Path
a 5
a 6
b 9
b 2
c 4
**Using group by.nth which is an Updated code from previous solution to get nth entry
def keep_second_dup(duplicate):
duplicate[Columnname]=duplicate[Columnname'].value_counts()
second_duplicate=duplicate[duplicate['Count']>=1]
residual=duplicate[duplicate['Count']==1]
sec=second_duplicated.groupby([Columnname]).nth([1]).reset_index()
final_data=pd.concat([sec,residual])
final_data.drop('Count',axis=1,inplace=True)
return final_data
I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.
You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2
Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
I have two separate pandas dataframes (df1 and df2) which have multiple columns, but only one in common ('text').
I would like to do find every row in df2 that does not have a match in any of the rows of the column that df2 and df1 have in common.
df1
A B text
45 2 score
33 5 miss
20 1 score
df2
C D text
.5 2 shot
.3 2 shot
.3 1 miss
Result df (remove row containing miss since it occurs in df1)
C D text
.5 2 shot
.3 2 shot
Is it possible to use the isin method in this scenario?
As you asked, you can do this efficiently using isin (without resorting to expensive merges).
>>> df2[~df2.text.isin(df1.text.values)]
C D text
0 0.5 2 shot
1 0.3 2 shot
You can merge them and keep only the lines that have a NaN.
df2[pd.merge(df1, df2, how='outer').isnull().any(axis=1)]
or you can use isin:
df2[~df2.text.isin(df1.text)]
EDIT:
import numpy as np
mergeddf = pd.merge(df2,df1, how="left")
result = mergeddf[(np.isnan(mergeddf['A']))][['C','D','text']]
So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.