I just asked a similar question but then
realized, it wasn't the right question.
What I'm trying to accomplish is to combine two data frames that actually have the same columns, but may or may not have common rows (indices of a MultiIndex). I'd like to combine them taking the sum of one of the columns, but leaving the other columns.
According to the accepted answer, the approach may be something like:
def mklbl(prefix,n):
try:
return ["%s%s" % (prefix,i) for i in range(n)]
except:
return ["%s%s" % (prefix,i) for i in n]
mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])
mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])
df2 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1)),'c':np.arange(len(mi1)), 'd':np.arange(len( mi1))[::-1]}, index=mi1).sort_index().sort_index(axis=1)
df1 = pd.DataFrame({'a':np.arange(len(mi2)), 'b':np.arange(len(mi2)),'c':np.arange(len(mi2)), 'd':np.arange(len( mi2))[::-1]}, index=mi2).sort_index().sort_index(axis=1)
df1 = df1.add(df2.pop('b'))
but the problem is this will fail as the indices don't align.
This is close to what I'm trying to achieve, except that I lose rows that are not common to the two dataframes:
df1['b'] = df1['b'].add(df2['b'], fill_value=0)
But this gives me:
Out[197]:
a b c d
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
When I want:
In [197]: df1
Out[197]:
a b c d
A0 C0 0 0 0 7
C1 1 2 1 6
A1 C0 2 4 2 5
C1 3 6 3 4
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
Note: in response to #RandyC's comment about the XY problem... the specific problem is that I have a class which reads data and returns a dataframe of 1e9 rows. The columns of the data frame are latll, latur, lonll, lonur, concentration, elevation. The data frame is indexed by a MultiIndex (lat, lon, time) where time is a datetime. The rows of the two dataframes may/may not be the same (IF they exist for a given date, the lat/lon will be the same... they are grid cell centers). latll, latur, lonll, lonur are calculated from lat/lon. I want to sum the concentration column as I add two data frames, but not change the others.
Self answering, there was an error in the comment above that caused a double adding. This is correct:
newdata = df2.pop('b')
result = df1.combine_first(df2)
result['b']= result['b'].add(newdata, fill_value=0)
seems to provide the solution to my use-case.
Related
I'm working with a DataFrame that looks like this:
ID SEQ BEG_GAP END_GAP
0 A1 ABCDEFG 2 4
1 B1 YUUAAMN 4 6
2 C1 LKHUTYYYYA 7 9
And what I'm trying to do is basically first replace the part of the string in "SEQ" that's in between the values of "BEG_GAP" and "END_GAP", to then explode the two pieces of the string left into two different lines (probably using Panda's explode).
I.e: First expected result:
ID SEQ BEG_GAP END_GAP
0 A1 AB---FG 2 4
1 B1 YUUA--- 4 6
2 C1 LKHUTY--YA 7 8
To then get:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
1 A1 FG 2 4
2 B1 YUUA 4 6
3 C1 LKHUTY 7 8
4 C1 YA 7 8
I'm trying to use the following code:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
for index, rows in df.iterrows():
start = df["BEG_GAP"].astype(float)
stop= df["END_GAP"].astype(float)
df["SEQ"] = df["SEQ"].astype(str)
df['SEQ'] = df['SEQ'].str.slice_replace(start=start,stop=stop,repl='-')
But the column "SEQ" that I'm getting is full of NaN. I suppose it has to do with how I'm using start and stop. I could use some help with this, and also with how to later divide the rows according to the gaps.
I hope I was clear enough, thanks in advance!
Let's try:
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
Output:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
0 A1 FG 2 4
1 B1 YUUA 4 6
2 C1 LKHUTYY 7 8
2 C1 A 7 8
Let's say I have a pandas dataframe
rid category
0 0 c2
1 1 c3
2 2 c2
3 3 c3
4 4 c2
5 5 c2
6 6 c1
7 7 c3
8 8 c1
9 9 c3
I want to add 2 columns pid and nid, such that for each row pid contains a random id (other than rid) that belongs to the same category as rid and nid contains a random id that belongs to a different category as rid,
an example dataframe would be:
rid category pid nid
0 0 c2 2 1
1 1 c3 7 4
2 2 c2 0 1
3 3 c3 1 5
4 4 c2 5 7
5 5 c2 4 6
6 6 c1 8 5
7 7 c3 9 8
8 8 c1 6 2
9 9 c3 1 2
Note that pid should not be the same as rid. Right now, I am just brute forcing it by iterating through the rows and sampling each time, which seems very inefficient.
Is there a better way to do this?
EDIT 1: For simplicity let us assume that each category is represented at least twice, so that at least one id can be found that is not rid but has the same category.
EDIT 2: For further simplicity let us assume that in a large dataframe the probability of ending up with the same id as rid is zero. If that is the case I believe the solution should be easier. I would prefer not to make this assumption though
For pid column use Sattolo's algorithm and for nid get all possible values with difference all volues of column with values of group with numpy.random.choice and set difference:
from random import randrange
#https://stackoverflow.com/questions/7279895
def sattoloCycle(items):
items = list(items)
i = len(items)
while i > 1:
i = i - 1
j = randrange(i) # 0 <= j <= i-1
items[j], items[i] = items[i], items[j]
return items
def outsideGroupRand(x):
return np.random.choice(list(set(df['rid']).difference(x)),
size=len(x),
replace=False)
df['pid1'] = df.groupby('category')['rid'].transform(sattoloCycle)
df['nid1'] = df.groupby('category')['rid'].transform(outsideGroupRand)
print (df)
rid category pid nid pid1 nid1
0 0 c2 2 1 4 6
1 1 c3 7 4 7 4
2 2 c2 0 1 5 3
3 3 c3 1 5 1 0
4 4 c2 5 7 2 9
5 5 c2 4 6 0 8
6 6 c1 8 5 8 3
7 7 c3 9 8 9 5
8 8 c1 6 2 6 5
9 9 c3 1 2 3 6
import pandas as pd
import numpy as np
## generate dummy data
raw = {
"rid": range(10),
"cat": np.random.choice("c1,c2,c3".split(","), 10)
}
df = pd.DataFrame(raw)
def get_random_ids(x):
pids,nids = [],[]
sh = x.copy()
for _ in x:
## do circular shift choose random value except cur_val
cur_value = sh.iloc[0]
sh = sh.shift(-1)
sh[-1:] = cur_value
pids.append(np.random.choice(sh[:-1]))
## randomly choose from values from other cat
nids = np.random.choice(df[df["cat"]!=x.name]["rid"], len(x))
return pd.DataFrame({"pid": pids, "nid": nids}, index=x.index)
new_ids = df.groupby("cat")["rid"].apply(lambda x:get_random_ids(x))
df.join(new_ids).sort_values("cat")
output
rid cat pid nid
5 5 c1 8.0 9
8 8 c1 5.0 6
0 0 c2 6.0 1
2 2 c2 0.0 8
3 3 c2 0.0 9
6 6 c2 2.0 4
7 7 c2 3.0 1
1 1 c3 9.0 5
4 4 c3 9.0 0
9 9 c3 4.0 2
Start with defining a function computing pid:
def getPid(elem, grp):
return grp[grp != elem].sample().values[0]
Parameters:
eleme - the current rid from group,
grp - the whole group of rid values.
The idea is to:
select "other" elements from the current group (for some category),
call sample,
return the only returned value from a Series returned by sample.
Then define second function, generating both new ids:
def getIds(grp):
pids = grp.rid.apply(getPid, grp=grp.rid)
rowNo = grp.rid.size
currGrp = grp.name
nids = df.query('category != #currGrp').rid\
.sample(rowNo, replace=True)
return pd.DataFrame({'pid': pids, 'nid': nids.values}, index=grp.index)
Note that:
all nid values for the current group can be computed with
a single call to sample,
from a Series of rids for "other categories.
But pid values must be computed separately, applying getPid to each
element (rid) of the current group.
The reason is that each time a different element should be eliminated
from the current group, before sample is called.
And to get the result, run a single instruction:
pd.concat([df, df.groupby('category').apply(getIds)], axis=1)
I have a dataframe of which I wan't to create subsets in a loop according to the values of one column.
Here is an example df :
c1 c2 c3
A 1 2
A 2 2
B 0 2
B 1 1
I would like to create subsets like so in a loop
first iteration, select all rows in which C1=A, and only columns 2 and 3, second, all rows in which C1=B, and only C2 and 3.
I've tried the following code :
for level in enumerate(df.loc[:,"C1"].unique()):
df_s = df.loc[df["C1"]==level].iloc[:, 1:len(df.columns)]
#other actions on the subsetted dataframe
but the subset isn't performed.
How to iterate throudh the levels of a column
For instance in R it would be
for (le in levels(df$C1){
dfs <- df[df$C1==le,2:ncol(df)]
}
Thanks
There is no need for the enumerate which gives both index and values, just loop through c1 column directly:
for level in df.c1.unique():
df_s = df.loc[df.c1 == level].drop('c1', 1)
print(level + ":\n", df_s)
#A:
# c2 c3
#0 1 2
#1 2 2
#B:
# c2 c3
#2 0 2
#3 1 1
Most likely, what you need is df.groupby('c1').apply(lambda g: ...), which should be a more efficient approach; Here g is the sub data frame with a unique c1 value.
for level in df.loc[:,"c1"].unique():
print(level)
df_s = df.loc[df["c1"]==level,:].iloc[:,1:len(df)]
print(df_s)
A
c2 c3
0 1 2
1 2 2
B
c2 c3
2 0 2
3 1 1
Or ( this one is more like R )
for level in df.loc[:,"c1"].unique():
print(level)
df_s = df.loc[df["c1"]==level,df.columns[1:len(df)]]
print(df_s)
Question:
How can one use pandas df.groupby() function to create randomly selected groups of groups?
Example:
I would like to group a dataframe into random groups of size n where n corresponds to the number of unique values in a given column.
I have a dataframe with a variety of columns including "id". Some rows have unique ids whereas others may have the same id. For example:
c1 id c2
0 a 1 4
1 b 2 6
2 c 2 2
3 d 5 7
4 y 9 3
In reality this dataframe can have up to 1000 or so rows.
I would like to be able to group this dataframe using the following criteria:
each group should contain at most n unique ids
no id should appear in more than one group
the specific ids in a given group should be selected randomly
each id should appear in exactly one group
For example the example dataframe (above) could become:
group1:
c1 id c2
0 a 1 4
4 y 9 3
group2:
c1 id c2
1 b 2 6
2 c 2 2
3 d 5 7
where n = 2
Thanks for your suggestions.
It seems difficult for a uniq groupby statement. A way to do that :
uniq=df['id'].unique()
random.shuffle(uniq)
groups=np.split(uniq,2)
dfr=df.set_index(df['id'])
for gp in groups : print (dfr.loc[gp])
For
c1 id c2
id
9 y 9 3
1 a 1 4
c1 id c2
id
5 d 5 7
2 b 2 6
2 c 2 2
If size of groups(n) does'nt divide len(uniq), You can use np.split(uniq,range(n,len(uniq),n)) instead.
Here's a way to do it:
import numpy as np
df = pd.DataFrame({'c1':list('abcdy'), 'id':[1,2,2,5,9], 'c2':[4,6,2,7,3]})
n = 2
shuffled_ids = np.random.permutation(df['id'].unique())
id_groups = [shuffled_ids[i:i+n] for i in xrange(0, len(shuffled_ids), n)]
groups = [df['id'].apply(lambda x: x in g) for g in id_groups]
Output:
In [1]: df[groups[0]]
Out[1]:
c1 c2 id
1 b 6 2
2 c 2 2
3 d 7 5
In [2]: df[groups[1]]
Out[2]:
c1 c2 id
0 a 4 1
4 y 3 9
This approach doesn't involve changing the index, in case you need to keep it.
I have two dataframes, one is pretty big the other is really huge.
df1: "classid"(text), "c1" (numeric), "c2"(numeric)
df2: "classid"(text), "c3" (numeric), "c4"(numeric)
I want to filter df2 based on values on df1. In pseudocode one would formulate it like this:
df2[(df2.classid == df1.classid) & (df2.c3 < df1.c1) & (df2.c4 < df1.c2)]
Right now I do this by iterating rows in df1 and doing some 40k filter calls on df2, which is a 3mil rows table. Obviously it works too slow.
df = dataframe()
for row in df1:
dft = df2[(df2.classid == row.classid) & (df2.c3 < row.c1) & (df2.c4 < row.c2)]
df.add(dft)
I guess the best option is to make an inner join and then the (df2.c3 < df1.c1) & (df2.c4 < df1.c2) filtering but the problem is that the inner join would create a huge table, since classid are not indexes and not unique row identifiers. If filtering could be applied concomitantly that might just work. Any ideas?
Iterating should be the last resort, I'd merge the other dfs columns c1 and c2 to df:
df = df.merge(df1, on='classid', how='left')
Then I would groupby the classid and then filter the rows like the following example:
In [95]:
df = pd.DataFrame({'classid':[0,0,1,1,1,2,2], 'c1':np.arange(7), 'c2':np.arange(7), 'c3':3, 'c4':4})
df
Out[95]:
c1 c2 c3 c4 classid
0 0 0 3 4 0
1 1 1 3 4 0
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2
In [100]:
df.groupby('classid').filter(lambda x: len( x[x['c3'] < x['c1']] & x[x['c4'] < x['c2']] ) > 0)
Out[100]:
c1 c2 c3 c4 classid
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2