I have a dataframe df. I want to add 2 new columns 0 and 1 and add data to these columns by one row at a time, and not the complete column at once. By using pd.Series for all the rows in df I am getting NaN value in the new column data other than the last row. Provide me a way to fix this.
I need to add data by one row at a time. Please provide solution accordingly.
df
val
1
2
3
code
for j in range(len(df)):
for i in range(2):
cal = df.val.iloc[j] + 10
df[i] = pd.Series(cal, index=df.index[[j]])
output
val | 0 | 1
1 | NaN | NaN
2 | NaN | NaN
3 | 13.0 | 13.0
expected output
val | 0 | 1
1 | 11.0 | 11.0
2 | 12.0 | 12.0
3 | 13.0 | 13.0
EDIT
I had actually asked a question on stackoverflow whose answer I could not get. That is why I had tried to condense the question and present it this way. If possible you all can check the original question here
It is not clear why you are trying to add rows one at a time with inefficient methods, hence I suggest not to use this code but to rely on vectorized solutions.
However, if you really want to do it for some reason, you should modify your cycle like this
for j in range(len(df)):
for i in range(2):
cal = df.val.iloc[j] + 10
df.loc[j, i] = cal
# val 0 1
# 0 1 11.0 11.0
# 1 2 12.0 12.0
# 2 3 13.0 13.0
Use apply function
In [29]: df
Out[29]:
val
0 1
1 2
2 3
In [13]: df[0] = df["val"].apply(lambda x: x + 10)
In [14]: df[1] = df["val"].apply(lambda x: x + 10)
In [15]: df
Out[15]:
val 0 1
0 1 11 11
1 2 12 12
2 3 13 13
Or use iterrows
In [21]: temp = []
In [22]: for inex,row in df.iterrows():
...: temp.append(row["val"] + 10)
...:
In [23]: temp
Out[23]: [11, 12, 13]
In [24]: df[0] = temp
In [25]: df[1] = temp
In [26]: df
Out[26]:
val 0 1
0 1 11 11
1 2 12 12
2 3 13 13
Disclaimer - you should NOT use this code. It's the wrong way. But - given that you want to do it row by row, here's a solution:
df = pd.DataFrame({"val": [1,2, 3]})
for i in df.index:
val = df.loc[i, "val"]
for j in [0,1]:
df.loc[i, j] = val + 10
print(df)
==>
val 0 1
0 1 11.0 11.0
1 2 12.0 12.0
2 3 13.0 13.0
The proper way would be to do something like:
df = pd.DataFrame({"val": [1,2, 3]})
df[0] = df.val + 10
df[1] = df.val + 10
Same results basically, much better when it comes to pandas.
maybe:
for i in range(len(df)):
df["val"].iloc[i] = df.val.iloc[i] + 10
Related
I have a CSV file with many columns in it. Let me give you people an example.
A B C D
1 1 0
1 1 1
0 0 0
I want to do this.
if col-A first row value == 1 AND col-B first row value == 1 AND col-C first row value == 1;
then put "FIC" in first row of Col-D
else:
enter "PI"
I am using pandas.
There are more than 1500 rows and I want to do this for every row. How can I do this? Please help
If need test if all values are 1 per filtered columns use:
df['D'] = np.where(df[['A','B','C']].eq(1).all(axis=1), 'FIC','PI')
Or if only 0,1 values in filtered columns:
df['D'] = np.where(df[['A','B','C']].all(axis=1), 'FIC','PI')
EDIT:
print (df)
A B C D
0 1 1 NaN NaN
1 1 1 1.0 NaN
2 0 0 0.0 NaN
m1 = df[['A','B','C']].all(axis=1)
m2 = df[['A','B','C']].isna().any(axis=1)
df['D'] = np.select([m2, m1], ['ZD', 'FIC'],'PI')
print (df)
A B C D
0 1 1 NaN ZD
1 1 1 1.0 FIC
2 0 0 0.0 PI
Without numpy you can use:
df['D'] = df[['A', 'B', 'C']].astype(bool).all(1).replace({True: 'FIC', False: 'PI'})
print(df)
# Output
A B C D
0 1 1 0 PI
1 1 1 1 FIC
2 0 0 0 PI
I have this pandas dataframe:
id A B
1 nan 0
2 nan 1
3 6 0
4 nan 1
5 12 1
6 14 0
I want to change the value of all nan is 'A' based on the value of 'B',
for example if B = 0, A should be random number between [0,1]
if B = 1, A should be random number between [1,3]
How do i do this?
Solution if performance is important - generate random values by length of DataFrame and then assign values by conditions:
Use numpy.random.randint for generate random values and pass to numpy.select with chainded condition with & for bitwise AND, compare is by Series.isna and Series.eq :
a = np.random.randint(0,2, size=len(df)) #generate 0,1
b = np.random.randint(1,4, size=len(df)) #generate 1,2,3
m1 = df.A.isna()
m2 = df.B.eq(0)
m3 = df.B.eq(1)
df['A'] = np.select([m1 & m2, m1 & m3],[a, b], df.A)
print (df)
id A B
0 1 1.0 0
1 2 3.0 1
2 3 6.0 0
3 4 3.0 1
4 5 12.0 1
5 6 14.0 0
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(pd.isnull(tt).astype(int), index = tt.index, columns=map(lambda x: x + '_'+'NA',tt.columns))
bb
I want create this dataframe with pd.isnull(tt), and the columns name contain the NA, but why does this fail?
Using values
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(data=pd.isnull(tt).astype(int).values, index = tt.index, columns=list(map(lambda x: x + '_'+'NA',tt.columns)))
The reason why :
pandas data carry over the column and index , which pd.isnull(tt).astype(int) already have the columns name as b and a
More information
bb=pd.DataFrame(data=pd.isnull(tt).astype(int), index = tt.index,columns=['a','b', 'a_NA','b_NA'] )
bb
Out[399]:
a b a_NA b_NA
0 0 1 NaN NaN
1 0 0 NaN NaN
2 1 0 NaN NaN
3 0 0 NaN NaN
I have the following dataframe. Notice that Column B is a series of lists. This is what is giving me trouble
Dataframe 1:
Column A Column B
0 10 [X]
1 20 [X,Y]
2 15 [X,Y,Z]
3 25 [A]
4 60 [B]
I want to take all of the values in Column C (below), check if they exist in Column B, and then sum their values from Column A.
DataFrame 2: (Desired Output)
Column C Sum of Column A
0 X 45
1 Y 35
2 Z 15
3 Q 0
4 R 0
I know this can be accomplished using a for-loop, but I am looking for the "pandonic method" to solve this.
update
Here is a shorter and faster answer beginning with your second dataframe
df2['C'].apply(lambda x: df.loc[df['B'].apply(lambda y: x in y), 'A'].sum())
original answer
You first can 'normalize' the data in Column B.
df_normal = pd.concat([df.A, df.B.apply(pd.Series)], axis=1)
A 0 1 2
0 10 X NaN NaN
1 20 X Y NaN
2 15 X Y Z
3 25 A NaN NaN
4 60 B NaN NaN
And then stack and groupby to get the lookup table.
df_lookup = df_normal.set_index('A') \
.stack() \
.reset_index(name='group')\
.groupby('group')['A'].sum()
group
A 25
B 60
X 45
Y 35
Z 15
Name: A, dtype: int64
And then join to df2.
df2.join(df_lookup, on='C').fillna(0)
C A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
And in one line
df2.join(
df.set_index('A')['B'] \
.apply(pd.Series) \
.stack() \
.reset_index('A', name='group') \
.groupby('group')['A'] \
.sum(), on='C') \
.fillna(0)
And if you wanted to loop which isn't that bad in this situation
d = {}
for _, row in df.iterrows():
for var in row['B']:
if var in d:
d[var] += row['A']
else:
d[var] = row['A']
df2.join(pd.Series(d, name='Sum of A'), on='C').fillna(0)
Base on your example data:
df1=df.set_index('Column A')['Column B'].\
apply(pd.Series).stack().reset_index().\
groupby([0])['Column A'].sum().to_frame()
df2['Sum of Column A']=df2['Column C'].map(df1['Column A'])
df2.fillna(0)
Out[604]:
Column C Sum of Column A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
Data input :
df = pd.DataFrame({'Column A':[10,20,15,25,60],'Column B':[['X'],['X','Y'],['X','Y','Z'],['A'],['B']]})
df2 = pd.DataFrame({'Column C':['X','Y','Z','Q','R']})
I'd use a list comprehension like
df['result']=np.sum[(df['Column C'] in col['Column B'])*col['Column A'] for col in df]
I have a sample table like this:
Dataframe: df
Col1 Col2 Col3 Col4
A 1 10 i
A 1 11 k
A 1 12 a
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
B 2 21 w
B 2 25 e
B 2 36 q
C 1 23 a
C 1 24 b
I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
since group (A,2) has 2 records compared to group (A,1) which has 3 records.
I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it.
Any advise on how to approach this would be greatly appreciated.
I had this to get the groups for which I wanted the rows displayed:
groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
print filteredGroups.groupby(level=0).agg('idxmin')
The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df['rnk'] = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)
df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]
Col1 Col2 Col3 Col4 sz rnk rnk_rev
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
Edit: changed "count" to "size" (as in #Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.
And for clarity, here's what the df looks like before dropping the selected rows.
Col1 Col2 Col3 Col4 sz rnk rnk_rev
0 A 1 10 i 3 3.0 1.0
1 A 1 11 k 3 3.0 1.0
2 A 1 12 a 3 3.0 1.0
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
7 B 2 21 w 3 3.0 1.0
8 B 2 25 e 3 3.0 1.0
9 B 2 36 q 3 3.0 1.0
10 C 1 23 a 2 1.0 1.0
11 C 1 24 b 2 1.0 1.0
Definitely not a nice answer, but it should work:
tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]
But there must be a more clever way to use groupby than creating an auxiliary data frame...
The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.
The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.
#create data
import pandas as pd
df = pd.DataFrame({
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
})
Grouped = df.groupby("Col1")
def transFunc(x):
smallest = [None, None]
sub_groups = x.groupby("Col2")
for group, data in sub_groups:
if not smallest[1] or len(data) < smallest[1]:
smallest[0] = group
smallest[1] = len(data)
return sub_groups.get_group(smallest[0])
Grouped.apply(transFunc).reset_index(drop = True)
Edit to assign the result
result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)
I would like to add a shorter yet readable version of JohnE's solution
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)