I have a basic dataframe which is a result of a gruopby from unclean data:
df:
Name1 Value1 Value2
A 10 30
B 40 50
I have created a list as follows:
Segment_list = df['Name1'].unique()
Segment_list
array(['A', 'B'], dtype=object)
Now i want to traverse the list and find the amount in Value1 for each iteration so i am usinig:
for Segment_list in enumerate(Segment_list):
print(df['Value1'])
But I getting both values instead of one by one. I just need one value for one iteration. Is this possible?
Expected output:
10
40
I recommend using pandas.DataFrame.groupby to get the values for each group.
For the most part, using a for-loop with pandas is an indication that it's probably not being done correctly or efficiently.
Additional resources:
Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
Stack Overflow Pandas Tag Info Page
Option 1:
import pandas as pd
import numpy as np
import random
np.random.seed(365)
random.seed(365)
rows = 25
data = {'n': [random.choice(['A', 'B', 'C']) for _ in range(rows)],
'v1': np.random.randint(40, size=(rows)),
'v2': np.random.randint(40, size=(rows))}
df = pd.DataFrame(data)
# groupby n
for g, d in df.groupby('n'):
# print(g) # use or not, as needed
print(d.v1.values[0]) # selects the first value of each group and prints it
[out]: # first value of each group
5
33
18
Option 2:
dfg = df.groupby(['n'], as_index=False).agg({'v1': list})
# display(dfg)
n v1
0 A [5, 26, 39, 39, 10, 12, 13, 11, 28]
1 B [33, 34, 28, 31, 27, 24, 36, 6]
2 C [18, 27, 9, 36, 35, 30, 3, 0]
Option 3:
As stated in the comments, your data is already the result of groupby, and it will only ever have one value in the column for each group.
dfg = df.groupby('n', as_index=False).sum()
# display(dfg)
n v1 v2
0 A 183 163
1 B 219 188
2 C 158 189
# print the value for each group in v1
for v in dfg.v1.to_list():
print(v)
[out]:
183
219
158
Option 4:
Print all rows for each column
dfg = df.groupby('n', as_index=False).sum()
for col in dfg.columns[1:]: # selects all columns after n
for v in dfg[col].to_list():
print(v)
[out]:
183
219
158
163
188
189
I agree with #Trenton's comment that the whole point of using data frames is to avoid looping through them like this. Re-think this using a function. However the closest way to make what you've written work is something like this:
Segment_list = df['Name1'].unique()
for Index in Segment_list:
print(df['Value1'][df['Name1']==Index]).iloc[0]
Depending on what you want to happen if there are two entries for Name (presumably this can happen because you use .unique(), This will print the sum of the Values:
df.groupby('Name1').sum()['Value1']
Related
Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90
Say I have a dataframe
A B C D
2019-01-01 1 10 100 12
2019-01-02 2 20 200 23
2019-01-03 3 30 300 34
And an array to group the columns by
array([0, 1, 0, 2])
I wish to group the dataframe by the array (on the column axis), apply a function, then return a Series with length of the number of columns, containing the result of the applied function on each column.
So, for the above (with the applied function taking the group's sum), would want to output:
A 606
B 60
C 606
D 69
dtype: int64
My best attempt:
func = lambda a: np.full(a.shape[1], np.sum(a.values))
df.groupby(groups, axis=1).apply(func)
0 [606, 606]
1 [60]
2 [69]
dtype: object
(in this example the applied function returns equal values inside a group, but this can't be guaranteed for the real case)
I can not see how to do this with pandas grouping syntax, unless I am missing something. Could anyone lend a hand, thanks!
Try this:
import numpy as np
import pandas as pd
groups = [0, 1, 0, 2]
df = pd.DataFrame({'A': [1, 2, 3],
'B': [10, 20, 30],
'C': [100, 200, 300],
'D': [12, 23, 34]})
temp = df.apply(sum).to_frame()
temp.index = pd.MultiIndex.from_arrays(
np.stack([temp.index, groups]),
names=("df columns", "groups")
)
temp_filter = temp.groupby(level=1).agg(sum)
result = temp.join(temp_filter, rsuffix='0'). \
set_index(temp.index.get_level_values(0))["00"]
# df columns
# A 606
# B 60
# C 606
# D 69
# Name: 00, dtype: int64
The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
I need to start adding values in one of the columns in my df and return a row where the sum reaches a certain threshold. What is the easiest way to do it?
e.g.
threshold = 86
values ID
1 42 xxxxx
2 34 yyyyy
3 29 vvvvv
4 28 eeeee
should return line 3
import pandas as pd
df = pd.DataFrame(dict(values=[42, 34, 29, 28], ID=['x', 'y', 'z', 'e']))
threshold = 86
idx = df['values'].cumsum().searchsorted(threshold)
print(df.iloc[idx])
Try it here
Output:
values 29
ID z
Name: 2, dtype: object
Note that df.values has a special pandas meaning so df['values'] is different and necessary.
This should work
df['new_values'] = df['values'].cumsum()
rows = df[df['new_values']==threshold].index.to_list()
Another way
df['values'].cumsum().ge(threshold).idxmax()
Out[131]: 3
df.loc[df['values'].cumsum().ge(threshold).idxmax()]
Out[133]:
values 29
ID vvvvv
Name: 3, dtype: object
I have a dataframe df (see program below) whose column names and number are not fixed.
However, there is a list ls which will have the list of columns of df that needs to be appended together.
I tried
df['combined'] = df[ls].apply(lambda x: '{}{}{}'.format(x[0], x[1], x[2]), axis=1)
but here I am assuming that the list ls has 3 elements which is hard coding and incorrect.What if the list has 10 elements.. I want to dynamically read the list and append the columns of the dataframe.
import pandas as pd
def main():
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7],
'col_3': [14, 15, 16, 19],
'col_4': [22, 23, 24, 25],
'col_5': [30, 31, 32, 33],
})
ls = ['col_1','col_4', 'col_3']
df['combined'] = df[ls].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
print(df)
if __name__ == '__main__':
main()
You can use ''.join after converting the columns' data type to str:
df[ls].astype(str).apply(''.join, axis=1)
#0 02214
#1 12315
#2 22416
#3 32519
#dtype: object
You can use cumulative sum over strings for this for more speed i.e
df[ls].astype(str).cumsum(1).iloc[:,-1].values
Output :
0 02214
1 12315
2 22416
3 32519
Name: combined, dtype: object
If you need to add space then first add ' ' then find sum i.e
n = (df[ls].astype(str)+ ' ').sum(1)
0 0 22 14
1 1 23 15
2 2 24 16
3 3 25 19
dtype: object
Timings :
ndf = pd.concat([df]*10000)
%%timeit
ndf[ls].astype(str).cumsum(1).iloc[:,-1].values
1 loop, best of 3: 538 ms per loop
%%timeit
ndf[ls].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 1.93 s per loop