This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a Pandas data frame with three columns, two of which have identifiers (date and id) and one with the values I actually care about (value). It looks like this:
,date,id,value
0,20210801,269277473,-1114389.6
1,20210802,269277473,-1658061.0
2,20210803,269277473,-1338010.2
3,20210804,269277473,-475779.6
4,20210805,269277473,-1417980.0
5,20210806,269277473,-1673400.6
6,20210807,269277473,-1438969.8
12,20210801,269277476,504300.0
13,20210802,269277476,519889.8
14,20210803,269277476,513899.4
15,20210804,269277476,526258.8
16,20210805,269277476,524730.0
17,20210806,269277476,548010.6
18,20210807,269277476,539031.0
24,20210801,269277480,477399.0
25,20210802,269277480,443499.0
26,20210803,269277480,394801.2
27,20210804,269277480,440100.0
28,20210805,269277480,455499.6
29,20210806,269277480,441100.2
30,20210807,269277480,438899.4
I want to roll the values into a table in which the date in the index, the columns are the ids, and the content is the values, like the following:
date,269277473,269277476,269277480
20210801,-1114389.6,504300.0,477399.0
20210802,-1658061.0,519889.8,443499.0
20210803,-1338010.2,513899.4,394801.2
20210804,-475779.6,526258.8,440100.0
20210805,-1417980.0,524730.0,455499.6
20210806,-1673400.6,548010.6,441100.2
20210807,-1438969.8,539031.0,438899.4
Given my table us huge (hundreds of millions of values), what is the most efficient way of accomplishing this?
You need to apply a pivot:
df.pivot(*df)
which is equivalent in your case (as the columns are in order) to:
df.pivot(index='date', columns='id', values='value')
output:
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
Use pivot:
>>> df.pivot(index='date', columns='id', values='value')
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
If you have other columns, you can use pivot_table instead to apply a function on values of each columns (mean, sum, ...)
This question already has answers here:
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
(4 answers)
Closed 1 year ago.
Both the following lines seem to give the same output:
df1 = df[df['MRP'] > 1500]
df1 = df.loc[df['MRP'] > 1500]
Is loc an optional attribute when searching dataframe?
Coming from Padas.DataFrame.loc documentation:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean
array.
When you are using Boolean array to filter out data, .loc is optional, and in your example df['MRP'] > 1500 gives a Series with the values of truthfulness, so it's not necessary to use .loc in that case.
df[df['MRP']>15]
MRP cat
0 18 A
3 19 D
6 18 C
But if you want to access some other columns where this Boolean Series has True value, then you may use .loc:
df.loc[df['MRP']>15, 'cat']
0 A
3 D
6 C
Or, if you want to change the values where the condition is True:
df.loc[df['MRP']>15, 'cat'] = 'found'
This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.
This question already has an answer here:
How do you shift Pandas DataFrame with a multiindex?
(1 answer)
Closed 4 years ago.
So, I was wondering if I am doing this correctly, because maybe there is a much better way to do this and I am wasting a lot of time.
I have a 3 level index dataframe, like this:
IndexA IndexB IndexC ColumnA ColumnB
A B C1 HiA HiB
A B C2 HiA2 HiB2
I need to do a search for every row, saving data from other rows. I know this sounds strange, but it makes sense with my data. For example:
I want to add ColumnB data from my second row to the first one, and vice-versa, like this:
IndexA IndexB IndexC ColumnA ColumnB NewData
A B C1 HiA HiB HiB2
A B C2 HiA2 HiB2 HiB
In order to do this search, I do an apply on my df, like this:
df['NewData'] = df.apply(lambda r: my_function(df, r.IndexA, r.IndexB, r.IndexC), axis=1)
Where my function is:
def my_function(df, indexA, indexB, indexC):
idx = pd.IndexSlice
#Here I do calculations (substraction) to know what C exactly I want
#newIndexC = C - someConstantValue
try:
res = df.loc[idx[IndexA, IndexB, newIndexC],'ColumnB']
return res
except KeyError:
return -1
I tried to simplify a lot of this problem, sorry if it sounds confusing. Basically my data frame has 20 million rows, and this search takes 2 hours. I know it has to take a lot, because there are a lot of accesses, but I wanted to know if there could be a faster way to do this search.
More information:
On indexA I have different groups of values. Example: Countries.
On indexB I have different groups of dates.
On indexC I have different groups of values.
Answer:
df['NewData'] = df.groupby(level=['IndexA', 'IndexB'])['ColumnB'].shift(7)
All you're really doing is a shift. You can speed it up 1000x like this:
df['NewData'] = df['ColumnB'].shift(-someConstantValue)
You'll need to roll the data from the top someConstantValue number of rows around to the bottom--I'm leaving that as an exercise.
My DataFrame has a string in the first column, and a number in the second one:
GEOSTRING IDactivity
9 wydm2p01uk0fd2z 2
10 wydm86pg6r3jyrg 2
11 wydm2p01uk0fd2z 2
12 wydm80xfxm9j22v 2
39 wydm9w92j538xze 4
40 wydm8km72gbyuvf 4
41 wydm86pg6r3jyrg 4
42 wydm8mzt874p1v5 4
43 wydm8mzmpz5gkt8 5
44 wydm86pg6r3jyrg 5
45 wydm8w1q8bjfpcj 5
46 wydm8w1q8bjfpcj 5
What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value.
So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:
['2828', '9888','8888']
where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.
What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.
I hope it's clear enough.
this can be done easily as follows as a one liner: (considered to be pretty fast too)
result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()
this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.
output:
['2828', '9888', '8888']
Documentation:
pandas.groupby
pandas.apply
Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:
# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])
# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()
Result:
['2828', '9888', '8888']