I want to repeat a specific row of pandas data frame for a given number of times.
For example, this is my data frame
df= pd.DataFrame({
'id' : ['1','1', '2', '2','2','3'],
'val' : ['2015_11','2016_2','2011_9','2011_11','2012_2','2018_2'],
'data':['a','a','b','b','b','c']
})
print(df)
Here, "Val" column contains date in string format. It has a specific pattern 'Year_month'. For the same "id", I want the rows repeated the number of times that is equivalent to the difference between the given "val" column values. All other columns except the val column should have the duplicated value of previous row.
The output should be:
Using resample:
df.val = pd.to_datetime(df.val, format='%Y_%m')
out = df.set_index('val').groupby('id').data.resample('1m').ffill().reset_index()
out.assign(val=out.val.dt.strftime('%Y_%m'))
id val data
0 1 2015_11 a
1 1 2015_12 a
2 1 2016_01 a
3 1 2016_02 a
4 2 2011_09 b
5 2 2011_10 b
6 2 2011_11 b
7 2 2011_12 b
8 2 2012_01 b
9 2 2012_02 b
10 3 2018_02 c
Related
I have a DataFrame that looks like this:
Image of DataFrame
What I would like to do is to compare the values in all four columns (A, B, C, and D) for every row and count the number of times in which D has the smaller value than A, B, or C for each row and add it into the 'Count' column. So for instance, 'Count' should be 1 for the second row, the third row, and 2 for the last row.
Thank you in advance!
You can use vectorize the operation using gt and sum methods along an axis:
df['Count'] = df[['A', 'B', 'C']].gt(df['D'], axis=0).sum(axis=1)
print(df)
# Output
A B C D Count
0 1 2 3 4 0
1 4 3 2 1 3
2 2 1 4 3 1
In the future, please do not post data as an image.
Use a lambda function and compare across all columns, then sum across the columns.
data = {'A': [1,47,4316,8511],
'B': [4,1,3,4],
'C': [2,7,9,1],
'D': [32,17,1,0]
}
df = pd.DataFrame(data)
df['Count'] = df.apply(lambda x: x['D'] < x, axis=1).sum(axis=1)
Output:
A B C D Count
0 1 4 2 32 0
1 47 1 7 17 1
2 4316 3 9 1 3
3 8511 4 1 0 3
I have a dataframe where there are duplicate values in column A that have different values in column B.
I want to delete rows if one of column A duplicated values has values higher than 15 in column B.
Original Datafram
A Column
B Column
1
10
1
14
2
10
2
20
3
5
3
10
Desired dataframe
A Column
B Column
1
10
1
14
3
5
3
10
This works:
dfnew = df.groupby('A Column').filter(lambda x: x['B Column'].max()<=15 )
dfnew.reset_index(drop=True, inplace=True)
dfnew = dfnew[['A Column','B Column']]
print(dfnew)
output:
A Column B Column
0 1 10
1 1 14
2 3 5
3 3 10
Here is another way using groupby() and transform()
df.loc[~df['B Column'].gt(15).groupby(df['A Column']).transform('any')]
I have a large dataset where every two rows needs to be group together and combined into one longer row, basically duplicating the headers and adding the 2nd row to the 1st. Here is a small sample:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
I will have to group the rows my 'ID' so therefore I ran:
grouped = df.groupby(['ID'])
grp_lst = list(grouped)
This resulted in a list of tuples grouped by id where element 1 is the grouped dataframe I would like to combine.
The desired result is a dataframe that looks something like this:
ID Var1 Var2 ID.1 Var1.1 Var2.1
1 A B 1 2 5
2 C D 2 7 9
I have to do this over a large data set, where the "ID" is used to group the rows and then I want to basically add the bottom row to end on the top.
Any help would be appreciated and I assume there is a much easier way to do this than I am doing.
Thanks in advance!
Let us try:
i = df.groupby('ID').cumcount().astype(str)
df_out = df.set_index([df['ID'].values, i]).stack().unstack([2, 1])
df_out.columns = df_out.columns.map('.'.join)
Details:
group the dataframe on ID and use cumcount to create sequential counter to uniquely identify the rows per ID:
>>> i
0 0
1 1
2 0
3 1
dtype: object
Create multilevel index in the dataframe with the first level set to ID values and second level set to the above sequential counter, then use stack followed by unstack to reshape the dataframe in the desired format:
>>> df_out
ID Var1 Var2 ID Var1 Var2 #---> Level 0 columns
0 0 0 1 1 1 #---> Level 1 columns
1 1 A B 1 2 5
2 2 C D 2 7 9
Finally flatten the multilevel columns using Index.map with join:
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
1 1 A B 1 2 5
2 2 C D 2 7 9
Here is another way using numpy to reshape the dataframe first then tile the columns and create new dataframe from reshape values and tiled columns:
s = df.shape[1]
c = np.tile(df.columns, 2) + '.' + (np.arange(s * 2) // s).astype(str)
df_out = pd.DataFrame(df.values.reshape(-1, s * 2), columns=c)
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
0 1 A B 1 2 5
1 2 C D 2 7 9
Note: This method is only applicable if you have two rows per ID and the ID columns is already sorted.
I have the following dataset:
d = {'id': [1,1,1,1,3,3,3,4,4,4], 'number': [3,3,3,1,4,6,4,5,5,3]}
df = pd.DataFrame(data=d)
I want to get a new dataframe with the columns "id" and "final_number", where each id is assigned to the most "popular" number within each group of id's form the table above. How can I do it ?
The result should be:
the most "popular" number should be mode
df.groupby('id').number.apply(lambda x : x.mode()[0]).reset_index()
Out[1499]:
id number
0 1 3
1 3 4
2 4 5
Using groupby + value_counts + head -
df.groupby('id')\
.number.value_counts()\
.groupby(level=0)\
.head(1)\
.reset_index(name='count')\
.drop('count', 1)
id number
0 1 3
1 3 4
2 4 5
I have a data frame with two columns 'id' and 'time'. Need to compute mean times for ids and put result into new data frame with new column name. Input data frame:
id time
0 1 1
1 1 1
2 1 1
3 1 1
4 1 2
5 1 2
6 2 1
7 2 1
8 2 2
9 2 2
10 2 2
11 2 2
My code:
import pandas as pd
my_dict = {
'id': [1,1,1, 1,1,1, 2,2,2, 2,2,2],
'time':[1,1,1, 1,2,2, 1,1,2, 2,2,2]
}
df = pd.DataFrame(my_dict)
x = df.groupby(['id'])['time'].mean()
# x is a pandas.core.series.Series
type(x)
y = x.to_frame()
# y is pandas.core.frame.DataFrame
type(y)
list(y)
Running this code results in:
In [14]: y
Out[14]:
time
id
1 1.333333
2 1.666667
Groupby returns Pandas series 'x' which I then convert to data frame 'y'.
How can I change in the output 'y' data frame column name from 'time' to something else, for example 'mean'? Ideally I need output data frame with two columns : 'id' and 'mean'. How to do this?
Update2:
y = x.to_frame('mean').reset_index()
Solves the problem!
You can use agg to pass a name. The key is the name of the column and the value is the alias for the aggregate function. as_index=False is for id column to stay as a column:
df.groupby(['id'], as_index=False)['time'].agg({'mean': 'mean'})
Out:
id mean
0 1 1.333333
1 2 1.666667
Using your Series x, this would also have worked:
x.to_frame('mean').reset_index()
Out:
id mean
0 1 1.333333
1 2 1.666667