Repeating items in a data frame using pandas - python

I have the following dataFrame:
id z2 z3 z4
1 2 a fine
2 7 b good
3 9 c delay
4 30 d cold
I am going to generate a data frame by repeating each item in a row twice except items in column z4 (that should not be repeated). How I can do it using python and pandas.
The output should be like this:
id z1 z3 z4
1 2 a fine
1 2 a
1 2 a
2 7 b good
2 7 b
2 7 b
3 9 c delay
3 9 c
3 9 c
4 30 d cold
4 30 d
4 30 d

Another way to do this is to use indexing:
Notice that df.iloc[[0, 1, 2, 3]*2, :3] will give you two copies of the first three columns.
This can then be appended to the original df. Remove the NA. Then sort on index values and reset index (dropping the old index). All of which can be chained:
df.append(df.iloc[[0, 1, 2, 3]*2, :3]).fillna('').sort_index().reset_index(drop=True)
which produces:
id z2 z3 z4
0 1 2 a fine
1 1 2 a
2 1 2 a
3 2 7 b good
4 2 7 b
5 2 7 b
6 3 9 c delay
7 3 9 c
8 3 9 c
9 4 30 d cold
10 4 30 d
11 4 30 d

groupby and apply will do the trick:
def func(group):
copy = group.copy()
copy['z4'] = ""
return pd.concat((group, copy, copy))
df.groupby('id').apply(func).reset_index(drop=True)
id z2 z3 z4
0 1 2 a fine
1 1 2 a
2 1 2 a
3 2 7 b good
4 2 7 b
5 2 7 b
6 3 9 c delay
7 3 9 c
8 3 9 c
9 4 30 d cold
10 4 30 d
11 4 30 d

Related

Pandas - replicate rows with new column value from a list for each replication

So I have a data frame that has two columns, State and Cost, and a separate list of new "what-if" costs
State Cost
A 2
B 9
C 8
D 4
New_Cost_List = [1, 5, 10]
I'd like to replicate all the rows in my data set for each value of New_Cost, adding a new column for each New_Cost for each state.
State Cost New_Cost
A 2 1
B 9 1
C 8 1
D 4 1
A 2 5
B 9 5
C 8 5
D 4 5
A 2 10
B 9 10
C 8 10
D 4 10
I thought a for loop might be appropriate to iterate through, replicating my dataset for the length of the list and adding the values of the list as a new column:
for v in New_Cost_List:
df_new = pd.DataFrame(np.repeat(df.values, len(New_Cost_List), axis=0))
df_new.columns = df.columns
df_new['New_Cost'] = v
The output of this gives me the correct replication of State and Cost but the New_Cost value is 10 for each row. Clearly I'm not connecting how to get it to run through the list for each replicated set, so any suggestions? Or is there a better way to approach this?
EDIT 1
Reducing the number of values in the New_Cost_List from 4 to 3 so there's a difference in row count and length of the list.
Here is a way using the keys paramater of pd.concat():
(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))
Output:
New_Cost State Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4
If i understand your question correctly, this should solve your problem.
df['New Cost'] = new_cost_list
df = pd.concat([df]*len(new_cost_list), ignore_index=True)
Output:
State Cost New Cost
0 A 2 1
1 B 9 5
2 C 8 10
3 D 4 15
4 A 2 1
5 B 9 5
6 C 8 10
7 D 4 15
8 A 2 1
9 B 9 5
10 C 8 10
11 D 4 15
12 A 2 1
13 B 9 5
14 C 8 10
15 D 4 15
You can use index.repeat and numpy.tile:
df2 = (df
.loc[df.index.repeat(len(New_Cost_List))]
.assign(**{'New_Cost': np.repeat(New_Cost_List, len(df))})
)
or, simply, with a cross merge:
df2 = df.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
output:
State Cost New_Cost
0 A 2 1
0 A 2 5
0 A 2 10
1 B 9 1
1 B 9 5
1 B 9 10
2 C 8 1
2 C 8 5
2 C 8 10
3 D 4 1
3 D 4 5
3 D 4 10
For the provided order:
(df
.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
.sort_values(by='New_Cost', kind='stable')
.reset_index(drop=True)
)
output:
State Cost New_Cost
0 A 2 1
1 B 9 1
2 C 8 1
3 D 4 1
4 A 2 5
5 B 9 5
6 C 8 5
7 D 4 5
8 A 2 10
9 B 9 10
10 C 8 10
11 D 4 10

I want to generate a new column in a pandas dataframe, counting "edges" in another column

i have a dataframe looking like this:
A B....X
1 1 A
2 2 B
3 3 A
4 6 K
5 7 B
6 8 L
7 9 M
8 1 N
9 7 B
1 6 A
7 7 A
that is, some "rising edges" occur from time to time in the column X (in this example the edge is x==B)
What I need is, a new column Y which increments every time a value of B occurs in X:
A B....X Y
1 1 A 0
2 2 B 1
3 3 A 1
4 6 K 1
5 7 B 2
6 8 L 2
7 9 M 2
8 1 N 2
9 7 B 3
1 6 A 3
7 7 A 3
In SQL I would use some trick like sum(case when x=B then 1 else 0) over ... rows between first and previous. How can I do it in Pandas?
Use cumsum
df['Y'] = (df.X == 'B').cumsum()
Out[8]:
A B X Y
0 1 1 A 0
1 2 2 B 1
2 3 3 A 1
3 4 6 K 1
4 5 7 B 2
5 6 8 L 2
6 7 9 M 2
7 8 1 N 2
8 9 7 B 3
9 1 6 A 3
10 7 7 A 3

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

Pandas: Add rows in the groups of a dataframe

I have a data frame as follows:
df = pd.DataFrame({"date": [1,2,5,6,2,3,4,5,1,3,4,5,6,1,2,3,4,5,6],
"variable": ["A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","D","D","D","D"]})
date variable
0 1 A
1 2 A
2 5 A
3 6 A
4 2 B
5 3 B
6 4 B
7 5 B
8 1 C
9 3 C
10 4 C
11 5 C
12 6 C
13 1 D
14 2 D
15 3 D
16 4 D
17 5 D
18 6 D
In this data frame, there are 4 values in the variable column: A, B, C, D. My goal is that each of the variables needs to contain 1 to 6 dates in the date column.
But currently, a few values in the date column are missing for some variable. I tried grouping them and filling each value with a counter but sometimes there are more than one dates missing (For example, in variable A, the dates 4 and 5 are missing). Also, the counter made my code terribly slow as I have a couple of thousand of rows.
Is there a faster and smarter way to do this without using a counter?
The desired output should be as follows:
date variable
0 1 A
1 2 A
2 3 A
3 4 A
4 5 A
5 6 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 6 B
12 1 C
13 2 C
14 3 C
15 4 C
16 5 C
17 6 C
18 1 D
19 2 D
20 3 D
21 4 D
22 5 D
23 6 D
itertools.product
from itertools import product
pd.DataFrame([*product(
range(df.date.min(), df.date.max() + 1),
sorted({*df.variable})
)], columns=df.columns)
date variable
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 B
6 2 C
7 2 D
8 3 A
9 3 B
10 3 C
11 3 D
12 4 A
13 4 B
14 4 C
15 4 D
16 5 A
17 5 B
18 5 C
19 5 D
20 6 A
21 6 B
22 6 C
23 6 D
Using grpupby + reindex
df.groupby('variable', as_index=False).apply(
lambda g: g.set_index('date').reindex([1,2,3,4,5,6]).ffill().bfill())
.reset_index(level=1)
Output:
date variable
0 1 A
0 2 A
0 3 A
0 4 A
0 5 A
0 6 A
1 1 B
1 2 B
1 3 B
1 4 B
1 5 B
1 6 B
2 1 C
2 2 C
2 3 C
2 4 C
2 5 C
2 6 C
3 1 D
3 2 D
3 3 D
3 4 D
3 5 D
3 6 D
This is more of a work around but it should work
df.groupby(by=['variable']).agg({'date': range(6)}).explode('date')

Pandas reverse column values groupwise

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Categories

Resources