I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3
Related
I've had issues finding a concise way to append a series to each row of a dataframe, with the series labels becoming new columns in the df. All the values will be the same on each of the dataframes' rows, which is desired.
I can get the effect by doing the following:
df["new_col_A"] = ser["new_col_A"]
.....
df["new_col_Z"] = ser["new_col_Z"]
But this is so tedious there must be a better way, right?
Given:
# df
A B
0 1 2
1 1 3
2 4 6
# ser
C a
D b
dtype: object
Doing:
df[ser.index] = ser
print(df)
Output:
A B C D
0 1 2 a b
1 1 3 a b
2 4 6 a b
I have a bunch of rows which I want to rearrange one after the other based on a particular column.
df
B/S
0 B
1 B
2 S
3 S
4 B
5 S
I have thought about doing a loc based on B and S and then adding them all together in a new dataframe but that doesn't seem like good practice for pandas.
Is there a pandas centric approach to this?
Output required
B/S
0 B
2 S
1 B
3 S
4 B
5 S
We can achieve this by making smart use of reset_index:
m = df['B/S'].eq('B')
b = df[m].reset_index(drop=True)
s = df[~m].reset_index(drop=True)
out = b.append(s).sort_index().reset_index(drop=True)
B/S
0 B
1 S
2 B
3 S
4 B
5 S
If you want to keep your index information, we can slightly adjust our approach:
m = df['B/S'].eq('B')
b = df[m].reset_index()
s = df[~m].reset_index()
out = b.append(s).sort_index().set_index('index')
B/S
index
0 B
2 S
1 B
3 S
4 B
5 S
I try to rename a column name in a Pandas MultiIndex but it doesn't work. Here you can see my series object. Btw, why is the dataframe df_injury_record becoming a series object in this function?
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
In the next line you will see my try to rename the column.
Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
But after this, the column has still the same name.
Regards
One possible problem should be pandas version under 0.24 or you forget assign back like mentioned #anky_91:
df_injury_record = pd.DataFrame({'Surface':list('aaaabbbbddd'),
'BodyPart':list('abbbbdaaadd')})
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
print (Frequency_BodyPart)
Surface BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
Frequency_BodyPart = Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
print (Frequency_BodyPart)
Class BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
If want 3 columns DataFrame working also for oldier pandas versions:
df = Frequency_BodyPart.reset_index(name='count').rename(columns={'Surface': 'Class'})
print (df)
Class BodyPart count
0 a a 1
1 a b 3
2 b a 2
3 b b 1
4 b d 1
5 d a 1
6 d d 2
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)