I need to replace all the spaces in a dataframe column with a period. ie:
Original df:
symbol
0 AEC
1 BRK A
2 BRK B
3 CTRX
4 FCE A
Desired result df:
symbol
0 AEC
1 BRK.A
2 BRK.B
3 CTRX
4 FCE.A
Is there a way to do this without needing to iterate through each row, replacing the space one at a time? I prefer not to iterate one row at a time if there is a vectorized way to do things.
Use vectorised str.replace:
In [95]:
df['symbol'] = df['symbol'].str.replace(' ','.')
df
Out[95]:
symbol
0 AEC
1 BRK.A
2 BRK.B
3 CTRX
4 FCE.A
Related
My dataset is as the example below:
Index ID
0 1.4A
1 1.4D
2 5B
3 D6C
4 ZG67A
5 ZG67C
I want to add a "-" before the last position of the values in my column. The values don't have a consitent lenght, therefore I cannot choose a position to place the - between, as in this helpful post
One good solution in the related post is to use pd-Series.str and to chose a position
df['ID'.str[2:] + "-" + df["c"].str[4:]
I somehow need to address the position before the last letter in every row in my column['ID']. Later I want to apply split, but as far as I understood split, it needs a delimiter to split.
Best Outcome:
Index ID
0 1.4-A
1 1.4-D
2 5-B
3 D6-C
4 ZG67-A
5 ZG67-C
Thanks
Try:
df["ID"] = df["ID"].str.replace(r"(.*)([A-Z]+)$", r"\1-\2", regex=True)
print(df)
Prints:
Index ID
0 0 1.4-A
1 1 1.4-D
2 2 5-B
3 3 D6-C
4 4 ZG67-A
5 5 ZG67-C
you can reference positions relative to the end of a string using negative indices, just like normal list or string indexing:
df['ID'].str[:-1] + "-" + df["ID"].str[-1:]
If you're hoping to split out the last character in each string, you could use a regular expression to match exactly one character before the end - no delimiter needed:
In [9]: df.ID.str.split(r'(?=.$)', regex=True)
Out[9]:
Index
0 [1.4, A]
1 [1.4, D]
2 [5, B]
3 [D6, C]
4 [ZG67, A]
5 [ZG67, C]
Name: ID, dtype: object
Using a regex to match the position before the last character (using a lookahead):
df['ID'] = df['ID'].str.replace(r'(?=.$)', '-', regex=True)
output (as new column ID2 for comparison):
Index ID ID2
0 0 1.4A 1.4-A
1 1 1.4D 1.4-D
2 2 5B 5-B
3 3 D6C D6-C
4 4 ZG67A ZG67-A
5 5 ZG67C ZG67-C
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
I am looping over groups of a pandas dataframe:
for group_number, group in df.groupby( "group_number")
in this loop the rows are order by date and I want to access values in the first and in the last rows (start and end of records in the group).
Unfortunately first() and last() don't work for the groups in this loop.
Can I do that with dataframes or do I have to loop over a list of list of tuples ?
Thanks for your help
Get your first and last from your groupby, using take, and then operate on that:
for group_number, group in df.groupby("group number"):
group.take([0,-1])
For example, using a filler df:
>>> df = pd.DataFrame({'group number':np.repeat(np.arange(1,4),4),\
'data':list('abcd1234wxyz')})
>>> df
group number data
0 1 a
1 1 b
2 1 c
3 1 d
4 2 1
5 2 2
6 2 3
7 2 4
8 3 w
9 3 x
10 3 y
11 3 z
>>> for group_number, group in df.groupby('group number'):
print(group.take([0,-1]))
group number data
0 1 a
3 1 d
group number data
4 2 1
7 2 4
group number data
8 3 w
11 3 z
#Joshua Voskamp
I didn't see this possibility :) Also it returns dataframe and if in my loop I want to access specific row/column values I have to work on it a bit more :
for group_number, group in df.groupby('group number'):
print(group.take([0,-1], axis=1).iloc[0].values)
In my case it's simpler to use to access specific values :
for group_number, group in df.groupby('group number'):
print(group.iloc[ 0].data, group.iloc[ -1].data,)
Maybe I should have use these fonction (first, last, nth, take, ...) and at the end merge/join the dataframes in one (not looping at all) !
Thanks :)
Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.
I have a dataframe like this:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
is a
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply that.
You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)
df.groupby('A')['B'].agg(lambda col: ''.join(col))
You could try this:
df.groupby('A').agg({'B':'sum','C':'-'.join})
Named aggregations with pandas >= 0.25.0
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
aggregate and get a list of strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random
a simple solution would be :
>>> df.groupby(['A','B']).c.unique().reset_index()
If you'd like to overwrite column B in the dataframe, this should work:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
Following #Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))