I have a dataframe
df = pd.DataFrame([
['2', '3', 'nan'],
['0', '1', '4'],
['5', 'nan', '7']
])
print df
0 1 2
0 2 3 nan
1 0 1 4
2 5 nan 7
I want to convert these strings to numbers and sum the columns and convert back to strings.
Using astype(float) seems to get me to the number part. Then summing is easy with sum(). Then back to strings should be easy too with astype(str)
df.astype(float).sum().astype(str)
0 7.0
1 4.0
2 11.0
dtype: object
That's almost what I wanted. I wanted the string version of integers. But floats have decimals. How do I get rid of them?
I want this
0 7
1 4
2 11
dtype: object
Converting to int (i.e. with .astype(int).astype(str)) won't work if your column contains nulls; it's often a better idea to use string formatting to explicitly specify the format of your string column; (you can set this in pd.options):
>>> pd.options.display.float_format = '{:,.0f}'.format
>>> df.astype(float).sum()
0 7
1 4
2 11
dtype: float64
Add a astype(int) in the mix:
df.astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
dtype: object
Demonstration of example with empty cells. This was not a requirement from the OP but to satisfy the detractors
df = pd.DataFrame([
['2', '3', 'nan', None],
[None, None, None, None],
['0', '1', '4', None],
['5', 'nan', '7', None]
])
df
0 1 2 3
0 2 3 nan None
1 None None None None
2 0 1 4 None
3 5 nan 7 None
Then
df.astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
3 0
dtype: object
Because the OP didn't specify what they'd like to happen when a column was all missing, presenting zero is a reasonable option.
However, we could also drop those columns
df.dropna(1, 'all').astype(float).sum().astype(int).astype(str)
0 7
1 4
2 11
dtype: object
For pandas >= 1.0:
<NA> type was introduced for 'Int64'. You can now do this:
df['your_column'].astype('Int64').astype('str')
And it will properly convert 1.0 to 1.
Alternative:
If you do not want to change the display options of all pandas, #maxymoo solution does, you can use apply:
df['your_column'].apply(lambda x: f'{x:.0f}')
Add astype(int) right before conversion to a string:
print (df.astype(float).sum().astype(int).astype(str))
Generates the desired result.
based on toto_tico's solution - alternative , minor changes to avoid null case become nan
df['your_column'].apply(lambda x: f'{x:.0f}' if not pd.isnull(x) else '')
The above didnt work for me so im going to add my solution
Convert to a string and strip away the .0:
db['a] = db['a'].astype(str).str.rstrip('.0')
The above solutions, when converting to string, will turn NaN into a string as well. To get around that and retain NaN, use:
c = ... # your column
np.where(
df[c].isnull(), np.nan,
df[c].apply('{:.0f}'.format)
)
Retaining NaN allows you to do stuff like convert a nullable column of integers like 19991231, 20000101, np.nan, 20000102 into date time without triggering date parsing errors.
Related
I have a data frame like this:
df:
C1 C2 C3
1 4 6
2 NaN 9
3 5 NaN
NaN 7 3
I want to concatenate the 3 columns to a single column with comma as a seperator.
But I want the comma(",") only in case of non-null value.
I tried this but this doesn't work for non-null values:
df['New_Col'] = df[['C1','C2','C3']].agg(','.join, axis=1)
This gives me the output:
New_Col
1,4,6
2,,9
3,5,
,7,3
This is my ideal output:
New_Col
1,4,6
2,9
3,5
7,3
Can anyone help me with this?
Judging by your (wrong) output, you have a dataframe of strings and NaN values are actually empty strings (otherwise it would throw TypeError: expected str instance, float found because NaN is a float).
Since you're dealing with strings, pandas is not optimized for it, so a vanilla Python list comprehension is probably the most efficient choice here.
df['NewCol'] = [','.join([e for e in x if e]) for x in df.values]
In your case do stack
df['new'] = df.stack().astype(int).astype(str).groupby(level=0).agg(','.join)
Out[254]:
0 1,4,6
1 2,9
2 3,5
3 7,3
dtype: object
You can use filter to get rid of NaNs:
df['New_Col'] = df.apply(lambda x: ','.join(filter(lambda x: x is not np.nan,list(x))), axis=1)
I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)
I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.
I want to remove all numbers within the entries of a certain column in a Python pandas dataframe. Unfortunately, commands like .join() and .find() are not iterable (when I define a function to iterate on the entries, it gives me a message that floating variables do not have .find and .join attributes). Are there any commands that take care of this in pandas?
def remove(data):
for i in data if not i.isdigit():
data=''
data=data.join(i)
return data
myfile['column_name']=myfile['column_name'].apply(remove())
You can remove all numbers like this:
import pandas as pd
df = pd.DataFrame ( {'x' : ['1','2','C','4']})
df[ df["x"].str.isdigit() ] = "NaN"
Impossible to know for sure without a data sample, but your code implies data contains strings since you call isdigit on the elements.
Assuming the above, there are many ways to do what you want. One of them is conditional list comprehension:
import pandas as pd
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
out = [ x if x.isdigit() else '' for x in s['x'] ]
# Output: ['', '2', '3', '', '', '0']
Or look at using pd.to_numeric with errors='coerce' to cast the column as numeric and eliminate non-numeric values:
Using #Raidex setup:
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
pd.to_numeric(s['x'], errors='coerce')
Output:
0 NaN
1 2.0
2 3.0
3 NaN
4 NaN
5 0.0
Name: x, dtype: float64
EDIT to handle either situation.
s['x'].where(~s['x'].str.isdigit())
Output:
0 p
1 NaN
2 NaN
3 d
4 f
5 NaN
Name: x, dtype: object
OR
s['x'].where(s['x'].str.isdigit())
Output:
0 NaN
1 2
2 3
3 NaN
4 NaN
5 0
Name: x, dtype: object
In my application, I receive a pandas DataFrame (say, block), that has a column called est. This column can contain a mix of strings or floats. I need to convert all values in the column to floats and have the column type be float64. I do so using the following code:
block[est].convert_objects(convert_numeric=True)
block[est].astype('float')
This works for most cases. However, in one case, est contains all empty strings. In this case, the first statement executes without error, but the empty strings in the column remain empty strings. The second statement then causes an error: ValueError: could not convert string to float:.
How can I modify my code to handle a column with all empty strings?
Edit: I know I can just do block[est].replace("", np.NaN), but I was wondering if there's some way to do it with just convert_objects or astype that I'm missing.
Clarification: For project-specific reasons, I need to use pandas 0.16.2.
Here's an interaction with some sample data that demonstrates the failure:
>>> block = pd.DataFrame({"eps":["", ""]})
>>> block = block.convert_objects(convert_numeric=True)
>>> block["eps"]
0
1
Name: eps, dtype: object
>>> block["eps"].astype('float')
...
ValueError: could not convert string to float:
It's easier to do it using:
pandas.to_numeric
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html
import pandas as pd
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
df['eps'] = pd.to_numeric(df['eps'], errors='coerce')
'coerce' will convert any value error to NaN
df['eps'].astype('float')
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
Then you can apply other functions without getting errors :
df['eps'].round()
0 1.0
1 2.0
2 2.0
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
def convert_float(val):
try:
return float(val)
except ValueError:
return np.nan
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
>>> df.eps.apply(lambda x: convert_float(x))
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64