I read from cvs file. My dataframe contains strings that are really floats. Also there are NaN values.
Basically I want to transformate NaN in mean and strings in floats.
There are methodes that could help like fillna that replaces nan values, for it I cant get mean (cause values are strings).
Also there is a float() methode but if it's applied on NaN it will give 0, that is not good for me.
Is there any good decision to replace NaN values by mean and convert strings into floats?
Example of dataframe:
1 9,5 50,6 45,75962845 2,6 6,5 11 8,9 NaN
2 10,5 59,9 74,44538987 0 4,5 8,9 NaN NaN
3 20,1 37,7 NaN 0,8 2,5 9,7 6,7 4,2
4 10,7 45,2 10,9710853 0,4 3,1 6,9 5,5 4,7
5 13,2 39,9 9,23393302 0 5,8 9,2 7,4 4,3
P.S As A. Leistra proposed I used
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
df[col].fillna(df[col].mean())
to_numeric with errors='coerce' creates a lot of new NaNs. errors='ignore' parameter seems to be good, but gives TypeError: Can't convert 'int' object to str implicitly on the line df[col].fillna(df[col].mean())
P.S.2 As piRSquared advised I tried to add decimal=',' in read_csv function. But it gives still the same error TypeError: Can't convert 'int' object to str implicitly
You should have read in the data using a decimal=',' argument if you used pd.read_csv. Otherwise, if you're stuck with this data frame, you can dump it out to a csv and try again.
pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False)), decimal=',')
0 1 2 3 4 5 6 7 8
0 1 9.5 50.6 45.759628 2.6 6.5 11.0 8.9 NaN
1 2 10.5 59.9 74.445390 0.0 4.5 8.9 NaN NaN
2 3 20.1 37.7 NaN 0.8 2.5 9.7 6.7 4.2
3 4 10.7 45.2 10.971085 0.4 3.1 6.9 5.5 4.7
4 5 13.2 39.9 9.233933 0.0 5.8 9.2 7.4 4.3
Filling in missing data becomes easy.
d = pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False)), decimal=',')
d.fillna(d.mean())
0 1 2 3 4 5 6 7 8
0 1 9.5 50.6 45.759628 2.6 6.5 11.0 8.900 4.4
1 2 10.5 59.9 74.445390 0.0 4.5 8.9 7.125 4.4
2 3 20.1 37.7 35.102509 0.8 2.5 9.7 6.700 4.2
3 4 10.7 45.2 10.971085 0.4 3.1 6.9 5.500 4.7
4 5 13.2 39.9 9.233933 0.0 5.8 9.2 7.400 4.3
First you need to convert the strings to floats using to_numeric:
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
(Using 'coerce' replace non-convertible values with NaN, which is what you want here). Then you will be able to use fillna:
df.fillna(df.mean())
Related
I happen to have a dataset that looks like this:
A-B A-B A-B A-B A-B B-A B-A B-A B-A B-A
2 3 2 4 5 3.1 3 2 2.5 2.6
NaN 3.2 3.3 3.5 5.2 NaN 4 2.7 3.2 5
NaN NaN 4.1 4 6 NaN NaN 4 4.1 6
NaN NaN NaN 4.2 5.1 NaN NaN NaN 3.5 5.2
NaN NaN NaN NaN 6 NaN NaN NaN NaN 5.7
It's very bad, I know. But what I would like to obtain is:
A-B B-A
2 3.1
3.2 4
4.1 4
4.2 3.5
6 5.7
Which are the values on the "diagonals"
Is there a way I can get something like this?
You could use groupby and a dictionary comprehension with numpy.diag:
df2 = pd.DataFrame({x: np.diag(g) for x, g in df.groupby(level=0, axis=1)})
output:
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
Another option is to convert to long form, and then drop duplicates: this can be achieved with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.pivot_longer(names_to=".value",
names_pattern=r"(.+)",
ignore_index=False)
.dropna()
.loc[lambda df: ~df.index.duplicated()]
)
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
#mozway's solution should be faster though, as you avoid building large number of rows only to prune them, which is what this option does.
Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN
I have a dataframe that contains values for individual days:
day value
1 10.1
2 15.4
3 12.1
4 14.1
5 -9.7
6 2.0
8 3.4
There is not necessary a value for each day (day 7 is missing in my example), but there is never more than one value per day.
I want to add additional columns to this dataframe, containing per row the value of the day before, the value of two days ago, the value of three days ago etc. The result would be:
day value value-of-1 value-of-2 value-of-3
1 10.1 NaN NaN NaN
2 15.4 10.1 NaN NaN
3 12.1 15.4 10.1 NaN
4 14.1 12.1 15.4 10.1
5 -9.7 14.1 12.1 15.4
6 2.0 -9.7 14.1 12.1
8 3.4 NaN 2.0 -9.7
At the moment, I add to the orginal dataframe a column containing the required day and then merge the original dataframe using this new column as join condition. After some reorganizing of the columns, I get my result:
data = [[1, 10.1], [2, 15.4], [3, 12.1], [4, 14.1], [5, -9.7], [6, 2.0], [8, 3.4]]
df = pd.DataFrame(data, columns = ['day', 'value'])
def add_column_for_prev_day(df, day):
df[f"day-{day}"] = df["day"] - day
df = df.merge(df[["day", "value"]], how="left", left_on=f"day-{day}", right_on="day", suffixes=("", "_r")) \
.drop(["day_r",f"day-{day}"],axis=1) \
.rename({"value_r": f"value-of-{day}"}, axis=1)
return df
df = add_column_for_prev_day(df, 1)
df = add_column_for_prev_day(df, 2)
df = add_column_for_prev_day(df, 3)
I wonder if there is a better and faster way to get the same result, especially without having to merge the dataframe over and over again.
A simple shift does not help as there are days without data.
You can use:
m=df.set_index('day').reindex(range(df['day'].min(),df['day'].max()+1))
l=[1,2,3]
for i in l:
m[f"value_of_{i}"] = m['value'].shift(i)
m.reindex(df.day).reset_index()
day value value_of_1 value_of_2 value_of_3
0 1 10.1 NaN NaN NaN
1 2 15.4 10.1 NaN NaN
2 3 12.1 15.4 10.1 NaN
3 4 14.1 12.1 15.4 10.1
4 5 -9.7 14.1 12.1 15.4
5 6 2.0 -9.7 14.1 12.1
6 8 3.4 NaN 2.0 -9.7
I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10