I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10
Related
I happen to have a dataset that looks like this:
A-B A-B A-B A-B A-B B-A B-A B-A B-A B-A
2 3 2 4 5 3.1 3 2 2.5 2.6
NaN 3.2 3.3 3.5 5.2 NaN 4 2.7 3.2 5
NaN NaN 4.1 4 6 NaN NaN 4 4.1 6
NaN NaN NaN 4.2 5.1 NaN NaN NaN 3.5 5.2
NaN NaN NaN NaN 6 NaN NaN NaN NaN 5.7
It's very bad, I know. But what I would like to obtain is:
A-B B-A
2 3.1
3.2 4
4.1 4
4.2 3.5
6 5.7
Which are the values on the "diagonals"
Is there a way I can get something like this?
You could use groupby and a dictionary comprehension with numpy.diag:
df2 = pd.DataFrame({x: np.diag(g) for x, g in df.groupby(level=0, axis=1)})
output:
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
Another option is to convert to long form, and then drop duplicates: this can be achieved with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.pivot_longer(names_to=".value",
names_pattern=r"(.+)",
ignore_index=False)
.dropna()
.loc[lambda df: ~df.index.duplicated()]
)
A-B B-A
0 2.0 3.1
1 3.2 4.0
2 4.1 4.0
3 4.2 3.5
4 6.0 5.7
#mozway's solution should be faster though, as you avoid building large number of rows only to prune them, which is what this option does.
I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")
I read from cvs file. My dataframe contains strings that are really floats. Also there are NaN values.
Basically I want to transformate NaN in mean and strings in floats.
There are methodes that could help like fillna that replaces nan values, for it I cant get mean (cause values are strings).
Also there is a float() methode but if it's applied on NaN it will give 0, that is not good for me.
Is there any good decision to replace NaN values by mean and convert strings into floats?
Example of dataframe:
1 9,5 50,6 45,75962845 2,6 6,5 11 8,9 NaN
2 10,5 59,9 74,44538987 0 4,5 8,9 NaN NaN
3 20,1 37,7 NaN 0,8 2,5 9,7 6,7 4,2
4 10,7 45,2 10,9710853 0,4 3,1 6,9 5,5 4,7
5 13,2 39,9 9,23393302 0 5,8 9,2 7,4 4,3
P.S As A. Leistra proposed I used
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
df[col].fillna(df[col].mean())
to_numeric with errors='coerce' creates a lot of new NaNs. errors='ignore' parameter seems to be good, but gives TypeError: Can't convert 'int' object to str implicitly on the line df[col].fillna(df[col].mean())
P.S.2 As piRSquared advised I tried to add decimal=',' in read_csv function. But it gives still the same error TypeError: Can't convert 'int' object to str implicitly
You should have read in the data using a decimal=',' argument if you used pd.read_csv. Otherwise, if you're stuck with this data frame, you can dump it out to a csv and try again.
pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False)), decimal=',')
0 1 2 3 4 5 6 7 8
0 1 9.5 50.6 45.759628 2.6 6.5 11.0 8.9 NaN
1 2 10.5 59.9 74.445390 0.0 4.5 8.9 NaN NaN
2 3 20.1 37.7 NaN 0.8 2.5 9.7 6.7 4.2
3 4 10.7 45.2 10.971085 0.4 3.1 6.9 5.5 4.7
4 5 13.2 39.9 9.233933 0.0 5.8 9.2 7.4 4.3
Filling in missing data becomes easy.
d = pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False)), decimal=',')
d.fillna(d.mean())
0 1 2 3 4 5 6 7 8
0 1 9.5 50.6 45.759628 2.6 6.5 11.0 8.900 4.4
1 2 10.5 59.9 74.445390 0.0 4.5 8.9 7.125 4.4
2 3 20.1 37.7 35.102509 0.8 2.5 9.7 6.700 4.2
3 4 10.7 45.2 10.971085 0.4 3.1 6.9 5.500 4.7
4 5 13.2 39.9 9.233933 0.0 5.8 9.2 7.400 4.3
First you need to convert the strings to floats using to_numeric:
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
(Using 'coerce' replace non-convertible values with NaN, which is what you want here). Then you will be able to use fillna:
df.fillna(df.mean())
I'm looking to transpose a column within a dataframe so that it becomes a row, while using another row as the index. Specifically, I need all ColB values where ColA == '1' to become the values for RowA, and all the ColB where ColA == '2' to become the values for RowB.
i.e. I need to turn:
index ColA ColB
0 1.0 1.1
1 1.0 12.2
2 1.0 4.5
3 2.0 5.1
4 2.0 7.7
5 2.0 9.5
into ...
ColB
0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5
------ Update #1 --------
In reference to the answer provided by #Scott_Boston:
df.groupby('ColA').apply(lambda x: x.reset_index().ColB)
seems to give me:
ColA
1.0 0 1.1
1 12.2
2 4.5
2.0 0 5.1
1 7.7
2 9.5
df.groupby('ColA').ColB.apply(list).apply(pd.Series).rename_axis('ColB',1)
Out[113]:
ColB 0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5
Let's use groupby, apply, and reset_index:
df.groupby('ColA').apply(lambda x: x.reset_index().ColB)
Output:
ColB 0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5