I want to ask about a data cleansing question which I suppose python may be more efficient. The data have a lot of mis-placed columns and I have to use some characteristics based on certain columns to place them to the right positions. Below is an example in Stata code:
forvalues i = 20(-1)2{
local j = `i' + 25
local k = `j' - 2
replace v`j' = v`k' if substr(v23, 1, 4) == "1980"
}
That is, I move contents in columns v25 - v43 backward by 2 if the observation in column v23 starts with "1980". Otherwise, the columns are correct.
Any help is appreciated.
The following is a simplified example to show it works:
In [65]:
# create some dummy data
import pandas as pd
import io
pd.set_option('display.notebook_repr_html', False)
temp = """v21 v22 v23 v24 v25 v28
1 1 19801923 1 5 8
1 1 20003 1 5 8
1 1 9129389 1 5 8
1 1 1980 1 5 8
1 1 1923 2 5 8
1 1 9128983 1 5 8"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
df
Out[65]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 5 8
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 5 8
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
In [68]:
# I have to convert my data to a string in order for this to work, it may not be necessary for you in which case the following commented out line would work for you:
#df.v23.str.startswith('1980')
df.v23.astype(str).str.startswith('1980')
Out[68]:
0 True
1 False
2 False
3 True
4 False
5 False
Name: v23, dtype: bool
In [70]:
# now we can call shift by 2 along the column axis to assign the values back
df.loc[df.v23.astype(str).str.startswith('1980'),['v25','v28']] = df.shift(2,axis=1)
df
Out[70]:
v21 v22 v23 v24 v25 v28
0 1 1 19801923 1 19801923 1
1 1 1 20003 1 5 8
2 1 1 9129389 1 5 8
3 1 1 1980 1 1980 1
4 1 1 1923 2 5 8
5 1 1 9128983 1 5 8
So what you need to do is define the list of columns up front:
In [72]:
target_cols = ['v' + str(x) for x in range(25,44)]
print(target_cols)
['v25', 'v26', 'v27', 'v28', 'v29', 'v30', 'v31', 'v32', 'v33', 'v34', 'v35', 'v36', 'v37', 'v38', 'v39', 'v40', 'v41', 'v42', 'v43']
Now substitute this back into my method and I believe it should work:
df.loc[df.v23.astype(str).str.startswith('1980'),target_cols] = df.shift(2,axis=1)
See shift to understand the params
Related
My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)
I have a pandas DataFrame (ignore the indices of the DataFrame)
Tab Ind Com Val
4 BAS 1 1 10
5 BAS 1 2 5
6 BAS 2 1 20
8 AIR 1 1 5
9 AIR 1 2 2
11 WTR 1 1 2
12 WTR 2 1 1
And a pandas series
Ind
1 1.208333
2 0.857143
dtype: float64
I want to multiply each element of the Val column of the DataFrame with the element of the series that has the same Ind value. How would I approach this? pandas.DataFrame.mul only matches on index, but I don't want to transform the DataFrame.
Looks like pandas.DataFrame.join could solve your problem:
temp = df.join(the_series,on='Ind', lsuffix='_orig')
df['ans'] = temp.Val*temp.Ind
Output
Tab Ind Com Val ans
4 BAS 1 1 10 12.083330
5 BAS 1 2 5 6.041665
6 BAS 2 1 20 17.142860
8 AIR 1 1 5 6.041665
9 AIR 1 2 2 2.416666
11 WTR 1 1 2 2.416666
12 WTR 2 1 1 0.857143
Or another way to achieve the same using a more compact syntax (thanks W-B)
df1['New']=df1.Ind.map(the_series).values*df1.Val
I have this groupby dataframe ( I actually don't know how to call this type of table)
A B C
1 1 124284.312500
2 64472.187500
4 32048.910156
8 16527.763672
16 8841.874023
2 1 61971.035156
2 31569.882812
4 16000.071289
8 7904.339844
16 4046.967041
4 1 31769.435547
2 15804.815430
4 7917.609375
8 4081.160400
16 2034.404541
8 1 15738.752930
2 7907.003418
4 3972.494385
8 1983.464478
16 1032.913574
I want to plot the graph, which has A as x-axis, C as y-axis and B as different variables with legend.
In pandas document, I found the graph I try to have, but no luck yet.
==========edited ===============
This is original dataframe
A B C
0 1 1 122747.722000
1 1 2 61839.731000
2 1 2 61839.762000
3 1 4 31736.405000
4 1 4 31736.559000
5 1 4 31787.312000
6 1 4 31787.833000
7 1 8 15872.596000
8 1 8 15865.406000
9 1 8 15891.001000
I have df = df.groupby(['A', 'B']).C.mean()
How can I plot the graph with stacked table?
Thanks!
Use unstack:
df.unstack().plot()
I have a multi-indexed pandas dataframe that looks like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
5 1 1.143599
2 1.151358
3 1.272172
10 1 1.765615
2 1.779330
3 1.752246
20 1 1.685807
2 1.688354
3 1.614013
..... ....
0 4 2.111466
5 1.933589
6 1.336527
5 4 2.006936
5 2.040884
6 1.430818
10 4 1.398334
5 1.594028
6 1.684037
20 4 1.529750
5 1.721385
6 1.608393
(Note that I've only posted one antibody, there are many analogous entries under the antibody index) but they all have the same format. Despite missing out the entries in the middle for the sake of space you can see that I have 6 experimental repeats but they are not organized properly. My question is: how would I get the DataFrame to aggregate all the repeats. So the output would look something like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.60839
..... ....
Thanks in advance
I think you need sort_index:
df = df.sort_index(level=[0,1,2])
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64
Or you can omit parameter levels:
df = df.sort_index()
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64
so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)