select series based on index in a DataFrame in python - python

I have a pd.Series like
myS = pd.Series(np.arange(1,11,1))
I also have a pd.DataFrame like
mydf = pd.DataFrame([[1,2,3],[7,8,9]])
I would like to select values in myS based on index in mydf, but would like to have the results stored in a dataframe with same shape as mydf.
So the desired resulting dataframe is pd.DataFrame([[2,3,4],[8,9,1]])
What is the best way to achieve this?

Using replace
yourdf=mydf.replace(myS)
yourdf
Out[174]:
0 1 2
0 2 3 4
1 8 9 10

Related

Calculate quantile for each observation in a dataframe

I am new to Python and I have the following dataframe structure:
data = {'name': ["a","b","c","d","e","f","g","h"], 'value1': [1,2,3,4,5,6,7,8],'value2': [1,2,3,4,5,6,7,8]}
data = pd.DataFrame.from_dict(data)
data = data.transpose()
What I want to calculate is a new dataframe, where for each row, each column has a value corresponding to the quantile in the data.
In other words, I am trying to understand how to apply the function pd.quantile to return a dataframe with each entry being equal to the quantile value of the column in the row.
I tried the following, but I don't think it works:
x.quantile(q = 0.9,axis =0)
or:
x.apply(quantile,axis=0)
Many thanks in advance.
This is because you transpose your data and as per pandas documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
When the DataFrame has mixed dtypes, we get a transposed DataFrame
with the object dtype
Your dataframe after loading looks like below, which means it has 'mixed dtypes' (one column is object / category and the other two are integers).
name value1 value2
0 a 1 1
1 b 2 2
2 c 3 3
3 d 4 4
4 e 5 5
5 f 6 6
6 g 7 7
7 h 8 8
In this case you transpose your data and it is being converted to object dtype, which means that quantile function does not understand it as numbers.
Try removing transposing step and use axis argument to decide for which direction you want to calculate quantiles.
By the way, you can do transposition with:
df = df.T

Erro using groupby with a transform in a column (pandas)

Using groupby in a df to transform a column is giving me a diferent result
I have a df like this:
column data is a float column and group is the categories of my data.
I want to transform my column data stratefy by categories with .pct_change().rolling(2).sum(),
like this:
df[df.group == 'D'].data.pct_change().rolling(2).sum()
That give me:
data
0 NaN
2 NaN
3 -0.604782
5 -0.298356
6 1.036227
8 -0.008092
9 396.681408
16 397.087583
17 -0.427873
23 0.253185
29 0.040770
But, when trying to automate things using groupby,
like this:
df_modif = pd.concat([df.groupby(by='group').data.pct_change().rolling(2).sum(), df.group], axis=1)
That give me:
Can someone help me understand what is going on?

selecting rows with min and max values of a defined column in pandas

I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Only getting relevant data from Pandas Dataframe

Brief background: I just started recently using Pandas to read in a csv file of data. I'm able to create a dataframe from reading the csv but now I want to do some calculations using only specific columns of the dataset.
Is there a way to create a new dataframe where I only use rows where the relevant columns are not NA or 0? For example imagine an array that looks like:
blah blah1 blah2 blah3
0 1 1 1 1
1 NA NA 1 NA
2 1 1 1 1
So say I want to do things with data under columns "blah1" and "blah2", but I want to only use rows 0 and 2 because 1 has an NA under the column "blah".
Is there a simple way of doing this? Thanks!
Edit (Clarifications):
- I don't know ahead of time that I want to drop row 1, thus I need to be able to check for a NA value (and possibly any other placeholder value beyond just whether it is null).
Yes, you can use dropna
df = df.dropna(axis = 1)
and to select columns use this:
df = df[["blah1", "blah2"]]
Now df contains only cols "blah1" and "blah2" and rows 0 and 2
EDIT 1
To limit NaN verification to some columns you can use isnull().
mask = df[["blah1", "blah2"]].isnull().all(axis=1)
df = df[~mask]
EDIT 2
mask = df.B == 'placeholder'
df = df[~mask]

Categories

Resources