I am new to Pandas and I have a question which I could not fix it by myself.
This is my column
0 12.000000
1 21.540659
2 19.122413
3 16.568042
4 17.082154
5 15.932148
6 15.226856
7 14.400521
8 17.900962
9 17.169741
10 NaN
and I want to shift it by one row.The expected result should be looks like this:
0 NaN
1 12.000000
2 21.540659
3 19.122413
4 16.568042
5 17.082154
6 15.932148
7 15.226856
8 14.400521
9 17.900962
10 17.169741
Here is my code:
data['A']=pd.Series(a).shift(periods=1)
a is a list and I convert it to the pandas series to add a new column as an "A" in my dataframe. Howerevr I need to shift my rows without losing the last data.
Wait now? Doesn't this work?
data["A"] = data["A"].shift(1)
IIUC,
You could pad your Series creation with an extra row then shift:
data['A'] = pd.Series(a+[pd.np.nan]).shift(1)
You can use np.roll for that.
Answer found here, check the link for details.
Related
I know it is a simple answer but i could'nt find anywhere, I need to show the multiplication of all values of a single column in python.
Here's the dataframe:
VALUE
0 2
1 3
2 1
3 3
4 1
The output should give me 23131 = 18
Try prod
df.VALUE.prod()
Out[345]: 18
To add to the previous answer you can use df.product(axis=0) as well.
I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.
I have a certain condition (Incident = yes) and I want to know the values in two columns fulfilling this condition. I have a very big data frame (many rows and many columns) and I am looking for a "screening" function.
To illustrate the following example with the df (which has many more columns than shown):
Repetition Step Incident Test1 Test2
1 1 no 10 20
1 1 no 9 11
1 2 yes 9 19
1 2 yes 11 20
1 2 yes 12 22
1 3 yes 9 18
1 3 yes 8 18
What I would like to get as an answer is
Repetition Step
1 2
1 3
If I only wanted to know the Step, I would use the following command:
df[df.Incident == 'yes'].Step.unique()
Is there a similar command to get the values of two columns for a specific condition?
Thanks for the help! :-)
You could use the query option for the condition, select the interested columns, and finally remove duplicate values
df.query('Incident=="yes"').filter(['Repetition','Step']).drop_duplicates()
OR
you could use the Pandas' loc method, set the rows as the condition, set the columns part with the columns you are interested in, then drop the duplicates.
df.loc[df.Incident=="yes",['Repetition','Step']].drop_duplicates()
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN
Here is an example of data I'm working on. (as a pandas df)
index inv Rev_stream Bill_type Net_rev
1 1 A Original -24.77
2 1 B Original -24.77
3 2 A Original -409.33
4 2 B Original -409.33
5 2 C Original -409.33
6 2 D Original -409.33
7 3 A Original -843.11
8 3 A Rebill 279.5
9 3 B Original -843.11
10 4 A Rebill 279.5
11 4 B Original -843.11
12 5 B Rebill 279.5
How could I filter this df, in a way to only get the lines where invoice/Rev_stream combo has both original and rebill kind of Net_rev. In the example above it would be only lines with index 7 and 8.
Is there an easy way to do it, without iterating over the whole dataframe and building dictionaries of invoice+RevStream : Bill_type?
What I'm looking for is some kind of
df = df[df[['inv','Rev_stream']]['Bill_type'].unique().len() == 2]
Unfortunately the code above doesn't work.
Thanks in advance.
You can group your data by inv and Rev_stream columns and then check for each group if both Original and Rebill are in the Bill_type values and filter based on the condition:
(df.groupby(['inv', 'Rev_stream'])
.filter(lambda g: 'Original' in g.Bill_type.values and 'Rebill' in g.Bill_type.values))