New column in dataset based em last value of item - python

I have this dataset
In [4]: df = pd.DataFrame({'A':[1, 2, 3, 4, 5]})
In [5]: df
Out[5]:
A
0 1
1 2
2 3
3 4
4 5
I want to add a new column in dataset based em last value of item, like this
A
New Column
1
2
1
3
2
4
3
5
4
I tryed to use apply with iloc, but it doesn't worked
Can you help
Thank you

With your shown samples, could you please try following. You could use shift function to get the new column which will move all elements of given column into new column with a NaN in first element.
import pandas as pd
df['New_Col'] = df['A'].shift()
OR
In case you would like to fill NaNs with zeros then try following, approach is same as above for this one too.
import pandas as pd
df['New_Col'] = df['A'].shift().fillna(0)

Related

pandas groupby dataframes, calculate diffs between consecutive rows

Using pandas, I open some csv files in a loop and set the index to the cycleID column, except the cycleID column is not unique. See below:
for filename in all_files:
abfdata = pd.read_csv(filename, index_col=None, header=0)
abfdata = abfdata.set_index("cycleID", drop=False)
for index, row in abfdata.iterrows():
print(row['cycleID'], row['mean'])
This prints the 2 columns (cycleID and mean) of the dataframe I am interested in for further computations:
1 1.5020712104685252e-11
1 6.56683605063102e-12
2 1.3993315187144084e-11
2 -8.670502467042485e-13
3 7.0270625256163566e-12
3 9.509995221868016e-12
4 1.2901435995915644e-11
4 9.513106448422182e-12
The objective is to use the rows corresponding to the same cycleID and calculate the difference between the mean column values. So, if there are 8 rows in the table, the final array or list would store 4 values.
I want to make it scalable as well where there can be 3 or more rows with the same cycleIDs. In that case, each cycleID could have 2 or more mean differences.
Update: Instead of creating a new ques about it, I thought I'd add here.
I used the diff and groupby approach as mentioned in the solution. It works great but I have this extra need to save one of the mean values (odd row or even row doesn't matter) in a new column and make that part of the new data frame as well. How do I do that?
You can use groupby
s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)
output
1 -8.453876e-12
3 -1.486037e-11
5 2.482933e-12
7 -3.388330e-12
8 3.000000e-12
UPDATE
d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]
df = pd.DataFrame(d, columns=['cycleID', 'mean'])
df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
difference mean
1 -8.453876e-12 6.566836e-12
3 -1.486037e-11 -8.670502e-13
5 2.482933e-12 9.509995e-12
7 -3.388330e-12 9.513106e-12

Method to sort DataFrame's values

I don't understand this code:
d = {'col1': [5, 6,4, 1, 2, 9, 15, 11]}
df = pd.DataFrame(data=d)
df.head(10)
df['col1'] = df.sort_values('col1')['col1']
print(df.sort_values('col1')['col1'])
This is what is printed:
3 1
4 2
2 4
0 5
1 6
5 9
7 11
6 15
My df doesn't change at all.
Why does this code: df.sort_values('col1')['col1'] do not arrange my dataframe?
Thanks
If want assign back sorted column is necessary convert output to numpy array for prevent index alignment - it means if use only df.sort_values('col1')['col1'] it sorting correctly, index order is changed, but in assign step is change order like original, so no change in order of values.
df['col1'] = df.sort_values('col1')['col1'].to_numpy()
If default index another idea is create default index (same like original), so alignment asign by new index values:
df['col1'] = df.sort_values('col1')['col1'].reset_index(drop=True)
If want sort by col1 column:
df = df.sort_values('col1')

Numeric differences between two different dataframes in python

I would like to find the numeric difference between two or more columns of two different dataframe.
The following
would be the starting table.
This one Table (Table 2)
contains the single values that I need to subtract to Table 1.
I would like to get a third table where I get the numeric differences between each row of Table 1 and the single row from Table 2. Any help?
Try
df.subtract(df2.values)
with df being your starting table and df2 being Table 2.
Can you try this and see if this is what you need:
import pandas as pd
df = pd.DataFrame({'A':[5, 3, 1, 2, 2], 'B':[2, 3, 4, 2, 2]})
df2 = pd.DataFrame({'A':[1], 'B':[2]})
pd.DataFrame(df.values-df2.values, columns=df.columns)
Out:
A B
0 4 0
1 2 1
2 0 2
3 1 0
4 1 0
you can just do df1-df2.values like below this will use numpy broadcast to substract all df2 from all rows but df2 must have only one row
example
df1 = pd.DataFrame(np.arange(15).reshape(-1,3), columns="A B C".split())
df2 = pd.DataFrame(np.ones(3).reshape(-1,3), columns="A B C".split())
df1-df2.values

Python pandas question matching rows, finding indexes and creating column

I am learning Python and trying to solve a problem but got stuck here. I would like to do the following:
The dataframe is called: df_cleaned_sessions
It contains two columns with timestamps:
datetime_only_first_engagement
datetime_sessions
For you information the datetime_only_first_engagement column has a lot less timestamps than the datetime_sessions, the sessions column has a lot of NA values as this dataframe is a result of a left join.
I would like to do the following:
Find rows where datetime_only_first_engagement timestamp equals the timestamp from datetime_sessions, save the index from those rows, and create a new column in the dataframe called 'is_conversion', and set those (matching timestamps) indexes to True. The other indexes should be set to False.
Hope someone can help me!
Thanks a lot.
it would have been easier if you had provided a sample code and an expected output, however by reading your question I feel you would want to do the following:
import pandas as pd
Lets build a sample df:
df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8],[10,11]], columns=["A", "B"])
print(df)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 10 11
Lets assume df1 to be :
df1 = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8],[9,10]], columns=["D", "E"])
print(df1)
D E
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
Apply the below code to check if column A in df exists in Column D of df1:
df['is_conversion']= df['A'].isin(df1['D']).astype(bool)
print(df)
A B is_conversion
0 1 2 True
1 3 4 True
2 5 6 True
3 7 8 True
4 10 11 False
Similarly for your question, you can apply the same logic in matching different columns of the same dataframe too. I think you need:
df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['datetime_only_first_engagement'].isin(df_cleaned_sessions['datetime_sessions']).astype(bool)
Based on the comments: add this below the above code:
df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['is_conversion'].replace({True:1, False:0})
Alternative answer using np.where:
import numpy as np
df_cleaned_sessions['is_conversion'] = np.where(df_cleaned_sessions['datetime_only_first_engagement'].isin(df_cleaned_sessions['datetime_sessions']),True,False)
Hope that helps..!
From what I understand, you need numpy.where:
import numpy as np
df_cleaned_sessions['is_conversion'] = np.where(df_cleaned_sessions['datetime_only_first_engagement'] == df_cleaned_sessions['datetime_sessions'], True, False)
df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['datetime_only_first_engagement'] == df_cleaned_sessions['datetime_sessions']

Iterating over each element in pandas DataFrame

So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.
You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely
You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4
Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])

Categories

Resources