Hi all and thanks for help in advance.
The problem I am trying to solve is as follows:
I have two columns within one CSV file: Column A and Column B.
There are certain patterns that need to be present in my data in each row underneath column A and B.
For example, if there is a "1" in row 1 of column A , there must be a "5" adjacent to it in row 1 of column B.
If there was a "1" in row two of Column A, and a "2" adjacent to it in row 2 of column B I would need this to be flagged and printed out as "does not follow pattern"
The rules go as follows:
Any time theres a 1 in column A, next to it there should be a 5 in column B
Any time theres a 3 in column A, next to it there should be a 6 in column B
Any time theres a 2 in column A, next to it there should be a 4 in column B
Anytime these rules are not followed a return statement should say "pattern not followed"
Here is where I am at on the code, I can't seem to figure out a way of doing this data check.
import numpy as np
import pandas as pd
import os # filepaths
import glob
import getpass # Login information
unane = getpass.getuser()
# Paths:
path2proj = os.path.join('C:', os.sep, 'Users', unane, 'Documents', 'Home','Expts','TSP', '')
path2data = os.path.join(path2proj,'Data','')
path2asys = os.path.join(path2proj,'Analysis', '')
path2figs = os.path.join(path2asys, 'figures', '')
path2hddm = os.path.join(path2asys, 'modeling', '')
df = pd.read_csv(path2data + '001_2012_Dec_19_0932_PST_train.csv')
os.chdir(path2data)
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
all_filenames = glob.glob("*.csv")
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv("combined_csv.csv",index=False, encoding='utf-8-sig')
df = pd.read_csv(path2data + 'combined_csv.csv')
df['left_stim_number'].equals(df['right_stim_number'])
df = pd.read_csv(path2data + 'combined_csv.csv')
df1 = pd.DataFrame(df, columns=['left_stim_number'])
df2 = pd.DataFrame(df, columns=['right_stim_number'])
df1['match'] = np.where(df1['left_stim_number']== df2['right_stim_number'], True, False)
# Checking to see if there are any errors as all should add up to 7
df1['add'] = np.where(df1['left_stim_number']== df2['right_stim_number'], 0, df1['left_stim_number'] + df2['right_stim_number'])
# def see_correct(df):
# if df1['add'] == ['7']:
# return 1
# else:
# return 0
# df1.tail(10)
combined_csv.isna().sum()
combined_csv.dropna()
df.loc[df['left_stim_number'] != df['right_stim_number'],:]
---
Example of CSV data
A(left_stim_number) Column B (Right_stim_number)
1 5
1 5
3 6
1 5
3 6
2 4
2 4
2 4
1 5
Since we don't have an example I'll make one up - a pandas DataFrame with two columns of integers.
import numpy as np
import pandas as pd
np.random.seed(2)
df = pd.DataFrame({'colA':np.random.randint(0,10,100),
'colB':np.random.randint(0,10,100)})
>>> df.head()
colA colB
0 8 7
1 8 1
2 6 9
3 2 2
4 8 1
There might be more concise ways to do this nut this is pretty clear what is happening. This uses a lot of boolean indexing.
Your rules exclude any row in colA that is not 1,2,or 3. Rows in colB that are not 4,5,or 6 are also excluded. You can make a mask for all the excluded rows.
mask = ~df.colA.isin([1,2,3]) | ~df.colB.isin([4,5,6])
>>> df[mask].head()
colA colB
0 8 7
1 8 1
2 6 9
3 2 2
4 8 1
>>>
You can use the mask to assign "pattern not followed" to a new column for all those rows.
df.loc[mask,'colC'] = 'pattern not followed'
>>> df.head()
colA colB colC
0 8 7 pattern not followed
1 8 1 pattern not followed
2 6 9 pattern not followed
3 2 2 pattern not followed
4 8 1 pattern not followed
You can also use the mask to find all the rows that might match your criteria. Notice colC is NaN for these rows.
>>> df[~mask]
colA colB colC
13 3 5 NaN
35 2 6 NaN
39 1 5 NaN
61 2 5 NaN
62 1 5 NaN
65 1 6 NaN
69 1 5 NaN
70 2 4 NaN
77 3 5 NaN
92 1 6 NaN
98 2 5 NaN
>>>
Set colC the rows that meet the criteria to True(?).
df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True
That leaves some outliers.
>>> df.loc[df.colC.isna()]
colA colB colC
13 3 5 NaN
35 2 6 NaN
61 2 5 NaN
65 1 6 NaN
77 3 5 NaN
92 1 6 NaN
98 2 5 NaN
Which can be fixed with.
df.loc[df.colC.isna(),'colC'] = 'pattern not followed'
After looking at that only the last four operations are needed.
df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True
df.loc[df.colC.isna(),'colC'] = 'pattern not followed'
>>> df.loc[df.colC == True]
colA colB colC
39 1 5 True
62 1 5 True
69 1 5 True
70 2 4 True
>>>
If the text in csv file looks like this-
4,9
8,3
4,6
2,4
7,5
1,3
.
.
.
The Dataframe can be made with -
df = pd.read_csv('data.csv',names=['colA','colB'])
Related
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
I need to sum up values of 'D' column for every row with the same combination of values from columns 'A','B' and 'C. Eventually I need to create DataFrame with unique combinations of values from
columns 'A','B' and 'C' with corresponding sum in column D.
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df
OT:
A B C D
0 0 2 0 2
1 0 1 2 1
2 0 0 2 0
3 1 2 2 2
4 0 2 2 2
5 0 2 2 2
6 2 2 2 1
7 2 1 1 1
8 1 0 2 0
9 1 2 0 0
I've tried to create temporary data frame with empty cells
D = pd.DataFrame([i for i in range(len(df))]).rename(columns = {0:'D'})
D['D'] = ''
D
OT:
D
0
1
2
3
4
5
6
7
8
9
And use apply() to sum up all 'D' column values for unique row consisted of columns 'A','B' and 'C'. For example below line returns sum of values from 'D' column for 'A'=0,'B'=2,'C'=2:
df[(df['A']==0) & (df['B']==2) & (df['C']==2)]['D'].sum()
OT:
4
function:
def Sumup(cols):
A = cols[0]
B = cols[1]
C = cols[2]
D = cols[3]
sum = df[(df['A']==A) & (df['B']==B) & (df['C']==C)]['D'].sum()
return sum
apply on df and saved in temp df D['D']:
D['D'] = df[['A','B','C','D']].apply(Sumup)
Later I wanted to use drop_duplicates but I receive dataframe consisted of NaN's.
D
OT:
D
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Anyone could give me a hint how to manage the NaN problem or what other approach can I apply to solve the original
problem?
df.groupby(['A','B','C']).sum()
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df.groupby(["A", "B", "C"])["D"].sum()
I have a Pandas dataframe that contains a grouping variable. I would like to merge each group with other dataframes based on the contents of one of the columns. So, for example, I have a dataframe, dfA, which can be defined as:
dfA = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[0,1,0,0,1,1],
'c':['a','b','c','d','e','f']})
a b c
0 1 0 a
1 2 1 b
2 3 0 c
3 4 0 d
4 5 1 e
5 6 1 f
Two other dataframes, dfB and dfC, contain a common column ('a') and an extra column ('d') and can be defined as:
dfB = pd.DataFrame({'a':[1,2,3],
'd':[11,12,13]})
a d
0 1 11
1 2 12
2 3 13
dfC = pd.DataFrame({'a':[4,5,6],
'd':[21,22,23]})
a d
0 4 21
1 5 22
2 6 23
I would like to be able to split dfA based on column 'b' and merge one of the groups with dfB and the other group with dfC to produce an output that looks like:
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
In this simplified version, I could concatenate dfB and dfC and merge with dfA without splitting into groups as shown below:
dfX = pd.concat([dfB,dfC])
dfA = dfA.merge(dfX,on='a',how='left')
print(dfA)
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
However, in the real-world situation, the smaller dataframes will be generated from multiple different complex sources; generating the dataframes and combining into a single dataframe beforehand may not be feasible because there may be overlapping data on the column that will be used for merging the dataframes (but this will be avoided if the dataframe can be split based on the grouping variable). Is it possible to use Pandas groupby() method to do this instead? I was thinking of something like the following (which doesn't work, perhaps because I'm not combining the groups into a new dataframe correctly):
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
Any thoughts would be appreciated.
This will fix your code
l=[]
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
l.append(group)
pd.concat(l)
Out[215]:
a b c d
0 1 0 a 11.0
1 3 0 c 13.0
2 4 0 d NaN
0 2 1 b NaN
1 5 1 e 22.0
2 6 1 f 23.0
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
I am using a pandas/python dataframe. I am trying to do a lag subtraction.
I am currently using:
newCol = df.col - df.col.shift()
This leads to a NaN in the first spot:
NaN
45
63
23
...
First question: Is this the best way to do a subtraction like this?
Second: If I want to add a column (same number of rows) to this new column. Is there a way that I can make all the NaN's 0's for the calculation?
Ex:
col_1 =
Nan
45
63
23
col_2 =
10
10
10
10
new_col =
10
55
73
33
and NOT
NaN
55
73
33
Thank you.
I think your method of of computing lags is just fine:
import pandas as pd
df = pd.DataFrame(range(4), columns = ['col'])
print(df['col'] - df['col'].shift())
# 0 NaN
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'] + df['col'].shift())
# 0 NaN
# 1 1
# 2 3
# 3 5
# Name: col
If you wish NaN plus (or minus) a number to be the number (not NaN), use the add (or sub) method with fill_value = 0:
print(df['col'].sub(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'].add(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 3
# 3 5
# Name: col