Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D
I have a string dataframe that I would like to modify. I need to cut off each row of the dataframe at a value say A4 and replace other values after A4 with -- or remove them. I would like to create a new dataframe that has values only upto the string "A4". How would i do this?
import pandas as pd
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','A5','A6'],['A1','A3','A2','A5','A4','A6'],['A1','A2','A4','A3','A6','A5'],['A2','A1','A3','A4','A5','A6'], ['A2','A1','A3','A4','A6','A5'],['A1','A2','A4','A3','A5','A6']]
input = pd.DataFrame(values, columns)
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','--','--'],['A1','A3,'A2','A5','A4','--'],['A1','A2','A4','--','--','--'],['A2','A1','A3','A4','--','--'], ['A2','A1','A3','A4','--','--'],['A1','A2','A4','--','--','--']]
output = pd.DataFrame(values, columns)
You can make a small function, that will take an array, and modify the values after your desired value:
def myfunc(x, val):
for i in range(len(x)):
if x[i] == val:
break
x[(i+1):] = '--'
return x
Then you need to apply the function to the dataframe in a rowwise (axis = 1) manner:
input.apply(lambda x: myfunc(x, 'A4'), axis = 1)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 A5 A4 --
c3 A1 A2 A4 -- -- --
c4 A2 A1 A3 A5 A4 --
c5 A2 A1 A4 -- -- --
c6 A1 A2 A4 -- -- --
I assume you will have values more than A4
df.replace('A([5-9])', '--', regex=True)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 -- A4 --
c3 A1 A2 A4 A3 -- --
c4 A2 A1 A3 -- A4 --
c5 A2 A1 A4 A3 -- --
c6 A1 A2 A4 A3 -- --
Need to get the difference between 2 csv files, kill duplicates and Nan fields.
I am trying this one but it adds them together instead of subtracting.
df1 = pd.concat([df,cite_id]).drop_duplicates(keep=False)[['id','website']]
df is main dataframe
cite_id is dataframe that has to be subtracted.
You can do this efficiently using 'isin'
df.dropna().drop_duplicates()
cite_id.dropna().drop_duplicates()
df[~df.id.isin(cite_id.id.values)]
Or You can merge them and keep only the lines that have a NaN
df[pd.merge(cite_id, df, how='outer').isnull().any(axis=1)]
import pandas as pd
df1 = pd.read_csv("1.csv")
df2 = pd.read_csv("2.csv")
df1 = df1.dropna().drop_duplicates()
df2 = df2.dropna().drop_duplicates()
df = df2.loc[~df2.id.isin(df1.id)]
You can concatenate two dataframes as one, after that you can remove all dupicates
df1
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
cite_id
ID B C D
4 A2 B4 C4 D4
5 A3 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
pd.concat([df1,cite_id]).drop_duplicates(subset=['ID'], keep=False)
Out:
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
6 A6 B6 C6 D6
7 A7 B7 C7 D7
I am trying to sample the resulting data after doing a groupby on multiple columns. If the respective groupby has more than 2 elements, I want to take sample 2 records, else take all the records
df:
col1 col2 col3 col4
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
target df:
col1 col2 col3 col4
A1 A2 A3 A4 or A5 or A6
A1 A2 A3 A4 or A5 or A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
I have mentioned A4 or A5 or A6 because, when we take sample, either of the three might return
This is what i have tried so far:
trial = pd.DataFrame(df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2)))
However, in this I do not get col1, col2 and col3
I think need double reset_index - first for remove 3.rd level of MultiIndex and second for convert MultiIndex to columns:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index(level=3, drop=True)
.reset_index())
Or reset_index with drop for remove column level_3:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index()
.drop('level_3', 1))
print (trial)
col1 col2 col3 col4
0 A1 A2 A3 A4
1 A1 A2 A3 A6
2 B1 B2 B3 B4
3 B1 B2 B3 B5
4 C1 C2 C3 C4
There is no need to convert this to a pandas dataframe its one by default
trial=df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2))
And this should add the col1,2,3
trial.reset_index(inplace=True,drop=False)
I have a dataframe like the following:
df =
A B D
a1 b1 9052091001A
a2 b2 95993854906
a3 b3 93492480190
a4 b4 93240941993
What I want:
df_resp =
A B D
a1 b1 001A
a2 b2 4906
a3 b3 0190
a4 b4 1993
What I tried:
for i in (0,len(df['D'])):
df['D'][i]= df['D'][i][-4:]
Error I got:
KeyError: 4906
Also, it takes a really long time and I think there should be a quicker way with pandas.
Use pd.Series.str string accessor for vectorized string operations. These are preferred over using apply.
If D elements are already strings
df.assign(D=df.D.str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
If not
df.assign(D=df.D.astype(str).str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
You can change in place with
df['D'] = df.D.str[-4:]
Use the apply() method of pandas.Series, it will be way faster than iterating with a for loop...
This should work (provided the column contains only strings):
df_resp = df.copy()
df_resp['D'] = df_resp['D'].apply(lambda x : x[-4:])
As for the KeyError, it probably comes from your DataFrame's index, since calling df['D'][i] is equivalent to df.loc[i]['D'], i.e. i refers to the index's label, not its position. It would (probably) work if you replaced it with df.loc[i]['D'], which refers to the index at position i.
I hope this helps!