My dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'c1': [10, 11, 12, 13], 'c2': [100, 110, 120, 130], 'c3': [100, 110, 120, 130], 'c4': ['A', np.nan, np.nan, 'B']})
I need to replace row c2 and c3 from another dataframe using column 'c4'
replacer df:
df_replacer = pd.DataFrame({'c2': [11, 22], 'c3': [99, 299], 'c4': ['A', 'B']})
Below is how I am doing: (Is there a cleaner way to do?)
df = df.merge(df_replacer, on=['c4'], how='left')
df.loc[~df.c4.isna(), 'c2_x'] = df['c2_y']
df.loc[~df.c4.isna(), 'c3_x'] = df['c3_y']
df = df.rename({'c2_x': 'c2', 'c3_x':'c3'}, axis=1)
df = df[['c1', 'c2', 'c3', 'c4']]
I don't see another way to do it without using the merge, maybe you could do something like this :
df = df.merge(df_replacer, on='c4', how='left', suffixes=('', '_replacer'))
df['c2'] = np.where(df['c2_replacer'].notnull(), df['c2_replacer'], df['c2'])
df['c3'] = np.where(df['c3_replacer'].notnull(), df['c3_replacer'], df['c3'])
df = df.drop(['c2_replacer', 'c3_replacer'], axis=1)
# list of columns to update
cols=['c2', 'c3']
# set the index on column to use for matching the two DF
df.set_index('c4', inplace=True)
df_replacer.set_index('c4', inplace=True)
# use update to replace value in DF
df.update(df_replacer[cols] )
# reset the index
df.reset_index()
c4 c1 c2 c3
0 A 10 11.0 99.0
1 NaN 11 22.0 299.0
2 NaN 12 120.0 120.0
3 B 13 22.0 299.0
Related
Consider these two dataframes:
index = [0, 1, 2, 3]
columns = ['col0', 'col1']
data = [['A', 'D'],
['B', 'E'],
['C', 'F'],
['A', 'D']
]
df1 = pd.DataFrame(data, index, columns)
df2 = pd.DataFrame(data = [10, 20, 30, 40], index = pd.MultiIndex.from_tuples([('A', 'D'), ('B', 'E'), ('C', 'F'), ('X', 'Z')]), columns = ['col2'])
I want to add a column to df1 that tells me the value from looking at df2. The expected result would be like this:
index = [0, 1, 2, 3]
columns = ['col0', 'col1', 'col2']
data = [['A', 'D', 10],
['B', 'E', 20],
['C', 'F', 30],
['A', 'D', 10]
]
df3 = pd.DataFrame(data, index, columns)
What is the best way to achieve this? I am wondering if it should be done with a dictionary and then map or perhaps something simpler. I'm unsure.
Merge normally:
pd.merge(df1, df2, left_on=["col0", "col1"], right_index=True, how="left")
Output:
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10
try this:
indexes = list(map(tuple, df1.values))
df1["col2"] = df2.loc[indexes].values
Output:
#print(df1)
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10
I have two dataframes df1 and df2
df1 = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
df2 = pd.DataFrame({'name': ['A', 'B', 'D'],
'value': [20, 50, 7]})
I want to merge these two dataframes to a new dataframe df3 so I get the following result:
Then I want to have a forth new dataframe df4 where the rows aggregated to sums like
df4 = pd.DataFrame({'name': ['A', 'B', 'C', 'D'],
'value': [120, 350, 150, 7]})
How to do this?
You can concatenate the DataFrames together then use a groupby and sum:
df3 = pd.concat([df1, df2])
df4 = df3.groupby('name').sum().reset_index()
Result of df4:
name value
0 A 120
1 B 350
2 C 150
3 D 7
Another way is just append
df1.append(df2, ignore_index=True).groupby('name')['value'].sum().to_frame()
value
name
A 120
B 350
C 150
D 7
I have a CSV-file with only a single line, but with a lot of the same column headers (NOT duplicates). My final goal is to analyze the value of a given column dependent on the value of the previous column with the same name (which is not the column adjacent to it).
My data might look like this:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ***start block*** | stimulus | words.RT | words.ACC | ***end block*** | ***start block*** | stimulus | words.RT | words.ACC | ***end block*** |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| | pic1.png | 2300 | 1 | | | pic2.png | 2401 | 0 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
and so forth.
Now, I would like to be able to analyze the values of e.g. words.RT depending on the value of words.ACC in the previous block.
I'm not sure what the best approach to this is. I tried loading the CSV into a pandas-dataframe:
import pandas as pd
file = "01.csv"
df = pd.read_csv(file, delimiter=";")
df.columns = df.columns.str.strip("\t")
df.columns = df.columns.str.strip(".34")
df = df.iloc[[0]]
which basically gives me a datatable looking like the one I showed before. Is it possible to split the row into multiple rows according to the blocks? To me, it looks like I would need a three-dimensional array in order to encode the blocks? Is that even possible with pandas?
You can create
df1 = df.iloc[ : , 0:4]
df2 = df.iloc[ : , 4:8]
and append them
df = df1.append(df2)
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
}
df = pd.DataFrame(data)
print(df)
df1 = df.iloc[: , 0:4]
df1.columns = ['A', 'B', 'C', 'D']
df2 = df.iloc[: , 4:8]
df2.columns = ['A', 'B', 'C', 'D']
df = df1.append(df2)
df = df.reset_index(drop=True)
print(df)
If you have more blocks then you can use for-loop and
df.iloc[ : , i:i+4]
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
'A3': [1,2],
'B4': [3,4],
'C5': [5,6],
'D6': [7,8],
}
df = pd.DataFrame(data)
print(df)
# get first block
new_df = df.iloc[:, 0:4]
new_df.columns = ['A', 'B', 'C', 'D']
# get other blocks
for i in range(4, len(df.columns), 4):
temp_df = df.iloc[:, i:i+4]
temp_df.columns = ['A', 'B', 'C', 'D']
new_df = new_df.append( temp_df )
new_df = new_df.reset_index(drop=True)
print(new_df)
EDIT:
The same but with variable block_size and numbers as column's names.
import pandas as pd
data = {
'A1': [1,2],
'B1': [3,4],
'C1': [5,6],
'D1': [7,8],
'A2': [1,2],
'B2': [3,4],
'C2': [5,6],
'D2': [7,8],
'A3': [1,2],
'B3': [3,4],
'C3': [5,6],
'D3': [7,8],
'A4': [1,2],
'B4': [3,4],
'C4': [5,6],
'D4': [7,8],
}
df = pd.DataFrame(data)
print(df)
block_size = 4
# get first block
new_df = df.iloc[:, 0:block_size]
# set numbers for columns
new_df.columns = list(range(block_size))
# get other blocks
for i in range(block_size, len(df.columns), block_size):
temp_df = df.iloc[:, i:i+block_size]
# set the same numbers for columns
temp_df.columns = list(range(block_size))
new_df = new_df.append( temp_df )
# after loop reset rows numbers (indexes)
new_df = new_df.reset_index(drop=True)
print(new_df)
I want to assign in a new columns called 'new_col' a csv like string of other columns'values.
Currently I do as follows :
df['new_col'] = (df['a'].map(str) + ',' + df['b'].map(str))
This works perfectly fine but i want it to be autonomous. I want to feed the function a list of columns, and let it do the string.
Of course I could loop through the list as follows :
lstColumns = ['a','b']
lstItems = []
for item in lstColumns:
lstItems.append(df[item])
szChain = (',').join(lstItems)
But that's quite ugly, and I might get to use it on dataframes with more columns.
So is there any way to simplify this ?
You can use something like this :
df['new_col'] = df[df.columns].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
Apply a function row-wise (axis=1) to the dataframe.
The function maps to a string and joins with ", "
cols = ["a", "b"]
df.apply(lambda x: ", ".join(map(str, x[cols])), axis=1)
You can use the version proposed by #Anshul Jindal, but there is also another alternative, which significantly differs in the output and you may find it useful if you have nans in your data.
import io
df = pd.DataFrame({'a': ['a', 'b', np.nan],
'b': [np.nan, 'e', 'f'],
'c': ['g', 'h', 'i'],
'd': ['j', np.nan, 'l']})
cols = ['a', 'b' ,'d']
# another approach, using temporary text buffer
with io.StringIO() as output:
df[cols].to_csv(output, sep=',', index=False, header=False)
output.seek(0)
df = df.assign(new_col=output.readlines())
df.new_col = df.new_col.str.strip()
# approach proposed earlier
df = df.assign(new_col_2 = df[cols].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
))
print(df)
a b c d new_col new_col_2
0 a NaN g j a,,j a,j
1 b e h NaN b,e, b,e
2 NaN f i l ,f,l f,l
Plus quite surprising timing of the approaches:
import io
import timeit
df = pd.DataFrame({'a': ['a', 'b', np.nan],
'b': [np.nan, 'e', 'f'],
'c': ['g', 'h', 'i'],
'd': ['j', np.nan, 'l']})
cols = ['a', 'b' ,'d']
def buffer_approach(df, cols_to_merge):
with io.StringIO() as output:
df[cols_to_merge].to_csv(output, sep=',', index=False, header=False)
output.seek(0)
df = df.assign(new_col=output.readlines())
df.new_col = df.new_col.str.strip()
return df
def pandas_approach(df, cols_to_merge):
df = df.assign(new_col = df[cols_to_merge].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
))
return df
print(timeit.repeat("buffer_approach(df, cols)", globals=globals(), repeat=5, number=1000))
print(timeit.repeat("pandas_approach(df, cols)", globals=globals(), repeat=5, number=1000))
[2.5745794447138906, 2.556944037321955, 2.5482078031636775, 2.2512022089213133, 2.0038619451224804]
[3.6452969149686396, 3.326099018100649, 3.5136850751005113, 3.9479835461825132, 3.4149401267059147]
Maybe I didn't understand your question correctly, but if you have a lot of columns you could do this:
cols_a = ['a1', 'a2', 'a3']
cols_b = ['b1', 'b2', 'b3']
cols_res = ['res1', 'res2', 'res3']
df = pd.DataFrame({i:[i, i] for i in (cols_a+cols_b+ cols_res)})
print(df)
a1 a2 a3 b1 b2 b3 res1 res2 res3
0 a1 a2 a3 b1 b2 b3 res1 res2 res3
1 a1 a2 a3 b1 b2 b3 res1 res2 res3
df[cols_res] = (df[cols_a].astype(str).values + ',' + df[cols_b].astype(str).values)
print(df)
a1 a2 a3 b1 b2 b3 res1 res2 res3
0 a1 a2 a3 b1 b2 b3 a1,b1 a2,b2 a3,b3
1 a1 a2 a3 b1 b2 b3 a1,b1 a2,b2 a3,b3
I have this data frame:
>> df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
>> df
Place Values Var
0 A 250 All
1 A 30 French
2 B 120 All
3 B 12 German
4 C 200 All
5 C 112 Spanish
It has a repeating pattern of two rows for every Place. I want to reshape it so it's one row per Place and the Var column becomes two columns, one for "All" and one for the other value.
Like so:
Place All Language Value
A 250 French 30
B 120 German 12
C 200 Spanish 112
A pivot table would make a column for each unique value, and I don't want that.
What's the reshaping method for this?
Because the data appears in alternating pattern, we can conceptualize the transformation in 2 steps.
Step 1:
Go from
a,a,a
b,b,b
To
a,a,a,b,b,b
Step 2: drop redundant columns.
The following solution applies reshape to the values of the DataFrame; the arguments to reshape are (-1, df.shape[1] * 2), which says 'give me a frame that has twice as many columns and as many rows as you can manage.
Then, I hardwired the column indexes for the filter: [0, 1, 4, 5] based on your data layout. Resulting numpy array has 4 columns, so we pass it into a DataFrame constructor along with the correct column names.
It is an unreadable solution that depends on the df layout and produces columns in the wrong order;
import pandas as pd
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df = pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2)[:,[0,1,4,5]],
columns = ['Place', 'All', 'Value', 'Language'])
A different approach:
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df1 = df.set_index('Place').pivot(columns='Var')
df1.columns = df1.columns.droplevel()
df1 = df1.set_index('All', append=True).stack().reset_index()
print(df1)
Output:
Place All Var 0
0 A 250.0 French 30.0
1 B 120.0 German 12.0
2 C 200.0 Spanish 112.0