Consider the following csv file where there is a duplicate name in "Name" column:
ID,Name,T,CA,I,C,IP
129,K1,1.2,64,386,5522,0.07
6,K1,1.1,3072,28800,6485,4.44
157,K2,1.1,512,1204,3257,0.37
I want to group the rows by name and record I and C columns like this
K1:
0 I 386 28800
1 C 5522 6485
K2:
0 I 1204
1 C 3257
I have written this code which groups the rows by name column and build a dictionary.
data = {'Value':[0,1]}
kernel_df = pd.DataFrame(data, index=['C','I'])
my_dict = {'dummy':kernel_df}
df = pd.read_csv('test.csv', usecols=['Name', 'I', 'C'])
for name, df_group in df.groupby('Name'):
my_dict[name] = pd.DataFrame(df_group)
print(my_dict)
But the output is
{'dummy': Value
C 0
I 1, 'K1': Name I C
0 K1 386 5522
1 K1 28800 6485, 'K2': Name I C
2 K2 1204 3257}
As you can see the I and C are written in columns, so the rows for each key are increased. That is the opposite of what I want. How can I fix that?
I think you need select columns with transpose. I dont use dict comprehension, because in your code are added new DataFrame to existing dict:
data = {'Value':[0,1]}
kernel_df = pd.DataFrame(data, index=['C','I'])
my_dict = {'dummy':kernel_df}
for name, df_group in df.groupby('Name'):
my_dict[name] = df_group[[ 'I', 'C']].T
print(my_dict['K1'])
0 1
I 386 28800
C 5522 6485
If new column is necessary:
data = {'Value':[0,1]}
kernel_df = pd.DataFrame(data, index=['C','I'])
my_dict = {'dummy':kernel_df}
for name, df_group in df.groupby('Name'):
my_dict[name] = df_group[[ 'I', 'C']].T.rename_axis('g').reset_index()
print(my_dict['K1'])
g 0 1
0 I 386 28800
1 C 5522 6485
Related
I have to two dataframes
first one: df
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am, x,y,z,a']
})
df1
looks as:
Sample Value
0 Sam1 ak,b,c,k
1 Sam2 d,k,e,b,f,a
2 Sam3 am,x,y,z,a
second one: df2
df2 = pd.DataFrame({
'Remove': ['ak', 'b', 'k', 'a', 'am']})
df2
Looks as:
Remove
0 ak
1 b
2 k
3 a
4 am
I want to remove the strings from df1['Value'] that are matching with df2['Remove']
Expected output is:
Sample Value
Sam1 c
Sam2 d,e,f
Sam3 x,y,z
This code did not help me
Any help, thanks
Using apply as a 1 liner
df1['Value'] = df1['Value'].str.split(',').apply(lambda x:','.join([i for i in x if i not in df2['Remove'].values]))
Output:
>>> df1
Sample Value
0 Sam1 c
1 Sam2 d,e,f
2 Sam3 x,y,z
You can use apply() to remove items in df1 Value column if it is in df2 Remove column.
import pandas as pd
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am, x,y,z,a']
})
df2 = pd.DataFrame({'Remove': ['ak', 'b', 'k', 'a', 'am']})
remove_list = df2['Remove'].values.tolist()
def remove_value(row, remove_list):
keep_list = [val for val in row['Value'].split(',') if val not in remove_list]
return ','.join(keep_list)
df1['Value'] = df1.apply(remove_value, axis=1, args=(remove_list,))
print(df1)
Sample Value
0 Sam1 c
1 Sam2 d,e,f
2 Sam3 x,y,z
This script will help you
for index, elements in enumerate(df1['Value']):
elements = elements.split(',')
df1['Value'][index] = list(set(elements)-set(df2['Remove']))
Just iterate the data frame and get the diff of array with the remove array like this
The complete code will be sth like this
import pandas as pd
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am,x,y,z,a']
})
df2 = pd.DataFrame({
'Remove': ['ak', 'b', 'k', 'a', 'am']})
for index, elements in enumerate(df1['Value']):
elements = elements.split(',')
df1['Value'][index] = list(set(elements)-set(df2['Remove']))
print(df1)
output
Sample Value
0 Sam1 [c]
1 Sam2 [e, d, f]
2 Sam3 [y, x, z]
I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df
I have a data frame with a column called "Input", consisting of various numbers.
I created a dictionary that looks like this
sampleDict = {
"a" : ["123","456"],
"b" : ["789","272"]
}
I am attempting to loop through column "Input" against this dictionary. If any of the values in the dictionary are found (123, 789, etc), I would like to create a new column in my data frame that signifies where it was found.
For example, I would like to create column called "found" where the value is "a" when 456 was found in "Input." the value is "b" when 789 was found in the input.
I tried the following code but my logic seems to be off:
for key in sampleDict:
for p_key in df['Input']:
if code in p_key:
if code in sampleDict[key]:
df = print(code)
print(df)
Use map by flattened lists to dictionary, only is necessary all values in lists are unique:
d = {k: oldk for oldk, oldv in sampleDict.items() for k in oldv}
print (d)
{'123': 'a', '456': 'a', '789': 'b', '272': 'b'}
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(d)
print (df)
Input found
0 789 b
1 456 a
2 100 NaN
If duplicated values in lists is possible use aggregation, e.g. by join in first step and map by Series:
sampleDict = {
"a" : ["123","456", "789"],
"b" : ["789","272"]
}
df1 = pd.DataFrame([(k, oldk) for oldk, oldv in sampleDict.items() for k in oldv],
columns=['a','b'])
s = df1.groupby('a')['b'].apply(', '.join)
print (s)
a
123 a
272 b
456 a
789 a, b
Name: b, dtype: object
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(s)
print (df)
Input found
0 789 a, b
1 456 a
2 100 NaN
You can use collections.defaultdict to construct a mapping of list values to key(s). Data from #jezrael.
from collections import defaultdict
d = defaultdict(list)
for k, v in sampleDict.items():
for w in v:
d[w].append(k)
print(d)
defaultdict(list,
{'123': ['a'], '272': ['b'], '456': ['a'], '789': ['a', 'b']})
Then use pd.Series.map to map inputs to keys in a new series:
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(d)
print(df)
Input found
0 789 [a, b]
1 456 [a]
2 100 NaN
create a mask using a list comprehension then convert the list to an array and mask the true values in the search array
sampleDict = {
"a" : ["123","456"],
"b" : ["789","272"]
}
search=['789','456','100']
#https://www.techbeamers.com/program-python-list-contains-elements/
#https://stackoverflow.com/questions/10274774/python-elegant-and-efficient-ways-to-mask-a-list
for key,item in sampleDict.items():
print(item)
mask=[]
[mask.append(x in search) for x in item]
arr=np.array(item)
print(arr[mask])
Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)
I have a csv file.
index value d F
0 975 25.35 5
1 976 26.28 4
2 977 26.24 1
3 978 25.76 0
4 979 26.08 0
I created a dataframe from CSV file this way.
df = pd.read_csv("ThisFileL.csv")
I want to reconstruct a new DataFrame in my way by coppying the 2nd Columns three times.
data = pd.DataFrame()
data.add(df.value)
data.add(df.value)
data.add(df.value)
But it didn't work out. How can I do that?
Have you tried data['value1']=data['value'], data['value2']=data['value'], etc? It should create new columns holding the numbers in value.
You can do it by assigning 'value' column data in to new columns of the DataFrame.
df = pd.read_csv("ThisFileL.csv" , sep=' ')
df['value1'] = df.value
df['value2'] = df.value
The output of this would have following column headings.
index | value | d | F | value1 | value2
Creating a new column in a DataFrame is pretty straightforward. df[column_label] = values
What you're going to have to do is come up with some good names for your columns. I'll use a, b and c in this example.
df = pd.read_csv("ThisFileL.csv")
new_df = pd.DataFrame()
for key in ('a', 'b', 'c'):
new_df[key] = df['value']