I have a dataframe containing two empty columns in it. I have below list of tuples:
l = [('l1', 0.966797), ('l1', 0.998047), ('l2', 0.978516), ('l2', 0.998047), ('l3', 0.972656)]
I want to add these values to the two empty columns of dataframe. One way I know is to create a dataframe of the list like below
d = pd.DataFrame(l,columns=['s','p'])
But it will create a new dataframe, I need to add those values into an existing dataframe. Any suggestion is appreciated.
You can use double [] with columns names, join or concat:
df = pd.DataFrame({'column':range(5)})
print (df)
column
0 0
1 1
2 2
3 3
4 4
l = [('l1', 0.966797), ('l1', 0.998047),
('l2', 0.978516), ('l2', 0.998047), ('l3', 0.972656)]
df[['s','p']] = pd.DataFrame(l)
df = df.join(pd.DataFrame(l,columns=['s','p']))
df = pd.concat([df, pd.DataFrame(l,columns=['s','p'])], axis=1)
print (df)
column s p
0 0 l1 0.966797
1 1 l1 0.998047
2 2 l2 0.978516
3 3 l2 0.998047
4 4 l3 0.972656
Related
I've got a dataframe column.A containing lists and I'm trying to populate a new column with a list of values in columnA that aren't present in a secondary list.
d = {'colA': [['UVB', 'NER', 'GGR'], ['KO'], ['ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
The code I've tried is:
finaldf['colB'] = [i for i in list(finaldf.AllGenes) if i not in List]
But this just populates colB with the same list of values thats in colA
Not totally clear what you want
d = {'colA': [['UVB', 'NER', 'GGR'], ['KO'], ['ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
"""
colA
0 [UVB, NER, GGR]
1 [KO]
2 [ERK1, ERK2]
3 []
"""
# filter
dont_include = ["NER", "ERK2"]
df["colB"] = df["colA"].apply(
lambda col_a: [e for e in col_a if e not in dont_include]
)
"""
colA colB
0 [UVB, NER, GGR] [UVB, GGR]
1 [KO] [KO]
2 [ERK1, ERK2] [ERK1]
3 [] []
"""
try using this.
How to convert following list to a pandas dataframe?
my_list = [["A","B","C"],["A","B","D"]]
And as an output I would like to have a dataframe like:
Index
A
B
C
D
1
1
1
1
0
2
1
1
0
1
You can craft Series and concatenate them:
my_list = [["A","B","C"],["A","B","D"]]
df = (pd.concat([pd.Series(1, index=l, name=i+1)
for i,l in enumerate(my_list)], axis=1)
.T
.fillna(0, downcast='infer') # optional
)
or with get_dummies:
df = pd.get_dummies(pd.DataFrame(my_list))
df = df.groupby(df.columns.str.split('_', 1).str[-1], axis=1).max()
output:
A B C D
1 1 1 1 0
2 1 1 0 1
I'm unsure how those two structures relate. The my_list is a list of two lists containing ["A","B","C"] and ["A", "B","D"].
If you want a data frame like the table you have, I would suggest making a dictionary of the values first, then converting it into a pandas dataframe.
my_dict = {"A":[1,1], "B":[1,1], "C": [1,0], "D":[0,1]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:
I just asked a similar question rename columns according to list which has a correct answer for how to add suffixes to column names correctly. But i have a new issue. I want to rename the actual index name for the columns per dataframe. I have three lists of data frames (some of the data frames contain duplicate column index names (and actual data frame names as well - but thats not the issue, the issue is the duplicated original column.names). I simply want to append a suffix to each dataframe.column.name within each list, with a name in the suffix list, based on its numeric order.
here is an example of the data and the output i would like:
# add string to end of x in list of dfs
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
df1.columns.name = 'abc'
df2.columns.name = 'abc'
df3.columns.name = 'efg'
df4.columns.name = 'abc'
cat_a = [df2, df1]
cat_b = [df3, df2, df1]
cat_c = [df1]
dfs = [cat_a, cat_b, cat_c]
suffix = ['group1', 'group2', 'group3']
# expected output =
#for df in cat_a: df.columns.name = df.columns.name + 'group1'
#for df in cat_b: df.columns.name = df.columns.name + 'group2'
#for df in cat_c: df.columns.name = df.columns.name + 'group3'
and here is some code that i have written that doesn't work - where df.column.names are duplicated across data frames, multiple suffixes are appended
for x, df in enumerate(dfs):
for i in df:
n = ([(i.columns.name + '_' + str(suffix[x])) for out in i.columns.name])
i.columns.name=n[x]
thank you for looking, i really appreciate it
Your current code is not working as you have multiple references to the same df in your lists, so only the last change matters. You need to make copies.
Assuming you want to change the columns index name for each df in dfs, you can use a list comprehension:
dfs = [[d.rename_axis(suffix[i], axis=1) for d in group]
for i,group in enumerate(dfs)]
output:
>>> dfs[0][0]
group1 c d
0 5 0
1 9 3
2 3 9
3 4 2
4 1 0
5 7 6
6 5 2
7 8 0
8 1 2
9 7 2
I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6
I have a data frame looks like this:
P Q L
1 2 3
2 3
4 5 6,7
The objective is to check if there is any value in L, if yes, extract the value on L and P column:
P L
1 3
4,6
4,7
Note there might more than one values in L, in the case of more than 1 value, I would need two rows.
Bellow is my current script, it cannot generate the expected result.
df2 = []
ego
other
newrow = []
for item in data_DF.iterrows():
if item[1]["L"] is not None:
ego = item[1]['P']
other = item[1]['L']
newrow = ego + other + "\n"
df2.append(newrow)
data_DF2 = pd.DataFrame(df2)
First, you can extract all rows of the L and P columns where L is not missing like so:
df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')
Next, you can deal with the multiple values in some of the remaining L rows as follows:
df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()
To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.
First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.
print df
P Q L
0 1 2 3
1 2 3 NaN
2 4 5 6,7
s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0 3
2 6
2 7
Name: L, dtype: object
df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
P L
0 1 3
1 2 NaN
2 4 6
2 4 7
df = df.dropna().reset_index(drop=True)
print df
P L
0 1 3
1 4 6
2 4 7
I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:
import pandas as pd
df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
if row['company_id'] == 12345 or row['company_id'] == 56789:
df2 = df2.append(row, ignore_index = True)