Creating lists out of multiple duplicate strings in pandas dataframe - python

I am trying to de-duplicate rows in pandas. I have millions of rows of duplicates and it isn't suitable for what I'm trying to do.
From this:
col1 col2
0 1 23
1 1 47
2 1 58
3 1 9
4 1 4
I want to get this:
col1 col2
0 1 [23, 47, 58, 9, 4]
I have managed to do it manually by writing individual scripts for each spreadsheet, but it would really be great to have a more generalized way of doing it.
So far I've tried:
def remove_duplicates(self, df):
ids = df[self.key_field].unique()
numdicts = []
for i in ids:
instdict = {self.key_field: i}
for col in self.deduplicate_fields:
xf = df.loc[df[self.key_field] == i]
instdict[col] = str(list(xf[col]))
numdicts.append(instdict)
for n in numdicts:
print(pd.DataFrame(data=n, index=self.key_field))
return df
But unbelievably, this returns the same thing I started with.
The only way I've managed it so far is to create lists for each column manually and loop through the unique index keys from the dataframe, and add all of the duplicates to a list, then zip all of the lists and create a dataframe from them.
However, this doesn't seem to work when there is an unknown number of columns that need to be de-duplicated.
Any better way of doing this would be appreciated!
Thanks in advance!

Is this what you are looking for when you need one column only:
df.groupby('col1')['col2'].apply(lambda x: list(x)).reset_index()
And for all other columns use agg:
df.groupby('col1').apply(lambda x: list(x)).reset_index()
With agg you can also specify which columns to use:
df.groupby('col1')['col2', 'col3'].apply(lambda x: list(x)).reset_index()

You can try the following:
df.groupby('col1').agg(lambda x: list(x))

For multiple columns it should look like this instead to avoid errors:
df.groupby('col1')[['col2','col3']].agg(lambda x: list(x)).reset_index()

Related

Store nth row elements in a list panda dataframe

I am new to python.Could you help on follow
I have a dataframe as follows.
a,d,f & g are column names. dataframe can be named as df1
a d f g
20 30 20 20
0 1 NaN NaN
I need to put second row of the df1 into a list without NaN's.
Ideally as follows.
x=[0,1]
Select the second row using df.iloc[1] then using .dropna remove the nan values, finally using .tolist method convert the series into python list.
Use:
x = df.iloc[1].dropna().astype(int).tolist()
# x = [0, 1]
Check itertuples()
So you would have something like taht:
for row in df1.itertuples():
row[0] #-> that's your index of row. You can do whatever you want with it, as well as with whole row which is a tuple now.
you can also use iloc and dropna() like that:
row_2 = df1.iloc[1].dropna().to_list()

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

How to efficiently select the minimum value in a column grouped by index pandas?

I am trying to find the smallest value per index in python pandas however the data set I am using is quite large. How would I most efficiently accomplish the following:
given a DataFrame:
index col1 col2
i1 1 5
2 6
i2 3 7
4 8
How would I most efficiently get the smallest value in col1 grouped by index i.e. [1,3] or {'i1':1,'i2':3} etc., The following quite obviously is a sub standard implementation
min_time = [frame.loc[index_val]['timestamp_ms'].min() for index_val in ['i1','i2']]
Thanks
Use groupby with aggregate min, then convert output to list or dictionary:
out = df.groupby(level=0)['col1'].min().tolist()
Or:
out = df.groupby(level=0)['col1'].min().to_dict()
Nicer solution with Series.min:
out = df['col1'].min(level=0).tolist()
out = df['col1'].min(level=0).to_dict()

Reindexing a pandas DataFrame using a dict (python3)

Is there a way, without use of loops, to reindex a DataFrame using a dict? Here is an example:
df = pd.DataFrame([[1,2], [3,4]])
dic = {0:'first', 1:'second'}
I want to apply something efficient to df for obtaining:
0 1
first 1 2
second 3 4
Speed is important, as the index in the actual DataFrame I am dealing with has a huge number of unique values. Thanks
You need the rename function:
df.rename(index=dic)
# 0 1
#first 1 2
#second 3 4
Modified the dic to get the results: dic = {0:'first', 1:'second'}

Apply function to pandas dataframe that returns multiple rows

I would like to apply a function to a pandas DataFrame that splits some of the rows into two. So for example, I may have this as input:
df = pd.DataFrame([{'one': 3, 'two': 'a'}, {'one': 5, 'two': 'b,c'}], index=['i1', 'i2'])
one two
i1 3 a
i2 5 b,c
And I want something like this as output:
one two
i1 3 a
i2_0 5 b
i2_1 5 c
My hope was that I could just use apply() on the data frame, calling a function that returns a dataframe with 1 or more rows itself, which would then get merged back together. However, this does not seem to work at all. Here is a test case where I am just trying to duplicate each row:
dfa = df.apply(lambda s: pd.DataFrame([s.to_dict(), s.to_dict()]), axis=1)
one two
i1 one two
i2 one two
So if I return a DataFrame, the column names of that DataFrame seem to become the contents of the rows. This is obviously not what I want.
There is another question on here that was solved by using .groupby(), however I don't think this applies to my case since I don't actually want to group by anything.
What is the correct way to do this?
You have a messed up database (comma separated string where you should have separate columns). We first fix this:
df2 = pd.concat([df['one'], pd.DataFrame(df.two.str.split(',').tolist(), index=df.index)], axis=1)
Which gives us something more neat as
In[126]: df2
Out[126]:
one 0 1
i1 3 a None
i2 5 b c
Now, we can just do
In[125]: df2.set_index('one').unstack().dropna()
Out[125]:
one
0 3 a
5 b
1 5 c
Adjusting the index (if desired) is trivial and left to the reader as an exercise.

Categories

Resources