Combining a list of tuple dataframes in python - python

I have a large dataset where every two rows needs to be group together and combined into one longer row, basically duplicating the headers and adding the 2nd row to the 1st. Here is a small sample:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
I will have to group the rows my 'ID' so therefore I ran:
grouped = df.groupby(['ID'])
grp_lst = list(grouped)
This resulted in a list of tuples grouped by id where element 1 is the grouped dataframe I would like to combine.
The desired result is a dataframe that looks something like this:
ID Var1 Var2 ID.1 Var1.1 Var2.1
1 A B 1 2 5
2 C D 2 7 9
I have to do this over a large data set, where the "ID" is used to group the rows and then I want to basically add the bottom row to end on the top.
Any help would be appreciated and I assume there is a much easier way to do this than I am doing.
Thanks in advance!

Let us try:
i = df.groupby('ID').cumcount().astype(str)
df_out = df.set_index([df['ID'].values, i]).stack().unstack([2, 1])
df_out.columns = df_out.columns.map('.'.join)
Details:
group the dataframe on ID and use cumcount to create sequential counter to uniquely identify the rows per ID:
>>> i
0 0
1 1
2 0
3 1
dtype: object
Create multilevel index in the dataframe with the first level set to ID values and second level set to the above sequential counter, then use stack followed by unstack to reshape the dataframe in the desired format:
>>> df_out
ID Var1 Var2 ID Var1 Var2 #---> Level 0 columns
0 0 0 1 1 1 #---> Level 1 columns
1 1 A B 1 2 5
2 2 C D 2 7 9
Finally flatten the multilevel columns using Index.map with join:
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
1 1 A B 1 2 5
2 2 C D 2 7 9

Here is another way using numpy to reshape the dataframe first then tile the columns and create new dataframe from reshape values and tiled columns:
s = df.shape[1]
c = np.tile(df.columns, 2) + '.' + (np.arange(s * 2) // s).astype(str)
df_out = pd.DataFrame(df.values.reshape(-1, s * 2), columns=c)
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
0 1 A B 1 2 5
1 2 C D 2 7 9
Note: This method is only applicable if you have two rows per ID and the ID columns is already sorted.

Related

Compare values across multiple columns in pandas and count the instances in which values in the last column is higher than the others

I have a DataFrame that looks like this:
Image of DataFrame
What I would like to do is to compare the values in all four columns (A, B, C, and D) for every row and count the number of times in which D has the smaller value than A, B, or C for each row and add it into the 'Count' column. So for instance, 'Count' should be 1 for the second row, the third row, and 2 for the last row.
Thank you in advance!
You can use vectorize the operation using gt and sum methods along an axis:
df['Count'] = df[['A', 'B', 'C']].gt(df['D'], axis=0).sum(axis=1)
print(df)
# Output
A B C D Count
0 1 2 3 4 0
1 4 3 2 1 3
2 2 1 4 3 1
In the future, please do not post data as an image.
Use a lambda function and compare across all columns, then sum across the columns.
data = {'A': [1,47,4316,8511],
'B': [4,1,3,4],
'C': [2,7,9,1],
'D': [32,17,1,0]
}
df = pd.DataFrame(data)
df['Count'] = df.apply(lambda x: x['D'] < x, axis=1).sum(axis=1)
Output:
A B C D Count
0 1 4 2 32 0
1 47 1 7 17 1
2 4316 3 9 1 3
3 8511 4 1 0 3

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.
You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Pandas Insert a row above the Index and the Series data in a Dataframe

I ve been around several trials, nothing seems to work so far.
I have tried df.insert(0, "XYZ", 555) which seemed to work until it did not for some reasons i am not certain.
I understand that the issue is that Index is not considered a Series and so, df.iloc[0] does not allow you to insert data directly above the Index column.
I ve also tried manually adding in the list of indices part of the definition of the dataframe a first index with the value "XYZ"..but nothing has work.
Thanks for your help
A B C D are my columns. range(5) is my index. I am trying to obtain this below, for an arbitrary row starting with type, and then a list of strings..thanks
A B C D
type 'string1' 'string2' 'string3' 'string4'
0
1
2
3
4
If you use Timestamps as Index adding a row and a custom single row with its own custom index will throw an error:
ValueError: Cannot add integral value to Timestamp without offset. I am guessing it's due to the difference in the operands, if i substract an Integer from a Timestamp for example.. ? how could i fix this in a generic manner? thanks! –
if you want to insert a row before the first row, you can do it this way:
data:
In [57]: df
Out[57]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
adding one row:
In [58]: df.loc[df.index.min() - 1] = ['z', -1]
In [59]: df
Out[59]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
-1 z -1
sort index:
In [60]: df = df.sort_index()
In [61]: df
Out[61]:
id var
-1 z -1
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
optionally reset your index :
In [62]: df = df.reset_index(drop=True)
In [63]: df
Out[63]:
id var
0 z -1
1 a 1
2 a 2
3 a 3
4 b 5
5 b 9

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources