efficient solution for reshaping the dataframe in pandas - python

I have a dataframe like
id col1 col2 col3 ......col25
1 a b c d ...........
2 d e f NA ........
3 a NA NA NA .......
What I want is:
id start end
1 a b
1 b c
1 c d
2 d e
2 e f
for names, row in data_final.iterrows():
for i in range(0,26):
try:
x = pd.Series([row["id"],row[i], row[i+1]],index=['id', 'start','end'])
df1 = df1.append(x, ignore_index = True)
except:
break
This works but it is definitely is not the best solution as its time complexity is too high.
I need a better and efficient solution for this.

One way could be to stack to remove missing values, groupby and zip to aggregate each elements with the succeeding one. The we just need to flatten the result with itertools.chain and create a dataframe:
from itertools import chain
l = [list(zip(v.values[:-1], v.values[1:])) for _,v in df.stack().groupby(level=0)]
pd.DataFrame(chain.from_iterable(l), columns=['start', 'end'])
start end
0 a b
1 b c
2 c d
3 d e
4 e f

Related

Filter Pandas Dataframe using list, but making sure count of elements matches count in list

so I have a list
my_list = [1,1,2,3,4,4]
I have a dataframe that looks like this
col_1 col_2
a 1
b 1
c 2
d 3
e 3
f 4
g 4
h 4
I basically want a final dataframe like
col_1 col_2
a 1
b 1
c 2
d 3
f 4
g 4
Basically I cant use
my_df[my_df['col_2'].isin(my_list)]
since this will include all the rows. I want the first row that matches with each item on the list, but all the same count of rows.
Use GroupBy.cumcount for counter with original and helper DataFrame and filter by inner join in DataFrame.merge:
my_list = [1,1,2,3,4,4]
df1 = pd.DataFrame({'col_2':my_list})
df1['g'] = df1.groupby('col_2').cumcount()
my_df['g'] = my_df.groupby('col_2').cumcount()
df = my_df.merge(df1).drop('g', axis=1)
print (df)
col_1 col_2
0 a 1
1 b 1
2 c 2
3 d 3
4 f 4
5 g 4

Copy above row values below

Probably easy, but couldn't find it.
Want to copy large chunk of data in row where concat data wasn't properly filled in so NaN is below values.
Small example:
df1 = {'col1': ['a', 'b','c','d','e','f','g',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df1 = pd.DataFrame(data=df1)
Did this:
df1['col1'][7:14] = df1['col1'][0:7]
Worked fine.
But what about larger data sets where I don't know the index slicing? Is there a built-in function for this?
Try 1) not to chain index, 2) passing numpy array on assignment:
df.loc[7:14, 'col1'] = df.loc[:7,'col1'].values
Output:
col1
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 a
8 b
9 c
10 d
11 e
12 f
13 g

creating a column in dataframe pandas based on a condition

I have two lists
A = ['a','b','c','d','e']
B = ['c','e']
A dataframe with column
A
0 a
1 b
2 c
3 d
4 e
I wish to create an additional column for rows where elements in B matches A.
A M
0 a
1 b
2 c match
3 d
4 e match
You can use loc or numpy.where and condition with isin:
df.loc[df.A.isin(B), 'M'] = 'match'
print (df)
A M
0 a NaN
1 b NaN
2 c match
3 d NaN
4 e match
Or:
df['M'] = np.where(df.A.isin(B),'match','')
print (df)
A M
0 a
1 b
2 c match
3 d
4 e match

Adding rows to a Pandas dataframe based on a list in a column (and vice versa)

I have this dataframe:
dfx = pd.DataFrame([[1,2],['A','B'],[['C','D'],'E']],columns=list('AB'))
A B
0 1 2
1 A B
2 [C, D] E
... that I want to transform in ...
A B
0 1 2
1 A B
2 C E
3 D E
... adding a row for each value contained in column A if it's a list.
Which is the most pythonic way?
And vice versa, if I want to group by a column (let's say B) and have in column A a list of the grouped values? (so the opposite that the example above)
Thanks in advance,
Gianluca
You have mixed dataframe - int with str and list values (very problematic because many functions raise errors), so first convert all numeric to str by where and mask is by to_numeric with parameter errors='coerce' which convert non numeric to NaN:
dfx.A = dfx.A.where(pd.to_numeric(dfx.A, errors='coerce').isnull(), dfx.A.astype(str))
print (dfx)
A B
0 1 2
1 A B
2 [C, D] E
and then create new DataFrame by np.repeat and flat values of lists by chain.from_iterable:
df = pd.DataFrame({
"B": np.repeat(dfx.B.values, dfx.A.str.len()),
"A": list(chain.from_iterable(dfx.A))})
print (df)
A B
0 1 2
1 A B
2 C E
3 D E
Pure pandas solution convert column A to list and then create new DataFrame.from_records. Then drop original column A and join stacked df:
df = pd.DataFrame.from_records(dfx.A.values.tolist(), index = dfx.index)
df = dfx.drop('A', axis=1).join(df.stack().rename('A')
.reset_index(level=1, drop=True))[['A','B']]
print (df)
A B
0 1 2
1 A B
2 C E
2 D E
If need lists use groupby and apply tolist:
print (df.groupby('B')['A'].apply(lambda x: x.tolist()).reset_index())
B A
0 2 [1]
1 B [A]
2 E [C, D]
but if need list only if length of values is more as 1 is necessary if..else:
print (df.groupby('B')['A'].apply(lambda x: x.tolist() if len(x) > 1 else x.values[0])
.reset_index())
B A
0 2 1
1 B A
2 E [C, D]

pandas: dataframe from dict with comma separated values

I'm trying to create a DataFrame from a nested dictionary, where the values are in comma separated strings.
Each value is nested in a dict, such as:
dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
My desired output is:
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
All I have so far is converting the dict to dataframe and splitting the items in each list. But I'm not sure this is getting me any closer to my objective.
df = pd.DataFrame(dict)
Out[439]:
1 2 3
event A, B, C D, B, A, C D, B, C
In [441]: df.loc['event'].str.split(',').apply(pd.Series)
Out[441]:
0 1 2 3
1 A B C NaN
2 D B A C
3 D B C NaN
Any help is appreciated. Thanks
You can use a couple of comprehensions to massage the nested dict into a better format for creation of a DataFrame that flags if an entry for the column exists or not:
the_dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
df = pd.DataFrame([[{z:1 for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 1.0 1 1 NaN
1 1.0 1 1 1.0
2 NaN 1 1 1.0
Once you've made the DataFrame you can simply loop through the columns and convert the values that flagged the existence of the letter into a letter using the where method(below this does where NaN leave as NaN, otherwise it inserts the letter for the column):
for col in df.columns:
df_mask = df[col].isnull()
df[col]=df[col].where(df_mask,col)
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
Based on #merlin's suggestion you can go straight to the answer within the comprehension:
df = pd.DataFrame([[{z:z for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
From what you have(modified the split a little to strip the extra spaces) df1, you can probably just stack the result and use pd.crosstab() on the index and value column:
df1 = df.loc['event'].str.split('\s*,\s*').apply(pd.Series)
df2 = df1.stack().rename('value').reset_index()
pd.crosstab(df2.level_0, df2.value)
# value A B C D
# level_0
# 1 1 1 1 0
# 2 1 1 1 1
# 3 0 1 1 1
This is not exactly as you asked for, but I imagine you may prefer this to your desired output.
To get exactly what you are looking for, you can add an extra column which is equal to the value column above and then unstack the index that contains the values:
df2 = df1.stack().rename('value').reset_index()
df2['value2'] = df2.value
df2.set_index(['level_0', 'value']).drop('level_1', axis = 1).unstack(level = 1)
# value2
# value A B C D
# level_0
# 1 A B C None
# 2 A B C D
# 3 None B C D

Categories

Resources