Can't create jagged dataframe in pandas? - python

I have a simple dataframe with 2 columns and 2rows.
I also have a list of 4 numbers.
I want to concatenate this list to the FIRST column of the dataframe, and only the first. So the dataframe will have 6rows in the first column, and 2in the second.
I wrote this code:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
numbers = [5, 6, 7, 8]
for i in range(0, 4):
df1['A'].loc[i + 2] = numbers[i]
print(df1)
It prints the original dataframe oddly enough. But when I debug and evaluate the expression df1['A'] then it does show the new numbers. What's going on here?
It's not just that it's printing the original df, it also writes the original df to csv when I use to_csv method.

It seems you need:
for i in range(0, 4):
df1.loc[0, i] = numbers[i]
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN
df1 = pd.concat([df1, pd.DataFrame([numbers], index=[0])], axis=1)
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN

Related

How to replace DataFrame.append with pd.concat to append a Series as row?

I have a data frame with numeric values, such as
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
and I append a single row with all the column sums
totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)
Simple enough.
Here are the values of df, totals, and df_append
>>> df
A B
0 1 2
1 3 4
>>> totals
A 4
B 6
Name: totals, dtype: int64
>>> df_append
A B
0 1 2
1 3 4
totals 4 6
Unfortunately, in newer versions of pandas the method DataFrame.append is deprecated, and will be removed in some future version of pandas. The advise is to replace it with pandas.concat.
Now, using pd.concat naively as follows
df_concat_bad = pd.concat([df, totals])
produces
>>> df_concat_bad
A B 0
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like calling pd.concat with axis=1, because that would add the totals as column:
>>> pd.concat([df, totals], axis=1)
A B totals
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
(In this case, the result looks the same as using the default axis=0, because the indexes of df and totals are disjoint, as are their column names.)
How to handle this (elegantly and efficiently)?
The solution is to convert totals (a Series object) to a DataFrame (which will then be a column) using to_frame() and next transpose it using T:
df_concat_good = pd.concat([df, totals.to_frame().T])
yields the desired
>>> df_concat_good
A B
0 1 2
1 3 4
totals 4 6
I prefer to use df.loc() to solve this problem than pd.concat()
df.loc["totals"]=df.sum()

align two pandas dataframes on values in one column, otherwise insert NA to match row number

I have two pandas DataFrames (df1, df2) with a different number of rows and columns and some matching values in a specific column in each df, with caveats (1) there are some unique values in each df, and (2) there are different numbers of matching values across the DataFrames.
Baby example:
df1 = pd.DataFrame({'id1': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 6]})
df2 = pd.DataFrame({'id2': [1, 1, 2, 2, 2, 2, 3, 4, 5],
'var1': ['B', 'B', 'W', 'W', 'W', 'W', 'H', 'B', 'A']})
What I am seeking to do is create df3 where df2['id2'] is aligned/indexed to df1['id1'], such that:
NaN is added to df3[id2] when df2[id2] has fewer (or missing) matches to df1[id1]
NaN is added to df3[id2] & df3[var1] if df1[id1] exists but has no match to df2[id2]
'var1' is filled in for all cases of df3[var1] where df1[id1] and df2[id2] match
rows are dropped when df2[id2] has more matching values than df1[id1] (or no matches at all)
The resulting DataFrame (df3) should look as follows (Notice id2 = 5 and var1 = A are gone):
id1
id2
var1
1
1
B
1
1
B
1
NaN
B
2
2
W
2
2
W
3
3
H
3
NaN
H
3
NaN
H
3
NaN
H
4
4
B
6
NaN
NaN
6
NaN
NaN
I cannot find a combination of merge/join/concatenate/align that correctly solves this problem. Currently, everything I have tried stacks the rows in sequence without adding NaN in the proper cells/rows and instead adds all the NaN values at the bottom of df3 (so id1 and id2 never align). Any help is greatly appreciated!
You can first assign a helper column for id1 and id2 based on groupby.cumcount, then merge. Finally ffill values of var1 based on the group id1
def helper(data,col): return data.groupby(col).cumcount()
out = df1.assign(k = helper(df1,['id1'])).merge(df2.assign(k = helper(df2,['id2'])),
left_on=['id1','k'],right_on=['id2','k'] ,how='left').drop('k',1)
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
Or similar but without assign as HenryEcker suggests :
out = df1.merge(df2, left_on=['id1', helper(df1, ['id1'])],
right_on=['id2', helper(df2, ['id2'])], how='left').drop(columns='key_1')
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
print(out)
id1 id2 var1
0 1 1.0 B
1 1 1.0 B
2 1 NaN B
3 2 2.0 W
4 2 2.0 W
5 3 3.0 H
6 3 NaN H
7 3 NaN H
8 3 NaN H
9 4 4.0 B
10 6 NaN NaN
11 6 NaN NaN

Append list to pandas DataFrame as new row with index

Despite of the numerous stack overflow questions on appending data to a dataframe I could not really find an answer to the following.
I am looking for a straight forward solution to append a list as last row of a dataframe.
Imagine I have a simple dataframe:
indexlist=['one']
columnList=list('ABC')
values=np.array([1,2,3])
# take care, the values array is a 3x1 size array.
# row has to be 1x3 so we have to reshape it
values=values.reshape(1,3)
df3=pd.DataFrame(values,index=indexlist,columns=columnList)
print(df3)
A B C
one 1 2 3
After some operations I get the following list:
listtwo=[4,5,6]
I want to append it at the end of the dataframe.
I change that list into a series:
oseries=pd.Series(listtwo)
print(type(oseries))
oseries.name="two"
now, this does not work:
df3.append(oseries)
since it gives:
A B C 0 1 2
one 1.0 2.0 3.0 NaN NaN NaN
two NaN NaN NaN 5.0 6.0 7.0
I would like to have the values under A B and C.
I also tried:
df3.append(oseries, columns=list('ABC')) *** not working ***
df3.append(oseries, ignore_index=True) *** working but wrong result
df3.append(oseries, ignore_index=False) *** working but wrong result
df3.loc[oseries.name]=oseries adds a row with NaN values
what I am looking for is
a) how can I add a list to a particular index name
b) how can I simple add a row of values out of a list even if I don't have a name for index (leave it empty)
Either assign in-place with loc:
df.loc['two'] = [4, 5, 6]
# df.loc['two', :] = [4, 5, 6]
df
A B C
one 1 2 3
two 4 5 6
Or, use df.append with the second argument being a Series object having appropriate index and name:
s = pd.Series(dict(zip(df.columns, [4, 5, 6])).rename('two'))
df2 = df.append(s)
df2
A B C
one 1 2 3
two 4 5 6
If you are appending to a DataFrame without an index (i.e., having a numeric index), you can use loc after finding the max of the index and incrementing by 1:
df4 = pd.DataFrame(np.array([1,2,3]).reshape(1,3), columns=list('ABC'))
df4
A B C
0 1 2 3
df4.loc[df4.index.max() + 1, :] = [4, 5, 6]
df4
A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
Or, using append with ignore_index=True:
df4.append(pd.Series(dict(zip(df4.columns, [4, 5, 6]))), ignore_index=True)
A B C
0 1 2 3
1 4 5 6
Without index
lst1 = [1,2,3]
lst2 = [4,5,6]
p1 = pd.DataFrame([lst1])
p2 = p1.append([lst2], ignore_index = True)
p2.columns = list('ABC')
p2
A B C
0 1 2 3
1 4 5 6
With index
lst1 = [1,2,3]
lst2 = [4,5,6]
p1 = pd.DataFrame([lst1], index = ['one'], columns = list('ABC'))
p2 = p1.append(pd.DataFrame([lst2], index = ['two'], columns = list('ABC')))
p2
A B C
one 1 2 3
two 4 5 6

Pandas: move row (index and values) from last to first [duplicate]

This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler one—one of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0

How to plot one column in different graphs?

I have the following problem. I have this kind of a dataframe:
f = pd.DataFrame([['Meyer', 2], ['Mueller', 4], ['Radisch', math.nan], ['Meyer', 2],['Pavlenko', math.nan]])
is there an elegant way to split the DataFrame up in several dataframes by the first column? So, I would like to get a dataframe where first column = 'Müller' and another one for first column = Radisch.
Thanks in advance,
Erik
You can loop by unique values of column A with boolean indexing:
df = pd.DataFrame([['Meyer', 2], ['Mueller', 4],
['Radisch', np.nan], ['Meyer', 2],
['Pavlenko', np.nan]])
df.columns = list("AB")
print (df)
A B
0 Meyer 2.0
1 Mueller 4.0
2 Radisch NaN
3 Meyer 2.0
4 Pavlenko NaN
print (df.A.unique())
['Meyer' 'Mueller' 'Radisch' 'Pavlenko']
for x in df.A.unique():
print(df[df.A == x])
A B
0 Meyer 2.0
3 Meyer 2.0
A B
1 Mueller 4.0
A B
2 Radisch NaN
A B
4 Pavlenko NaN
Then use dict comprehension - get dictionary of DataFrames:
dfs = {x:df[df.A == x].reset_index(drop=True) for x in df.A.unique()}
print (dfs)
{'Meyer': A B
0 Meyer 2.0
1 Meyer 2.0, 'Radisch': A B
0 Radisch NaN, 'Mueller': A B
0 Mueller 4.0, 'Pavlenko': A B
0 Pavlenko NaN}
print (dfs.keys())
dict_keys(['Meyer', 'Radisch', 'Mueller', 'Pavlenko'])
print (dfs['Meyer'])
A B
0 Meyer 2.0
1 Meyer 2.0
print (dfs['Pavlenko'])
A B
0 Pavlenko NaN

Categories

Resources