I've searched several books and sites and I can't find anything that quite matches what I'm trying to do. I would like to create itemized lists from a dataframe and reconfigure the data like so:
A B A B C D
0 1 aa 0 1 aa
1 2 bb 1 2 bb
2 3 bb 2 3 bb aa
3 3 aa --\ 3 4 aa bb dd
4 4 aa --/ 4 5 cc
5 4 bb
6 4 dd
7 5 cc
I've experimented with grouping, stacking, unstacking, etc. but nothing that I've attempted has produced the desired result. If it's not obvious, I'm very new to python and a solution would be great but an understanding of the process I need to follow would be perfect.
Thanks in advance
Using pandas you can query all results e.g. where A=4.
A crude but working method would be to iterate through the various index values and gather all 'like' results into a numpy array and convert this into a new dataframe.
Pseudo code to demonstrate my example:
(will need rewriting to actually work)
l= [0]*df['A'].max()
for item in xrange(df['A'].max() ):
l[item] = df.loc[df['A'].isin(item)]
df = pd.DataFrame(l)
# or something of the sort
I hope that helps.
Update from comments:
animal_list=[]
for animal in ['cat','dog'...]:
newdf=df[[x.is('%s'%animal) for x in df['A']]]
body=[animal]
for item in newdf['B']
body.append(item)
animal_list.append(body)
df=pandas.DataFrame(animal_list)
A quick and dirty method that will work with strings. Customize the column naming as per needs.
data = {'A': [1, 2, 3, 3, 4, 4, 4, 5],
'B': ['aa', 'bb', 'bb', 'aa', 'aa', 'bb', 'dd', 'cc']}
df = pd.DataFrame(data)
maxlen = df.A.value_counts().values[0] # this helps with creating
# lists of same size
newdata = {}
for n, gdf in df.groupby('A'):
newdata[n]= list(gdf.B.values) + [''] * (maxlen - len(gdf.B))
# recreate DF with Col 'A' as index; experiment with other orientations
newdf = pd.DataFrame.from_dict(newdict, orient='index')
# customize this section
newdf.columns = list('BCD')
newdf['A'] = newdf.index
newdf.index = range(len(newdf))
newdf = newdf.reindex_axis(list('ABCD'), axis=1) # to set the desired order
print newdf
The result is:
A B C D
0 1 aa
1 2 bb
2 3 bb aa
3 4 aa bb dd
4 5 cc
Related
I'm creating a DataFrame with pandas. The source is from multiple arrays, but I want to create DataFrames column by column, not row by row in default pandas.Dataframe() function.
pd.DataFrame seems to have lack of 'axis=' parameter, how can I achieve this goal?
You might use python's built-in zip for that following way:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
df = pd.DataFrame(zip(arrayA, arrayB, arrayC), columns=['AA','NN','gg'])
print(df)
Output:
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
Zip is a great solution in this case as pointed out by Daweo, but alternatively you can use a dictionary for readability purposes:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
my_dict = {
'AA': arrayA,
'NN': arrayB,
'gg': arrayC
}
df = pd.DataFrame(my_dict)
print(df)
Output
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
I need to make a function that can act on any dataframe and perform an action on it.
To clarify, for example let's say I have this sample dataframe here:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Which looks like this.
a b c
0 1 2 3
1 4 5 6
2 7 8 9
I have created a function that does something of this sort:
def ColDrop(df, collist):
> df=df.drop(columns = collist)
> return df
(assume > as indent)
What I'd like is for it to accept a list as the 'collist' variable and drop all of those from the dataframe stated as 'df', so...
col = ['a', 'b']
ColDrop(df, col)
Would look like...
c
0 3
1 6
2 9
However, it doesn't seem to work. Similarly I want to remove values from any dataframe based on its row, for example...
def rowvaluedrop(df, column, pattern):
> filter = df[column].str.contains(pattern)
> df = df[~filter]
> return df
rowvaluedrop(df, a, 4)
Would look like...
a b c
0 1 2 3
2 7 8 9
(i realise this second example may not work since the values are integers rather than strings, but i hope that my point gets across regardless.)
Thanks in advance.
You need to store the returning dataframe back to df implicitly
df = rowvaluedrop(df, a, 4)
I want to add a multi-index column to an existing pandas dataframe df. An example:
d = {('a','b'):[1,2,3], ('c', 'd'): [4,5,6]}
df = pd.DataFrame(d)
The resulting dataframe is:
a c
b d
0 1 4
1 2 5
2 3 6
Now I want to add a new column to the dataframe. The correct way to do that would be to use df['e', 'f'] = [7,8,9]. However, I would like to use the list new_key as the key. Normally I could use the asterisk *, but apparently it cannot be used outside of functions. So I get the following errors.
new_key = ['e','f']
df[new_key] = [7,8,9]
> KeyError: "['e' 'f'] not in index"
df[*new_key] = [7,8,9]
> SyntaxError: invalid syntax
Does anyone know how to solve this?
Cast to a tuple first:
df[tuple(new_key)] = [7,8,9]
a c e
b d f
0 1 4 7
1 2 5 8
2 3 6 9
I am trying to transform a data matrix in Python.
I want to change from :
Well A B C D
Production 1 2 3 4
to
Well Production
A 1
B 2
C 3
D 4
It is a simple task in Excel but I would like to know how to do it in Python.
How do I do it? I am sure there is a very simple way to do it but I just have not come across it?
I recommend setting index to Well before transposing. Transposing first you'll be left with a random column 0 and Production will become an observation in the dataframe.
df.T
0 # this is your column
Well Production # this becomes an observation
A 1
B 2
C 3
D 4
Do this:
df.set_index('Well').T
Well Production
A 1
B 2
C 3
D 4
If your data is contained in a dataframe you can simply transpose it.
data = data.transpose()
or equivalently
data = data.T
Convert your records into csv list format
l1 = [ 'Well', 'A', 'B', 'C', 'D', ]
l2 = [ 'Production', '1', '2', '3', '4' ]
for i,j in zip(l1, l2):
print ('%s %4s' %(i,j))
Output:
Well Production
A 1
B 2
C 3
D 4
Consider the dataframes d1 and d2
d1 = pd.DataFrame(dict(
A=list('111222'),
B=list('xyzxyz'),
C=range(6)
))
d2 = pd.DataFrame(dict(
A=list('111222'),
B=list('xyzxyz'),
C=range(6)
))
I want to concatenate these and perform a groupby
df = pd.concat([d.set_index('A') for d in [d1, d2]], keys=['d1', 'd2'])
print(df)
B C
A
d1 1 x 0
1 y 1
1 z 2
2 x 3
2 y 4
2 z 5
d2 1 x 0
1 y 1
1 z 2
2 x 3
2 y 4
2 z 5
However, when I do a groupby and sum
df.groupby(level='A').C.sum()
A
1 0
1 2
1 4
2 6
2 8
2 10
Name: C, dtype: int64
Which isn't at all what I was expecting.
I can take apart df and piece it back together again then perform the groupby...
I expected this
pd.DataFrame(
df.values,
pd.MultiIndex.from_tuples(df.index.values, names=df.index.names),
df.columns.values
).groupby(level='A').C.sum()
A
1 6
2 24
Name: C, dtype: int64
Can anyone explain what is going wrong?
I believe it is a bug. Making your index a MultiIndex is a small hack that works
df = pd.concat([d.set_index(['A', [np.nan]*len(d))]) for d in [d1, d2]], keys=['d1', 'd2'])
Another solution would be to reverse one of the DataFrames
df = pd.concat([d.set_index(['A']) for d in [d1, d2.sort_index(ascending=False)]],
keys=['d1', 'd2'])
Specifically, in concatenation of dataframes with same index which is non-multiIndex with specification of keys, the new MultiIndex that is created gets the labels 0,...,len(d) without relation to the original labels. (If you look in the index, you see that you get several copies of each label with different id).
Specifically, it is due to the following piece of code in pandas.core.reshape.concat
def _make_concat_multiindex(indexes, keys, levels=None, names=None):
...
... # Somewhere here we treat the non identical axis
...
if isinstance(new_index, MultiIndex):
new_levels.extend(new_index.levels)
new_labels.extend([np.tile(lab, kpieces) for lab in new_index.labels])
else:
new_levels.append(new_index)
new_labels.append(np.tile(np.arange(n), kpieces))
So, if the index is not multi-index already, the labels assigned are np.arange(n).
Removing the keys argument from concat() will allow your desired groupby() to succeed:
df = pd.concat([d.set_index('A') for d in [d1, d2]])
df.groupby(level='A').C.sum()
Alternately, if keys needs to stay, you can get there with reset_index() and a repeat groupby():
df = pd.concat([d.set_index('A') for d in [d1, d2]], keys=['d1', 'd2'])
(df.groupby(level='A').sum()
.reset_index()
.groupby('A').sum()
)