I'm creating a DataFrame with pandas. The source is from multiple arrays, but I want to create DataFrames column by column, not row by row in default pandas.Dataframe() function.
pd.DataFrame seems to have lack of 'axis=' parameter, how can I achieve this goal?
You might use python's built-in zip for that following way:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
df = pd.DataFrame(zip(arrayA, arrayB, arrayC), columns=['AA','NN','gg'])
print(df)
Output:
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
Zip is a great solution in this case as pointed out by Daweo, but alternatively you can use a dictionary for readability purposes:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
my_dict = {
'AA': arrayA,
'NN': arrayB,
'gg': arrayC
}
df = pd.DataFrame(my_dict)
print(df)
Output
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
Related
How to convert following list to a pandas dataframe?
my_list = [["A","B","C"],["A","B","D"]]
And as an output I would like to have a dataframe like:
Index
A
B
C
D
1
1
1
1
0
2
1
1
0
1
You can craft Series and concatenate them:
my_list = [["A","B","C"],["A","B","D"]]
df = (pd.concat([pd.Series(1, index=l, name=i+1)
for i,l in enumerate(my_list)], axis=1)
.T
.fillna(0, downcast='infer') # optional
)
or with get_dummies:
df = pd.get_dummies(pd.DataFrame(my_list))
df = df.groupby(df.columns.str.split('_', 1).str[-1], axis=1).max()
output:
A B C D
1 1 1 1 0
2 1 1 0 1
I'm unsure how those two structures relate. The my_list is a list of two lists containing ["A","B","C"] and ["A", "B","D"].
If you want a data frame like the table you have, I would suggest making a dictionary of the values first, then converting it into a pandas dataframe.
my_dict = {"A":[1,1], "B":[1,1], "C": [1,0], "D":[0,1]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:
I have two different dataframes and I want to get the sorted
values of two columns.
Setup
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'id': range(7),
'c': list('EDBBCCC')
})
df2 = pd.DataFrame({
'id': range(8),
'c': list('EBBCCCAA')
})
Desired Output
# notice that ABCDE appear in alphabetical order
c_first c_second
NAN A
B B
C C
D NAN
E E
What I've tried
pd.concat([df1.c.sort_values().drop_duplicates().rename('c_first'),
df2.c.sort_values().drop_duplicates().rename('c_second')
],axis=1)
How to get the output as given in required format?
Here one possible way to achive it:
t1 = df1.c.drop_duplicates()
t2 = df2.c.drop_duplicates()
tmp1 = pd.DataFrame({'id':t1, 'c_first':t1})
tmp2 = pd.DataFrame({'id':t2, 'c_second':t2})
result = pd.merge(tmp1,tmp2, how='outer').sort_values('id').drop('id', axis=1)
result
c_first c_second
4 NaN A
0 B B
1 C C
2 D NaN
3 E E
https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.concat.html
There is an argument in concat function.
Try to add sort=True.
Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)
I've searched several books and sites and I can't find anything that quite matches what I'm trying to do. I would like to create itemized lists from a dataframe and reconfigure the data like so:
A B A B C D
0 1 aa 0 1 aa
1 2 bb 1 2 bb
2 3 bb 2 3 bb aa
3 3 aa --\ 3 4 aa bb dd
4 4 aa --/ 4 5 cc
5 4 bb
6 4 dd
7 5 cc
I've experimented with grouping, stacking, unstacking, etc. but nothing that I've attempted has produced the desired result. If it's not obvious, I'm very new to python and a solution would be great but an understanding of the process I need to follow would be perfect.
Thanks in advance
Using pandas you can query all results e.g. where A=4.
A crude but working method would be to iterate through the various index values and gather all 'like' results into a numpy array and convert this into a new dataframe.
Pseudo code to demonstrate my example:
(will need rewriting to actually work)
l= [0]*df['A'].max()
for item in xrange(df['A'].max() ):
l[item] = df.loc[df['A'].isin(item)]
df = pd.DataFrame(l)
# or something of the sort
I hope that helps.
Update from comments:
animal_list=[]
for animal in ['cat','dog'...]:
newdf=df[[x.is('%s'%animal) for x in df['A']]]
body=[animal]
for item in newdf['B']
body.append(item)
animal_list.append(body)
df=pandas.DataFrame(animal_list)
A quick and dirty method that will work with strings. Customize the column naming as per needs.
data = {'A': [1, 2, 3, 3, 4, 4, 4, 5],
'B': ['aa', 'bb', 'bb', 'aa', 'aa', 'bb', 'dd', 'cc']}
df = pd.DataFrame(data)
maxlen = df.A.value_counts().values[0] # this helps with creating
# lists of same size
newdata = {}
for n, gdf in df.groupby('A'):
newdata[n]= list(gdf.B.values) + [''] * (maxlen - len(gdf.B))
# recreate DF with Col 'A' as index; experiment with other orientations
newdf = pd.DataFrame.from_dict(newdict, orient='index')
# customize this section
newdf.columns = list('BCD')
newdf['A'] = newdf.index
newdf.index = range(len(newdf))
newdf = newdf.reindex_axis(list('ABCD'), axis=1) # to set the desired order
print newdf
The result is:
A B C D
0 1 aa
1 2 bb
2 3 bb aa
3 4 aa bb dd
4 5 cc
I have the following DataFrame:
a b c
b
2 1 2 3
5 4 5 6
As you can see, column b is used as an index. I want to get the ordinal number of the row fulfilling ('b' == 5), which in this case would be 1.
The column being tested can be either an index column (as with b in this case) or a regular column, e.g. I may want to find the index of the row fulfilling ('c' == 6).
Use Index.get_loc instead.
Reusing #unutbu's set up code, you'll achieve the same results.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(5)
1
You could use np.where like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
print(df)
# a b c
# b
# 2 1 2 3
# 5 4 5 6
print(np.where(df.index==5)[0])
# [1]
print(np.where(df['c']==6)[0])
# [1]
The value returned is an array since there could be more than one row with a particular index or value in a column.
With Index.get_loc and general condition:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(df.index[df['b'] == 5][0])
1
The other answers based on Index.get_loc() do not provide a consistent result, because this function will return in integer if the index consists of all unique values, but it will return a boolean mask array if the index does not consist of unique values. A more consistent approach to return a list of integer values every time would be the following, with this example shown for an index with non-unique values:
df = pd.DataFrame([
{"A":1, "B":2}, {"A":2, "B":2},
{"A":3, "B":4}, {"A":1, "B":3}
], index=[1,2,3,1])
If searching based on index value:
[i for i,v in enumerate(df.index == 1) if v]
[0, 3]
If searching based on a column value:
[i for i,v in enumerate(df["B"] == 2) if v]
[0, 1]