I want to make a dataframe from all outputs from a python function. So I want to create a dataset as df. Any ideas?
import pandas as pd
def test(input):
kl = len(input)
return kl
test("kilo")
test("pound")
# initialize list of lists
data = [[4], [5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ["Name"])
Assuming this input and function:
words = [['abc', 'd'], ['efgh', '']]
def test(input):
kl = len(input)
return kl
You can create the DataFrame:
df = pd.DataFrame(words)
0 1
0 abc d
1 efgh
then applymap your function (Warning: applymap is slow, in this particular case (getting the length), there as much faster vectorial methods):
df2 = df.applymap(test)
0 1
0 3 1
1 4 0
Or run your function in python before creating the DataFrame:
df = pd.DataFrame([[test(x) for x in l] for l in words])
0 1
0 3 1
1 4 0
A related approach would be to repeatedly call your function to make a list and then form the dataframe from it:
import pandas as pd
words = ['kilo', 'pound', 'ton']
def test(input):
kl = len(input)
return kl
data = [] # create empty list
for entry in words:
data.append(test(entry))
df = pd.DataFrame(data, columns = ['names'])
Related
How to convert following list to a pandas dataframe?
my_list = [["A","B","C"],["A","B","D"]]
And as an output I would like to have a dataframe like:
Index
A
B
C
D
1
1
1
1
0
2
1
1
0
1
You can craft Series and concatenate them:
my_list = [["A","B","C"],["A","B","D"]]
df = (pd.concat([pd.Series(1, index=l, name=i+1)
for i,l in enumerate(my_list)], axis=1)
.T
.fillna(0, downcast='infer') # optional
)
or with get_dummies:
df = pd.get_dummies(pd.DataFrame(my_list))
df = df.groupby(df.columns.str.split('_', 1).str[-1], axis=1).max()
output:
A B C D
1 1 1 1 0
2 1 1 0 1
I'm unsure how those two structures relate. The my_list is a list of two lists containing ["A","B","C"] and ["A", "B","D"].
If you want a data frame like the table you have, I would suggest making a dictionary of the values first, then converting it into a pandas dataframe.
my_dict = {"A":[1,1], "B":[1,1], "C": [1,0], "D":[0,1]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:
I have the following dataset
df = pd.DataFrame([[1,1000],[2,1000],[3,1000]])
df.columns = ["A","B"]
df
A B
0 1 1000
1 2 1000
2 3 1000
I would like to create a new column C that calculates:
if A = 1 then C = B*.8
if A = 2 then C = B*.1
if A = 3 then C = B*.05
if A = 4 then C = B*.025
...
...(going up to 10)
Is it best to create a function?
def calculation(x):
if x == 1:
return y*0.75
elif...
But im not quite sure how to work with multiple columns. Any help would be appreciated! Thanks
Use Series.map by dictionary and then multiple by B column:
d = {1:.8, 2:.1, 3:.05, 4:.025}
df['C'] = df['A'].map(d).mul(df.B)
I'm creating a DataFrame with pandas. The source is from multiple arrays, but I want to create DataFrames column by column, not row by row in default pandas.Dataframe() function.
pd.DataFrame seems to have lack of 'axis=' parameter, how can I achieve this goal?
You might use python's built-in zip for that following way:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
df = pd.DataFrame(zip(arrayA, arrayB, arrayC), columns=['AA','NN','gg'])
print(df)
Output:
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
Zip is a great solution in this case as pointed out by Daweo, but alternatively you can use a dictionary for readability purposes:
import pandas as pd
arrayA = ['f','d','g']
arrayB = ['1','2','3']
arrayC = [4,5,6]
my_dict = {
'AA': arrayA,
'NN': arrayB,
'gg': arrayC
}
df = pd.DataFrame(my_dict)
print(df)
Output
AA NN gg
0 f 1 4
1 d 2 5
2 g 3 6
I'm trying to split a column and make some verifications. My objective is return 9 boolean columns in a dataframe, but te actual result is 9 columns of lists.
d = {'Posicao': ['MO (DC), PL (C)', 'GR, PL (C)', 'MO (DEC), PL (C)']}
dfF = pd.DataFrame(d)
def definir_pos(x):
GR = []
AT = []
GR_t = 0
AT_t = 0
splited = x.split(',')
for splite in splited:
if 'GR' in splite:
GR_t = 1
if 'PL' in splite:
AT_t = 1
GR.append(GR_t)
AT.append(AT_t)
dataf = [GR,AT]
return dataf
posicoes = dfF.Posicao.apply(definir_pos)
posicoesF = pd.DataFrame(posicoes.tolist(), columns = ['GR','AT'])
I just put 2 option in def, but in original code has 9
Output:
GR AT
0 [0] [1]
1 [1] [1]
2 [0] [1]
Expected output:
GR AT
0 0 1
1 1 1
2 0 1
In case you were not aware of some of the pandas methods, I have refined your code a lot, and it returns desired output. This code is much simpler and will have significantly better performance. You can use str.split() method with expand=True. Then use str.contains() to search for 'GR' and 'PL' simultaneously in each column by separating the two strings with an "or" operator |:
d = {'Posicao': ['MO (DC), PL (C)', 'GR, PL (C)', 'MO (DEC), PL (C)']}
dfF = pd.DataFrame(d)
dfF = dfF['Posicao'].str.split(',',expand=True).rename({0:'GR',1:'AT'}, axis=1)
dfF['GR'] = dfF['GR'].str.contains('GR|PL').astype(int)
dfF['AT'] = dfF['AT'].str.contains('GR|PL').astype(int)
dfF
Out[1]:
GR AT
0 0 1
1 1 1
2 0 1
If you want this to be function, it still can be with:
def definir_pos(df):
cols = ['GR', 'AT']
df = df['Posicao'].str.split(',',expand=True).rename({0:'GR',1:'AT'}, axis=1)
for col in cols:
df[col] = df[col].str.contains('GR|PL').astype(int)
return df
definir_pos(dfF)
Or, you can include a cols parameter:
def definir_pos(df , cols):
df = df['Posicao'].str.split(',',expand=True).rename({0:'GR',1:'AT'}, axis=1)
for col in cols:
df[col] = df[col].str.contains('GR|PL').astype(int)
return df
definir_pos(dfF, cols=['GR', 'AT'])
I have a data result that when I print it looks like
>>>print(result)
[[0]
[1]
[0]
[0]
[1]
[0]]
I guess that's about the same as [ [0][1][0][0][1][0] ] which seems a bit weird [0,1,0,0,1,0] seems a more logical representation but somehow it's not like that.
Though I would like these values to be added as a single column to a Panda dataframe df
I tried several ways to join it to my dataframe:
df = pd.concat(df,result)
df = pd.concat(df,{'result' =result})
df['result'] =pd.aply(result, axis=1)
with no luck. How can I do it?
There is multiple ways for flatten your data:
df = pd.DataFrame(data=np.random.rand(6,2))
result = np.array([0,1,0,0,1,0])[:, None]
print (result)
[[0]
[1]
[0]
[0]
[1]
[0]]
df['result'] = result[:,0]
df['result1'] = result.ravel()
#df['result1'] = np.concatenate(result)
print (df)
0 1 result result1
0 0.098767 0.933861 0 0
1 0.532177 0.610121 1 1
2 0.288742 0.718452 0 0
3 0.520980 0.367746 0 0
4 0.253658 0.011994 1 1
5 0.662878 0.846113 0 0
If you are looking to put that array in flat format pandas dataframe column, following is simplest way:
df["result"] = sum(result, [])
As long as the number of data points in this list is the same as the number of rows of the dataframe this should work:
import pandas as pd
your_data = [[0],[1],[0],[0],[1],[0]]
df = pd.DataFrame() # skip and use your own dataframe with len(df) == len(your_data)
df['result'] = [i[0] for i in your_data]