How do I append a list of integers as new columns to each row in a dataframe in Pandas?
I have a dataframe which I need to append a 20 column sequence of integers as new columns. The use case is that I'm translating natural text in a cell of the row into a sequence of vectors for some NLP with Tensorflow.
But to illustrate, I create a simple data frame to append:
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df.head()
Which generates the output:
And then, for each row, I need to pass a function that takes in a particular value in the column '2' and will return an array of integers that need to be appended as columns in the the data frame - not as an array in a single cell:
def foo(x):
return [x+1, x+2, x+3]
Ideally, to run a function like:
df[3, 4, 5] = df['2'].applyAsColumns(foo)
The only solution I can think of is to create the data frame with 3 blank columns [3,4,5] , and then use a for loop to iterate through the blank columns and then input them as values in the loop.
Is this the best way to do it, or is there any functions built into Pandas that would do this? I've tried checking the documentation, but haven't found anything.
Any help is appreciated!
IIUC,
def foo(x):
return pd.Series([x+1, x+2, x+3])
df = pd.DataFrame([(1, 2, 3),(11, 12, 13)])
df[[3,4,5]] = df[2].apply(foo)
df
Output:
0 1 2 3 4 5
0 1 2 3 4 5 6
1 11 12 13 14 15 16
Related
Using pandas, I open some csv files in a loop and set the index to the cycleID column, except the cycleID column is not unique. See below:
for filename in all_files:
abfdata = pd.read_csv(filename, index_col=None, header=0)
abfdata = abfdata.set_index("cycleID", drop=False)
for index, row in abfdata.iterrows():
print(row['cycleID'], row['mean'])
This prints the 2 columns (cycleID and mean) of the dataframe I am interested in for further computations:
1 1.5020712104685252e-11
1 6.56683605063102e-12
2 1.3993315187144084e-11
2 -8.670502467042485e-13
3 7.0270625256163566e-12
3 9.509995221868016e-12
4 1.2901435995915644e-11
4 9.513106448422182e-12
The objective is to use the rows corresponding to the same cycleID and calculate the difference between the mean column values. So, if there are 8 rows in the table, the final array or list would store 4 values.
I want to make it scalable as well where there can be 3 or more rows with the same cycleIDs. In that case, each cycleID could have 2 or more mean differences.
Update: Instead of creating a new ques about it, I thought I'd add here.
I used the diff and groupby approach as mentioned in the solution. It works great but I have this extra need to save one of the mean values (odd row or even row doesn't matter) in a new column and make that part of the new data frame as well. How do I do that?
You can use groupby
s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)
output
1 -8.453876e-12
3 -1.486037e-11
5 2.482933e-12
7 -3.388330e-12
8 3.000000e-12
UPDATE
d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]
df = pd.DataFrame(d, columns=['cycleID', 'mean'])
df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
difference mean
1 -8.453876e-12 6.566836e-12
3 -1.486037e-11 -8.670502e-13
5 2.482933e-12 9.509995e-12
7 -3.388330e-12 9.513106e-12
I have an excel spreadsheet with raw data in:
demo-data:
1
2
3
4
5
6
7
8
9
How do I combine all the numbers to one series, so I can start doing math on it. They are all just numbers of the same "kind"
Given your dataframe as df, this function may help df.values.flatten().
You can convert your dataframe to a list and iterate through it to extract and put values into a 1D list:
df = pd.read_excel("data.xls")
lst = df.to_numpy().tolist()
result = []
for row in lst:
for item in row:
result.append(item)
I have a column of integers, some are unique and some are the same. I want to add a column of random floats between 0 and 1 per row, but I want all of the floats to be the same per integer.
The code I'm providing shows a column of ints and a second column of random floats, but I need the floats for the same ints, like 1, 1, and 1, or 6 and 6, to all be the same, while still having whatever the float assigned to that int randomly generated. The ints I'm working with, however, are 8 digits, and the data set I am using is about 500,000 lines, so I am trying to be as efficient as possible.
I've created a working solution that iterates through the data frame that has already been created, but creating the random column, then iterating through checking like ints takes long. I wasn't sure if there was a more efficient method.
import numpy as np
import pandas as pd
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,12)
data = np.array([col1, col2])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
Use transform after a groupby:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
df = pd.DataFrame(col1, columns=['Col1'])
df['Col2'] = df.groupby('Col1')['Col1'].transform(lambda x: np.random.rand())
Result:
Col1 Col2
0 1 0.304472
1 1 0.304472
2 1 0.304472
3 2 0.883114
4 3 0.381417
5 3 0.381417
6 3 0.381417
7 4 0.668433
8 5 0.365895
9 6 0.484803
10 6 0.484803
11 7 0.403913
This takes about 200 ms for 600K rows on my old laptop computer.
This isn't totally iteration-free, but you're still only iterating over groups rather than every single row, so it's a touch better:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,len(set(col1)))
data = np.array([col1])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
df2 = df1.groupby(0)
counter = 0
final_df = pd.DataFrame(columns=[0,1])
for key, item in df2:
temp_df = df2.get_group(key)
temp_df[1] = [col2[counter]]*df2.get_group(key).shape[0]
counter += 1
final_df = final_df.append(temp_df)
final_df should be the result you're looking for.
Create a dictionary with random floats for each integer key, and then map Column 2 to the dictionary.
For integers already in Column1, start by making the dictionary:
myInts = df.Column1.unique().tolist()
myFloats = [random.uniform(0,1) for i in range(len(myInts))]
myDictionary = dict(list(zip(myInts , myFloats )))
This will give you:
{0: 0.7361124230574458,
1: 0.8039650720388128,
2: 0.7474880952026456,
3: 0.06792890878546265,
4: 0.4765215518349696,
5: 0.8058550699163101,
6: 0.8865969467094966,
7: 0.251791893958454,
8: 0.42261798056239686,
9: 0.03972320851777933,
....
}
Then map the dictionary keys to Column 1 so that each identical integer gets the same float. Something like:
df.Column2 = df.Column1.map(myDictionary)
More info on how to map a series to a dictionary is here:
Using if/else in pandas series to create new series based on conditions
In this way you can get the desired results without rearranging your dataframe or iterating through it.
Cheers!
I am new to Python and have a basic question. I have an empty dataframe Resulttable with columns A B and C which I want to keep filling with the answers of some calculations which I run in a loop represented by the loop index n. For ex. I want to store the value 12 in the nth row of column A, 35 in nth row of column B and so on for the whole range of n.
I have tried something like
Resulttable['A'].iloc[n] = 12
Resulttable['B'].iloc[n] = 35
I get an error single positional indexer is out-of-bounds for the first value of n, n=0.
How do I resolve this? Thanks!
You can first create an empty pandas dataframe and then append rows one by one as you calculate. In your range you need to specify one above the highest value you want i.e. range(0, 13) if you want to iterate for 0-12.
import pandas as pd
df = pd.DataFrame([], columns=["A", "B", "C"])
for i in range(0, 13):
x = i**1
y = i**2
z = i**3
df_tmp = pd.DataFrame([(x, y, z)], columns=["A", "B", "C"])
df = df.append(df_tmp)
df = df.reset_index()
This will result in a DataFrame as follows:
df.head()
index A B C
0 0 0 0 0
1 0 1 1 1
2 0 2 4 8
3 0 3 9 27
4 0 4 16 64
There is no way of filling an empty dataframe like that. Since there are no entries in your dataframe something like
Resulttable['A'].iloc[n]
will always result in the IndexError you described.
Instead of trying to fill the dataframe like that you better store the results from your loop in a list which you could call 'result_list'. Then you can create a dataframe using your list like that:
Resulttable= pd.DataFrame({"A": result_list})
If you've got another another list of results you want to store in another column of your dataframe, let's say result_list2, then you can create your dataframe like that:
Resulttable= pd.DataFrame({"A": result_list, "B": result_list2})
If 'Resulttable' has already been created you can add column B like that
Resulttable["B"] = result_list2
I hope I could help you.
Let say I have a list of 6 integers named ‘base’ and a dataframe of 100,000 rows with 6 columns of integers as well.
I need to create an additional column which show frequency of occurences of the list ‘base’ against each row in the dataframe data.
The sequence of integers both in the list ‘base’ and dataframe are to be ignored in this case.
The occurrence frequency can have a value ranging from 0 to 6.
0 means all 6 integers in list ‘base’ does not match any of 6 columns from a row in the dataframe.
Can anyone shed some light on this please ?
you can try this:
import pandas as pd
# create frame with six columns of ints
df = pd.DataFrame({'a':[1,2,3,4,10],
'b':[8,5,3,2,11],
'c':[3,7,1,8,8],
'd':[3,7,1,8,8],
'e':[3,1,1,8,8],
'f':[7,7,1,8,8]})
# list of ints
base =[1,2,3,4,5,6]
# define function to count membership of list
def base_count(y):
return sum(True for x in y if x in base)
# apply the function row wise using the axis =1 parameter
df.apply(base_count, axis=1)
outputs:
0 4
1 3
2 6
3 2
4 0
dtype: int64
then assign it to a new column:
df['g'] = df.apply(base_count, axis=1)