Fairly new to pandas and I have created a data frame called rollParametersDf:
rollParametersDf = pd.DataFrame(columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd'], index=[])
with the 4 column headings given. Which I would like to hold the reference dates for a study I am running. I want to add rows of data (one at a time) with the index name roll1, roll2..rolln that is created using the following code:
outsampleEnd = customCalender.iloc[[totalDaysAvailable]]
outsampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength+1]]
insampleEnd = customCalender.iloc[[totalDaysAvailable-outsampleLength]]
insampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength-insampleLength+1]]
print('roll',rollCount,'\t',outsampleEnd,'\t',outsampleStart,'\t',insampleEnd,'\t',insampleStart,'\t')
rollParametersDf.append({insampleStart,insampleEnd,outsampleStart,outsampleEnd})
I have tried using append but cannot get an individual row to append.
I would like the final dataframe to look like:
insampleStart insampleEnd outsampleStart outsampleEnd
roll1 1 5 6 8
roll2 2 6 7 9
:
rolln
You give key-values pairs to append
df = pd.DataFrame({'insampleStart':[], 'insampleEnd':[], 'outsampleStart':[], 'outsampleEnd':[]})
df = df.append({'insampleStart':[1,2], 'insampleEnd':[5,6], 'outsampleStart':[6,7], 'outsampleEnd':[8,9]}, ignore_index=True)
The pandas documentation has an example of appending rows to a DataFrame. This appending action is different from that of a list in that this appending action generates a new DataFrame. This means that for each append action you are rebuilding and reindexing the DataFrame which is pretty inefficient. Here is an example solution:
# create empty dataframe
columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd']
rollParametersDf = pd.DataFrame(columns=columns)
# loop through 5 rows and append them to the dataframe
for i in range(5):
# create some artificial data
data = np.random.normal(size=(1, len(columns)))
# append creates a new dataframe which makes this operation inefficient
# ignore_index causes reindexing on each call.
rollParametersDf = rollParametersDf.append(pd.DataFrame(data, columns=columns),
ignore_index=True)
print rollParametersDf
insampleStart insampleEnd outsampleStart outsampleEnd
0 2.297031 1.792745 0.436704 0.706682
1 0.984812 -0.417183 -1.828572 -0.034844
2 0.239083 -1.305873 0.092712 0.695459
3 -0.511505 -0.835284 -0.823365 -0.182080
4 0.609052 -1.916952 -0.907588 0.898772
Related
I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.
So, I have indexes in range data frame. I want to use them to find values in test dataframe and extract values from into new data frame.
My current code is:
d = []
for index in _range_.index:
d.append((test.loc[[index],:]))
_range_ data set:
a
2334 0.097946
3345 0.098201
3357 0.091249
3486 0.098214
5862 0.097946
6873 0.098201
6885 0.091249
7014 0.098214
_test_ data set:
0 1 2 3 4 5
0 4.187268 4.261664 4.329495 4.458864 3.071192 3.652938
You could join the two dataframes together on their common index using 'inner' then keep only the test columns.
cols = __test__.columns
df = __range__.join(__test__, how='inner')
df=df[cols]
If you have an overlap of colum names between the two dataframes, attach an lsuffix='_l' or something similar to ensure the range columns are ignored.
I'm unable to test this code on for your example though, it might be worth reading over this for future posts https://stackoverflow.com/help/minimal-reproducible-example
I have a pandas dataframe df_causation which I have created as an empty dataframe with a corresponding column name.
df_causation = pd.DataFrame(columns=['Question'])
I have a for loop, in which for each iteration of the loop, I get a new string called cause_str like this:-
for i in range(len(X_test)):
cause_str = hyp.join(f_imp) #cause_str is a new string obtained for each iteration
(Ignore the method on how this is obtained, I just gave an example)
I would like to append these new strings (cause_str) (all of them) to each successive row in my Pandas dataframe df_causation's Question column. Any suitable way for doing this?
EDIT: EXPECTED OUTPUT
df_causation. **Causation**
Row 0 cause_str from i = 0 th iteration in loop
Row 1 cause_str from i = 1 th iteration in loop etc.
IIUC correctly, this should work:
dfd['**Causation**'] = df['df_causation.'].apply(lambda x: f'cause str from i = {x.split(" ")[1]} th iteration in loop')
df3
df_causation. **Causation**
0 Row 0 cause str from i = 0 th iteration in loop
1 Row 1 cause str from i = 1 th iteration in loop
I am trying to compare two columns (key.response and corr_answer) in a csv file using pandas and creating a new column "Correct_or_Not" that will contain a 1 in the cell if the key.response and corr_answer column are equal and a 0 if they are not. When I evaluate on their own outside of the loop they return the truth value I expect. The first part of the code is just me formatting the data to remove some brackets and apostrophes.
I tried using a for loop, but for some reason it puts a 0 in every column for 'Correct_or_Not".
import pandas as pd
df= pd.read_csv('exptest.csv')
df['key.response'] = df['key.response'].str.replace(']','')
df['key.response'] = df['key.response'].str.replace('[','')
df['key.response'] = df['key.response'].str.replace("'",'')
df['corr_answer'] = df['corr_answer'].str.replace(']','')
df['corr_answer'] = df['corr_answer'].str.replace('[','')
df['corr_answer'] = df['corr_answer'].str.replace("'",'')
for i in range(df.shape[0]):
if df['key.response'][i] == df['corr_answer'][i]:
df['Correct_or_Not']=1
else:
df['Correct_or_Not']=0
df.head()
key.response corr_answer Correct_or_Not
0 1 1 0
1 2 2 0
2 1 2 0
You can generate the Correct_or_Not column all at once without the loop:
df['Correct_or_Not'] = df['key.response'] == df['corr_answer']
and df['Correct_or_Not'] = df['Correct_or_Not'].astype(int) if you need the results as integers.
In your loop you forgot the index [i] when assigning the result. Like this the last row's result gets applied everywhere.
you can also do this
df['Correct_or_not']=0
for i in range(df.shape[0]):
if df['key.response'][i]==df['corr_answer'][i]:
df['Correct_or_not'][i]=1
I tried to modify the dataframe through function by looping through rows and return the modified dataframe. In the below code, I pass a dataframe 'ding' to function 'test' and create a new column 'C' by iterating through every row and return the modified dataframe. I expected the test_ding df to have 3 columns but could see only two columns. Any help is highly appreciated.
P.S. It could have other easier methods to accomplish this small task, but I am looking to iterate over rows and would like to see the modifications done on the dataframe to be reflected outside of the function
s1 = pd.Series([1,3,5,6,8,10,1,1,1,1,1,1])
s2 = pd.Series([4,5,6,8,10,1,7,1,6,5,4,3])
ding=pd.DataFrame({'A':s1,'B':s2})
def test(ding):
for index,row in ding.iterrows():
row['C']=row.A+row.B
return ding
test_ding=test(ding)
You can use set_value on the original data frame instead of on row. set_value is pretty fast if you want to set values cell by cell:
def test(ding):
for index, row in ding.iterrows():
ding.set_value(index, 'C', row.A+row.B)
return ding
test_ding=test(ding)
test_ding
# A B C
#0 1 4 5.0
#1 3 5 8.0
#2 5 6 11.0
# ...