python for loop using index to create values in dataframe - python

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.

There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.

You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)

I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt

when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

Related

Count the number of elements in a list where the list contains the empty string

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array

Save a pandas dataframe inside another dataframe

I have a tricky case. Can't wrap my head around it.
I have a pandas dataframe like below:
In [3]: df = pd.DataFrame({'stat_101':[31937667515, 47594388534, 43568256234], 'group_id_101':[1,1,1], 'level_101':[1,2,2], 'stat_102':['00005#60-78','00005#60-78','00005#60-78'], 'avg_104':[27305.34552, 44783.49401, 22990.77442]})
In [4]: df
Out[4]:
stat_101 group_id_101 level_101 stat_102 avg_104
0 31937667515 1 1 00005#60-78 27305.34552
1 47594388534 1 2 00005#60-78 44783.49401
2 43568256234 1 2 00005#60-78 22990.77442
I want to group this on 'group_id_101','stat_102' columns and create another dataframe which will be storing the result of the grouped dataframe inside it.
Expected output:
In [27]: res = pd.DataFrame({'new_stat_101':[1], 'stat_102':['00005#60-78'], 'new_avg':['Dataframe_obj']})
In [28]: res
Out[28]:
new_stat_101 stat_102 new_avg
0 1 00005#60-78 Dataframe_obj
Where the Dataframe_obj will be another dataframe with rows like below:
stat_101 level_101 avg_104
0 31937667515 1 27305.34552
1 47594388534 2 44783.49401
2 43568256234 2 22990.77442
What is the best way to do this? Should I be saving a dataframe inside another dataframe or there's a more cleaner way of doing it?
Hope my question is clear.
Let's try
g = ['group_id_101', 'stat_102']
idx, dfs = zip(*df.groupby(g))
pd.DataFrame({'new_avg': dfs}, index=pd.MultiIndex.from_tuples(idx, names=g))
new_avg
group_id_101 stat_102
1 00005#60-78 stat_101 group_id_101 level_101 st...
"new_avg" is a column of DataFrames accessible by index.
Obligatory disclaimer: This is blatant abuse of DataFrames, you should typically not store objects that cannot take advantage of pandas vectorization.

Creating a new column but creates copy of dataframe

I would like to check the value of the row above and see it it is the same as the current row. I found a great answer here: df['match'] = df.col1.eq(df.col1.shift()) such that col1 is what you are comparing.
However, when I tried it, I received a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. My col1 is a string. I know you can suppress warnings but how would I check the same row above and make sure that I am not creating a copy of the dataframe? Even with the warning I do get my desired output, but was curious if there exists a better way.
import pandas as pd
data = {'col1':['a','a','a','b','b','c','c','c','d','d'],
'week':[1,1,1,1,1,2,2,2,2,2]}
df = pd.DataFrame(data, columns=['col1','week'])
df['check_condition'] = 1
while sum(df.check_condition) != 0:
for week in df.week:
wk = df.loc[df.week == week]
wk['match'] = wk.col1.eq(wk.col1.shift()) # <-- where the warning occurs
# fix the repetitive value...which I have not done yet
# for now just exit out of the while loop
df.loc[df.week == week,'check_condition'] = 0
You can't ignore a pandas SettingWithCopyWarning!
It's 100% telling you that your code is not going to work as intended, if at all. Stop, investigate and fix it. (It's not an ignoreable thing you can filter out, like a pandas FutureWarning nagging about deprecation.)
Multiple issues with your code:
You're trying to iterate over a dataframe (but not with groupby()), take slices of it (in the subdataframe wk, which yes is a copy of a slice)...
then assign to the (nonexistent) new column wk['match']. This is bad, you shouldn't do this. (You could initialize df['match'] = np.nan, but it'd still be wrong to try to assign to the copy in wk)...
SettingWithCopyWarning is being triggered when you try to assign to wk['match']. It's telling you wk is a copy of a slice from dataframe df, not df itself. Hence like it tells you: A value is trying to be set on a copy of a slice from a DataFrame. That assignment would only get thrown away every time wk gets overwritten by your loop, so even if you could force it to work on wk it would be wrong. That's why SettingWithCopyWarning is a code smell you shouldn't be making a copy of a slice of df in the first place.
Later on, you also try to assign to column df['check_condition'] while iterating over the df, that's also bad.
Solution:
df['check_condition'] = df['col1'].eq(df['col1'].shift()).astype(int)
df
col1 week check_condition
0 a 1 0
1 a 1 1
2 a 1 1
3 b 1 0
4 b 1 1
5 c 2 0
6 c 2 1
7 c 2 1
8 d 2 0
9 d 2 1
More generally, for more complicated code where you want to iterate over each group of dataframe according to some grouping criteria, you'd use use groupby() and split-apply-combine instead.
you're grouping by wk.col1.eq(wk.col1.shift()), i.e. rows where col1 value doesn't change from the preceding row
and you want to set check_condition to 0 on those rows
and 1 on rows where col1 value did change from the preceding row
But in this simpler case you can skip groupby() and do a direct assignment.

Creating New Column in Pandas Data frame with For Loop Issue

I am trying to compare two columns (key.response and corr_answer) in a csv file using pandas and creating a new column "Correct_or_Not" that will contain a 1 in the cell if the key.response and corr_answer column are equal and a 0 if they are not. When I evaluate on their own outside of the loop they return the truth value I expect. The first part of the code is just me formatting the data to remove some brackets and apostrophes.
I tried using a for loop, but for some reason it puts a 0 in every column for 'Correct_or_Not".
import pandas as pd
df= pd.read_csv('exptest.csv')
df['key.response'] = df['key.response'].str.replace(']','')
df['key.response'] = df['key.response'].str.replace('[','')
df['key.response'] = df['key.response'].str.replace("'",'')
df['corr_answer'] = df['corr_answer'].str.replace(']','')
df['corr_answer'] = df['corr_answer'].str.replace('[','')
df['corr_answer'] = df['corr_answer'].str.replace("'",'')
for i in range(df.shape[0]):
if df['key.response'][i] == df['corr_answer'][i]:
df['Correct_or_Not']=1
else:
df['Correct_or_Not']=0
df.head()
key.response corr_answer Correct_or_Not
0 1 1 0
1 2 2 0
2 1 2 0
You can generate the Correct_or_Not column all at once without the loop:
df['Correct_or_Not'] = df['key.response'] == df['corr_answer']
and df['Correct_or_Not'] = df['Correct_or_Not'].astype(int) if you need the results as integers.
In your loop you forgot the index [i] when assigning the result. Like this the last row's result gets applied everywhere.
you can also do this
df['Correct_or_not']=0
for i in range(df.shape[0]):
if df['key.response'][i]==df['corr_answer'][i]:
df['Correct_or_not'][i]=1

Pandas - Add column containing metadata about the row

I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.
I tried:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df['Num Hits'] = val
Which returns an error:
-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.
You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:
In [89]:
print df
0 1 2 3
0 0.835396 0.330275 0.786579 0.493567
1 0.751678 0.299354 0.050638 0.483490
2 0.559348 0.106477 0.807911 0.883195
3 0.250296 0.281871 0.439523 0.117846
4 0.480055 0.269579 0.282295 0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0 3
1 3
2 3
3 3
4 3
dtype: int64
#DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.
You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df.loc[index, 'Num Hits'] = val

Categories

Resources