assign column values to variable out of data frame - python

I read an excel sheet into a data frame. Is there a possibility to loop over the columns and assign a list with the values of each column to a variable that has the column name as a variable name?
So as a simple example I have the data frame
val = [[1,396,29],[2,397,29],[3,395,29],[4,393,29],[5,390,29],[6,398,29]]
df=pd.DataFrame(val,columns=['Hours','T1T_in','T1p_in'])
df
Hours T1T_in T1p_in
0 1 396 29
1 2 397 29
2 3 395 29
3 4 393 29
4 5 390 29
5 6 398 29
so I would like to have a loop creating lists with column Name as variable?
Hours = [1,2,3,4,5,6]
T1T_in = [396,397,395,393,390,398]
T1p_in = [29,29,29,29,29,29,29]
I find a solution for getting the names out but can not assign the values. Thank you for your help

the simplest way to get column elements as list is to use pandas.Series.tolist() :
Hours = df.Hours.to_list()
T1T_in = df.T1p_in.to_list()
T1p_in = df.T1p_in.to_list()
you can also use a for loop(as you mantion that you want) to get the columns and the rows if you want:
data = {}
for column_name, rows in df.iteritems():
data[column_name] = rows.to_list()
print(data)
output:
{'Hours': [1, 2, 3, 4, 5, 6],
'T1T_in': [396, 397, 395, 393, 390, 398],
'T1p_in': [29, 29, 29, 29, 29, 29]}
the above result can also be achieved with :
df.to_dict('list')
as #anky_91 was saying

It works so far. But I have one question. I still do not understand how I can access to the values for each variable now. In the following I want to work with these lists and calculate some items. A simplified example you can see below.
Thank you once again.
new=[]
for i in range(len(hours)):
new.append(T1T_in[i]+T1p_in[I])
Thank you once again for your help

Related

Store values from rows of a DataFrame and use them in another operation

I have a DataFrame I read from a CSV file and I want to store the individual values from the rows in the DataFrame in some variables. I want to use the values from the DataFrame in another step to perform another operation. Note that I do not want the result as series but values such as integers. I am still learning but I could not understand those resources I have consulted. Thank you in advance.
X
Y
Z
1
2
3
3
2
1
4
5
6
I want the values in a variable as x=1,3,4 and so on, as stated above.
There are many ways you can do this but one simple method is to use the index method. Other people may give other methods but let me illustrate the index method here. I will create a dictionary and change it to DataFrame from which rows iteration can be performed.
# Start by importing pandas as pd
import pandas as pd
# Proceed by defining a dictionary that contains a player's stats (just for
ilustration, not real data)
myData = {'Football Club': ['Chelsea', 'Man Utd', 'Inter Milan', 'Everton'],
'Matches Played': [2, 32, 36, 37],
'Goals Scored': [1, 12, 24, 25],
'Assist Given': [0, 0, 11, 6],
'Red card': [0,0,0,0,],
'Yellow Card':[0,4,4,3]}
# Next create a DataFrame from the dictionary from previous step
df = pd.DataFrame(myData, columns = ['Football Club', 'Matches Played', 'Goals
Scored', 'Red card', 'Yellow Card'])
#See what the data look like.
print("This is the created Dataframe from the dictionary:\n", df)
print("\n Now, you can iterate over selected rows or all the rows using
index
attribute as follows:\n")
#Store the values in variables
for indIte in df.index:
clubs=df['Football Club'][indIte]
goals =df['Goals Scored'][indIte]
matches=df['Matches Played'][indIte]
#To see the results that can be used later in the same program
print(clubs, matches, goals)
#You will get the following results:
This is the created Dataframe from the dictionary :
Football Club Matches Played Goals Scored Red card Yellow Card
0 Chelsea 2 1 0 0
1 Man Utd 32 12 0 4
2 Inter Milan 36 24 0 4
3 Everton 37 25 0 3
Now, you can iterate over selected rows or all the rows using index
attribute as follows:
Chelsea 2 1
Man Utd 32 12
Inter Milan 36 24
Everton 37 25
Use:
x, y, z = df.to_dict(orient='list').values()
>>> x
[1, 3, 4]
>>> y
[2, 2, 5]
>>> z
[3, 1, 6]
df.values is a numpy array of a dataframe. So you can manipulate df.values for subsequent processing.

Get indeces of given rows present in a dataframe

I have a dataframe that looks like this:
data = [[1, 10,100], [1.5, 15, 25], [7, 14, 70], [33,44,55]]
df = pd.DataFrame(data, columns = ['A', 'B','C'])
And has a visual expression like this
A B C
1 10 100
1.5 15 25
7 14 70
33 44 55
I have other data, that is a random subset of rows from the dataframe, so something like this
set_of_rows = [[1,10,100], [33,44,55]]
I want to get the indeces indicating the location of each row in set_of_rows inside df. So I need a function that does something like this:
indeces = func(subset=set_of_rows, dataframe=df)
In [1]: print(indeces)
Out[1]: [0, 3]
What function can do this? Tnx
Try the following:
[i for i in df.index if df.loc[i].to_list() in set_of_rows]
#[0, 3]
If you want it as a function:
def func(set_of_rows, df):
return [i for i in df.index if df.loc[i].to_list() in set_of_rows]
You can check this thread out;
Python Pandas: Get index of rows which column matches certain value
As far as I know, there is no intrinsic Panda function for your task so iteration is the only way to go about it. If you are concerned about dealing with the errors, you can add conditions in your loop that will take care of that.
for i in df.index:
lst = df.loc[i].to_list()
if lst in set_of_rows:
return i
else:
return None

For loop in pandas dataframe using enumerate

I have a basic dataframe which is a result of a gruopby from unclean data:
df:
Name1 Value1 Value2
A 10 30
B 40 50
I have created a list as follows:
Segment_list = df['Name1'].unique()
Segment_list
array(['A', 'B'], dtype=object)
Now i want to traverse the list and find the amount in Value1 for each iteration so i am usinig:
for Segment_list in enumerate(Segment_list):
print(df['Value1'])
But I getting both values instead of one by one. I just need one value for one iteration. Is this possible?
Expected output:
10
40
I recommend using pandas.DataFrame.groupby to get the values for each group.
For the most part, using a for-loop with pandas is an indication that it's probably not being done correctly or efficiently.
Additional resources:
Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
Stack Overflow Pandas Tag Info Page
Option 1:
import pandas as pd
import numpy as np
import random
np.random.seed(365)
random.seed(365)
rows = 25
data = {'n': [random.choice(['A', 'B', 'C']) for _ in range(rows)],
'v1': np.random.randint(40, size=(rows)),
'v2': np.random.randint(40, size=(rows))}
df = pd.DataFrame(data)
# groupby n
for g, d in df.groupby('n'):
# print(g) # use or not, as needed
print(d.v1.values[0]) # selects the first value of each group and prints it
[out]: # first value of each group
5
33
18
Option 2:
dfg = df.groupby(['n'], as_index=False).agg({'v1': list})
# display(dfg)
n v1
0 A [5, 26, 39, 39, 10, 12, 13, 11, 28]
1 B [33, 34, 28, 31, 27, 24, 36, 6]
2 C [18, 27, 9, 36, 35, 30, 3, 0]
Option 3:
As stated in the comments, your data is already the result of groupby, and it will only ever have one value in the column for each group.
dfg = df.groupby('n', as_index=False).sum()
# display(dfg)
n v1 v2
0 A 183 163
1 B 219 188
2 C 158 189
# print the value for each group in v1
for v in dfg.v1.to_list():
print(v)
[out]:
183
219
158
Option 4:
Print all rows for each column
dfg = df.groupby('n', as_index=False).sum()
for col in dfg.columns[1:]: # selects all columns after n
for v in dfg[col].to_list():
print(v)
[out]:
183
219
158
163
188
189
I agree with #Trenton's comment that the whole point of using data frames is to avoid looping through them like this. Re-think this using a function. However the closest way to make what you've written work is something like this:
Segment_list = df['Name1'].unique()
for Index in Segment_list:
print(df['Value1'][df['Name1']==Index]).iloc[0]
Depending on what you want to happen if there are two entries for Name (presumably this can happen because you use .unique(), This will print the sum of the Values:
df.groupby('Name1').sum()['Value1']

Trying to call a cell value, why is my list of column values being interpreted as index values?

I need to do some maths using the following dataframe. In a for loop iterating through VALUE column cells, I need to grab the corresponding FracDist.
VALUE FracDist
0 11 0.022133
1 21 0.021187
2 22 0.001336
3 23 0.000303
4 24 0.000015
5 31 0.000611
6 41 0.040523
7 42 0.285630
8 43 0.161956
9 52 0.296993
10 71 0.160705
11 82 0.008424
12 90 0.000130
13 95 0.000053
First I made a list of VALUE values which I can use in a for loop, which worked as expected:
IN: LCvals = df['VALUE'].tolist()
print LCvals
OUT: [11, 21, 22, 23, 24, 31, 41, 42, 43, 52, 71, 82, 90, 95]
When I try to grab a cell from the dataframe's FracDist column based on which VALUE row the for loop is on, that is where a problem comes up. Instead of looking up rows using VALUE from the VALUE column, the code is trying to lookup rows using VALUE as the index. So what I get:
IN: for val in LCvals:
print val
print LCdf.loc[val]['FracDist']
OUT: 11
0.00842444155517
21
KeyError: 'the label [21] is not in the [index]'
Note that the FracDist row that is grabbed for VALUE=11 is from index 11, not VALUE 11.
What needs to change in that for loop code to query rows based on VALUE in the VALUE column rather than VALUE as a spot in the index?
Here pd.DataFrame.loc will index first by row label and then, if a second argument is supplied, by column label. This is by design. See also Indexing and Selecting Data.
Don't, under any circumstances use chained indexing. For example, Boolean indexing followed by column label selection via LCdf.loc[LCdf['VALUE']==val]['FracDist'] is not recommended.
If you wish to iterate a single series, you can use pd.Series.items. But here you are using 'VALUE' as if it were an index, so you can use set_index first:
for val, dist in df.set_index('VALUE')['FracDist'].items():
print(val, dist)
11 0.022133
21 0.021187
...
90 0.00013
95 5.3e-05
If you pass in an integer into .loc, it will return (in this case) a value located at that index. You could use this LCdf.loc[LCdf['VALUE']==val]['FracDist'].
Edit: Here is a better (more efficient) answer:
for index, row in LCdf.iterrows():
print(row['VALUE'])
print(row['FracDist'])

Finding duplicates in a column, setting conditions, summing values from another column

I have a csv file and I'm currently using pandas module. Have not found the solution for my problem. Here is the sample, problem, and desired output csv.
Sample csv:
project, id, sec, code
1, 25, 50, 01
1, 25, 50, 12
1, 25, 45, 07
1, 5, 25, 03
1, 25, 20, 06
Problem:
I do not want to get rid of duplicated (id) but sum the values of (sec) to (code) 01 if duplicates are found given other codes such as 12, 7, and 6. I need to know how to set conditions as well. If code 7 is less than 60 do not sum. I have used the following code to sort by columns. the .isin however gets rid of "id" 5. In a larger file there will be other duplicate "id"s with similar codes.
df = df.sort_values(by=['id'], ascending=[True])
df2 = df.copy()
sort1 = df2[df2['code'].isin(['01', '07', '06', '12'])]
Desired Output:
project, id, sec, code
1, 5, 25, 03
1, 25, 120, 01
1, 25, 50, 12
1, 25, 45, 07
1, 25, 20, 06
I have thought of parsing through the file but I'm stuck on the logic.
def edit_data(df):
sum = 0
with open(df) as file:
next(file)
for line in file:
parts = line.split(',')
code = float(parts[3])
id = float(parts[1])
sec = float(parts[2])
return ?
Appreciate any help as I'm new in Python equivalent to 3 months experience. Thanks!
Let's try this:
df = df.sort_values('id')
#Use boolean indexing to eliminate unwanted records, then groupby and sum, convert the results to dataframe with indexes of groups.
sumdf = df[~((df.code == 7) & (df.sec < 60))].groupby(['project','id'])['sec'].sum().to_frame()
#Find first record of the group using duplicated and again with boolean indexing set the sec column for those records to NaN.
df.loc[~df.duplicated(subset=['project','id']),'sec'] = np.nan
#Set the index of the original dataframe and use combined_first to replace those NaN with values from the summed, grouped dataframe.
df_out = df.set_index(['project','id']).combine_first(sumdf).reset_index().astype(int)
df_out
Output:
project id code sec
0 1 5 3 25
1 1 25 1 120
2 1 25 12 50
3 1 25 7 45
4 1 25 6 20

Categories

Resources