Looping to recode variables in python - python

I'm fairly new to programming and I have a question on using loops to recode variables in a pandas data frame that I was hoping I could get some help with.
I want to recode multiple columns in a pandas data frame from units of seconds to minutes. I've written a simple function in python and then can copy and repeat it on each column which works, but I wanted to automate this. I appreciate the help.
The ivf.secondsUntilCC.xxx column contains the number of seconds until something happens. I want the new column ivf.minsUntilCC.xxx to be the number of minutes. The data frame name is data.
def f(x,y):
return x[y]/60
data['ivf.minsUntilCC.500'] = f(data,'ivf.secondsUntilCC.500')
data['ivf.minsUntilCC.1000'] = f(data,'ivf.secondsUntilCC.1000')
data['ivf.minsUntilCC.2000'] = f(data,'ivf.secondsUntilCC.2000')
data['ivf.minsUntilCC.3000'] = f(data,'ivf.secondsUntilCC.3000')
data['ivf.minsUntilCC.4000'] = f(data,'ivf.secondsUntilCC.4000')

I would use vectorized approach:
In [27]: df
Out[27]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 906395 854268 701859 979647 914942
1 288577 300394 577555 880370 924162 897984
2 66705 493545 232603 682509 794074 204429
3 747828 504930 379035 29230 410390 287327
4 926553 913360 657640 336139 210202 356649
In [28]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')] /= 60
In [29]: df
Out[29]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 15106.583333 14237.800000 11697.650000 16327.450000 15249.033333
1 288577 5006.566667 9625.916667 14672.833333 15402.700000 14966.400000
2 66705 8225.750000 3876.716667 11375.150000 13234.566667 3407.150000
3 747828 8415.500000 6317.250000 487.166667 6839.833333 4788.783333
4 926553 15222.666667 10960.666667 5602.316667 3503.366667 5944.150000
Setup:
df = pd.DataFrame(np.random.randint(0,10**6,(5,6)),
columns=['X','ivf.minsUntilCC.500', 'ivf.minsUntilCC.1000',
'ivf.minsUntilCC.2000', 'ivf.minsUntilCC.3000',
'ivf.minsUntilCC.4000'])
Explanation:
In [26]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')]
Out[26]:
ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 906395 854268 701859 979647 914942
1 300394 577555 880370 924162 897984
2 493545 232603 682509 794074 204429
3 504930 379035 29230 410390 287327
4 913360 657640 336139 210202 356649

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

FIlrer csv table to have just 2 columns. Python pandas pd .pd

i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71

apply function takes a long time to run

I'm working with a dataset of about ~ 32.000.000 rows:
RangeIndex: 32084542 entries, 0 to 32084541
df.head()
time device kpi value
0 2020-10-22 00:04:03+00:00 1-xxxx chassis.routing-engine.0.cpu-idle 100
1 2020-10-22 00:04:06+00:00 2-yyyy chassis.routing-engine.0.cpu-idle 97
2 2020-10-22 00:04:07+00:00 3-zzzz chassis.routing-engine.0.cpu-idle 100
3 2020-10-22 00:04:10+00:00 4-dddd chassis.routing-engine.0.cpu-idle 93
4 2020-10-22 00:04:10+00:00 5-rrrr chassis.routing-engine.0.cpu-idle 99
My goal is to create one aditional columns named role, filled with regard a regex
This is my approach
def router_role(row):
if row["device"].startswith("1"):
row["role"] = '1'
if row["device"].startswith("2"):
row["role"] = '2'
if row["device"].startswith("3"):
row["role"] = '3'
if row["device"].startswith("4"):
row["role"] = '4'
return row
then,
df = df.apply(router_role,axis=1)
However it's taking a lot of time ... any idea about other possible approach ?
Thanks
Apply is very slow and never very good. Try something like this instead:
df['role'] = df['device'].str[0]
Using apply is notoriously slow because it doesn't take advantage of multithreading (see, for example, pandas multiprocessing apply). Instead, use built-ins:
>>> import pandas as pd
>>> df = pd.DataFrame([["some-data", "1-xxxx"], ["more-data", "1-yyyy"], ["other-data", "2-xxxx"]])
>>> df
0 1
0 some-data 1-xxxx
1 more-data 1-yyyy
2 other-data 2-xxxx
>>> df["Derived Column"] = df[1].str.split("-", expand=True)[0]
>>> df
0 1 Derived Column
0 some-data 1-xxxx 1
1 more-data 1-yyyy 1
2 other-data 2-xxxx 2
Here, I'm assuming that you might have multiple digits before the hyphen (e.g. 42-aaaa), hence the extra work to split the column and get the first value of the split. If you're just getting the first character, do what #teepee did in their answer with just indexing into the string.
You can trivially convert your code to use np.vectorize().
See here:
Performance of Pandas apply vs np.vectorize to create new column from existing columns

Optimization of date subtraction on large dataframe - Pandas

I'm a beginner learning Python. I have a very large dataset - I'm having trouble optimizing my code to make this run faster.
My goal is to optimize all of this (my current code works but slow):
Subtract two date columns
Create new column with the result of that subtraction
Remove original two columns
Do all of this in a fast manner
Random finds:
Thinking about changing the initial file read method...
https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file
I have parse_dates=True when reading the CSV file - so could this be a slowdown? I have 50+ columns but only 1 timestamp column and 1 year column.
This column:
saledate
1 3/26/2004 0:00
2 2/26/2004 0:00
3 5/19/2011 0:00
4 7/23/2009 0:00
5 12/18/2008 0:00
Subtracted by (Should this be converted to a format like 1/1/1996?):
YearMade
1 1996
2 2001
3 2001
4 2007
5 2004
Current code:
mean_YearMade = dfx[dfx['YearMade'] > 1000]['YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[:, 'YearMade'][df['YearMade'] == 1000] = mean_YearMade
# Column has tons of erroneous years with 1000
df['saledate'] = pd.to_datetime(df['saledate'])
df['saleyear'] = df['saledate'].dt.year
df['Age_at_Sale'] = df['saleyear'] - df['YearMade']
df = df.drop('saledate', axis=1)
df = df.drop('YearMade', axis=1)
df = df.drop('saleyear', axis=1)
return df
Any optimization tricks would be much appreciated...
You can try use sub for substract and for select by condition use loc with mask like dfx['YearMade'] > 1000. Also creating column saleyear is not necessary.
dfx['saledate'] = pd.to_datetime(dfx['saledate'])
mean_YearMade = dfx.loc[dfx['YearMade'] > 1000, 'YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[df['YearMade'] == 1000, 'YearMade'] = mean_YearMade
df['Age_at_Sale'] = df['saledate'].dt.year.sub(df['YearMade'])
df = df.drop(['saledate', 'YearMade'], axis=1)
return df

Summing 3 columns in a dataframe

This should be easy:
I have a data frame with the following columns
a,b,min,w,w_min
all I want to do is sum up the columns min,w,and w_min and read that result into another data frame.
I've looked, but I can not find a previously asked question that directly relates back to this. Everything I've found seems much more complex then what I'm trying to do.
You can just pass a list of cols and select these to perform the summation on:
In [64]:
df = pd.DataFrame(columns=['a','b','min','w','w_min'], data = np.random.randn(10,5) )
df
Out[64]:
a b min w w_min
0 0.626671 0.850726 0.539850 -0.669130 -1.227742
1 0.856717 2.108739 -0.079023 -1.107422 -1.417046
2 -1.116149 -0.013082 0.871393 -1.681556 -0.170569
3 -0.944121 -2.394906 -0.454649 0.632995 1.661580
4 0.590963 0.751912 0.395514 0.580653 0.573801
5 -1.661095 -0.592036 -1.278102 -0.723079 0.051083
6 0.300866 -0.060604 0.606705 1.412149 0.916915
7 -1.640530 -0.398978 0.133140 -0.628777 -0.464620
8 0.734518 1.230869 -1.177326 -0.544876 0.244702
9 -1.300137 1.328613 -1.301202 0.951401 -0.693154
In [65]:
cols=['min','w','w_min']
df[cols].sum()
Out[65]:
min -1.743700
w -1.777642
w_min -0.525050
dtype: float64

Categories

Resources