Split one column of csv file based on another column

Split one column of csv file based on another column - python

I am trying to split a csv file of temperature data into smaller dictionaries so I can calculate the mean temperature of each month. The csv file is of the format below:
AirTemperature AirHumidity SoilTemperature SoilMoisture LightIntensity WindSpeed Year Month Day Hour Minute Second TimeStamp MonthCategorical
12 68 19 65 60 2 2016 1 1 0 1 1 10100 January
18 34 14 42 19 0 2016 1 1 1 1 1 10101 January
19 98 14 41 30 4 2016 1 1 2 1 1 10102 January
16 88 16 68 54 4 2016 1 1 3 1 1 10103 January
16 44 20 41 10 1 2016 1 1 4 1 1 10104 January
22 54 18 65 94 0 2016 1 1 5 1 1 10105 January
18 84 17 41 40 4 2016 1 1 6 1 1 10106 January
20 88 22 92 31 0 2016 1 1 7 1 1 10107 January
23 1 22 59 3 0 2016 1 1 8 1 1 10108 January
23 3 22 72 41 4 2016 1 1 9 1 1 10109 January
24 63 23 83 85 0 2016 1 1 10 1 1 10110 January
29 73 27 50 1 4 2016 1 1 11 1 1 10111 January
28 37 30 46 29 3 2016 1 1 12 1 1 10112 January
30 99 32 78 73 4 2016 1 1 13 1 1 10113 January
32 72 31 80 80 1 2016 1 1 14 1 1 10114 January
Where there are 24 readings per day over a 6 month period.
I can get half way there with the following code:
for row in df['AirTemperature']:
for equivalentRow in df['MonthCategorical']:
if equivalentRow == "January":
JanuaryAirTemperatures.append(row)
But the output of this has every AirTemp value duplicated by the number of rows containing the value January. I.e. rather than 12,18,19 etc it goes 12, 12, 12, 12, 12, 18, 18, 18, 18, 18, 19, 19, 19, 19
I tried the following:
for row in df['AirTemperature']:
if df['MonthCategorical'] == "January":
JanuaryAirTemperatures.append(row)
But I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I assume because it is trying to look at the whole column rather than the equivalent row.

IIUC, you can groupby by month and get the mean value of the Air Temperature per month with:
g = df.groupby('MonthCategorical')['AirTemperature'].mean().reset_index(name='MeanAirTemperature')
this returns:
MonthCategorical MeanAirTemperature
0 January 22
Then you can choose on what columns you want to groupby (i.e. instead of MonthCategorical you can group by Month only...).
EDIT:
You can also use transform to get a new column to append to the original dataframe with:
df['MeanAirTemperature'] = df.groupby('MonthCategorical')['AirTemperature'].transform('mean')

Related

Create timeseries data - Pandas

I have a multi-index dataframe of timeseries data which looks like the following;
A B C
1 1 21 32 4
2 4 2 23
3 12 9 10
4 1 56 37
.
.
.
.
30 63 1 27
31 32 2 32
.
.
.
12 1 2 3 23
2 23 1 12
3 32 3 23
.
.
.
31 23 2 32
It is essentially a multi-index of month and dates with three columns.
I need to turn this into daily data and essentially have a dataframe whereby there is a single index where value in the above dataframe responds to its' respective date over 10 years.
For exmaple;
Desired output;
A B C
01/01/2017 21 32 4
.
.
31/12/2017 23 2 32
.
.
01/01/2022 21 32 4
.
.
31/12/2022 23 2 32
I hope this is clear! Its essentially turning daily/monthly data into daily/monthly/yearly data.

You can use:
df.index = pd.to_datetime(df.index.rename(['month', 'day']).to_frame().assign(year=2022))
Output:
A B C
2022-01-01 21 32 4
2022-01-02 4 2 23
2022-01-03 12 9 10
2022-01-04 1 56 37
2022-01-30 63 1 27
2022-01-31 32 2 32
2022-12-01 2 3 23
2022-12-02 23 1 12
2022-12-03 32 3 23
2022-12-31 23 2 32
spanning several years
There is no absolute fool proof way to handle years if those are missing. What we can do it to infer the year change when a date goes back in the past and add 1 year in this case:
# let's assume the starting year is 2017
date = pd.to_datetime(df.index.rename(['month', 'day']).to_frame().assign(year=2017))
df.index = date + date.diff().lt('0').cumsum().mul(pd.DateOffset(years=1))
output:
A B C
2017-01-01 21 32 4
2017-01-02 4 2 23
2017-06-03 12 9 10
2017-06-04 1 56 37
2018-01-30 63 1 27 # added 1 year
2018-01-31 32 2 32
2018-12-01 2 3 23
2018-12-02 23 1 12
2018-12-03 32 3 23
2018-12-31 23 2 32
used input:
A B C
1 1 21 32 4
2 4 2 23
6 3 12 9 10
4 1 56 37
1 30 63 1 27 # here we go back from month 1 after month 6
31 32 2 32
12 1 2 3 23
2 23 1 12
3 32 3 23
31 23 2 32

Get next value from range after reaching specific multiples

I have a range of values i iterating through the number of hours in a year (8760) starting at 1. For every hour, the variable hour increments by 1 until it reaches 24 where it restarts. The variable year_day increments by 1 after every 24 hours is reached. Eg
i hour year_day
1 1 1
2 2 1
3 3 1
...
23 23 1
24 1 2
25 2 2
...
47 24 2
48 1 3
49 2 3
I'm struggling to make it so that when i = 24, hour also is 24 and year_day remains at 1. Then when i is the next value directly after a multiple is found, the hour restarts at 1 and year_day increments by 1. In other words, everytime it reaches midnight, the hour = 24 and year_day is still the previous day. Eg
i hour year_day
23 23 1
24 24 1
25 1 2
...
47 23 2
48 24 2
49 1 3
Here is the code:
hour = 0
year_day = 1
for i in range(1, 8761):
hour = hour + 1
if i % 24 == 0:
hour = 1
year_day = year_day + 1
print(i, hour, year_day)

Your code is ok, you just need to start with hour=1 and print before the if statement. Try the following:
hour = 1
year_day = 1
for i in range(1, 8761):
print(i, hour, year_day)
hour+=1
if i % 24 == 0:
hour = 1
year_day = year_day + 1
Output:
...
21 21 1
22 22 1
23 23 1
24 24 1
25 1 2
26 2 2
27 3 2
...

I have used a pandas approach to this question. The code is as follows:
import numpy as np
import pandas as pd
i = list(range(1,50))
df = pd.DataFrame(i, columns=["i"])
df["hours"] = df["i"]%24
df["hours"][df["hours"]==0] = 24
df["days"] = (df["i"]//24.1+1).astype(int)
display(df)
The output is:
i hours days
0 1 1 1
1 2 2 1
2 3 3 1
3 4 4 1
4 5 5 1
5 6 6 1
6 7 7 1
7 8 8 1
8 9 9 1
9 10 10 1
10 11 11 1
11 12 12 1
12 13 13 1
13 14 14 1
14 15 15 1
15 16 16 1
16 17 17 1
17 18 18 1
18 19 19 1
19 20 20 1
20 21 21 1
21 22 22 1
22 23 23 1
23 24 24 1
24 25 1 2
25 26 2 2
26 27 3 2
27 28 4 2
28 29 5 2
29 30 6 2
30 31 7 2
31 32 8 2
32 33 9 2
33 34 10 2
34 35 11 2
35 36 12 2
36 37 13 2
37 38 14 2
38 39 15 2
39 40 16 2
40 41 17 2
41 42 18 2
42 43 19 2
43 44 20 2
44 45 21 2
45 46 22 2
46 47 23 2
47 48 24 2
48 49 1 3

hour = 0
year_day = 1
for i in range(1, 8761):
if i % 24 == 0:
hour = 0
year_day += 1
hour += 1
print(i, hour, year_day)
Returns:
20 20 1
. . .
24 1 2
25 2 2
. . .
46 23 2
47 24 2
48 1 3

Pandas code to get the count of each values

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you

Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.

This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

python date and datetime from multiple columns

I have some data that looks like:
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS
1 54
2 55
3 56
4 57
5 58
6 59
7 0
8 1
9 2
At first the data was type float and the year was in format 16 so I did:
t['DATE - MONTH'] = t['DATE - MONTH'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR']+2000
t['DATE - DAY'] = t['DATE - DAY'].astype(int)
^Note I was also confused why when using an index number rather than the column name you only work on what seems to be a temp table ie you can print the desired result but it didnt change the data frame.
Then I tried two methods:
t['Date'] = pd.to_datetime(dict(year=t['DATE - YEAR'], month = t['DATE - MONTH'], day = t['DATE - DAY']))
t['Date'] = pd.to_datetime((t['DATE - YEAR']*10000+t['DATE - MONTH']*100+t['DATE - DAY']).apply(str),format='%Y%m%d')
Both return:
ValueError: cannot assemble the datetimes: time data 20000000 does not match format '%Y%m%d' (match)
I'd like to create a date column (and then after use a similar logic for a datetime column with the additional 3 columns).
What is the problem?
EDIT: I had bad data and added errors='coerce' to handle those rows

First rename all columns, filter by values of dict and use to_datetime:
Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like ['year', 'month', 'day', 'minute', 'second', 'ms', 'us', 'ns']) or plurals of the same.
d = {'DATE - YEAR':'year','DATE - MONTH':'month','DATE - DAY':'day',
'GMT HRS':'hour','GMT MINUTES':'minute','GMT SECONDS':'second'}
df['datetime'] = pd.to_datetime(df.rename(columns=d)[list(d.values())])
print (df)
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES \
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS datetime
1 54 2016-02-29 02:02:54
2 55 2016-02-29 02:02:55
3 56 2016-02-29 02:02:56
4 57 2016-02-29 02:02:57
5 58 2016-02-29 02:02:58
6 59 2016-02-29 02:02:59
7 0 2016-02-29 02:03:00
8 1 2016-02-29 02:03:01
9 2 2016-02-29 02:03:02
Detail:
print (df.rename(columns=d)[list(d.values())])
day month second year minute hour
1 29 2 54 2016 2 2
2 29 2 55 2016 2 2
3 29 2 56 2016 2 2
4 29 2 57 2016 2 2
5 29 2 58 2016 2 2
6 29 2 59 2016 2 2
7 29 2 0 2016 3 2
8 29 2 1 2016 3 2
9 29 2 2 2016 3 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split one column of csv file based on another column - python

Related

Create timeseries data - Pandas

Get next value from range after reaching specific multiples

Pandas code to get the count of each values

How to unpack a list of tuple in various length in a panda dataframe?

python date and datetime from multiple columns

Categories

Resources