I have a number of dataframes consisting of date time and precipitation data that I would like to later merge and plot by station ID. The station ID is present in the "header" of the df but I am not able to reason out how to assign the entire df a unique ID based on it. The .csv file name that the df is constructed from also has the station ID in the name.
For clarification, I am reading in each .csv file from path using an if statement within a loop to differentiate between the different file names and adjust the column headers as needed.
Reading in each file, where 8085 is part of the filename, looks like
def function:
if '8085' in file:
df = pd.read_csv(path + file, usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'], names=['Variable', 'Date', 'Time', 'FPR-D Oil'], header = None, parse_dates = [[1,2]], skiprows=[0,1])
df_fprd_oil = df[df['Variable'].str.contains('Precip')]
Example of the .csv file before reading in:
Station ID,Sensor Serial Num,
12345678,sn123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,
Example of the df after reading in:
Date_Time Variable FPR-D Oil
0 2020-02-01 09:45:00 Precip -2.19
3 2020-02-01 10:00:00 Precip -2.19
6 2020-02-01 10:15:00 Precip -2.19
What (I think) is desired
Date_Time Station ID Variable FPR-D Oil
0 2020-02-01 09:45:00 12345678 Precip -2.19
3 2020-02-01 10:00:00 12345678 Precip -2.19
6 2020-02-01 10:15:00 12345678 Precip -2.19
Or maybe even
Date_Time Variable FPR-D Oil Unique ID
0 2020-02-01 09:45:00 Precip -2.19 1
3 2020-02-01 10:00:00 Precip -2.19 1
6 2020-02-01 10:15:00 Precip -2.19 1
If you want to add UniqueID to your Dataframe and want it to be constant for one dataframe then you can simply do this.
df["UniqueID"] = 1
It will add a column named UniqueID in your existing df with the assigned value.
Hope it helps.
I am assuming, based on your comments above, that the station id is always located at the second row of the first column in all the csv files.
import pandas as pd
from io import StringIO
# sample data
s = """Station ID,Sensor, Serial Num,
12345678,123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,"""
# read your file
df = pd.read_csv(StringIO(s), usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'],
skiprows=[0,1], names=['Variable', 'Date', 'Time', 'FPR-D Oil'])
# read it again but only get the first value of the second row
sid = pd.read_csv(StringIO(s), skiprows=1, nrows=1, header=None)[0].iloc[0]
# filter and copy so you are not assign to a slice of a frame
new_df = df[df['Variable'] == 'Precip'].copy()
# assign sid to a new column
new_df.loc[:, 'id'] = sid
print(new_df)
Variable Date Time FPR-D Oil id
0 Precip 02/01/2020 09:45:00 -2.19 12345678
3 Precip 02/01/2020 10:00:00 -2.19 12345678
Try this:
df['Station ID'] = 12345678
Where df is your dataframe
Related
I have a summary dataframe. I want to extract quarterly data and export it to the quarterly folders created already.
My code:
ad = pd.DataFrame({"sensor_value":[10,20]},index=['2019-01-01 05:00:00','2019-06-01 05:00:00'])
ad =
sensor_value
2019-01-01 05:00:00 10
2019-06-01 05:00:00 20
ad.index = pd.to_datetime(ad.index,format = '%Y-%m-%d %H:%M:%S')
# create quarter column
ad['quarter'] = ad.index.to_period('Q')
ad =
sensor_value quarter
2019-01-01 05:00:00 10 2019Q1
2019-06-01 05:00:00 20 2019Q2
# quarters list
qt_list = ad['quarter'].unique()
# extract data for quarter and store it in the corresponding folder that already exist
fold_location = 'C\\Data\\'
for i in qt_list:
auxdf = ad[ad['quarter']=='%s'%(i)]
save_loc = fold_location+'\\'+str(i)
auxdf.to_csv(save_loc+'\\'+'Sensor_1minData_%s.csv'%(i))
Is there a better way of doing it?
Thanks
You can use groupby with something like:
for quarter, df in ad.groupby('quarter'):
df.to_csv(f"C\\Data\\{quarter}\\Sensor_1minData_{quarter}.csv")
I have a dataset / CSV with hourly prices in columns for each day in the rows. It looks something like this:
Date
01:00:00
02:00:00
...
00:00:00
01.01.2019
348,87
340,83
...
343,38
02.01.2019
...
...
...
...
I would need the dataset to be like this:
Date
Price
01.01.2019 01:00:00
348,87
01.01.2019 02:00:00
343,38
...
...
02.01.2019 00:00:00
...
And all the way to 01.01.2022.
I'm using pandas dataframes in Python. Could anyone help me with this?
RE:
df1 = pd.read_csv('Hourly_prices1.csv', delimiter = ';', index_col = ['Date'])
So basically, I want the index row to contain the price each hour on each day, going chronologically from 01.01.2019 01.00.00 all until 01.01.2022 00.00.00.
I need this to create a time series analysis, and amongst other things plot daily changes in the price of each hour of each day.
Assuming your dataset is df, Please try this:
hours_ls = df.columns[1:]
df['Date'] = df['Date'].astype(str)
df_new = pd.DataFrame()
for date in df['Date'].values:
price_ls = []
date_hr = []
for hour in hours_ls:
date_hr.append(date + ' ' + str(hour))
price_ls.append(df[df['Date']==date][hour].iloc[0])
df_new = df_new.append(pd.DataFrame(data={'Date':date_hr,'Price':price_ls}))
df_new will be the formatted dataframe required
I have a column in pandas dataframe that contains two types of information = 1. date and time, 2=company name. I have to split the column into two (date_time, full_company_name). Firstly I tried to split the columns based on character count (first 19 one column, the rest to other column) but then I realized that sometimes the date is missing so the split might not work. Then I tried using regex but I cant seem to extract it correctly.
The column:
the desired output:
If the dates are all properly formatted, maybe you don't have to use regex
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
OR
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")
Note:
If you have an option to avoid concatenating those strings to begin with, please do so. This is not a healthy habit.
Solution (not that pretty gets the job done):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet
use a non capturing group ?.* instead of (.*)
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")
We have a csv file and written below code to do a group by and get the max value and create an output file. But while reading final output file using data-frame read_csv , it is showing as empty..
Input file:
Manoj,2020-01-01 01:00:00
Manoj,2020-02-01 01:00:00
Manoj,2020-03-01 01:00:00
Rajesh,2020-01-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
Final output file:
Manoj,2020-03-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
and then when i am trying to read above final output file using df.read_Csv it shows dataframe empty.
import os
import re
import pandas as pd
z=open('outfile.csv','w')
fin=[]
k=open('j.csv','r')
for m in k:
d=m.split(',')[0]
if d not in fin:
fin.append(d.strip())
for p in fin:
gg=[]
g=re.compile(r'{0}'.format(p))
y=open('j.csv','r')
for b in y:
if re.search(g,b):
gg.append(b)
z.write(gg[-1].strip())
z.write('\n')
df = pd.read_csv("outfile.csv", delimiter=',', names=['Col1','Col2'], header=0)
print(df)
final output: Empty DataFrame , Index: []
Is there anything i missed , please any one suggest...
It's not necessary to use the for-loop to process the file. The data aggregation is more easily completed in pandas.
Your csv is shown without headers, so read the file in with pandas.read_csv, header=None, and use parse_dates to correctly format the datetime column.
The column with datetimes, is shown at column index 1, therefore parse_dates=[1]
This assumes the data begins on row 0 in the file and has no headers, as shown in the OP.
Create headers for the columns
As per a comment, the date component of 'datetime' can be accessed with the .dt accessor.
.groupby on name and aggregate .max()
import pandas as pd
# read the file j.csv
df = pd.read_csv('j.csv', header=None, parse_dates=[1])
# add headers
df.columns = ['name', 'datetime']
# select only the date component of datetime
df.datetime = df.datetime.dt.date
# display(df)
name datetime
0 Manoj 2020-01-01
1 Manoj 2020-02-01
2 Manoj 2020-03-01
3 Rajesh 2020-01-01
4 Rajesh 2020-05-01
5 Suresh 2020-04-01
# groupby
dfg = df.groupby('name')['datetime'].max().reset_index()
# display(dfg)
name datetime
0 Manoj 2020-03-01
1 Rajesh 2020-05-01
2 Suresh 2020-04-01
# save the file. If the headers aren't wanted, use `header=False`
dfg.to_csv('outfile.csv', index=False)
Create dataframe
import pandas as pd
df=pd.DataFrame(zip(
['Manoj','Manoj','Manoj','Rajesh','Rajesh','Suresh'],
['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-05-01','2020-04-01'],
['01:00:00','01:00:00','01:00:00','01:00:00','01:00:00','01:00:00']),
columns=['name','date','time'])
Convert date and time from string to date and time object
df['date']=pd.to_datetime(df['date'], infer_datetime_format=True).dt.date
df['time']=pd.to_datetime(df['time'],format='%H:%M:%S').dt.time
Take groupby
out=df.groupby(by=['name','time']).max().reset_index()
You can save and load it again
out.to_csv('out.csv',index=False)
df1=pd.read_csv('out.csv')
result
name time date
0 Manoj 01:00:00 2020-03-01
1 Rajesh 01:00:00 2020-05-01
2 Suresh 01:00:00 2020-04-01
Sorry, I created two separate columns for date and time, but I hope you can understand it
I have a dataframe that contains NaN values and I want to fill the missing data using information of the same month.
the dataframe looks this:
data = {'x':[208.999,-894.0,-171.0,108.999,-162.0,-29.0,-143.999,-133.0,-900.0],
'e':[0.105,0.209,0.934,0.150,0.158,'',0.333,0.089,0.189],
}
df = pd.DataFrame(data)
df = pd.DataFrame(data, index =['2020-01-01', '2020-02-01',
'2020-03-01', '2020-01-01',
'2020-02-01','2020-03-01',
'2020-01-01','2020-02-01',
'2020-03-01'])
df.index = pd.to_datetime(df.index)
df['e'] =df['e'].apply(pd.to_numeric, errors='coerce')
Now im using df=df.fillna(df['e'].mean()) to fill the nan value but it takes all the column data, is and it gives me 0.27 is there a way to use only the data of the same month?, the result should be 0.56
Try grouping in index.month and get mean (transformed) then fillna
df.index = pd.to_datetime(df.index)
out = df.fillna({'e':df.groupby(df.index.month)['e'].transform('mean')})
print(out)
x e
2020-01-01 208.999 0.1050
2020-02-01 -894.000 0.2090
2020-03-01 -171.000 0.9340
2020-01-01 108.999 0.1500
2020-02-01 -162.000 0.1580
2020-03-01 -29.000 0.5615
2020-01-01 -143.999 0.3330
2020-02-01 -133.000 0.0890
2020-03-01 -900.000 0.1890
Maybe you could use interpolate() instead of fillna(), but you have to sort the index first, ie.:
df.e.sort_index().interpolate()
Output:
2020-01-01 0.1050
2020-01-01 0.1500
2020-01-01 0.3330
2020-02-01 0.2090
2020-02-01 0.1580
2020-02-01 0.0890
2020-03-01 0.9340
2020-03-01 0.5615
2020-03-01 0.1890
Name: e, dtype: float64
By default linear interpolation is used, so in case of a single occurrence of NaN you get the mean value and the missing one was replaced by 0.5615 like you expected. However if the NaN was the first sample of the month after sorting the result would be the mean of the last month's last value and this month's next value, but it works in cases where there are NaNs for the whole month and nothing to average, so depending how strict you are on the same month requirement or how are your missing values spread across the whole dataframe you can accept this solution or not.