How to append multiple pandas DataFrames in a loop? - python

I've been banging my head on this python problem for a while and am stuck. I am for-looping through several csv files and want one data frame that appends the csv files in a way that one column from each csv file is a column name and sets a common index of a date_time.
There are 11 csv files that look like this data frame except for different value and pod number, but the time_stamp is the same for all the csvs.
data
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
And this is the for-loop that I have so far:
import glob
import pandas as pd
filenames = sorted(glob.glob('*.csv'))
new = []
for f in filenames:
data = pd.read_csv(f)
time_stamp = [pd.to_datetime(d) for d in time_stamp]
new.append(data)
my_df = pd.DataFrame(new, columns=['pod','time_stamp','value'])
What I want is a data frame that looks like this where each column is the result of value from each of the csv files.
time_stamp 97 98 99 ...
2016-02-22 3.04800 4.20002 3.5500
2016-02-29. 23.62201 24.7392 21.1110
2016-03-07 13.97001 11.0284 12.0000
But right now the output of my_df is very wrong and looks like this. Any ideas of where I went wrong?
0
0 pod time_stamp value 0 22 2016-...
1 pod time_stamp value 0 72 2016-...
2 pod time_stamp value 0 79 2016-0...
3 pod time_stamp value 0 86 2016-...
4 pod time_stamp value 0 87 2016-...
5 pod time_stamp value 0 88 2016-...
6 pod time_stamp value 0 90 2016-0...
7 pod time_stamp value 0 93 2016-0...
8 pod time_stamp value 0 95 2016-...

I'd recommend first concatenating all your dataframes together with pd.concat, and then doing one final pivot operation.
filenames = sorted(glob.glob('*.csv'))
new = [pd.read_csv(f, parse_dates=['time_stamp']) for f in filenames]
df = pd.concat(new) # omit axis argument since it is 0 by default
df = df.pivot(index='time_stamp', columns='pod')
Note that I'm forcing read_csv to parse time_stamp when loading the dataframe, so parsing after loading is no longer required.
MCVE
df
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
df.pivot(index='time_stamp', columns='pod')
value
pod 97
time_stamp
2016-02-22 3.048000
2016-02-29 23.622001
2016-03-07 13.970001
2016-03-14 6.604000
2016-03-21 NaN

Related

Convert a pandas column of int to timestamp datatype when column name is not known

Hi my use case is I have dynamic names of datetime field and date will be some thing like unix timestamp . so every time there could be different column name and there can be multiple date filed . so how I can do this ? for now if I do hardcode for column name this works for me
df['date'] = pandas.to_datetime(df['date'], unit='s')
but not sure how I can make this for dynamic names and multiple fields with pandas
As suggested on my comment, you can try to convert all columns as DatetimeIndex then just keep one where conversion succeeded?
# Inspired from #mozway, https://stackoverflow.com/a/75106101/15239951
def find_datetime_col(df):
mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
return mask.mean().idxmax()
col = find_datetime_col(df)
print(col)
# Output
Heading 1
Input dataframe:
>>> df
Heading 1 Heading 2 Heading 3 Heading 4
0 2023-01-01 34 12 34
1 2023-01-02 42 99 42
2 2023-01-03 42 99 42
3 2023-01-04 42 99 42

appending pandas columns data

why can't the pandas data frame append appropriately to form one data frame in this loop?
#Produce the overall data frame
def processed_data(data1_,f_loc,open,close):
"""data1_: is the csv file to be modified
f_loc: is the location of csv files to be processed
open and close: are the columns to undergo computations
returns a new dataframe of modified columns"""
main_file=drop_col(data1_)#Dataframe to append more data columns to
for i in files_path(f_loc):
data=get_data_frame(i[0])#returns the dataframe, takes file path location of the csv file and returns the data frame
perc=perc_df(data,open,close,i[1])#Dataframe to append
copy_data=main_file.append(perc)
return copy_data
heres the output:
Date WTRX-USD
0 2021-05-27 NaN
1 2021-05-28 NaN
2 2021-05-29 NaN
3 2021-05-30 NaN
4 2021-05-31 NaN
.. ... ...
79 NaN -2.311576
80 NaN 5.653349
81 NaN 5.052950
82 NaN -2.674435
83 NaN -3.082957
[450 rows x 2 columns]
My intention is to return something like this(where each append operation adds a column):
Date Open High Low Close Adj Close Volume
0 2021-05-27 0.130793 0.136629 0.124733 0.128665 0.128665 70936563
1 2021-05-28 0.128659 0.129724 0.111244 0.113855 0.113855 71391441
2 2021-05-29 0.113752 0.119396 0.108206 0.111285 0.111285 62049940
3 2021-05-30 0.111330 0.115755 0.107028 0.112185 0.112185 70101821
4 2021-05-31 0.112213 0.126197 0.111899 0.125617 0.125617 83502219
.. ... ... ... ... ... ... ...
361 2022-05-23 0.195637 0.201519 0.185224 0.185231 0.185231 47906144
362 2022-05-24 0.185242 0.190071 0.181249 0.189553 0.189553 33312065
363 2022-05-25 0.189550 0.193420 0.183710 0.183996 0.183996 33395138
364 2022-05-26 0.184006 0.186190 0.165384 0.170173 0.170173 57218888
365 2022-05-27 0.170636 0.170660 0.165052 0.166864 0.166864 63560568
[366 rows x 7 columns]
pandas.concat
pandas.DataFrame.append has been deprecated. Use pandas.concat instead.
Combine DataFrame objects horizontally along the x-axis by passing in
axis=1
copy_data=pd.concat([copy_data,perc], axis=1)

column not found while renaming in panda dataframe

I have this panda dataframe
timestamp EG2021 EGH2021
2021-01-04 33 Nan
2021-02-04 45 65
And I Am trying to replace the columnm name with new name as mapped in an excel file like this
OldId NewId
EG2021 LER_EG2021
EGH2021 LER_EGH2021
I tried below code but its not working I get the error as
KeyError: "None of [Index(['LER_EG2021',LER_EGH2021'],\n
dtype='object', length=186)] are in the [columns]
Code:
df = pd.ExcelFile('ids.xlsx').parse('Sheet1')
x=[]
x.append(df['external_ids'].to_list())
dtest_df = (my panda dataframe as mentioned above)
mapper = df.set_index(df['oldId'])[df['NewId']]
dtest_df.columns = dtest_df.columns.Series.replace(mapper)
Any idea what wrong am I doing??
You need:
mapper = df.set_index('oldId')['NewId']
dtest_df.columns = dtest_df.columns.map(mapper.to_dict())
Or:
dtest_df = dtest_df.rename(columns=df.set_index('oldId')['NewId'].to_dict())
dtest_df output:
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 NaN
1 2021-02-04 45 65
Another way, dict the zip of the df with the Old and New ids
dtest_df.rename(columns=dict(zip(df['OldId'], df['NewId'])), inplace=True)
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 Nan
1 2021-02-04 45 65

How to export dictionary with multiple values in Excel file

I have a dictionary with multiple values to a key. For ex:
dict = {u'Days': [u'Monday', u'Tuesday', u'Wednesday', u'Thursday'],u'Temp_value':[54,56,57,45], u'Level_value': [30,34,35,36] and so on...}
I want to export this Data to excel in the below-mentioned formet.
Column 1 Column 2 column 3 so on...
Days Temp_value Level_value
Monday 54 30
Tuesday 56 34
Wednesday 57 35
Thursday 45 36
How can I do that?
Use pandas
import pandas as pd
df = pd.DataFrame(your_dict)
df.to_excel('your_file.xlsx', index=False)

Splitting, merging, sorting CSV

I have several CSV files containing measurements from several sensors
s1.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 88 -84 27
01/25/12;10:30:00;a; -80
01/25/12;10:50:00;b; -96 3 -88
01/25/12;09:00:00;b; -97 101
01/25/12;09:10:00;c; 28
s2.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 133
01/25/12;10:25:00;a; -8 -5
I'd like to create one CSV by source (a/b/c) with every measure in separated column sorted by date and hour
a.CSV :
date;hour;source;s1;s2
01/25/12;10:20:00;a; 88 -84 27; 133
01/25/12;10:25:00;a; ; -8 -5
01/25/12;10:30:00;a; -80;
...
I'm stuck here :
import glob
import csv
import os
os.system('cls')
sources = dict()
sensor = 0
filelist = glob.glob("*.csv")
for f in filelist:
reader = csv.DictReader(open(f),delimiter=";")
for row in reader:
# date = row['date'] # date later
hour = row['hour']
val = row['values']
source = row['source']
if not sources.has_key(source): # new source
sources[source] = list()
#
sources[source].append({'hour':hour, 'sensor'+`sensor`:val})
sensor+=1
I'm not sure the data structure is good to sort. I also fell like repeating column name.
Using your data provided, I cooked up something using Pandas. Please see code below.
The output, granted, is non-ideal, as the hour and source get repeated within a column. As I am learning too, I'd also welcome any expert input on whether Pandas can do what the OP is asking for!
In [1]: import pandas as pd
In [2]: s1 = pd.read_csv('s1.csv', delimiter=';', parse_dates=True)
In [3]: s1
Out[3]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
In [4]: s2 = pd.read_csv('s2.csv', delimiter=';', parse_dates=True)
In [5]: s2
Out[5]:
date hour source values
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [6]: joined = s1.append(s2)
In [7]: joined
Out[7]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [8]: grouped = joined.groupby('hour').sum()
In [9]: grouped.to_csv('a.csv')
In [10]: grouped
Out[10]:
date source values
hour
09:00:00 01/25/12 b -97 101
09:10:00 01/25/12 c 28
10:20:00 01/25/1201/25/12 aa 88 -84 27 133
10:25:00 01/25/12 a -8 -5
10:30:00 01/25/12 a -80
10:50:00 01/25/12 b -96 3 -88
If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.
Here's what I think you want to do:
Read the data in, and build a nested dictionary data structure, as follows:
The top level key would be the source (e.g. 'a').
The second level will be keyed by a (date, time) tuple.
The inner most level will be keyed by sensor, taken from the filename, and have the actual sensor readings as values.
You'd also want to keep track of all the sensors that have been seen.
To write the data out, you'd loop over the items of the outermost dictionary, creating a new output file for each one.
The rows of each file would be determined by sorting the keys of the next dictionary.
The last value of each row would be formed by concatenating the values of the innermost dict, filling in an empty string for any missing values.
Here's some code:
from collections import defaultdict
from datetime import datetime
import csv
import glob
import os
# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []
filelist = glob.glob("*.csv")
# read old files
for fn in filelist:
sensor = os.path.splitext(fn)[0]
sensors.append(sensor)
with open(fn, 'rb') as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
date = datetime.strptime(row['date'], '%m/%d/%y')
data[row['source']][date, row['hour']][sensor] = row['values']
sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors
for source, source_data in data.iteritems():
fn = "{}.csv".format(source)
with open(fn, 'wb') as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(header)
for (date, time), hour_data in sorted(source_data.items()):
values = [hour_data[sensor] for sensor in sensors]
writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)
I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD, which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.
The code assumes that the sensor values can be ordered lexicographically. This is likely if you only have a few of them, e.g. s1 and s2. However, if you have a s10, it will be sorted ahead of s2. To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).
One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, e.g. a.csv will be seen by glob.glob('*.csv') as input files when you run again.

Categories

Resources