Generate XML file from pandas with python - python

I have a pandas dataframe that looks like this :
Date Min Max C
01.01.2003 01.01.2003 Nan NaT
02.01.2003 Nan Nan NaT
03.01.2003 Nan Nan NaT
04.01.2003 Nan 04.01.2003 NaT
06.01.2003 06.01.2003 Nan NaT
07.01.2003 Nan Nan NaT
08.01.2003 Nan 08.01.1993 NaT
09.01.2003 Nan Nan 09.01.2003
14.01.2003 14.01.2003 Nan NaT
15.01.2003 Nan Nan NaT
16.01.2003 Nan 16.01.2003 NaT
29.01.2003 Nan Nan 29.01.2003
And I want to get a XML data that looks like this
-<Or>
-<Date Source="test" test2="test3">
<Min>01.01.2003</Min>
<Max>01.04.2003</Max>
</Date>
-<Date Source="test" test2="test3">
<Min>06.01.2003</Min>
<Max>08.01.2003</Max>
</Date>
-<Date Source="test" test2="test3">
<Min>14.01.2003</Min>
<Max>16.01.2003</Max>
</Date>
-<Date Source="="test" test2="test3">
09.01.2003 29.01.2023
<Dates></Dates>
</Date>
This is the code :
data = gfg.Element("Or")
for idx, row in data.iterrows():
element1 = gfg.SubElement(data, "Test")
element2 = gfg.SubElement(data, "Test2")
s_elem1 = gfg.SubElement(element1, 'Min')
s_elem2 = gfg.SubElement(element1, 'Max')
s_elem1.text=row['Min']
s_elem2.text=row['Max']
b_xml=gfg.tostring(data)
Because it loops over row which then there will be max or min would be empty/blank space. what should I modified so I can get like the example above ?
thank

As commented, consider aggregating data before exporting to XML:
Data
from io import StringIO
import pandas as pd
txt = '''\
Date Min Max C
01.01.2003 01.01.2003 NaT NaT
02.01.2003 NaT NaT NaT
03.01.2003 NaT NaT NaT
04.01.2003 NaT 04.01.2003 NaT
06.01.2003 06.01.2003 NaT NaT
07.01.2003 NaT NaT NaT
08.01.2003 NaT 08.01.1993 NaT
09.01.2003 NaT NaT 09.01.2003
14.01.2003 14.01.2003 NaT NaT
15.01.2003 NaT NaT NaT
16.01.2003 NaT 16.01.2003 NaT
29.01.2003 NaT NaT 29.01.2003
'''
with StringIO(txt) as f:
dates_df = pd.read_csv(f, sep="\s+", parse_dates=["Min", "Max", "C"], dayfirst=True)
Cleaning / Grouping / Aggregation
agg_dates_df = (
dates_df.dropna(how="all", axis="rows", subset=["Min", "Max"])
.drop(["Date", "C"], axis="columns")
.assign(
Group = lambda df: pd.notnull(df["Min"]).astype('int').cumsum()
)
.groupby(["Group"])[["Min", "Max"]].max()
.apply(lambda col: col.dt.strftime("%d.%m.%Y"), axis="columns")
.assign(Source = "test", test2 = "test3")
)
print(agg_dates_df)
# Min Max Source test2
# Group
# 1 01.01.2003 04.01.2003 test test3
# 2 06.01.2003 08.01.1993 test test3
# 3 14.01.2003 16.01.2003 test test3
XML
output = dates_df.to_xml(
index = False,
root_name = "Or",
row_name = "Date",
attr_cols = ["Source", "test2"],
elem_cols = ["Min", "Max"]
)
print(output)
# <?xml version='1.0' encoding='utf-8'?>
# <Or>
# <Date Source="test" test2="test3">
# <Min>01.01.2003</Min>
# <Max>04.01.2003</Max>
# </Date>
# <Date Source="test" test2="test3">
# <Min>06.01.2003</Min>
# <Max>08.01.1993</Max>
# </Date>
# <Date Source="test" test2="test3">
# <Min>14.01.2003</Min>
# <Max>16.01.2003</Max>
# </Date>
# </Or>

Related

Best way to merge multiple csv files starting with different timestamps using pandas concat

The dataset:
I have a collection csv files inside a folder with each csv file with title: timestamp and close. Each file is saved as {symbol}.csv where symbols range from a list of symbols eg: ['ADAUSDT', 'MAGICUSDT', 'LUNCUSDT', 'LINAUSDT', 'LEVERUSDT', 'BUSDUSDT', BTSUSDT, ALGOUSDT].... In reality I have over 100+ symbols
Here's the link to sample csv files incase you need them
What I would like to do:
I want to merge all the close prices inside these files into one data frame using pd.concat without losing much data. Most of the files start at a similar date, but some of them don't have much data back to 1 year (eg: LUNCUSDT) In those cases I want to find a way in which I can either drop those files and merge the rest depending on whether the rest of the dates all come within a close range.
If that is complicated maybe I would like to try to arrange them all together based on the most recent data. However, all the DateTime stamps in the last rows are also not in the same range.
I would appreciate any help on how I can approach this logic. Thanks in advance.
Here's my attempt:
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
merged_df = pd.DataFrame()
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df = df.rename(columns={'close': symbol})
#df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
merged_df = pd.concat([merged_df, df], axis=1)
merged_df
This unfortunately prints an uneven dataframe, which also has repeating timestamp column. Therefore I do not know how I could figure out the latest and earliest time in the row:
timestamp ADAUSDT timestamp XRPUSDT timestamp XLMUSDT timestamp TRXUSDT timestamp VETUSDT ... timestamp LEVERUSDT timestamp STGUSDT timestamp LUNCUSDT timestamp HFTUSDT timestamp MAGICUSDT
0 2022-02-14 17:35:00 1.048 2022-02-14 17:35:00 0.7989 2022-02-14 17:35:00 0.2112 2022-02-14 17:35:00 0.06484 2022-02-14 17:35:00 0.05662 ... 2022-07-13 04:00:00 0.001252 2022-08-19 09:00:00 0.4667 2022-09-09 08:00:00 0.000529 2022-11-07 13:00:00 3.6009 2022-12-12 08:00:00 0.7873
1 2022-02-14 17:40:00 1.047 2022-02-14 17:40:00 0.7986 2022-02-14 17:40:00 0.2111 2022-02-14 17:40:00 0.06482 2022-02-14 17:40:00 0.05665 ... 2022-07-13 04:05:00 0.001249 2022-08-19 09:05:00 0.5257 2022-09-09 08:05:00 0.000522 2022-11-07 13:05:00 2.9160 2022-12-12 08:05:00 0.8116
2 2022-02-14 17:45:00 1.048 2022-02-14 17:45:00 0.7981 2022-02-14 17:45:00 0.2111 2022-02-14 17:45:00 0.06488 2022-02-14 17:45:00 0.05668 ... 2022-07-13 04:10:00 0.001320 2022-08-19 09:10:00 0.5100 2022-09-09 08:10:00 0.000517 2022-11-07 13:10:00 2.6169 2022-12-12 08:10:00 0.8064
3 2022-02-14 17:50:00 1.047 2022-02-14 17:50:00 0.7980 2022-02-14 17:50:00 0.2109 2022-02-14 17:50:00 0.06477 2022-02-14 17:50:00 0.05658 ... 2022-07-13 04:15:00 0.001417 2022-08-19 09:15:00 0.5341 2022-09-09 08:15:00 0.000520 2022-11-07 13:15:00 2.4513 2022-12-12 08:15:00 0.8035
4 2022-02-14 17:55:00 1.047 2022-02-14 17:55:00 0.7969 2022-02-14 17:55:00 0.2108 2022-02-14 17:55:00 0.06474 2022-02-14 17:55:00 0.05656 ... 2022-07-13 04:20:00 0.001400 2022-08-19 09:20:00 0.6345 2022-09-09 08:20:00 0.000527 2022-11-07 13:20:00 2.5170 2022-12-12 08:20:00 0.8550
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105123 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105124 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105126 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105127 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Any help would be appreciated. Thankyou!
Option 1
Your test implementation attempts to perform a horizontal concat. However, I think you should consider a vertical concat instead - and indeed, the title of your question is simply "best way to merge multiple csv files with different timestamps"
The advantage of using the standard vertical concat is that for time series datasets that arrives at irregular intervals, it is far more memory efficient. You've already touched a little bit on the issues that you would run into doing a horizontal concat in your example - multiple timestamp columns, one for each csv.
A better approach might be to add an additional column called symbol to differentiate each row, and a vertical concat like so
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
dfs = []
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
# adds symbol as a column
df['symbol'] = symbol
merged_df = pd.concat(dfs).set_index('timestamp').sort_index()
Now, your merged df will look something like this
+-----------+-------+--------+
| timestamp | close | symbol |
+-----------+-------+--------+
|...........|.......|........|
+----------------------------+
Option 2
If you still want to do a horizontal concat, you could try to take advantage of merge_asof instead to get things on the same row.
To get back into the realm of regular intervals, you could use pandas' date_range to generate a periodic column, and then merge_asof every csv file back into it.
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
df_merged = pd.DataFrame(
data=pd.date_range(start='1/1/2023', end='1/2/2023', freq='T'),
columns=['time']
).set_index('time') # required for merge_asof
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
df = df.set_index('timestamp') # required for merge_asof
df_merged = pd.merge_asof(
left=df_merged,
right=df.rename(columns={'close': symbol}),
left_index=True,
right_index=True,
)
Option 3
If you're ok with using a huge amount of memory, you can simply merge every dataframe as you are already doing, and then fillna with the bfill argument

How to create a time matrix full of NaT in python?

I would like to create an empty 3D time matrix (with known size) that I will later populate in a loop with either pd.dateTimeIndex or a list of pd.timestamp. Is there a simple method ?
This does not work:
timeMatrix = np.empty( shape=(100, 1000, 2) )
timeMatrix[:] = pd.NaT
I can do without the second line but then the numbers in timeMatrix become 10^18 numbers.
timeMatrix = np.empty( shape=(100, 1000, 2) )
for pressureLevel in levels:
timeMatrix[ i_airport, 0:varyingNumberBelow1000, pressureLevel ] = dates_datetimeindex
Thank you
df = pd.DataFrame(index=range(10), columns=range(10), dtype="datetime64[ns]")
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9
0 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT

in a pandas DF with 'season' (season1, season2...) columns, 6 months or ~182 days needs to be added to the last season that's not null

I have a pandas DF with multiple seasons and for each row, I need to add 6 months (~182 Days) to the last season that's not null.The dates are dtype: datetime64[ns].
df:
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
Desired Output:
S1 S2 S3
2021-06-30 naT naT
2021-06-30 naT naT
2020-12-31 2021-06-30 naT
2020-12-31 2020-12-31 2021-07-31
Use .shift() to find if the next cell in the row is NaT and then use pd.DateOffset() to add extra months to those cells:
import pandas as pd
from io import StringIO
text = """
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
"""
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
df = df.apply(pd.to_datetime, errors='coerce')
# find in which cells the next value is na
next_value_in_row_na = df.shift(-1, axis=1).isna()
# for each cell where the next value is na, try to add 6 months
df = df.mask(next_value_in_row_na, df + pd.DateOffset(months=6))
Resulting dataframe:
S1 S2 S3
0 2021-06-30 NaT NaT
1 2021-06-30 NaT NaT
2 2020-12-31 2021-06-30 NaT
3 2020-12-31 2020-12-31 2021-07-31

How to fill NaT and NaN values separately

My dataframe contains both NaT and NaN values
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00 20.565
2 2015-11-11 11:44:00 21.0000 NaT NaN
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00 13.700
I want to fill NaT with dates and NaN with numbers. Fillna(4) method replaces both NaT and NaN with 4. Is it possible to differentiate between NaT and NaN somehow?
My current workaround is to df[column].fillna()
Since NaTs pertain to datetime columns, you can exclude them when applying your filling operation.
u = df.select_dtypes(exclude=['datetime'])
df[u.columns] = u.fillna(4)
df
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00 20.565
2 2015-11-11 11:44:00 21.0000 NaT 4.000
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00 13.700
Similarly, to fill NaT values only, change "exclude" to "include" in the code above.
u = df.select_dtypes(include=['datetime'])
df[u.columns] = u.fillna(pd.to_datetime('today'))
df
Date/Time_entry Entry Date/Time_exit Exit
0 2015-11-11 10:52:00 19.9900 2015-11-11 11:30:00.000000 20.350
1 2015-11-11 11:36:00 20.4300 2015-11-11 11:38:00.000000 20.565
2 2015-11-11 11:44:00 21.0000 2019-02-17 16:11:09.407466 4.000
3 2009-04-20 10:28:00 13.7788 2009-04-20 10:46:00.000000 13.700
Try something like this, using pandas.DataFrame.select_dtypes:
>>> import pandas as pd, datetime, numpy as np
>>> df = pd.DataFrame({'a': [datetime.datetime.now(), np.nan], 'b': [5, np.nan], 'c': [1, 2]})
>>> df
a b c
0 2019-02-17 18:06:15.231557 5.0 1
1 NaT NaN 2
>>> fill_dt = datetime.datetime.now()
>>> fill_value = 4
>>> dt_filled_df = df.select_dtypes('datetime').fillna(fill_dt)
>>> dt_filled_df
a
0 2019-02-17 18:06:15.231557
1 2019-02-17 18:06:36.040404
>>> value_filled_df = df.select_dtypes('int').fillna(fill_value)
>>> value_filled_df
c
0 1
1 2
>>> dt_filled_df.columns = [col + '_notnull' for col in dt_filled_df]
>>> value_filled_df.columns = [col + '_notnull' for col in value_filled_df]
>>> df = df.join(value_filled_df)
>>> df = df.join(dt_filled_df)
>>> df
a b c c_notnull a_notnull
0 2019-02-17 18:06:15.231557 5.0 1 1 2019-02-17 18:06:15.231557
1 NaT NaN 2 2 2019-02-17 18:06:36.040404

removing rows with any column containing NaN, NaTs, and nans

Currently I have data as below:
df_all.head()
Out[2]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 AA 2016-01-14 36.351784 0.000112
2 4063 AAC 2016-01-15 36.351784 (0.000004)
3 4064 AAL 2016-01-19 36.590483 0.000006
4 4065 AAMC 2016-01-20 35.934062 0.000002
df_all.tail()
Out[3]:
Unnamed: 0 Symbol Date Close Weight
1252498 26950320 nan NaT 9.84 NaN
1252499 26950321 nan NaT 10.26 NaN
1252500 26950322 nan NaT 9.99 NaN
1252501 26950323 nan NaT 9.11 NaN
1252502 26950324 nan NaT 9.18 NaN
df_all.dtypes
Out[4]:
Unnamed: 0 int64
Symbol object
Date datetime64[ns]
Close float64
Weight object
dtype: object
As can be seen, I am getting values in Symbol of nan, Nat for Date and NaN for weight.
MY GOAL: I want to remove any row that has ANY column containing nan, Nat or NaN and have a new df_clean to be the result
I don't seem to be able to apply the appropriate filter? I am not sure if I have to convert the datatypes first (although I tried this as well)
You can use
df_all.replace({'nan': None})[~pd.isnull(df_all).any(axis=1)]
This is because isnull recognizes both NaN and NaT as "null" values.
Since, the symbol 'nan' is not caught by dropna() or isnull(). You need to cast the symbol'nan' as np.nan
Try this:
df["symbol"] = np.where(df["symbol"]=='nan',np.nan, df["symbol"] )
df.dropna()

Categories

Resources