Changing date column of csv with Python Pandas - python

I have a csv file like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
31.05.2022, 8,28, 8,25, 8,38, 8,23, 108,84M, 0,61%
(more than a thousand lines)
I want to change it like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
5/31/2022, 8.28, 8.25, 8.38, 8.23, 108.84M, 0.61%
Especially "Date" format is Day.Month.Year and I need to put it in Month/Day/Year format.
i write the code like this:
import pandas as pd
import numpy as np
import datetime
data=pd.read_csv("qwe.csv", encoding= 'utf-8')
df.Tarih=df.Tarih.str.replace(".","/")
df.Şimdi=df.Şimdi.str.replace(",",".")
df.Açılış=df.Açılış.str.replace(",",".")
df.Yüksek=df.Yüksek.str.replace(",",".")
df.Düşük=df.Düşük.str.replace(",",".")
for i in df['Tarih']:
q = 1
datetime_obj = datetime.datetime.strptime(i, "%d/%m/%Y")
df['Tarih'].loc[df['Tarih'].values == q] = datetime_obj
But the "for" loop in my code doesn't work. I need help on this. Thank you

Just looking at converting the date, you can import to a datetime object with arguments for pd.read_csv, then convert to your desired format by applying strftime to each entry.
If I have the following tmp.csv:
date, value
30.05.2022, 4.2
31.05.2022, 42
01.06.2022, 420
import pandas as pd
df = pd.read_csv('tmp.csv', parse_dates=['date'], dayfirst=True)
df['date'] = df['date'].dt.strftime('%m/%d/%Y')
print(df)
output:
date value
0 05/30/2022 4.2
1 05/31/2022 42.0
2 06/01/2022 420.0

Related

python pandas converting UTC integer to datetime

I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40

Sum timedeltas in python

I have multiple sets of dates/times I'm trying to manipulate in python, imported from a csv file using the pandas module. I've converted each entry from a string to datetime, and I can manipulate the data with + and -, but I get an error when trying to use 'sum()'. Specifically: "TypeError: 'Timedelta' object is not iterable".
Here is the code I'm using:
import pandas as pd
import numpy as np
from datetime import datetime
A = pd.read_csv('filename')
B = A['Start Time (UTCG)']
C = A['Stop Time (UTCG)']
DT_B = pd.to_datetime(B) #converting from string
DT_C = pd.to_datetime(C)
timediff = DT_C - DT_B
diffsum = sum(timediff)
where 'Start time' and 'Stop time' are in the format "11 Mar 2017 10:37:12.330" and B and C are lists.
I'm pretty new to python, so apologies if I'm overlooking something simple. If there's an easier way to manipulate strings of dates/times without datetime, that would be good too. Any help in getting "sum" to work would be appreciated. Thanks!
You might try using the sum method that ships with a pandas series which should handle this correctly.
>>> import pandas as pd
>>> from datetime import timedelta
>>> data = [timedelta(i) for i in range(10)]
>>> a = pd.Series(data)
>>> a.sum()
Timedelta('45 days 00:00:00')
Note that I say Series and not dataframe. When you pull the exact column out of the dataframe as you did like this C = A['Stop Time (UTCG)'] the type of C is a Series.
It might be cleaner to just create a new column from the other two in the first data frame, and then just aggregate or call the sum method on that column. Something like this:
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> from datetime import datetime, timedelta
>>> data1 = [datetime.now() for i in range(5)]
>>> data2 = [datetime.now() for i in range(5)]
>>> data = {'start': data1, 'stop': data2}
>>> df = pd.DataFrame(data)
>>> df
start stop
0 2017-03-11 22:38:11.606500 2017-03-11 22:38:37.474962
1 2017-03-11 22:38:11.606509 2017-03-11 22:38:37.474971
2 2017-03-11 22:38:11.606510 2017-03-11 22:38:37.474973
3 2017-03-11 22:38:11.606511 2017-03-11 22:38:37.474974
4 2017-03-11 22:38:11.606512 2017-03-11 22:38:37.474975
>>> df.dtypes // use dtypes to make sure the types are what you think they are
start datetime64[ns]
stop datetime64[ns]
dtype: object
>>> df['diff'] = df['stop'] - df['start']
>>> df['diff'].sum()
Timedelta('0 days 00:02:09.342313')

pandas save date in ISO format?

I'm trying to generate a Pandas DataFrame where date_range is an index. Then save it to a CSV file so that the dates are written in ISO-8601 format.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
NumberOfSamples = 10
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = DataFrame(index=dates)
df3.to_csv('dates.txt', header=False)
The current output to dates.txt is:
2013-01-01 00:00:00
2013-01-01 00:01:30
2013-01-01 00:03:00
2013-01-01 00:04:30
...................
I'm trying to get it to look like:
2013-01-01T00:00:00Z
2013-01-01T00:01:30Z
2013-01-01T00:03:00Z
2013-01-01T00:04:30Z
....................
Use datetime.strftime and call map on the index:
In [72]:
NumberOfSamples = 10
import datetime as dt
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = pd.DataFrame(index=dates)
df3.index = df3.index.map(lambda x: dt.datetime.strftime(x, '%Y-%m-%dT%H:%M:%SZ'))
df3
Out[72]:
Empty DataFrame
Columns: []
Index: [2013-01-01T00:00:00Z, 2013-01-01T00:01:30Z, 2013-01-01T00:03:00Z, 2013-01-01T00:04:30Z, 2013-01-01T00:06:00Z, 2013-01-01T00:07:30Z, 2013-01-01T00:09:00Z, 2013-01-01T00:10:30Z, 2013-01-01T00:12:00Z, 2013-01-01T00:13:30Z]
Alternatively and better in my view (thanks to #unutbu) you can pass a format specifier to to_csv:
df3.to_csv('dates.txt', header=False, date_format='%Y-%m-%dT%H:%M:%SZ')
With pd.Index.strftime:
If you're sure that all your dates are UTC, you can hardcode the format:
df3.index = df3.index.strftime('%Y-%m-%dT%H:%M:%SZ')
which gives you 2013-01-01T00:00:00Z and so on. Note that the "Z" denotes UTC!
With pd.Timestamp.isoformat and pd.Index.map:
df3.index = df3.index.map(lambda timestamp: timestamp.isoformat())
This gives you 2013-01-01T00:00:00. If you attach a timezone to your dates first (e.g. by passing tz="UTC" to date_range), you'll get: 2013-01-01T00:00:00+00:00 which also conforms to ISO-8601 but is a different notation. This should work for any dateutil or pytz timezone, leaving no room for ambiguity when clocks switch from daylight saving to standard time.

pandas: convert datetime to end-of-month

I have written a function to convert pandas datetime dates to month-end:
import pandas
import numpy
import datetime
from pandas.tseries.offsets import Day, MonthEnd
def get_month_end(d):
month_end = d - Day() + MonthEnd()
if month_end.month == d.month:
return month_end # 31/March + MonthEnd() returns 30/April
else:
print "Something went wrong while converting dates to EOM: " + d + " was converted to " + month_end
raise
This function seems to be quite slow, and I was wondering if there is any faster alternative? The reason I noticed it's slow is that I am running this on a dataframe column with 50'000 dates, and I can see that the code is much slower since introducing that function (before I was converting dates to end-of-month).
df = pandas.read_csv(inpath, na_values = nas, converters = {open_date: read_as_date})
df[open_date] = df[open_date].apply(get_month_end)
I am not sure if that's relevant, but I am reading the dates in as follows:
def read_as_date(x):
return datetime.datetime.strptime(x, fmt)
Revised, converting to period and then back to timestamp does the trick
In [104]: df = DataFrame(dict(date = [Timestamp('20130101'),Timestamp('20130131'),Timestamp('20130331'),Timestamp('20130330')],value=randn(4))).set_index('date')
In [105]: df
Out[105]:
value
date
2013-01-01 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-30 2.545073
In [106]: df.index = df.index.to_period('M').to_timestamp('M')
In [107]: df
Out[107]:
value
2013-01-31 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-31 2.545073
Note that this type of conversion can also be done like this, the above would be slightly faster, though.
In [85]: df.index + pd.offsets.MonthEnd(0)
Out[85]: DatetimeIndex(['2013-01-31', '2013-01-31', '2013-03-31', '2013-03-31'], dtype='datetime64[ns]', name=u'date', freq=None, tz=None)
If the date column is in datetime format and is set to starting day of the month, this will add one month of time to it:
df['date1']=df['date'] + pd.offsets.MonthEnd(0)
import pandas as pd
import numpy as np
import datetime as dt
df0['Calendar day'] = pd.to_datetime(df0['Calendar day'], format='%m/%d/%Y')
df0['Calendar day'] = df0['Calendar day'].apply(pd.datetools.normalize_date)
df0['Month Start Date'] = df0['Calendar day'].dt.to_period('M').apply(lambda r: r.start_time)
This code should work. Calendar Day is a column in which date is given in the format %m/%d/%Y. For example: 12/28/2014 is 28 December, 2014. The output comes out to be 2014-12-01 in class 'pandas.tslib.Timestamp' type.
you can also use numpy to do it faster:
import numpy as np
date_array = np.array(['2013-01-01', '2013-01-15', '2013-01-30']).astype('datetime64[ns]')
month_start_date = date_array.astype('datetime64[M]')
In case the date is not in the index but in another column (works for Pandas 0.25.0):
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(date = [pd.Timestamp('20130101'),
pd.Timestamp('20130201'),
pd.Timestamp('20130301'),
pd.Timestamp('20130401')],
value = np.random.rand(4)))
print(df.to_string())
df.date = df.date.dt.to_period('M').dt.to_timestamp('M')
print(df.to_string())
Output:
date value
0 2013-01-01 0.295791
1 2013-02-01 0.278883
2 2013-03-01 0.708943
3 2013-04-01 0.483467
date value
0 2013-01-31 0.295791
1 2013-02-28 0.278883
2 2013-03-31 0.708943
3 2013-04-30 0.483467
What you are looking for might be:
df.resample('M').last()
The other method as said earlier by #Jeff:
df.index = df.index.to_period('M').to_timestamp('M')

Getting a time index in python for pandas dataframe

I'm having a bit of trouble getting the right time index for my pandas dataframe.
import pandas as pd
from datetime import strptime
import numpy as np
stockdata = pd.read_csv("/home/stff/symbol_2012-02.csv", parse_dates =[[0,1,2]])
stockdata.columns = ['date_time','ticker','exch','salcond','vol','price','stopstockind','corrind','seqnum','source','trf','symroot','symsuffix']
I think the problem is that the time stuff comes in the first three columns: year/month/date, hour/minute/second, millisecond. Also, the hour/minute/second column drops the first zero if its before noon.
print(stockdata['date_time'][0])
20120201 41206 300
print(stockdata['date_time'][50000])
20120201 151117 770
Ideally, I would like to define my own function that could be called by the converters argument in the read_csv function.
Suppose you have a csv file that looks like this:
date,time,milliseconds,value
20120201,41206,300,1
20120201,151117,770,2
Then using parse_dates, index_cols and date_parser parameters of read_csv method, one could construct a pandas DataFrame with time index like this:
import datetime as dt
import pandas as pd
parse = lambda x: dt.datetime.strptime(x, '%Y%m%d %H%M%S %f')
df = pd.read_csv('test.csv', parse_dates=[['date', 'time', 'milliseconds']],
index_col=0, date_parser=parse)
This yields:
value
date_time_milliseconds
2012-02-01 04:12:06.300000 1
2012-02-01 15:11:17.770000 2
And df.index:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-02-01 04:12:06.300000, 2012-02-01 15:11:17.770000]
Length: 2, Freq: None, Timezone: None
This answer is based on a similar solution proposed here.

Categories

Resources