Reassign Pandas DataFrame column values - python

I have a csv that has incorrect datetime formatting. I've worked out how to convert those values into the format I need, but now I need to reassign all the values in a column to the new, converted values.
For example, I'm hoping that there's something I can put into the following FOR loop that will insert the values back into the dataframe at the correct location:
for i in df[df.columns[1]]:
t = pd.Timestamp(i)
short_date = t.date().strftime('%m/%d/%Y').lstrip('0')
# Insert back into dataframe?
As always, your help is very much appreciated!
Part of the column in question:
Part of the dataframe in question:
Created Date
2019-02-27 22:55:16
2019-01-29 22:57:12
2018-11-29 00:13:31
2019-01-30 21:35:15
2018-12-20 21:14:45
2018-11-01 16:20:15
2019-04-11 16:38:07
2019-01-24 00:23:17
2018-12-21 19:30:10
2018-12-19 22:33:04
2018-11-07 19:54:19
2019-05-10 21:15:00

In the simplest, but most instructive, possible terms:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df
# x y
# 0 1 4
# 1 2 5
# 2 3 6
df[:] = df[:].astype(float)
df
# x y
# 0 1.0 4.0
# 1 2.0 5.0
# 2 3.0 6.0
Let pandas do the work for you.
Or, for only one column:
df.x = df.x.astype(float)
df
# x y
# 0 1.0 4
# 1 2.0 5
# 2 3.0 6
You'll, of course, replace astype(float) with .date().strftime('%m/%d/%Y').lstrip('0').

To reassign a column, no need for a loop. Something like this should work:
df["column"] = new_column
new_column is either a Series of matching length, or something that can be broadcasted1 to that length. You can find more details in the docs.
That said, if pd.Timestamp can already parse your data, there is no need for "formatting". The formatting is not associated with a timestamp instance. You can choose a particular formatting when you are converting to string with something like df["timestamp"].dt.strftime("%m/%d/%Y").
On the other hand, if you want to change the precision of your timestamp, you can do something like this:
df["timestamp"] = df["timestamp"].astype("datetime64[D]")
Here, all time information will be rounded to a resolution of days. The letter between the [ and ] is the resolution. Again, all this and more is discussed in the docs.
1 Broadcasting is a concept from numpy where you can operate between different but compatibly shaped arrays. Again, everything is covered in the docs.

Thank you all for your help. All of the answers were helpful, but the answer I ended up using was as follows:
import pandas as pd
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.strftime('%m/%d/%Y')

Related

Merge the all elements of multiple columns into one column in series while keeping NaNs

Context: I have 5 years of weight data. The first column is the date (month and day), the succeeding columns are the years with corresponding weight for each day of the month. I want to have a full plot of all of my data among other things and so I want to combine all into just two columns. First column is the dates from 2018 to 2022, then the second column is the corresponding weight to each date. I have managed the date part, but can't combine the weight data. In essence, I want to turn ...
0 1
0 1 4.0
1 2 NaN
2 3 6.0
Into ...
0
0 1
1 2
2 3
3 4
4 NaN
5 6.0
pd.concat only puts the year columns next to each other. .join, .merge, melt, stack. agg don't work either. How do I do this?
sample code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'2018': [1, 2, 3]})
df2 = pd.DataFrame({'2019': [4, np.NaN, 6]})
merged_df = pd.concat([df1,df2],axis=1, ignore_index=True, levels = 0)
print(merged_df)
P.S. I particularly don't want to input any index names (like id_vars="2018") because I want this process to be automated as the years go by with more data.
concat, merge, melt, join, stack, agg. i want to combine all column data into just one series
I think np.ravel(merged_df,order='F') will do the job for you.
If you want it in the form of a dataframe then pd.DataFrame(np.ravel(merged_df,order='F')).
It's not fully clear what's your I/O but based on your first example, you can use concat like this :
pd.concat([df["0"], df["1"].rename("0")], ignore_index=True)
Output :
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
5 6.0
Name: 0, dtype: float64

How can I remove the .0 after floats in a DataFrame without changing the object type? (I have NANs and require numerical sorting)

I need to remove the .0s after floats in a column of a DataFrame, made from a dictionary.
For example, the dictionary might be:
mydict = { "part1" : [1,2,None,4,5] "part2" : [6,7,None,9,10] }
and then when I mydf = pd.DataFrame(mydict), the DataFrame generated is as follows:
part1 part2
0 1.0 6.0
1 2.0 7.0
2 NaN NaN
3 4.0 9.0
4 5.0 10.0
This happens because every single column in a DataFrame must have all objects of the same type. But, I want to have no .0s at the end of my data for look purposes. Obviously, I can't make them integers, due to the lack of an NaN in integers. I also can't make them strings, for the reason of numerical sorting. I also wouldn't want "01","02","03"…"10" for the purpose of looks.
Becuase this project is really serious, the looks matter, so please don't blame me of overthinking looks of data.
the comments point out a couple of solutions.
i prefer to cast to type .astype('Int64') which retains the NaN as a <NA>.
here is my solution (with your data):
import pandas as pd
mydict = {"part1":[1,2,None,4,5], "part2":[6,7,None,9,10]}
df = pd.DataFrame(mydict)
df['part2'] = df['part2'].astype('Int64')
print(df)
returns this:
part1 part2
0 1.0 6
1 2.0 7
2 NaN <NA>
3 4.0 9
4 5.0 10
You can apply the above to one (or many) columns of your choice.

Pandas subtracting rows gives wrong result

My pandas dataframe consists of a Column "timeStamp", the elements of which are of type datetime.datetime. I'm trying to obtain the difference between two consecutive rows of this column to obtain the time spent in seconds. I use the following piece of code for it.
df["Time"] = df["timeStamp"].diff(0).dt.total_seconds()
Generally it's working fine, however, I keep getting 0.0 as a result of this operation in quite a few instances even when it's not the case.
Examples values that result in 0.0:
import pandas as pd
import datetime
import numpy as np
df = pd.DataFrame({'S.No.': [1, 2, 3, 4], 'ABC': [datetime.datetime(2019,2,25,11,49,50), datetime.datetime(2019,2,25,11,50,0),datetime.datetime(2019,2,25,11,50,7),datetime.datetime(2019,2,25,11,50,12)]})
df["Time"] = df["ABC"].diff(0).dt.seconds
print df
Note: using python2.7
Try this:
print(df["timestamp"].diff().fillna(0).dt.seconds)
0 0
1 10
2 7
3 5
df['difference']=df["timestamp"].diff().fillna(0).dt.seconds
print(df)
timestamp difference
0 2019-02-25 11:49:50 0
1 2019-02-25 11:50:00 10
2 2019-02-25 11:50:07 7
3 2019-02-25 11:50:12 5
Use
df["Time"] = df["timeStamp"].diff().dt.total_seconds()
instead.
The argument in diff specifies the number of rows above of the row with which you want to calculate the difference. Now, you're filling it with 0, so your subtracting a value from itself, which will always give 0. By leaving it empty, it uses the default value 1, so the difference with 1 row above.

Pandas series inserted into dataframe are read as NaN

I'm finding that when adding a series, based on the same time-period, to an existing dataframe it gets imported as NaNs. The dataframe has a field column, but I don't understand why that should change anything. To see the steps of my code, you can review the attached image. Hope that someone can help!
Illustration showing how the dataframe that the series is inserted into and how it gets read as NaN
Assuming that the value in the Field Index column is "actual" for every row, a solution could be the following:
test.reset_index().set_index('Date').assign(m1=m1)
That solution works but it can be done shorter:
days = pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31'])
df = pd.DataFrame({'Field': ['Actual']*3, 'Date': days, 'Val':[1, 2, 3]}).set_index(['Field', 'Date'])
m1 = pd.Series([0, 2, 4], index=days)
df.reset_index(level='Field').assign(m1=m1)
Field Val m1
Date
2018-01-31 Actual 1 0
2018-02-28 Actual 2 2
2018-03-31 Actual 3 4
btw, that would be a nice mcve

Pandas- set values to an empty dataframe

I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6

Categories

Resources