I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc
I am making a small mistake and I'm not sure how to merge two df correctly. I want to merge on IBES_cusip to get gvkey into df1.
I try the following, but it just expands the dataset out and does not match correctly:
df1 = df1.merge(df2, how = 'left', on =['IBES_cusip'])
df1
IBES_cusip pends pdicity ... ltg_eps futepsgrowth
0 00036110 1983-05-31 ANN ... NaN NaN
1 00036110 1983-05-31 ANN ... NaN NaN
2 00036110 1983-05-31 ANN ... NaN NaN
3 98970110 1983-05-31 ANN ... NaN NaN
4 98970110 1983-05-31 ANN ... NaN NaN
... ... ... ... ... ...
373472 98970111 2018-12-31 ANN ... 10.00 0.381119
373473 98970111 2018-12-31
df2
gvkey IBES_cusip
0 024538 86037010
1 004678 33791510
2 066367 26357810
3 137024 06985P20
4 137024 06985P20
... ...
833796 028955 33975610
833797 061676 17737610
833798 011096 92035510
833799 005774 44448210
833800 008286 69489010
Your main problem is that your df2 contains duplicate values in IBES_cusip column.
from the sample you gave I can see that
3 137024 06985P20
4 137024 06985P20
are the same values, this would cause the to get unwanted results (duplicate rows in the output).
try this
df1 = df1.merge(df2.drop_duplicates(subset=['IBES_cusip']), how='left', on='IBES_cusip')
Which should technically just add a gvkey column to your df1.
This assumes that you are pretty sure that you don't have rows with the same IBES_cusip that are matched with different gvkey otherwise you need to figure that out first.
I use read_csv to fill a pandas. In this pandas I have a full NaN empty columns and this turns into a problem when I use pivot_table.
Here my situation:
d= {'dates': ['01/01/20','01/02/20','01/03/20'], 'country':['Fra','Fra','Fra'], 'val': [np.nan,np.nan,np.nan]}
df = pd.DataFrame(data=d)
piv=df.pivot_table(index='country',values='val',columns='dates')
print(piv)
Empty DataFrame
Columns: []
Index: []
I would like to have this :
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
from docs, set dropna = False DataFrame.pivot_table
piv = df.pivot_table(index='country',values='val',columns='dates', dropna=False)
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
Just use the dropna argument of pivot:
df.pivot_table(index='country',columns='dates', values='val', dropna = False)
The output is:
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
I have a pandas dataframe whose index is created by
pd.bdate_range. The index column consists of business days (Monday through Friday) starting 1993/1/5. The first 12 rows are:
df_xx[0:12]
Out[163]:
aaa aaa_f
1993-01-05 125.25 NaN
1993-01-06 124.84 NaN
1993-01-07 125.09 NaN
1993-01-08 125.42 NaN
1993-01-11 125.36 NaN
1993-01-12 125.05 NaN
1993-01-13 125.87 NaN
1993-01-14 125.65 NaN
1993-01-15 126.05 NaN
1993-01-18 125.82 NaN
1993-01-19 125.46 NaN
1993-01-20 125.39 NaN
How can I create a subset with only Friday data?
Get names of days by DatetimeIndex.day_name and filter DataFrame byboolean indexing :
df = df[df.index.day_name() == 'Friday']
print (df)
aaa aaa_f
1993-01-08 125.42 NaN
1993-01-15 126.05 NaN
When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?
As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]
This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0