Python: reorganize a dataframe with repeated values appearing in one column. - python

I have a dataframe that looks like this:
Instrument Date Total Return
0 KYG2615B1014 2017-11-29T00:00:00Z 0.000000
1 KYG2615B1014 2017-11-28T00:00:00Z -10.679612
2 KYG2615B1014 2017-11-27T00:00:00Z -8.035714
3 JP3843250006 2017-11-29T00:00:00Z 0.348086
4 JP3843250006 2017-11-28T00:00:00Z 0.349301
5 JP3843250006 2017-11-27T00:00:00Z 0.200000
Given that dataframe, I would like to make it look like this:
11/27/2017 11/28/2017 11/29/2017
KYG2615B1014 -8.035714 -10.679612 0.000000
JP3843250006 0.200000 0.349301 0.348086
Basically what I want is to place every date as a new column and inside that column, placing the corresponding value. I wouldn't say "filtering" or "deleting" duplicates, I'd say this is much more like rearranging.
Both dataframes were generated by me, but the thing is that to acquire this data I have to make a call to an API. In the 1st dataframe I make only one call and pull all of this data, while in the other I make one call per each date. So 1st is much more efficient than the 2nd and figured it was the right call, but I'm stuck in this part of reorganizing the dataframe to what I need.
I thought of creating an empty dataframe and then populate it, by picking indexes of repeated elements in the 'Instrument' column, use those indexes to get elements from the 'Total Return' column and then place the elements from that chunk of data accordingly, but I don't know how to do it.
If someone can help me, I'll be happy to know.
Not sure if useful at this point, but this how I generated the dataframe (before populating it) in the 2nd version:
import pandas as pd
import datetime
#Getting a list of dates
start=datetime.date(2017,11,27)
end=datetime.date.today() - datetime.timedelta(days=1)
row_dates=[x.strftime('%m/%d/%Y') for x in pd.bdate_range(start,end).tolist()]
#getting identifiers to be used on Eikon
csv_data=pd.read_csv('171128.csv', header=None)
identifiers=csv_data[0].tolist()
df=pd.DataFrame(index=identifiers, columns=row_dates)

You can use pd.crosstab:
pd.crosstab(df.Instrument, df['Date'],values=df['Total Return'], aggfunc='mean')
Output:
Date 2017-11-27T00:00:00Z 2017-11-28T00:00:00Z 2017-11-29T00:00:00Z
Instrument
JP3843250006 0.200000 0.349301 0.348086
KYG2615B1014 -8.035714 -10.679612 0.000000

This looks like pandas.pivot_table() pivot_table to me, note you can add an agg function if you think there will be duplicates (from example looks like only one reading per day).
import pandas as pd
instrument=['KYG2615B1014','KYG2615B1014','KYG2615B1014', 'JP3843250006', 'JP3843250006', 'JP3843250006']
date=['11/29/2017', '11/28/2017', '11/27/2017', '11/29/2017', '11/28/2017', '11/27/2017']
total_return=[0.0, -10.679612, -8.035714, 0.348086, 0.349301, 0.200000]
stacked = pd.DataFrame(dict(Instrument=instrument, Date=date, Total_return=total_return)
pd.pivot_table(stacked, values='Total_return', index='Instrument', columns='Date')
This returns the following:
Date 11/27/2017 11/28/2017 11/29/2017
Instrument
JP3843250006 0.200000 0.349301 0.348086
KYG2615B1014 -8.035714 -10.679612 0.000000

Related

How do I create custom calculations on Groupby objects in pandas with a new row below each object

Thank you for taking the time to read through my question. I hope you can help.
I have a large DataFrame with loads of columns. One column is an ID with multiple classes on which I would like to calculate totals and other custom calculations based on the columns above it.
The DataFrame columns look something like this:
I would like to calculate the Total AREA for each ID for all the CLASSES. Then I need to calculate the custom totals for the VAR columns using the variables from the other columns. In the end I would like to have a series of grouped IDs that look like this:
I hope that this make sense. The current thinking I have applied is to use the following code:
df = pd.read_csv(data.csv)
df.groupby('ID').apply(lambda x: x['AREA'].sum())
This provides me with a list of all the summed areas, which I can store in a variable to append back to the original dataframe through the ID and CLASS column. However, I am unsure how I get the other calculations done, as shown above. On top of that, I am not sure how to get the final DataFrame to mimic the above table format.
I am just starting to understand Pandas and constantly having to teach myself and ask for help when it gets rough.
Some guidance would be greatly appreciated. I am open to providing more information and clarity on the problem if this question is insufficient. Thank you.
I am not sure if I understand your formulas correctly.
First you can simplify your formula by using the built-in sum() function:
df=pd.DataFrame({'ID':[1.1,1.1,1.2,1.2,1.2,1.3,1.3], 'Class':[1,2,1,2,3,1,2],'AREA':[350,200,15,5000,65,280,70],
'VAR1':[24,35,47,12,26,12,78], 'VAR2':[1.5,1.2,1.1,1.4,2.3,4.5,0.8], 'VAR3':[200,300,400,500,600,700,800]})
df.groupby(['ID']).sum()['AREA']
This will give the mentioned list
ID
1.1 550
1.2 5080
1.3 350
Name: AREA, dtype: int64
For Area Class 1 you just have to add a key()to the groupby() command:
df.groupby(['ID', 'Class']).sum()['AREA']
Resulting in:
ID Class
1.1 1 350
2 200
1.2 1 15
2 5000
3 65
1.3 1 280
2 70
Name: AREA, dtype: int64
Since you want to sum up the square of the sum over each Class we can add both approaches together:
df.groupby(['ID', 'Class']).apply(lambda x: x['AREA'].sum()**2).groupby('ID').sum()
With the result
ID
1.1 162500
1.2 25004450
1.3 83300
dtype: int64
I recommend to strip the command apart and try to understand each step. If you need further assistance just ask.

How do I group a pandas DataFrame using one column or another

Dear pandas DataFrame experts,
I have been using pandas DataFrames to help with re-writing the charting code in an open source project (https://openrem.org/, https://bitbucket.org/openrem/openrem).
I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.
An example dataframe might contain the following data:
study_name request_name total_dlp x_ray_system_name
head head 50.0 All systems
head head 100.0 All systems
head NaN 200.0 All systems
blank NaN 75.0 All systems
blank NaN 125.0 All systems
blank head 400.0 All systems
The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:
df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})
with the following result:
total_dlp
count mean
x_ray_system_name study_name
All systems blank 3 200.000000
head 3 116.666667
I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name. So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.
I would like the results to look something like this:
total_dlp
count mean
x_ray_system_name name
All systems blank 3 200.000000
head 4 187.500000
Does anyone know how I can carry out a groupby based on categories in one field or another?
Any help you can offer will be very much appreciated.
Kind regards,
David
You (groupby) data is essentially union of:
extract those with study_name == request_name
duplicate those with study_name != request_name, one for study_name, one for request_name
We can duplicate the data with melt
(pd.concat([df.query('study_name==request_name') # equal part
.drop('request_name', axis=1), # remove so `melt` doesn't duplicate this data
df.query('study_name!=request_name')]) # not equal part
.melt(['x_ray_system_name','total_dlp']) # melt to duplicate
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Update: editing the above code helps me realize that we could simplify do:
# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
.melt(['x_ray_system_name','total_dlp'])
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Output:
x_ray_system_name value
All systems blank 200.0
head 187.5
Name: total_dlp, dtype: float64
I have a similar approach to that of #QuangHoang but with a different order of the operations.
I am using here the original (range) index to chose how to drop the duplicate data.
You can melt, drop_duplicates and dropna and groupby:
(df.reset_index()
.melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
.drop_duplicates(['index', 'value'])
.dropna(subset=['value'])
.groupby(["x_ray_system_name", 'value'])
.agg({"total_dlp": ["count", "mean"]})
)
output:
total_dlp
count mean
x_ray_system_name value
All systems blank 3 200.0
head 4 187.5

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Pandas adding rows to df in loop

I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame.
The end format of the data frame I would like is something like the following:
df:
id 2018-01 2018-02 2018-03
234 2 1 3
345 4 5 1
534 5 3 4
234 2 2 3
When I iterate through the data in the loop I have a dictionary with the id, the month and the value for the month, for example:
{'id':234,'2018-01':2}
{'id':534,'2018-01':5}
{'id':534,'2018-03':4}
.
.
.
What is the best way to take an empty data frame and add rows and columns with their values to it in a loop?
Essentially as I iterate it would look something like this
df:
id 2018-01
234 2
then
df:
id 2018-01
234 2
534 5
then
df:
id 2018-01 2018-03
234 2
534 5 4
and so on...
IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value
df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]
for x in l:
df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
print(df)
2018-01
id
234 2
2018-01
id
234 2
534 5
2018-01 2018-03
id
234 2.0 NaN
534 5.0 4.0
It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:
df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')
If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:
def datagen(your_input):
for item in your_input:
# your code to make a dict
yield dict
The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.
The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:
df.loc[ len(df) ] = [new, row, of, data]
This will "append" the new data row to the end of the dataframe in-place.
The above example is for an empty Dataframe with exactly 4 columns, such as:
df = pandas.DataFrame( columns=["col1", "col2", "col3", "col4"] )
df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row.
For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.
–––––
I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.
–––––
However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.
Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:
df:
id date data
234 2018-01 2
534 2018-01 5
535 2018-03 4
(has two entries for the same date).
Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Categories

Resources