I would like to store some multidimensional/nested data in a pandas dataframe or panel such that I would like to be able to return for example:
All the times for Runner A, Race A
All the times(and names) for Race A for a certain year say 2015
Split 2 Time for Race A 2015 for all runners
Example data would look something like this, note that not all runners will have data for all years or all races. I have a fair amount of data in the Runner Profile which I'd prefer not to store on every line.
In addition I have another level of data for certain races. So for Race A/2015 for example I would like to have another level of data for split times, average paces etc.
Could anyone suggest a good way to do this with Pandas or any other way?
Name | Gender | Age
Runner A | Male | 35
Race A
Year | Time
2015 | 2:35:09
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
2014 | 2:47:34
2013 | 2:50:12
Race B
Year | Time
2013 | 1:32:07
Runner B | Male | 29
Race A
Year | Time
2015 | 3:05:56
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...
Runner C | Female | 32
Race B
Year | Time
1998 | 1:29:43
Related
I have a model that produces an output in csv. The columns are as follows (just an fictive example):
| Car | Price | Year |
The car column has different car manufacturers for example, with an average car price for each year in column 'Price'.
Example
| Car | Price | Year |
| BMW | 34000 | 1990 |
| BMW | 35000 | 1991 |
| BMW | 37000 | 1993 |
| AUDI | 32000 | 1991 |
| AUDI | 33500 | 1992 |
| AUDI | 34000 | 1993 |
| AUDI | 35500 | 1994 |
| SEAT | 25600 | 1994 |
...
I would like to be able to plot:
An area chart with all the prices for each car manufacturer in the years that the prices are available, within a 20 year period (for example 1990-2010).
Some years, there is no price available for some of the car manufacturers, and for that reason not all car manufacturer has 20 rows of data in the csv, the output just skips the whole year and row. See the BWM in the example, lacking 1992.
Since I run the model with different inputs, the actual names of the "Cars" change (and so do the prices), so I need the code to pick up a certain car name and then plot the available values for each run.
This is just an example for simplification, but the layout of the actual data is the same. Would much appreciate some help on this one!
Try this I think this might work. Also, I am not a pro just a beginner
import pandas as pd
import matplotlib.pyplot as plt
med_path = "path for csv file"
med = pd.read_csv(med_path)
fig, ax = plt.subplots(dpi=120)
area = pd.DataFrame(prices, columns=[‘a’, ‘b’, ‘c’, ‘d’]) # in the places of a,b,c replace with years
area.plot(kind=’area’,ax=ax)
plt.title(‘Graph for Area plot’)
plt.show()
I think this might not be an ideal way to hardcode all the values but you can use for loop to iterate through the csv file's content
I have a dataframe which looks like:
|--------------|------------|------------|
| User | Text | Effort |
|--------------|------------|------------|
| user122 | TextA | 2 Weeks |
|--------------|------------|------------|
| user124 | TextB | 2 Weeks |
|--------------|------------|------------|
| user125 | TextC | 3 Weeks |
|--------------|------------|------------|
| user126 | TextD | 2 Weeks |
|--------------|------------|------------|
| user126 | TextE | 2 Weeks |
|--------------|------------|------------|
My goal is to group the table by Effort and get the unique count of each user per group. I am able to do that by:
df.groupby(['Effort']).agg({"User": pd.Series.nunique})
And this results into this table:
|--------------|------------|
| Effort | User |
|--------------|------------|
| 2 weeks | 3 |
|--------------|------------|
| 3 weeks | 1 |
|--------------|------------|
However, by doing so I am loosing my text column information. Another solution I tried is to keep the first occurrence of that column, but I am still unhappy because I loose something on the way.
Question
Is there any way in which I can keep my initial dataframe without loosing any row and column but at the same time still group by Effort?
The best option you have is using transform if you ask me. This way you keep the shape of your original data, but still get the results of a groupby.
df['Nunique'] = df.groupby('Effort')['User'].transform('nunique')
User Text Effort Nunique
0 user122 TextA 2 Weeks 3
1 user124 TextB 2 Weeks 3
2 user125 TextC 3 Weeks 1
3 user126 TextD 2 Weeks 3
4 user126 TextE 2 Weeks 3
I have a dataframe of the following structure (the time columns are actual time, in case it matters):
group | attr1 | attr2 | time1 | time2
--------------------------------------------
1 | 1 | 7 | 1 | 2
1 | 4 | 4 | 4 | 7
1 | 3 | 3 | 6 | 9
2 | 2 | 2 | 2 | 5
2 | 2 | 5 | 3 | 6
2 | 1 | 6 | 4 | 7
2 | 4 | 2 | 5 | 8
3 | 6 | 7 | 6 | 10
What I would like to do is the following:
Group by group
For every group data frame:
2.1. Apply expanding window on the whole dataframe (all columns)
2.2. For every 'expanding' dataframe
2.2.1. Filter the 'expanding' dataframe using time1 & time2 columns (e.g. `df[df[time1]<df[time2]]`)
2.2.2. Perform various aggregations (ideally using `.agg` with `dict` argument, as there are many different aggregations for many columns)
The output has basically the same number of rows as the input
My problems are:
I don't see a way to specify 'expanding grouping column'. If that was possible, then I could do something like:
def func(group_df, agg_dict):
group_df_filtered = filter the dataframe on time columns
return group_df_filtered.agg(agg_dict)
df.groupby(['group', expanding(1)]).apply(func, agg_dict=agg_dict)
I don't see a way to perform expanding operation on the whole dataframe. If that was possible, I could do:
def func(group_df, agg_dict):
for col, funcs in agg_dict:
agg_dict[col] = [lambda df: f(df[df[time1]<df[time2]]) for f in funcs]
return group_df.expanding(1).agg(agg_dict)
df.groupby('group').apply(func, agg_dict=agg_dict)
I found a workaround that works similarly to the second approach, except that I pass whole columns to the func and do subsetting (as I have the whole column instead of just the expanding part) and filtering inside the function, but it's terribly slow, mostly due to the fact that I'm wrapping functions together and have a lot of custom code.
Is there a nice and most importantly, a fast way to achieve the functionality I need? I guess it would require as little pure python code as possible to work relatively fast (one of the reasons I use agg and dict instead of e.g. doing apply row by row or something similar, that would kill the performance, the other reason is I have multiple functions for different columns, so implementing that by hand every time would be way too verbose).
I am matching two large data-sets and trying to perform update,remove and create operations on original data-set by comparing it with other data-set. How can I update 2 or 3 column out of 10 of original data-set and keep other column's value same as before?
I tried merge but no avail.
Original data:
id | full_name | date
1 | John | 02-23-2006
2 | Paul Elbert | 09-29-2001
3 | Donag | 11-12-2013
4 | Tom Holland | 06-17-2016
other data:
id | full_name | date
1 | John | 02-25-2018
2 | Paul | 03-09-2001
3 | Donag | 07-09-2017
4 | Tom | 05-09-2016
Is it possible to update date column of original data on the basis of ID?
answering your question :
"When ID match code update all values in date column without changing any value in name column of original data set"
original = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul Elbert'],'date':
['02-23-2006','09-29-2001']})
other = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul'],'date':['02-25-2018','03-09-2001']})
original = original[['id','full_name']].merge(other[['id','date']],on='id')
print(original)
id full_name date
0 1 John 02-25-2018
1 2 Paul Elbert 03-09-2001
I have one problem with one of my projects at school. I am attempting to change the order of my data.
You are able to appreciate how the data is arranged
this picture contains a sample of the data I am referring to
This is the format I am attempting to reach:
Company name | activity description | year | variable 1 | variable 2 |......
company 1 | | 2011 | | |
company 1 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
company 2 | | 2011 | | |
company 2 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
for ever single one of the 10 companies. this is a sample of my whole data-set, which contains more than 15000 companies. I attempted creating a dataframe of the size I want but I have problems filling it with the data I want and in the format I want. I am fairly new to python. Could anyone help me, please?