Grouper and axis must be same length in Python - python

I am a beginner of Python, and I study a textbook to learn the Pandas module.
I have a dataframe called Berri_bike, and it is from the following code:
bike_df=pd.read_csv(os.path.join(path,'comptagevelo2012.csv'),parse_dates=['Date'],\
encoding='latin1',dayfirst=True,index_col='Date')
Berri_bike=bike_df['Berri1'].copy() # get only the column='Berri1'
Berri_bike['Weekday']=Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby('Weekday').aggregate(sum)
weekday_counts
I have 3 columns in my Berri_bilk , a data index- from 1/1/2012 to 12/31/2012, and value column with numbers for each data, and a weekday column I assigned to it. But when I want to group by the values, I got the error: ValueError: Grouper and axis must be same length, I am not sure what this mean, what I want to do is very simple, like in SQL, sum(value) grouped weekday... can anyone please let me know what happended here?

You copy your column into a pandas series instead of a new dataframe, hence the following operations behave differently. You can see this if you print out Berri_bike because it doesn't show the column name.
Instead, you should copy the column into a new dataframe:
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 30, size = (70, 2)),
columns = ["A", "B"],
index = pd.date_range("20180101", periods = 70))
Berri_bike = df[["A"]]
Berri_bike['Weekday'] = Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby("Weekday").sum()
print(weekday_counts)
#sample output
A
Weekday
0 148
1 101
2 127
3 139
4 163
5 74
6 135

Related

Using prior column to create new column in DataFrame creation

I know how to create a new column based on another column in Pandas. What I'm trying to do is create a new column based on another column at the time of DataFrame creation. Here is the code I have now:
rng = np.random.default_rng()
number_of_trials = float('10E+06')
simulations = pd.DataFrame({'true_average': rng.beta(81, 219, size=int(number_of_trials))})
simulations = simulations.assign(hits=lambda x: rng.binomial(300, x.true_average, size =int(number_of_trials)))
Instead of doing two lines to create the true_average and hits columns in the DataFrame I would like to do it just in the DataFrame object instantiation if possible. Everything I've searched for just tells me how to do it in two steps which is fine but I know this is possible in R so just wondered if Pandas had the same functionality.
I've tried to create a column based on doing a lambda function accessing the true_average column but it just stores the function itself as the value in the Dataframe.
I think you can just use the logic you use to create the original column (true_average) as the second parameter in rng.binomial:
rng = np.random.default_rng(seed=42)
number_of_trials = float('10E+06')
simulations = pd.DataFrame({'true_average': rng.beta(81, 219, size=int(number_of_trials)),
'hits': rng.binomial(300, (rng.beta(81, 219, size=int(number_of_trials))), size =int(number_of_trials))})
print(simulations)
Yields:
true_average hits
0 0.248803 65
1 0.253768 99
2 0.242576 67
3 0.277595 78
4 0.335829 80
... ... ...
9999995 0.267265 66
9999996 0.308596 100
9999997 0.279287 88
9999998 0.247802 79
9999999 0.269566 67
[10000000 rows x 2 columns]

Index returned by np.argmax of a series within a dataframe slice points to wrong value when used as index into same dataframe

I have a dataframe created from collected sampled data. I then manipulate the dataframe to remove duplicates, sort, and remove saturated values:
df = pd.read_csv(path+ newfilename, header=0, usecols=[0,1,2,3,5,7,10],
names=['ch1_real', 'ch1_imag', 'ch2_real', 'ch2_imag', 'ch1_log_mag', 'ch1_phase',
'ch2_log_mag', 'ch2_phase', 'pr_sample_real', 'pr_sample_imag', 'distance'])
tmp=df.drop_duplicates(subset='distance', keep='first').copy()
tmp.sort_values("distance", inplace=True)
dfUnique=tmp[tmp.distance <65000].copy()
I also add two calculated values (with help from #Stef)
dfUnique['ch1_log_mag']=20np.log10((dfUnique.ch1_real +1jdfUnique.ch1_imag).abs())
dfUnique['ch2_log_mag']=20np.log10((dfUnique.ch2_real +1jdfUnique.ch2_imag).abs())
the problem arises when I try to find the index of the maximum magnitude. It turns out (unexpectedly to me), that dataframes keep there original data indices. So, after sorting and removing rows, the index of given row is not its index in the new ordered dataframe, but its row index within the original dataframe:
ch1_real ch1_imag ch2_real ... distance ch1_log_mag ch2_log_mag
79 0.011960 -0.003418 0.005127 ... 0.0 -38.104414 -33.896518
78 -0.009766 -0.005371 -0.015870 ... 1.0 -39.058001 -34.533870
343 0.002197 0.010990 0.003662 ... 2.0 -39.009865 -37.278737
80 -0.002686 0.010740 0.011960 ... 3.0 -39.116435 -34.902513
341 -0.007080 0.009033 0.016600 ... 4.0 -38.803434 -35.582833
81 -0.004883 -0.008545 -0.016850 ... 12.0 -40.138523 -35.410047
83 -0.009277 0.004883 -0.000977 ... 14.0 -39.589769 -34.848170
84 0.006592 -0.010250 -0.009521 ... 27.0 -38.282239 -33.891250
85 0.004395 0.010010 0.017580 ... 41.0 -39.225735 -34.890353
86 -0.007812 -0.005127 -0.015380 ... 53.0 -40.589187 -35.625615
When I then use:
np.argmax(dfUnique.ch1_log_mag)
to find the index of maximum magnitude, this returns the index in the new ordered dataframe series. But, when I use this to index into the dataframe to extract other values in that row, I get elements from the original dataframe at that row index.
I exported the dataframe to excel to more easily observe what was happening. Column 1 is the dataframe index. Notice that is is different than the row number on the spreadsheet.
The np.argmax command above returns 161. If I look at the new ordered dataframe, index 161 is this row highlighted below (data starts on row two in the spreadsheet, and indices start at 0 in python):
and is correct. However, per the original dataframes order, this was at index 238. When I then try to access ch1_log_max[161],
dfUnique.ch1_log_mag[161]
I get -30.9759, instead of -11.453. It grabbed the value using 161 as the index into original dataframe:
this is pretty scary --that two functions use two different reference frames (at least to a novice python user). How do I avoid this? (How) Do I reindex the dataframe? or should I be using an equivalent pandas way of finding the maximum in a series within a dataframe (assuming the issue is due to how pandas and numpy operate on data)? Is the issue the way I'm creating copies of the dataframe?
If you sort a dataframe, it preserves indices.
import pandas as pd
a = pd.DataFrame(np.random.randn(24).reshape(6,4), columns=list('abcd'))
a.sort_values(by='d', inplace=True)
print(a)
>>>
a b c d
2 -0.553612 1.407712 -0.454262 -1.822359
0 -1.046893 0.656053 1.036462 -0.994408
5 -0.772923 -0.554434 -0.254187 -0.948573
4 -1.660773 0.291029 1.785757 -0.457495
3 0.128831 1.399746 0.083545 -0.101106
1 -0.250536 -0.045355 0.072153 1.871799
In order to reset index, you can use .reset_index(drop=True):
b = a.sort_values(by='d').reset_index(drop=True)
print(b)
>>>
a b c d
0 -0.553612 1.407712 -0.454262 -1.822359
1 -1.046893 0.656053 1.036462 -0.994408
2 -0.772923 -0.554434 -0.254187 -0.948573
3 -1.660773 0.291029 1.785757 -0.457495
4 0.128831 1.399746 0.083545 -0.101106
5 -0.250536 -0.045355 0.072153 1.871799
To find the original index of max value, you can use .idxmax() then use .loc[]:
ix_max = a.d.idxmax()
# or ix_max = np.argmax(a.d)
print(f"ix_max = {ix_max}")
a.loc[ix_max]
>>>
ix_max = 1
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
or if you have got new index order, you can use .iloc:
iix = np.argmax(a.d.values)
print(f"iix = {iix}")
print(a.iloc[iix])
>>>
iix = 5
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
You can have a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Compare two rows in a data frame after groupby and perform operations

I have two different csv files, I have merged them into a single data frame and grouped according to the 'class_name' column. The group by works as intended but I dont know how to perform the operation by comparing the groups against one other. From r1.csv the class algebra has gone down by 5 students, so I want -5, calculus has increased by 5 so it has to +5, this has to be added as a new column in a separate data frame. Same with date arithmetics.
This is what I tried so far
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
I can see that my group by works, I tried .sum() .diff() but none seem to do what I want, what can I do here. Thanks.
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
Needed
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
Not sure if there's a more direct way, but you can start by appending missing data on both your dfs:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
With the missing classes filled, now you can use groupby to assign a new column for the difference and lastly drop_duplicates:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0

Pandas is printing more rows than expected

Currently I work on a database and I try to sort my rows with pandas. I have a column called 'sessionkey' which refers to a session. So each row can be assigned to a session. I tried to seperate the Data into these sessions.
Furthermore there can be duplicated rows. I tried to drop those with the drop_duplicates function from pandas.
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
tmp = df['sessionkey'].values #I want to split data into different sessions
tmp = np.unique(tmp)
df.set_index('sessionkey', inplace=True)
watching = df.loc[tmp[10]].drop_duplicates(keep='first') #here I pick one example
print(watching.sort_values(by =['eventTimestamp', 'eventClickSequenz']))
print(watching.info())
I would have thought that this works fine but when I tried to check my results by printing out my splitted dataframe the output looks very odd to me. For example I printed the length of the Dataframe it says 38 rows x 4 columns. But when I print the same Dataframe there are clearly more than 38 rows and there are still duplicates in it.
I already tried to split the data by using unique indices:
comparison = pd.DataFrame()
for index, item in enumerate(df['sessionkey'].values):
if item==tmp: comparison = comparison.append(df.iloc[index])
comparison.drop_duplicates(keep='first', inplace=True)
print(comparison.sort_values( by = ['eventTimestamp']))
But the Problem is still the same.
The output also seems to follow a pattern. Lets say we have 38 entries. Then pandas returns me the first 1-37 entries and then appends the 2-38 entries. So the last one is left out and then the whole list is shifted and printed again.
When I return the numpy values there are just 38 different rows. So is this a problem of the print function from pandas? Is there an error in my code? Does pandas have a problem with not-unique indexes?
EDIT:
Okay I figured out what the problem is. I wanted to look at a long dataframe so I used:
pd.set_option('display.max_rows', -1)
Now we can use some sample data:
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
Printed it now looks like this:
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
Although I expected it to look like this:
sessionkey event
0 119 0
1 119 1
2 119 2
I thought my Dataframe has the wrong shape but this is not the case.
So the event in the middle gets printed doubled. Is this a bug or the intendent output?
so drop_duplicates() doesn't look at the index when getting rid of rows, instead it looks at the whole row. But it does have a useful subset kwarg which allows you to specify which rows to use.
You can try the following
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
print(df.shape)
print(df["session"].nunique()) # number of unique sessions
df_unique = df.drop_duplicates(subset=["session"],keep='first')
# these two numbers should be the same
print(df_unique.shape)
print(df_unique["session"].nunique())
It sounds like you want to drop_duplicates based on the index - by default drop_duplicates drops based on the column values. To do that try
df.loc[~df.index.duplicated()]
This should only select index values which are not duplicated
I used your sample code.
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
And I got your expected outcome.
sessionkey event
0 119 0
1 119 1
2 119 2
After I set the max_rows option, as you did:
pd.set_option('display.max_rows', -1)
I got the incorrect outcome.
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
The problem might be the "-1" setting. The doc states that "None" will set max rows to unlimited. I am unsure what "-1" will do in a parameter that takes positive integers or None as acceptable values.
Try
pd.set_option('display.max_rows', None)

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Categories

Resources