I know how to create a new column based on another column in Pandas. What I'm trying to do is create a new column based on another column at the time of DataFrame creation. Here is the code I have now:
rng = np.random.default_rng()
number_of_trials = float('10E+06')
simulations = pd.DataFrame({'true_average': rng.beta(81, 219, size=int(number_of_trials))})
simulations = simulations.assign(hits=lambda x: rng.binomial(300, x.true_average, size =int(number_of_trials)))
Instead of doing two lines to create the true_average and hits columns in the DataFrame I would like to do it just in the DataFrame object instantiation if possible. Everything I've searched for just tells me how to do it in two steps which is fine but I know this is possible in R so just wondered if Pandas had the same functionality.
I've tried to create a column based on doing a lambda function accessing the true_average column but it just stores the function itself as the value in the Dataframe.
I think you can just use the logic you use to create the original column (true_average) as the second parameter in rng.binomial:
rng = np.random.default_rng(seed=42)
number_of_trials = float('10E+06')
simulations = pd.DataFrame({'true_average': rng.beta(81, 219, size=int(number_of_trials)),
'hits': rng.binomial(300, (rng.beta(81, 219, size=int(number_of_trials))), size =int(number_of_trials))})
print(simulations)
Yields:
true_average hits
0 0.248803 65
1 0.253768 99
2 0.242576 67
3 0.277595 78
4 0.335829 80
... ... ...
9999995 0.267265 66
9999996 0.308596 100
9999997 0.279287 88
9999998 0.247802 79
9999999 0.269566 67
[10000000 rows x 2 columns]
Related
I have a dataframe created from collected sampled data. I then manipulate the dataframe to remove duplicates, sort, and remove saturated values:
df = pd.read_csv(path+ newfilename, header=0, usecols=[0,1,2,3,5,7,10],
names=['ch1_real', 'ch1_imag', 'ch2_real', 'ch2_imag', 'ch1_log_mag', 'ch1_phase',
'ch2_log_mag', 'ch2_phase', 'pr_sample_real', 'pr_sample_imag', 'distance'])
tmp=df.drop_duplicates(subset='distance', keep='first').copy()
tmp.sort_values("distance", inplace=True)
dfUnique=tmp[tmp.distance <65000].copy()
I also add two calculated values (with help from #Stef)
dfUnique['ch1_log_mag']=20np.log10((dfUnique.ch1_real +1jdfUnique.ch1_imag).abs())
dfUnique['ch2_log_mag']=20np.log10((dfUnique.ch2_real +1jdfUnique.ch2_imag).abs())
the problem arises when I try to find the index of the maximum magnitude. It turns out (unexpectedly to me), that dataframes keep there original data indices. So, after sorting and removing rows, the index of given row is not its index in the new ordered dataframe, but its row index within the original dataframe:
ch1_real ch1_imag ch2_real ... distance ch1_log_mag ch2_log_mag
79 0.011960 -0.003418 0.005127 ... 0.0 -38.104414 -33.896518
78 -0.009766 -0.005371 -0.015870 ... 1.0 -39.058001 -34.533870
343 0.002197 0.010990 0.003662 ... 2.0 -39.009865 -37.278737
80 -0.002686 0.010740 0.011960 ... 3.0 -39.116435 -34.902513
341 -0.007080 0.009033 0.016600 ... 4.0 -38.803434 -35.582833
81 -0.004883 -0.008545 -0.016850 ... 12.0 -40.138523 -35.410047
83 -0.009277 0.004883 -0.000977 ... 14.0 -39.589769 -34.848170
84 0.006592 -0.010250 -0.009521 ... 27.0 -38.282239 -33.891250
85 0.004395 0.010010 0.017580 ... 41.0 -39.225735 -34.890353
86 -0.007812 -0.005127 -0.015380 ... 53.0 -40.589187 -35.625615
When I then use:
np.argmax(dfUnique.ch1_log_mag)
to find the index of maximum magnitude, this returns the index in the new ordered dataframe series. But, when I use this to index into the dataframe to extract other values in that row, I get elements from the original dataframe at that row index.
I exported the dataframe to excel to more easily observe what was happening. Column 1 is the dataframe index. Notice that is is different than the row number on the spreadsheet.
The np.argmax command above returns 161. If I look at the new ordered dataframe, index 161 is this row highlighted below (data starts on row two in the spreadsheet, and indices start at 0 in python):
and is correct. However, per the original dataframes order, this was at index 238. When I then try to access ch1_log_max[161],
dfUnique.ch1_log_mag[161]
I get -30.9759, instead of -11.453. It grabbed the value using 161 as the index into original dataframe:
this is pretty scary --that two functions use two different reference frames (at least to a novice python user). How do I avoid this? (How) Do I reindex the dataframe? or should I be using an equivalent pandas way of finding the maximum in a series within a dataframe (assuming the issue is due to how pandas and numpy operate on data)? Is the issue the way I'm creating copies of the dataframe?
If you sort a dataframe, it preserves indices.
import pandas as pd
a = pd.DataFrame(np.random.randn(24).reshape(6,4), columns=list('abcd'))
a.sort_values(by='d', inplace=True)
print(a)
>>>
a b c d
2 -0.553612 1.407712 -0.454262 -1.822359
0 -1.046893 0.656053 1.036462 -0.994408
5 -0.772923 -0.554434 -0.254187 -0.948573
4 -1.660773 0.291029 1.785757 -0.457495
3 0.128831 1.399746 0.083545 -0.101106
1 -0.250536 -0.045355 0.072153 1.871799
In order to reset index, you can use .reset_index(drop=True):
b = a.sort_values(by='d').reset_index(drop=True)
print(b)
>>>
a b c d
0 -0.553612 1.407712 -0.454262 -1.822359
1 -1.046893 0.656053 1.036462 -0.994408
2 -0.772923 -0.554434 -0.254187 -0.948573
3 -1.660773 0.291029 1.785757 -0.457495
4 0.128831 1.399746 0.083545 -0.101106
5 -0.250536 -0.045355 0.072153 1.871799
To find the original index of max value, you can use .idxmax() then use .loc[]:
ix_max = a.d.idxmax()
# or ix_max = np.argmax(a.d)
print(f"ix_max = {ix_max}")
a.loc[ix_max]
>>>
ix_max = 1
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
or if you have got new index order, you can use .iloc:
iix = np.argmax(a.d.values)
print(f"iix = {iix}")
print(a.iloc[iix])
>>>
iix = 5
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
You can have a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
Currently I work on a database and I try to sort my rows with pandas. I have a column called 'sessionkey' which refers to a session. So each row can be assigned to a session. I tried to seperate the Data into these sessions.
Furthermore there can be duplicated rows. I tried to drop those with the drop_duplicates function from pandas.
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
tmp = df['sessionkey'].values #I want to split data into different sessions
tmp = np.unique(tmp)
df.set_index('sessionkey', inplace=True)
watching = df.loc[tmp[10]].drop_duplicates(keep='first') #here I pick one example
print(watching.sort_values(by =['eventTimestamp', 'eventClickSequenz']))
print(watching.info())
I would have thought that this works fine but when I tried to check my results by printing out my splitted dataframe the output looks very odd to me. For example I printed the length of the Dataframe it says 38 rows x 4 columns. But when I print the same Dataframe there are clearly more than 38 rows and there are still duplicates in it.
I already tried to split the data by using unique indices:
comparison = pd.DataFrame()
for index, item in enumerate(df['sessionkey'].values):
if item==tmp: comparison = comparison.append(df.iloc[index])
comparison.drop_duplicates(keep='first', inplace=True)
print(comparison.sort_values( by = ['eventTimestamp']))
But the Problem is still the same.
The output also seems to follow a pattern. Lets say we have 38 entries. Then pandas returns me the first 1-37 entries and then appends the 2-38 entries. So the last one is left out and then the whole list is shifted and printed again.
When I return the numpy values there are just 38 different rows. So is this a problem of the print function from pandas? Is there an error in my code? Does pandas have a problem with not-unique indexes?
EDIT:
Okay I figured out what the problem is. I wanted to look at a long dataframe so I used:
pd.set_option('display.max_rows', -1)
Now we can use some sample data:
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
Printed it now looks like this:
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
Although I expected it to look like this:
sessionkey event
0 119 0
1 119 1
2 119 2
I thought my Dataframe has the wrong shape but this is not the case.
So the event in the middle gets printed doubled. Is this a bug or the intendent output?
so drop_duplicates() doesn't look at the index when getting rid of rows, instead it looks at the whole row. But it does have a useful subset kwarg which allows you to specify which rows to use.
You can try the following
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
print(df.shape)
print(df["session"].nunique()) # number of unique sessions
df_unique = df.drop_duplicates(subset=["session"],keep='first')
# these two numbers should be the same
print(df_unique.shape)
print(df_unique["session"].nunique())
It sounds like you want to drop_duplicates based on the index - by default drop_duplicates drops based on the column values. To do that try
df.loc[~df.index.duplicated()]
This should only select index values which are not duplicated
I used your sample code.
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
And I got your expected outcome.
sessionkey event
0 119 0
1 119 1
2 119 2
After I set the max_rows option, as you did:
pd.set_option('display.max_rows', -1)
I got the incorrect outcome.
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
The problem might be the "-1" setting. The doc states that "None" will set max rows to unlimited. I am unsure what "-1" will do in a parameter that takes positive integers or None as acceptable values.
Try
pd.set_option('display.max_rows', None)
I am a beginner of Python, and I study a textbook to learn the Pandas module.
I have a dataframe called Berri_bike, and it is from the following code:
bike_df=pd.read_csv(os.path.join(path,'comptagevelo2012.csv'),parse_dates=['Date'],\
encoding='latin1',dayfirst=True,index_col='Date')
Berri_bike=bike_df['Berri1'].copy() # get only the column='Berri1'
Berri_bike['Weekday']=Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby('Weekday').aggregate(sum)
weekday_counts
I have 3 columns in my Berri_bilk , a data index- from 1/1/2012 to 12/31/2012, and value column with numbers for each data, and a weekday column I assigned to it. But when I want to group by the values, I got the error: ValueError: Grouper and axis must be same length, I am not sure what this mean, what I want to do is very simple, like in SQL, sum(value) grouped weekday... can anyone please let me know what happended here?
You copy your column into a pandas series instead of a new dataframe, hence the following operations behave differently. You can see this if you print out Berri_bike because it doesn't show the column name.
Instead, you should copy the column into a new dataframe:
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 30, size = (70, 2)),
columns = ["A", "B"],
index = pd.date_range("20180101", periods = 70))
Berri_bike = df[["A"]]
Berri_bike['Weekday'] = Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby("Weekday").sum()
print(weekday_counts)
#sample output
A
Weekday
0 148
1 101
2 127
3 139
4 163
5 74
6 135
I have the following Dataframe:
Rec Channel Value1 Value2
Pre 10 20
Pre 35 42
Event A 23 39
FF 50 75
Post A 79 11
Post B 88 69
And have got to the point where with the following code:
res = df[df['Channel'].isin({'A', 'B'})
I am able to find all the instances in the Dataframe where the column 'Channel' has values of either A or B. I now am trying to determine a way to use a For Loop so that it will go through and print each row where A or B is found separately.
The reasoning for a For Loop is that this is just a sample Dataframe, my application is going to have a dynamic value of A and B's found depending on the Dataframe and I would like to be able to call upon each individually regardless of the number of instances.
Additionally, I would like an easy way to index upon the first and last instance where an A or B is found (again, the location is going to be changing from Dataframe to Dataframe) so I can't just do:
res1 = res.loc[4]
to identify the first one in this case, I need something that is going to be more robust regardless of the index I can call upon the first and last instance. Can someone please assist?
It would go something like this:
res = df[df.Channel.isin(['A', 'B'])]
for row in df[df.Channel.isin(['A', 'B'])].iterrows():
row_index = row[0]
Let's suppose that I have a big data in a csv file:
frame.number frame.len frame.cap_len frame.Type
1 100 100 ICMP
2 64 64 UDP
3 100 100 ICMP
4 87 64 ICMP
I want to change the type of the frame based on its length.
The first problem is that I don't know how to extract the rank of the column, then change the frame typelike this:
if frame.len==100 then it puts frame.type=ICMP_tt else if frame.len==87 then it puts frame.type=ICMP_nn
I would like that it looks like this:
frame.number frame.len frame.cap_len frame.Type
1 100 100 ICMP_tt
2 64 64 UDP
3 100 100 ICMP_tt
4 87 64 ICMP_nn
I try by using this code but it doesn't make any modification.
import pandas
df = pandas.read_csv('Test.csv')
if df['frame.len'] == 100:
df['frame.type'].replace("ICMP_tt")
I would be very grateful if you could help me please.
Similar question: How to conditionally update DataFrame column in Pandas
import pandas
df = pandas.read_csv('Test.csv')
df.loc[df['frame.len'] == 100, 'frame.Type'] = "ICMP_tt"
df.loc[df['frame.len'] == 87, 'frame.Type'] = "ICMP_nn"
df
Result:
Should do the trick. The first item given to df.loc[] is an array full of True/False values telling it which rows to update (will also accept single number as the row index if I recall correctly), and the second item specifies which column to update.